Finding Duplicates SQL table
Those who wish to practice the ‘Group By’ and ‘having’ clauses in SQL are via an interactive SQL Training course. The top ones have around 80 exercises with a hands on approach helping developers practice various SQL constructions especially:
- Single table queries.
- Via aggregation.
- Grouping data.
- Joining Data from multiple tables.
The best practices of databases often revolve around presence of unique constraints (the primary key) on a table to prevent two duplications during data consolidation and extraction.
Yet, professionals can find themselves working on datasets with duplicate rows. This is due to human error, an application problem or unsorted data extracted and merged from external sources among many.
Should duplicate values be fixed? They can create problematic calculations. Companies can incur expenses due to duplicate values, especially in the case of eCommerce companies where duplicate values may result in duplicate customer orders processed many times. This can impact the business’s core values.
How to find duplicates in SQL – here is how its done
The first step in finding duplicates SQL is defining rules and criteria for detecting them. It could be a combination of two or more columns having duplicate values, or could be a single column with duplicate values. To understand more of it, let’s explore two key examples which will explore both scenarios using a customer order database.
When it comes to the general approach for either situation, finding duplicate values in SQL consists of the following two steps:
- Utilizing the ‘Group By’ clause. This helps group all rows through the target column(s) parameter. This helps developers check for any duplicate values present.
- Utilizing the ‘count’ function in the clause ‘HAVING.’ This helps detect if any of the groups have more than a single entry. Those would be the groups having multiple values.
In case of duplicate values being present in one column
Let us now examine the search for duplicate values in an individual column. In this example, a simple orders table used in any typical eCommerce setup will be used. A sample will be shown below:
|Order ID||Customer ID||Employee ID||Date of Order Placement||Shipper ID|
In this example, duplication has been spotted in the OrderID column. On an ideal basis, each row must have a unique value corresponding to each OrderID. But in rows two and three, each individual order was not assigned its own value.
The following query is hence used for determining duplicate values:
GROUP BY OrderID
HAVING COUNT(OrderID) > 1
Number of Records found: 2
|Order ID||Customer ID|
As observed, OrderID 20261 and 20264 have duplicate values. These duplicate values were determined through the clauses ‘Group By’ and ‘Having’. Once developers have validated the duplicate rows, the ‘Delete’ statement can be used to remove them.
Determining duplicate in multiple columns
Most of the time SQL analysts are interested in finding the needed rows where combinations of some columns match. For this instance,the OrderDetails table will be used which will display such duplications:
SQL analysts will find entries where the values in the Order ID and ProductID are the same. This kind of duplicate values indicate a bug in the ordering system. Each order hence processes each product in that order in the shopping cart just once.
In case multiple quantities of that product were ordered, the quantity value would rise, and separate (duplicated) rows wouldn’t be made. This kind of glitch can impact business operations adversely if orders are being fulfilled, packaged and delivered automatically.
The following query (despite its similarity to the previous one) can be used to look for duplicates in multiple column values:
SELECT OrderID, ProductID, COUNT(*)
GROUP-BY OrderID, ProductID
HAVING COUNT(*) > 1
Number of Records found: 2
From the above example of SQL find duplicates, it can be confirmed that the ordering system does have a bug. It allows SQL programmers to look for errors in the system. Such instance shows that products were registered as new orders, despite the fact that the same customer added them to the same customer.
Hence, Quality assurance teams can take corrective actions to correct the bug in the order management system. This helps them ensure this does not take place again.
It should be noted that in the above function, the COUNT(*) command was used instead of a column-specific counter command like COUNT(OrderID). The COUNT(*) command counts all rows and the COUNT(Column) only counts non-null values in a specified column.
Yet in the above example there would have been no difference due to absence of null values in either of the two columns being grouped.
Finding duplicates in SQL is usually about checks in quality and rationality of data as well as validating it. Such kinds of checks are usually applied in daily operations of all kinds of businesses and companies alike.
It is wise for SQL analysts and data scientists to know each method of finding duplication in SQL tables, and making sure everything is done in the best manner possible. Each data set is unique and the criteria coders apply for quality and rationality checks varies too.