How to

Finding Duplicates SQL Values in a Table

Finding Duplicates SQL table

Those who wish to practice the ‘Group By’ and ‘having’ clauses in SQL are via an interactive SQL Training course. The top ones have around 80 exercises with a hands on approach helping developers practice various SQL constructions especially:

  • Single table queries.
  • Via aggregation.
  • Grouping data.
  • Joining Data from multiple tables.
  • Subqueries.

The best practices of databases often revolve around presence of unique constraints (the primary key) on a table to prevent two duplications during data consolidation and extraction. 

Yet, professionals can find themselves working on datasets with duplicate rows. This is due to human error, an application problem or unsorted data extracted and merged from external sources among many.

Should duplicate values be fixed? They can create problematic calculations. Companies can incur expenses due to duplicate values, especially in the case of eCommerce companies where duplicate values may result in duplicate customer orders processed many times. This can impact the business’s core values.

How to find duplicates in SQL – here is how its done

The first step in finding duplicates SQL is defining rules and criteria for detecting them. It could be a combination of two or more columns having duplicate values, or could be a single column with duplicate values. To understand more of it, let’s explore two key examples which will explore both scenarios using a customer order database.

When it comes to the general approach for either situation, finding duplicate values in SQL consists of the following two steps:

  • Utilizing the ‘Group By’ clause. This helps group all rows through the target column(s) parameter. This helps  developers check for any duplicate values present.
  • Utilizing the ‘count’ function in the clause ‘HAVING.’ This helps detect  if any of the groups have more than a single entry. Those would be the groups having multiple values.

In case of duplicate values being present in one column

Let us now examine the search for duplicate values in an individual column. In this example, a simple orders table used in any typical eCommerce setup will be used. A sample will be shown below:

Order ID Customer ID Employee ID Date of Order Placement Shipper ID
20260 11190 12AT88 2022/07/12 33#
20261 11191 KI12345 2022/08/88 22#
20261 11191 KI2345 2022/08/28 22#
20262 11192 JNHG88 2022/09/11 32#
20263 11193 KKLO90 2022/05/06 66#
20264 11194 9989MM 2022/03/18 92#
20264 11194 77TT99 2022/03/18 92#

In this example, duplication has been spotted in the OrderID column. On an ideal basis, each row must have a unique value corresponding to each OrderID. But in rows two and three, each individual order was not assigned its own value. 

The following query is hence used for determining duplicate values:

SELECT                     OrderID,

COUNT (OrderID)

FROM Orders

GROUP BY OrderID

HAVING COUNT(OrderID) > 1

RESULT

Number of Records found: 2

Order ID Customer ID
20261 11191
20264 11194

As observed, OrderID 20261 and 20264 have duplicate values. These duplicate values were determined through the clauses ‘Group By’ and ‘Having’. Once developers have validated the duplicate rows, the ‘Delete’ statement can be used to remove them.

Determining duplicate in multiple columns

Most of the time SQL analysts are interested in finding the needed rows where combinations of some columns match. For this instance,the OrderDetails table will be used which will display such duplications:

OrderDetailID OrderID ProductID Quantity
A1 222882 44 12
A2 222882 66 10
A3 222882 78 8
A4 222990 112 12
A5 222991 178 22
A6 222992 222 20

SQL analysts will find entries where the values in the Order ID and ProductID are the same. This kind of duplicate values indicate a bug in the ordering system. Each order hence processes each product in that order in the shopping cart just once.

In case multiple quantities of that product were ordered, the quantity value would rise, and separate (duplicated) rows wouldn’t be made. This kind of glitch can impact business operations adversely if orders are being fulfilled, packaged and delivered automatically.

The following query (despite its similarity to the previous one) can be used to look for duplicates in multiple column values:

SELECT OrderID, ProductID, COUNT(*)

FROM OrderDetails

GROUP-BY OrderID, ProductID

HAVING COUNT(*) > 1

RESULT

Number of Records found: 2

From the above example of SQL find duplicates, it can be confirmed that the ordering system does have a bug. It allows SQL programmers to look for errors in the system. Such instance shows that products were registered as new orders, despite the fact that the same customer added them to the same customer.

Hence, Quality assurance teams can take corrective actions to correct the bug in the order management system. This helps them ensure this does not take place again.

It should be noted that in the above function, the COUNT(*) command was used instead of a column-specific counter command like COUNT(OrderID). The COUNT(*) command counts all rows and the COUNT(Column) only counts non-null values in a specified column. 

Yet in the above example there would have been no difference due to absence of null values in either of the two columns being grouped.

Conclusion

Finding duplicates in SQL is usually about checks in quality and rationality of data as well as validating it. Such kinds of checks are usually applied in daily operations of all kinds of businesses and companies alike.

It is wise for SQL analysts and data scientists to know each method of finding duplication in SQL tables, and making sure everything is done in the best manner possible. Each data set is unique and the criteria coders apply for quality and rationality checks varies too.

 

Related Articles

Leave a Reply

Back to top button