SQL syntax Language  Index 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 [1718 19 20 

How to delete duplicate rows in the table?

    by S.Moiseenko (2009-06-13)

As a rule this kind of a question arises when certain mistakes have been made in design of the tables, in particular, primary key is missing and there is data which impedes its creation. Thereby limitations of subject field require uniqueness of corresponding data.

Let us assume that there is the following table T:

name
John
Smith
John
Smith
Smith
Tom

To simplify this analysis I don't include other columns assuming that data in those is uniquely determined by the value in name column. It is necessary to specify uniqueness of name column (e.g. using primary key), preliminarily having deleted duplicates.

Widespread solution to this problem would be generation of auxiliary table of the needed design, into which are copied unique rows from table T with subsequent dropping the table T and renaming of the auxiliary table. Below this algorithm is implemented using T-SQL code.


CREATE TABLE Ttemp(name VARCHAR(50) NOT NULL PRIMARY KEY);
GO
INSERT INTO Ttemp
SELECT DISTINCT * FROM T;
GO
DROP TABLE T;
GO
EXEC sp_rename 'Ttemp', 'T';
GO
SELECT * FROM T;

Ultimately we will achieve the necessary result:

name
John
Smith
Tom

Thereby constraints imposed by primary key will prevent from occurrence of duplicates subsequently.

Is it possible to avoid generation of the new table? In fact, it is., e.g. making use of the following algorithm:
- to add new column of IDENTITY type, which enumerates all the rows in the table;
- to delete from each group of rows with equal values in name column all the rows apart from the row with maximal number (or minimal - it is all the same, as it goes about duplicates);
- to delete auxiliary column;
- to impose restriction.

Here is the example of implementation of this approach:


ALTER TABLE T
ADD id INT IDENTITY(1,1);
GO
DELETE FROM T
WHERE id < (SELECT MAX(id)
	     FROM T AS T1
              WHERE T.name = T1.name
              );
GO
ALTER TABLE T
DROP COLUMN id;
GO
ALTER TABLE T
ALTER COLUMN name VARCHAR(50) NOT NULL;
GO
ALTER TABLE T
ADD CONSTRAINT T_PK PRIMARY KEY(name);
GO

And can we do without creation of additional column? We can answer in the affirmative, but to do so we need to make use of new features of the language, specified in ANSI SQL-99 standard. The key idea is that one should not create actual column in the table, which should be deleted subsequently, it is sufficient to generate a virtual one (calculated). This column will be created using windowed functions, assigning rank to each row in the window determined by equality of the values in name column. Finally we will delete the rows the rank of which exceeds 1.

Let's closely consider generation of the query for deleting the duplicates employing this method.

1. Numeration of the rows

We cannot fulfill the ranking of the rows merely because there is no denominator for their ranking. The matter is that identical lines will be marked by identical rank. Therefore first we should numerate them, arranging them in accordance with name column.


SELECT *, ROW_NUMBER() OVER(ORDER BY name) num
FROM T

As a result we will obtain

name	num
John	1
John	2
Smith	3
Smith	4
Smith	5
Tom	6

2. Ranking of the rows within groups of duplicates

Unfortunately, it is forbidden (MS SQL Server) to use windowed functions within windowed functions. I.e. we cannot write the following code:


SELECT name,
RANK() OVER (PARTITION BY name ORDER BY ROW_NUMBER() OVER(ORDER BY name)) rnk
FROM T;

Therefore we will use the subquery:


SELECT name, RANK() OVER (PARTITION BY name ORDER BY num) rnk
FROM (SELECT *, ROW_NUMBER() OVER(ORDER BY name) num
	  FROM T
	  ) X;

Below is given the result of this subquery.

name	rnk
John	1
John	2
Smith	1
Smith	2
Smith	3
Tom	1

3. Deleting duplicates from virtual table

It is inadmissible to delete the rows from the query, i.e. we cannot solve the problem in such a way:


DELETE FROM (SELECT name, RANK() OVER (PARTITION BY name ORDER BY num) rnk
			 FROM (SELECT *, ROW_NUMBER() OVER(ORDER BY name) num
	  		 	   FROM T
				  ) X
WHERE rnk > 1;

In DELETE statement it is possible to use only the basic table or view. Therefore first we should create a view and then delete the entries from the latter. In fact, the entries are deleted from the underlying table, for which the view has been made. Thus we can solve the problem in the following way:


CREATE VIEW Tview
AS
SELECT name, RANK() OVER (PARTITION BY name ORDER BY num) rnk
FROM(SELECT *, ROW_NUMBER() OVER(ORDER BY name) num
	 FROM T
	) X
GO
DELETE FROM Tview
WHERE rnk > 1;

"Need I really generate anything again?" - you might ask. It is not absolutely needed and, in order to prove it, we can make use of common table expressions (CTE), which can be named also virtual views. CTE, although not stored in the database, might be used together with modification statements. As a result the solution boils down to the single query:


WITH CTE AS
	(SELECT name, RANK() OVER (PARTITION BY name ORDER BY num) rnk
	 FROM(SELECT *, ROW_NUMBER() OVER(ORDER BY name) num
     	  FROM T
		  ) X
	)
DELETE FROM CTE
WHERE rnk > 1;
GO

The main thing is not to forget to generate the primary key. :-)

    2009-10-06

I have casually found this article using inbound link and have decided to mind to itself. :-)

Here is this phrase: "We cannot fulfill the ranking of the rows merely because there is no denominator for their ranking."

Certainly, it is correct, but we can refuse ranking (to the detriment of training :-)), having done "pseudo-ranking". The matter is that there is an opportunity to make independent numbering for each group of rows if you use PARTITION BY feature in the OVER clause for ROW_NUMBER function. So, RANK function becomes unnecessary here, if to accomplish partitioning by name


SELECT name, ROW_NUMBER() OVER(PARTITION BY name ORDER BY name)
FROM T;

It will simplify all the subsequent queries, in particular, last solution to a problem of removal of duplicates can be rewritten in the form:


WITH CTE AS (
	SELECT name, ROW_NUMBER() OVER(PARTITION BY name ORDER BY name) rnk
	FROM T
	 )
DELETE FROM CTE
WHERE rnk > 1;

Dzone.com





Previous | Index | Next



Home SELECT exercises (rating stages) DML exercises Developers Rambler's Top100