Remove duplicates with fewer null values

I have a staff table that contains about 25 columns. There are a lot of duplicates right now, and I would like to try to get rid of some of these duplicates.

Firstly, I want to find duplicates by looking at several records that have the same value from name, surname, employee number, company number and status.

SELECT firstname,lastname,employeenumber, companynumber, statusflag FROM employeemaster GROUP BY firstname,lastname,employeenumber,companynumber, statusflag HAVING (COUNT(*) > 1) 

This gives me duplicates, but my goal is to find and save the best single entry and delete other entries. The "best single record" is determined by the record with the least number of NULL values ​​in all other columns. How can i do this?

I am using MGMT Studio for Microsoft SQL Server 2012.

Example:

enter image description here

Red: REMOVE Green: STORE

NOTE. The table has more columns than shown in this table.

+6
source share
3 answers

You can use the sys.columns table to get a list of columns and build a dynamic query. This query will return a "KeepThese" value for each record that you want to keep based on your criteria.

 -- insert test data create table EmployeeMaster ( Record int identity(1,1), FirstName varchar(50), LastName varchar(50), EmployeeNumber int, CompanyNumber int, StatusFlag int, UserName varchar(50), Branch varchar(50) ); insert into EmployeeMaster ( FirstName, LastName, EmployeeNumber, CompanyNumber, StatusFlag, UserName, Branch ) values ('Jake','Jones',1234,1,1,'JJONES','PHX'), ('Jake','Jones',1234,1,1,NULL,'PHX'), ('Jake','Jones',1234,1,1,NULL,NULL), ('Jane','Jones',5678,1,1,'JJONES2',NULL); -- get records with most non-null values with dynamic sys.column query declare @sql varchar(max) select @sql = ' select e.*, row_number() over(partition by e.FirstName, e.LastName, e.EmployeeNumber, e.CompanyNumber, e.StatusFlag order by n.NonNullCnt desc) as KeepThese from EmployeeMaster e cross apply (select count(n.value) as NonNullCnt from (select ' + replace(( select 'cast(' + c.name + ' as varchar(50)) as value union all select ' from sys.columns c where c.object_id = t.object_id for xml path('') ) + '#',' union all select #','') + ')n)n' from sys.tables t where t.name = 'EmployeeMaster' exec(@sql) 
+2
source

Try it.

 ;WITH cte AS (SELECT Row_number() OVER( partition BY firstname, lastname, employeenumber, companynumber, statusflag ORDER BY (SELECT NULL)) rn, firstname, lastname, employeenumber, companynumber, statusflag, username, branch FROM employeemaster), cte1 AS (SELECT a.firstname, a.lastname, a.employeenumber, a.companynumber, a.statusflag, Row_number() OVER( partition BY a.firstname, a.lastname, a.employeenumber, a.companynumber, a.statusflag ORDER BY (CASE WHEN a.username IS NULL THEN 1 ELSE 0 END +CASE WHEN a.branch IS NULL THEN 1 ELSE 0 END) )rn -- add the remaining columns in case statement FROM cte a JOIN employeemaster b ON a.firstname = b.firstname AND a.lastname = b.lastname AND a.employeenumber = b.employeenumber AND a.companynumbe = b.companynumber AND a.statusflag = b.statusflag) SELECT * FROM cte1 WHERE rn = 1 
+1
source

I am testing MySQL and using NULL String concat to find the best record. Since LENGTH (NULL || 'data) is 0. Only if the entire column is not NULL does some length exist. Perhaps this is not perfect.

 create table EmployeeMaster ( Record int auto_increment, FirstName varchar(50), LastName varchar(50), EmployeeNumber int, CompanyNumber int, StatusFlag int, UserName varchar(50), Branch varchar(50), PRIMARY KEY(record) ); INSERT INTO EmployeeMaster ( FirstName, LastName, EmployeeNumber, CompanyNumber, StatusFlag, UserName, Branch ) VALUES ('Jake', 'Jones', 1234, 1, 1, 'JJONES', 'PHX'), ('Jake', 'Jones', 1234, 1, 1, NULL, 'PHX'), ('Jake', 'Jones', 1234, 1, 1, NULL, NULL), ('Jane', 'Jones', 5678, 1, 1, 'JJONES2', NULL); 

My query idea looks like

  SELECT e.* FROM employeemaster e JOIN ( SELECT firstname, lastname, employeenumber, companynumber, statusflag, MAX( LENGTH ( username || branch ) ) data_quality FROM employeemaster GROUP BY firstname, lastname, employeenumber, companynumber, statusflag HAVING count(*) > 1 ) g ON LENGTH ( username || branch ) = g.data_quality 
+1
source

Source: https://habr.com/ru/post/980942/


All Articles