Remove duplicates, but keep the most complete iteration

I am trying to figure out how to remove duplicates based on three variables ( id, key, and num). I would like to remove the duplicate with the minimum number of filled columns. If an equal number is filled, or it can be deleted. For instance,

Original <- data.frame(id= c(1,2,2,3,3,4,5,5), 
key=c(1,2,2,3,3,4,5,5),
num=c(1,1,1,1,1,1,1,1),
v4= c(1,NA,5,5,NA,5,NA,7), 
v5=c(1,NA,5,5,NA,5,NA,7))

The output will be as follows:

Finished <- data.frame(id= c(1,2,3,4,5),
key=c(1,2,3,4,5),
num=c(1,1,1,1,1),
v4= c(1,5,5,5,7),
v5=c(1,5,5,5,7))

My real data set is larger and combines mostly numeric, but some variable characters, but I could not determine the best way to do this. I previously used a program that will do something similar in a duplicates command called check.all.

So far, my thoughts have been to use grepl and determine where "anything" is.

Present <- apply(Original, 2, function(x) grepl("[[:alnum:]]", x))

Then, using the resulting data frame, I ask rowSums and Cbind for its original.

CompleteNess <- rowSums(Present)
cbind(Original, CompleteNess)

, ... , , (CompleteNess); , .

, id, key num - CompleteNess.

- , . !

+4
2

. , :

#Order by the degree of completeness    
Original<-Original[order(CompleteNess),]

#Starting from the bottom select the not duplicated rows 
#based on the first 3 columns
Original[!duplicated(Original[,1:3], fromLast = TRUE),]

, , .

+3

:

Original <- data.frame(id= c(1,2,2,3,3,4,5,5), 
                       key=c(1,2,2,3,3,4,5,5),
                       num=c(1,1,1,1,1,1,1,1),
                       v4= c(1,NA,5,5,NA,5,NA,7), 
                       v5=c(1,NA,5,5,NA,5,NA,7))
Present <- apply(Original, 2, function(x) grepl("[[:alnum:]]", x))

#get the score 
Original$present <- rowSums(Present)

#create a column to aggregate on
Original$id.key.num <- paste(Original$id, Original$key, Original$num, sep = "-")

library("plyr")
#aggregate here
Final <- ddply(Original,.(id.key.num),summarize,
      Max = max(present))

, :

Final <- ddply(Original,.(id.key.num),summarize,
      Max = max(present),
      v4 = v4[which.max(present)],
      v5 = v5[which.max(present)]
      )
+2

Source: https://habr.com/ru/post/1652507/


All Articles