I am trying to figure out how to remove duplicates based on three variables ( id, key, and num). I would like to remove the duplicate with the minimum number of filled columns. If an equal number is filled, or it can be deleted. For instance,
Original <- data.frame(id= c(1,2,2,3,3,4,5,5),
key=c(1,2,2,3,3,4,5,5),
num=c(1,1,1,1,1,1,1,1),
v4= c(1,NA,5,5,NA,5,NA,7),
v5=c(1,NA,5,5,NA,5,NA,7))
The output will be as follows:
Finished <- data.frame(id= c(1,2,3,4,5),
key=c(1,2,3,4,5),
num=c(1,1,1,1,1),
v4= c(1,5,5,5,7),
v5=c(1,5,5,5,7))
My real data set is larger and combines mostly numeric, but some variable characters, but I could not determine the best way to do this. I previously used a program that will do something similar in a duplicates command called check.all.
So far, my thoughts have been to use grepl and determine where "anything" is.
Present <- apply(Original, 2, function(x) grepl("[[:alnum:]]", x))
Then, using the resulting data frame, I ask rowSums and Cbind for its original.
CompleteNess <- rowSums(Present)
cbind(Original, CompleteNess)
, ... , , (CompleteNess); , .
, id, key num - CompleteNess.
- , . !