Following my question here , I am trying to replicate in R the functionality of the Stata command duplicates tag, which allows me to mark all rows of a data set that are duplicated in terms of a given set of variables:
clear *
set obs 16
g f1 = _n
expand 104
bys f1: g f2 = _n
expand 2
bys f1 f2: g f3 = _n
expand 41
bys f1 f2 f3: g f4 = _n
des // describe the dataset in memory
preserve
sample 10 // draw a 10% random sample
tempfile sampledata
save `sampledata', replace
restore
// append the duplicate rows to the data
append using `sampledata'
sort f1-f4
duplicates tag f1-f4, generate(dupvar)
browse if dupvar == 1 // check that all duplicate rows have been tagged
Edit
Here's what Stata produces (added to @Arun's query):
f1 f2 f3 f4 dupvar
1 1 1 1 0
1 1 1 2 0
1 1 1 3 1
1 1 1 3 1
1 1 1 4 0
1 1 1 5 0
1 1 1 6 0
1 1 1 7 0
1 1 1 8 1
1 1 1 8 1
Note that for (f1, f2, f3, f4) = (1, 1, 1, 3)there are two lines, and both are marked dupvar = 1. Similarly, for two lines that are duplicates of (f1, f2, f3, f4) =(1, 1, 1, 8).
R:
The basic function duplicatedmarks only the second duplicate. So, I wrote a function to replicate the Stata function in R using ddply.
dfUnique = expand.grid(f1 = factor(1:16),
f2 = factor(1:41),
f3 = factor(1:2),
f4 = factor(1:104))
dfDup = rbind(dfUnique, dfUnique[sample(1:nrow(dfUnique), 100), ])
dfDup$data = rnorm(nrow(dfDup))
fnDupTag = function(dfX, indexVars) {
dfDupTag = ddply(dfX, .variables = indexVars, .fun = function(x) {
if(nrow(x) > 1) x$dup = 1 else x$dup = 0
return(x)
})
return(dfDupTag)
}
indexVars = paste0('f', 1:4, sep = '')
dfTemp = fnDupTag(dfDup, indexVars)
But as with the related question, performance is a huge problem. Another possible solution is
dfDup$dup = duplicated(dfDup[, indexVars]) |
duplicated(dfDup[, indexVars], fromLast = TRUE)
dfDupSorted = with(dfDup, dfDup[order(eval(parse(text = indexVars))), ])
:
1. ddply ?
2. duplicated? ?
3. data.table? ?