Delete rows of data frame based on join between multiple columns

Given the following data frame:

# input 
a <- data.frame(
  X1=c("a","a","a","a","a","a","a","a","a","a","a","b","b","b","b","b","b","b","b","b"),
  X2=c(2,4,6,2,4,7,9,5,4,7,3,5,8,4,3,5,7,6,3,5),
  X3=c(5,6,1,4,7,5,5,4,4,2,5,4,5,2,4,7,3,5,3,7)
)

How to delete any row that is smaller in variable 2 and variable 3 than another row, where two rows have the same coefficient (variable 1)?

eg.

a[1,1]==a[2,1] and
a[1,2]<a[2,2] and 
a[1,3]<a[2,3] then a[1,] should be removed.

# output 

a <- data.frame( X1=c("a","a","a","a","b","b","b","b"), 
                 X2=c(4,4,7,9,8,5,6,5), 
                 X3=c(6,7,5,5,5,7,5,7) ) 
+4
source share
2 answers

The function isRemovedwill give TRUEeither the FALSEspecified condition for each line i:

isRemoved = function(i, a) {
  out = logical(nrow(a))
  for(j in 1:nrow(a)) {
    out[j] = a[i,1]==a[j,1] & a[i,2]<a[j,2] & a[i,3]<a[j,3]
  }
  out = any(out)
  return(out)
}

then you can apply this to all lines:

remove = sapply(1:nrow(a), isRemoved, a=a)

and save the desired line:

a.new = a[!remove, ]

a.new 

   X1 X2 X3
2   a  4  6
5   a  4  7
6   a  7  5
7   a  9  5
13  b  8  5
16  b  5  7
18  b  6  5
20  b  5  7
+2
source

If speed is not your main concern, this is, in my opinion, quite readable:

library(plyr)
ddply(a, "X1", function(x) {
  n <- seq_len(nrow(x))
  m <- outer(n, n, Vectorize(function(i,j) all(x[i, 2:3] < x[j, 2:3])))
  i <- rowSums(m) > 0L
  return(x[!i, ])
})

m - TRUE FALSE, , i j, i j.

+3

Source: https://habr.com/ru/post/1523731/


All Articles