R: links to .table data tables in cut function in jth position

Basically, I have the following 2 data.table s:

dt - Contains a value field ( y ) and a grouping field ( x )
bk - contains 4 "break" ( bn ) fields describing the bucket structure for the interval [1, inf) for each group x found in dt . Each bn represents the minimum value (inclusive) for the bucket and continues to the next bucket (for example: 4 buckets for x = 1 - [1,3], [3,5], [5, 10), [10, inf)). Note that bucket structures are not necessarily unique.

 > #4 groups (x), each with a bucket structure defined breaks (bn). > bk<- data.table(x=c(1:4), b1=c(1,1,1,1), b2=c(3,3,4,4), b3=c(5,5,7,8), b4=c(10,10,10,10), key="x") > bk x b1 b2 b3 b4 1: 1 1 3 5 10 2: 2 1 3 5 10 3: 3 1 4 7 10 4: 4 1 4 8 10 > dt<- data.table(x=rep(c(1:4),5), y=rep(c(1:10),2), key="x") > dt xy 1: 1 1 2: 1 5 3: 1 9 4: 1 3 5: 1 7 6: 2 2 7: 2 6 8: 2 10 9: 2 4 10: 2 8 11: 3 3 12: 3 7 13: 3 1 14: 3 5 15: 3 9 16: 4 4 17: 4 8 18: 4 2 19: 4 6 20: 4 10 

My goal is to add a field b to dt that indicates which bucket (1, 2, 3 or 4) is being written based on the bucket structure corresponding to group x . See Desired Result below:

  xyb 1: 1 1 1 #Buckets for x=1 2: 1 5 3 3: 1 9 3 4: 1 3 2 5: 1 7 3 6: 2 2 1 #Buckets for x=2 (same as 1) 7: 2 6 3 8: 2 10 4 9: 2 4 2 10: 2 8 3 11: 3 3 1 #Buckets for x=3 12: 3 7 3 13: 3 1 1 14: 3 5 2 15: 3 9 3 16: 4 4 2 #Buckets for x=4 17: 4 8 3 18: 4 2 1 19: 4 6 2 20: 4 10 4 

My initial idea was to join two data.table and use the cut function to return the bucket number for each record, however I am having problems with the break argument. First attempt shown below:

 > bkt[dt, .(x, y, b=cut(y, breaks=c(b1, b2, b3, b4, "inf"), include.lowest=TRUE, labels=c(1:4)))] Error in cut.default(y, breaks = c(b1, b2, b3, b4, "inf"), include.lowest = TRUE, : 'breaks' are not unique 

If I create a variable a to store the structure of the bucket (say, for x = 1, for example), then it works as I expected:

 > a<- c(1, 3, 5, 10, "inf") > bkt[dt, .(x, y, b=cut(y, breaks=a, include.lowest=TRUE, labels=c(1:4)))] xyb 1: 1 1 1 2: 1 5 2 3: 1 9 3 4: 1 3 1 5: 1 7 3 6: 2 2 1 7: 2 6 3 8: 2 10 3 9: 2 4 2 10: 2 8 3 11: 3 3 1 12: 3 7 3 13: 3 1 1 14: 3 5 2 15: 3 9 3 16: 4 4 2 17: 4 8 3 18: 4 2 1 19: 4 6 3 20: 4 10 3 

This is still not a practical solution for my applications, but I hope someone can help me figure out how to properly deliver information about the structure of the array to the breaks argument to get a similar result. I tried various combinations of c , list , unlist , as.numeric to pass the correct break argument, but no luck. Any help / understanding would be appreciated. Thanks!

Full disclosure, I'm new to R, and this is my first post to be enjoyable.

+5
source share
3 answers

Change the connection syntax a bit:

 dt[bk, v := cut(y, breaks = c(b1, b2, b3, b4, Inf), include.lowest = TRUE, labels = 1:4) , by=.EACHI] xyv 1: 1 1 1 2: 1 5 2 3: 1 9 3 4: 1 3 1 5: 1 7 3 6: 2 2 1 7: 2 6 3 8: 2 10 3 9: 2 4 2 10: 2 8 3 11: 3 3 1 12: 3 7 2 13: 3 1 1 14: 3 5 2 15: 3 9 3 16: 4 4 1 17: 4 8 2 18: 4 2 1 19: 4 6 2 20: 4 10 3 

These results differ from the desired result in the OP, but I think the error lies in interpreting the cut arguments (which I find confusing).

This approach is rather inelegant, since each b1, ..., b4 must be written; and it won’t scale as more breakpoints are added. I agree with @NathanWerth's suggestion to change the bk table. His approach also gives the desired OP result without using cut arguments.


Side Note: The correct Inf character is not "inf" .

+4
source

You can use melt.data.table to restructure the bk dataset into a lighter form:

 bk_long <- melt.data.table( bk, id.vars = 'x', measure.vars = paste0('b', 1:4), value.name = 'y' ) setkey(bk_long, x) bk_long[, variable := NULL] bk_long[, b := seq_len(.N), by = x] bk_long # xyb # 1: 1 1 1 # 2: 1 3 2 # 3: 1 5 3 # 4: 1 10 4 # 5: 2 1 1 # 6: 2 3 2 # 7: 2 5 3 # 8: 2 10 4 # 9: 3 1 1 # 10: 3 4 2 # 11: 3 7 3 # 12: 3 10 4 # 13: 4 1 1 # 14: 4 4 2 # 15: 4 8 3 # 16: 4 10 4 

Then make a sliding connection, as Frank suggested:

 bk_long[dt, on = c('x', 'y'), roll = TRUE] # xyb # 1: 1 1 1 # 2: 1 5 3 # 3: 1 9 3 # 4: 1 3 2 # 5: 1 7 3 # 6: 2 2 1 # 7: 2 6 3 # 8: 2 10 4 # 9: 2 4 2 # 10: 2 8 3 # 11: 3 3 1 # 12: 3 7 3 # 13: 3 1 1 # 14: 3 5 2 # 15: 3 9 3 # 16: 4 4 2 # 17: 4 8 3 # 18: 4 2 1 # 19: 4 6 2 # 20: 4 10 4 
+3
source

After several attempts, I finally got findInterval to work.

The method is similar to frank, except that by used for each group, and not in one connection. Using the built-in .BY value, you can iterate over the bk strings that are passed into the second argument (vec) of findInterval .

 dt[, b := findInterval(y, c(unlist(bk[.BY, b1:b4]), Inf), rightmost.closed=FALSE), by=x] 

which returns

 dt xyb 1: 1 1 1 2: 1 5 3 3: 1 9 3 4: 1 3 2 5: 1 7 3 6: 2 2 1 7: 2 6 3 8: 2 10 4 9: 2 4 2 10: 2 8 3 11: 3 3 1 12: 3 7 3 13: 3 1 1 14: 3 5 2 15: 3 9 3 16: 4 4 2 17: 4 8 3 18: 4 2 1 19: 4 6 2 20: 4 10 4 
+1
source

Source: https://habr.com/ru/post/1270288/


All Articles