Basically, I have the following 2 data.table s:
dt - Contains a value field ( y ) and a grouping field ( x )
bk - contains 4 "break" ( bn ) fields describing the bucket structure for the interval [1, inf) for each group x found in dt . Each bn represents the minimum value (inclusive) for the bucket and continues to the next bucket (for example: 4 buckets for x = 1 - [1,3], [3,5], [5, 10), [10, inf)). Note that bucket structures are not necessarily unique.
> #4 groups (x), each with a bucket structure defined breaks (bn). > bk<- data.table(x=c(1:4), b1=c(1,1,1,1), b2=c(3,3,4,4), b3=c(5,5,7,8), b4=c(10,10,10,10), key="x") > bk x b1 b2 b3 b4 1: 1 1 3 5 10 2: 2 1 3 5 10 3: 3 1 4 7 10 4: 4 1 4 8 10 > dt<- data.table(x=rep(c(1:4),5), y=rep(c(1:10),2), key="x") > dt xy 1: 1 1 2: 1 5 3: 1 9 4: 1 3 5: 1 7 6: 2 2 7: 2 6 8: 2 10 9: 2 4 10: 2 8 11: 3 3 12: 3 7 13: 3 1 14: 3 5 15: 3 9 16: 4 4 17: 4 8 18: 4 2 19: 4 6 20: 4 10
My goal is to add a field b to dt that indicates which bucket (1, 2, 3 or 4) is being written based on the bucket structure corresponding to group x . See Desired Result below:
xyb 1: 1 1 1 #Buckets for x=1 2: 1 5 3 3: 1 9 3 4: 1 3 2 5: 1 7 3 6: 2 2 1 #Buckets for x=2 (same as 1) 7: 2 6 3 8: 2 10 4 9: 2 4 2 10: 2 8 3 11: 3 3 1 #Buckets for x=3 12: 3 7 3 13: 3 1 1 14: 3 5 2 15: 3 9 3 16: 4 4 2 #Buckets for x=4 17: 4 8 3 18: 4 2 1 19: 4 6 2 20: 4 10 4
My initial idea was to join two data.table and use the cut function to return the bucket number for each record, however I am having problems with the break argument. First attempt shown below:
> bkt[dt, .(x, y, b=cut(y, breaks=c(b1, b2, b3, b4, "inf"), include.lowest=TRUE, labels=c(1:4)))] Error in cut.default(y, breaks = c(b1, b2, b3, b4, "inf"), include.lowest = TRUE, : 'breaks' are not unique
If I create a variable a to store the structure of the bucket (say, for x = 1, for example), then it works as I expected:
> a<- c(1, 3, 5, 10, "inf") > bkt[dt, .(x, y, b=cut(y, breaks=a, include.lowest=TRUE, labels=c(1:4)))] xyb 1: 1 1 1 2: 1 5 2 3: 1 9 3 4: 1 3 1 5: 1 7 3 6: 2 2 1 7: 2 6 3 8: 2 10 3 9: 2 4 2 10: 2 8 3 11: 3 3 1 12: 3 7 3 13: 3 1 1 14: 3 5 2 15: 3 9 3 16: 4 4 2 17: 4 8 3 18: 4 2 1 19: 4 6 3 20: 4 10 3
This is still not a practical solution for my applications, but I hope someone can help me figure out how to properly deliver information about the structure of the array to the breaks argument to get a similar result. I tried various combinations of c , list , unlist , as.numeric to pass the correct break argument, but no luck. Any help / understanding would be appreciated. Thanks!
Full disclosure, I'm new to R, and this is my first post to be enjoyable.