Group R selects a conditional value in data.table

Hi, I want to select a group of values ​​determined by a value in a data table.

In particular, I would like to select all columns grouped by date and identifier for all positive values, where e == 1

id date e logret 7 2011-07-29 1 -0.0272275211 7 2011-07-29 2 0.0034229025 7 2011-07-29 3 0.0042622177 8 2011-07-29 1 0.0035662770 8 2011-07-29 2 -0.0015268474 8 2011-07-29 3 0.0013333333 7 2011-07-30 1 0.0044444444 7 2011-07-30 2 -0.0001111111 7 2011-07-30 3 0.0013333333 

here all the elements for id 8 and the date 2011-07-29, and all id 7 elements for the date 2011-07-30 will be selected, because logret for e == 1 is> 0, where, like all id 7 elements in 2011-07-29 are ignored since the first logret (where e == 1) 0

Ans:

  8 2011-07-29 1 0.0035662770 8 2011-07-29 2 -0.0015268474 8 2011-07-29 3 0.0013333333 7 2011-07-30 1 0.0044444444 7 2011-07-30 2 -0.0001111111 7 2011-07-30 3 0.0013333333 

in sql I would use some kind of subtitle for this. I would:

 1) Select the id and date where e=1 and logret > 0 2) Select * join on results of subselect 

I think data.table can do this too, but it's hard for me to describe it in terms of data.table. In particular, I can repeat step 1, but I cannot complete part of the connection in step 2.

 pos <- DT[e==1][logret > 0] 

But can't join pos values ​​back to my DT

+6
source share
2 answers

I solved this in the round:

 pos <- DT[e==1][logret > 0, list(id,date)] ans <- DT[J(pos$id,pos$date)]; 

it would be interesting to hear more elegant 1 string ways to do this in data.table.


EDIT from Matthew:

If key(DT) already (id,date) , then one liner will be:

 DT[DT[e==1 & logret>0, list(id,date)]] 

and it should be faster too. If you can rely on id and date , which are the first two columns of DT , then it can be shortened to:

 DT[DT[e==1 & logret>0]] 
+2
source

It is not very, and it is not in data.table , but it looks like it would work:

 # Recreate your data df = read.table(header=TRUE, text="id date e logret 7 2011-07-29 1 -0.0272275211 7 2011-07-29 2 0.0034229025 7 2011-07-29 2 0.0042622177 8 2011-07-29 1 0.0035662770 8 2011-07-29 2 -0.0015268474 8 2011-07-29 3 0.0013333333") df[which(df$id != df$id[which(df$e == 1 & df$logret < 0)]),] # id date e logret # 4 8 2011-07-29 1 0.003566277 # 5 8 2011-07-29 2 -0.001526847 # 6 8 2011-07-29 3 0.001333333 # ## Or the equivalent in "positive" terms # # df[which(df$id == df$id[which(df$e == 1 & df$logret > 0)]),] 

Update based on comments and new sample data

On top of my head (I had no experience with the data.table package, this is on my “find out” list). Here a solution is possible:

 temp = split(df, df$date) lapply(temp, function(x) x[which(x$id == x$id[which(x$e == 1 & x$logret > 0)]),]) # $`2011-07-29` # id date e logret # 4 8 2011-07-29 1 0.003566277 # 5 8 2011-07-29 2 -0.001526847 # 6 8 2011-07-29 3 0.001333333 # # $`2011-07-30` # id date e logret # 7 7 2011-07-30 1 0.0044444444 # 8 7 2011-07-30 2 -0.0001111111 # 9 7 2011-07-30 3 0.0013333333 

Update 2

It is also worth a try merge :

 merge(df, df[which(df$e == 1 & df$logret > 0), c(1, 2)]) # id date e logret # 1 7 2011-07-30 1 0.0044444444 # 2 7 2011-07-30 2 -0.0001111111 # 3 7 2011-07-30 3 0.0013333333 # 4 8 2011-07-29 1 0.0035662770 # 5 8 2011-07-29 2 -0.0015268474 # 6 8 2011-07-29 3 0.0013333333 
+3
source

Source: https://habr.com/ru/post/921393/


All Articles