Add rows to the data table, but not when some columns have the same value

I have a data.table dat with 4 columns, say ( col1 , col2 , col3 , col4 ).

Input data:

 structure(list(col1 = c(5.1, 5.1, 4.7, 4.6, 5, 5.1, 5.1, 4.7, 4.6, 5), col2 = c(3.5, 3.5, 3.2, 3.1, 3.6, 3.5, 3.5, 3.2, 3.1, 3.6), col3 = c(1.4, 1.4, 1.3, 1.5, 1.4, 3.4, 3.4, 1.3, 1.5, 1.4 ), col4 = structure(c(1L, 1L, 1L, 1L, 1L, 4L, 4L, 4L, 4L, 4L), .Label = c("setosa", "versicolor", "virginica", "eer"), class = "factor")), .Names = c("col1", "col2", "col3", "col4"), row.names = c(NA, -10L), class = c("data.table", "data.frame")) r col1 col2 col3 col4 1: 5.1 3.5 1.4 setosa 2: 5.1 3.5 1.4 setosa 3: 4.7 3.2 1.3 setosa 4: 4.6 3.1 1.5 setosa 5: 5.0 3.6 1.4 setosa 6: 5.1 3.5 3.4 eer 7: 5.1 3.5 3.4 eer 8: 4.7 3.2 1.3 eer 9: 4.6 3.1 1.5 eer 10: 5.0 3.6 1.4 eer 

I perform the following operation on col3 for each unique col4 value

 dat[ , r_new:= sum(col3, na.rm = T), .(col4)] #syntax 1 

So above, sytnax creates a new r_new column with the values ​​obtained by adding these col3 values, where col4 is the same. Thus, each unique col4 value will have an unuique value in the r_new column.

Now I want to do the same as above, but not include those lines where col1 and col2 take the same value (something like below)

 dat[col1 is different OR col2 is different , r_new:= sum(col3, na.rm = T), .(col4)] 

What this will do when the sum function is executed line by line, it will not contain those lines where both col1 and col2 take the same values.

How to include this condition in the same syntax as 1?

Expected Result:

  col1 col2 col3 col4 r_new 1: 5.1 3.5 1.4 setosa 5.6 2: 5.1 3.5 1.4 setosa 5.6 3: 4.7 3.2 1.3 setosa 5.6 4: 4.6 3.1 1.5 setosa 5.6 5: 5.0 3.6 1.4 setosa 5.6 6: 5.1 3.5 3.4 eer 7.6 7: 5.1 3.5 3.4 eer 7.6 8: 4.7 3.2 1.3 eer 7.6 9: 4.6 3.1 1.5 eer 7.6 10: 5.0 3.6 1.4 eer 7.6 

As you can see in the expected output, for setosa lines 1 and 2 took the same value for col1 and col2 , and for err lines 6 and 7 took the same values ​​for col1 and col2 , so we did not add these lines (we just looked at them one time). Do not worry about col3 (it will take the same value if col1 and col2 take the same value.

EDIT: second dput:

 structure(list(col1 = c(5.1, 5.1, 4.7, 4.6, 5, 5.1, 5.1, 4.7, 4.6, 5.1), col2 = c(3.5, 3.5, 3.2, 3.1, 3.6, 3.5, 3.5, 3.2, 3.1, 3.4), col3 = c(1.4, 1.4, 1.3, 1.5, 1.4, 3.4, 3.4, 1.3, 1.5, 3.4 ), col4 = c("A", "A", "A", "A", "A", "B", "B", "B", "B", "B"), count = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1), r_new = c(5.6, 5.6, 5.6, 5.6, 5.6, 9.6, 9.6, 9.6, 9.6, 9.6)), .Names = c("col1", "col2", "col3", "col4", "count", "r_new"), row.names = c(NA, -10L), class = c("data.table", "data.frame")) col1 col2 col3 col4 count r_new 1: 5.1 3.5 1.4 A 1 5.6 2: 5.1 3.5 1.4 A 1 5.6 3: 4.7 3.2 1.3 A 1 5.6 4: 4.6 3.1 1.5 A 1 5.6 5: 5.0 3.6 1.4 A 1 5.6 6: 5.1 3.5 3.4 B 1 9.6 7: 5.1 3.5 3.4 B 1 9.6 8: 4.7 3.2 1.3 B 1 9.6 9: 4.6 3.1 1.5 B 1 9.6 10: 5.1 3.4 3.4 B 1 9.6 

EDIT 2: Third dput

  col1 col2 col3 col4 count r_new 1: 5.1 3.5 1.4 A 1 5.6 2: 5.1 3.5 1.4 A 1 5.6 3: 4.7 3.2 1.3 A 1 5.6 4: 4.6 3.1 1.5 A 1 5.6 5: 5.0 3.6 1.4 A 1 5.6 6: 5.1 3.5 3.4 B 1 6.2 7: 5.1 3.5 3.4 B 1 6.2 8: 4.7 3.2 1.3 B 1 6.2 9: 4.6 3.1 1.5 B 1 6.2 10: 5.1 3.5 3.4 B 1 6.2 structure(list(col1 = c(5.1, 5.1, 4.7, 4.6, 5, 5.1, 5.1, 4.7, 4.6, 5.1), col2 = c(3.5, 3.5, 3.2, 3.1, 3.6, 3.5, 3.5, 3.2, 3.1, 3.5), col3 = c(1.4, 1.4, 1.3, 1.5, 1.4, 3.4, 3.4, 1.3, 1.5, 3.4 ), col4 = c("A", "A", "A", "A", "A", "B", "B", "B", "B", "B"), count = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1), r_new = c(5.6, 5.6, 5.6, 5.6, 5.6, 6.2, 6.2, 6.2, 6.2, 6.2)), .Names = c("col1", "col2", "col3", "col4", "count", "r_new"), row.names = c(NA, -10L), class = c("data.table", "data.frame")) 
+2
source share
2 answers

We can multiply col3 inside j using ?data.table::duplicated .

 dat[, r_new := sum(col3[!duplicated(.SD, by = c("col1","col2"))], na.rm = T), by = col4] > dat # col1 col2 col3 col4 count r_new # 1: 5.1 3.5 1.4 A 1 5.6 # 2: 5.1 3.5 1.4 A 1 5.6 # 3: 4.7 3.2 1.3 A 1 5.6 # 4: 4.6 3.1 1.5 A 1 5.6 # 5: 5.0 3.6 1.4 A 1 5.6 # 6: 5.1 3.5 3.4 B 1 6.2 # 7: 5.1 3.5 3.4 B 1 6.2 # 8: 4.7 3.2 1.3 B 1 6.2 # 9: 4.6 3.1 1.5 B 1 6.2 #10: 5.1 3.5 3.4 B 1 6.2 
+3
source

Accept mtoto's answer, which is easier to read, but here is an alternative.

 DT[, r_new:=unique(.SD,by=c("col1","col2"))[,sum(col3, na.rm=TRUE)], by=col4] 
+2
source

Source: https://habr.com/ru/post/1245817/


All Articles