Rcpp function inside data.table join

Question

Why Rcpp my Rcpp function inside the data.table produce a different (& wrong) result compared to using outside the join?

Example

I have two data.table s, and I want to find the Euclidean distance between each pair of coordinates in both tables.

To calculate the distance, I defined two functions: one in the R base and the other in Rcpp .

 library(Rcpp) library(data.table) rEucDist <- function(x1, y1, x2, y2) return(sqrt((x2 - x1)^2 + (y2 - y1)^2)) cppFunction('NumericVector cppEucDistance(NumericVector x1, NumericVector y1, NumericVector x2, NumericVector y2){ int n = x1.size(); NumericVector distance(n); for(int i = 0; i < n; i++){ distance[i] = sqrt(pow((x2[i] - x1[i]), 2) + pow((y2[i] - y1[i]), 2)); } return distance; }') dt1 <- data.table(id = rep(1, 6), seq1 = 1:6, x = c(1:6), y = c(1:6)) dt2 <- data.table(id = rep(1, 6), seq2 = 7:12, x = c(6:1), y = c(6:1)) 

At the first connection, then calculating the distance, both functions produce the same result

 dt_cpp <- dt1[ dt2, on = "id", allow.cartesian = T] dt_cpp[, dist := cppEucDistance(x, y, ix, iy)] dt_r <- dt1[ dt2, on = "id", allow.cartesian = T] dt_r[, dist := rEucDist(x, y, ix, iy)] all.equal(dt_cpp$dist, dt_r$dist) # [1] TRUE 

However, if I make a calculation in a compound, the results are different; The cpp version is incorrect.

 dt_cppJoin <- dt1[ dt2, { (cppEucDistance(x, y, ix, iy)) }, on = "id", by = .EACHI ] dt_rJoin <- dt1[ dt2, { (rEucDist(x, y, ix, iy)) }, on = "id", by = .EACHI ] all.equal(dt_cppJoin$V1, dt_rJoin$V1) # "Mean relative difference: 0.6173913" ## note that the R version of the join is correct all.equal(dt_r$dist, dt_rJoin$V1) # [1] TRUE 

What happens with an Rcpp implementation that forces the connection version to give a different result?

+5
source share
1 answer

TL; DR

  • Ultimately, it comes down to the fact that I donโ€™t understand how EACHI works.

  • My C ++ loop in its current form relies on x1 and x2 for the same length. This is not the case when EACHI used, so my vector subset in a C ++ function will be incorrect. Therefore, I will need a more complex C ++ function.


I believe that the problem that I see comes down to what EACHI doing. After reading Arun answer again and again, I think I understand.

In particular, EACHI "evaluates the j-expression for matching rows for each row in Y". Therefore, it takes table Y ( dt2 in my case) and evaluates it every row at a time. This can be seen if you use multiple prints inside the Rcpp function.

 ## I'm reducing the data sets so the 'prints' are readible dt1 <- data.table(id = rep(1, 2), seq1 = 1:2, x = c(1:2), y = c(1:2)) dt2 <- data.table(id = rep(1, 2), seq2 = 7:8, x = c(6:5), y = c(6:5)) cppFunction('NumericVector cppEucDistance(NumericVector x1, NumericVector y1, NumericVector x2, NumericVector y2){ int n = x1.size(); NumericVector distance(n); Rcpp::Rcout << "x1 size: " << x1.size() << std::endl; Rcpp::Rcout << "x2 size: " << x2.size() << std::endl; Rcpp::Rcout << "n size: " << n << std::endl; for(int i = 0; i < n; i++){ distance[i] = sqrt(pow((x2[i] - x1[i]), 2) + pow((y2[i] - y1[i]), 2)); } return distance; }') dt_cppJoin <- dt1[ dt2, {print(ix); (cppEucDistance(x, y, ix, iy)) }, on = "id", by = .EACHI ] # [1] 6 # x1 size: 2 # x2 size: 1 # n size: 2 # [1] 5 # x1 size: 2 # x2 size: 1 # n size: 2 

Note the output from the print instruction that x2 is only 1 length per iteration. While n is 2. Therefore, my x2[i] is obviously โ€œcrashingโ€ when i goes to the second element (remembering that C ++ arrays are indexed to 0)

Whereas EACHI has the same effect as the full join approach

 dt_cppJoin <- dt1[ dt2, {print(ix); (cppEucDistance(x, y, ix, iy)) }, on = "id" ] # [1] 6 6 5 5 # x1 size: 4 # x2 size: 4 # n size: 4 

+2
source

Source: https://habr.com/ru/post/1268683/


All Articles