In R, how do I retrieve rows based on another data frame where the values ​​need not be exact? Do they just need to be close enough?

I have two data frames with the same values

df1:

v1 v2 v3 v4 v5 v6 v7 ......

500 40 5.2 z1 .....

500 40 7.2 z2 .....

500 40 9.0 z3 .....

500 40 3.5 z4 .....

500 40 4.2 z5 .....

df2:

v1 v2 v3 v4 v5 v6 v7 .....

500 40 5.1 m1 .....

500 40 7.9 m2 .....

500 20 8.6 m3 .....

500 40 3.7 m4 .....

500 40 4.0 m5 .....

I would like to combine (or any function like this) so that my new df1 file has an exact match of v1 and v2, but v3 does not need to be strictly exact. Is there a way I can match v3 with an accuracy of +/- 0.2?

I would like the last df1 to look like this:

v1 v2 v3 v4 v5 v6 v7 .....

500 40 5.2 z1 .....

500 40 3.5 z4 .....

500 40 4.2 z5 .....

I get as low as possible, but I'm not sure how to account for the variability of v3 column.

hed <- c("v1", "v2", "v3") #original data didn't have header
df1_final <- merge(df1, df2[hed],by=hed)

If there is a better language to solve this problem, I would agree as well, but this is only one part for the entire R script I'm working on.

+4
source share
2 answers

tidyverse, join, filter near ( ):

library(tidyverse)

df1 <- data_frame(v1 = c(500, 500, 500, 500, 500),
                  v2 = c(40, 40, 40, 40, 40),
                  v3 = c(5.2, 7.2, 9.0, 3.5, 4.2),
                  v4 = c("z1", "z2", "z3", "z4", "z5"))

df2 <- data_frame(v1 = c(500, 500, 500, 500, 500),
                  v2 = c(40, 40, 20, 40, 40),
                  v3 = c(5.1, 7.9, 8.6, 3.7, 4.0),
                  v4 = c("m1", "m2", "m3", "m4", "m5"))

df1 %>%
  full_join(df2, by = c("v1", "v2")) %>%    # join on v1 and v2
  filter(near(v3.x, v3.y, tol = 0.21)) %>%  # filter with a tolerance
  rename(v3 = v3.x, v4 = v4.x) %>%          # rename the columns
  select(v1:v4)                             # select em

# A tibble: 3 x 4
     v1    v2    v3 v4   
  <dbl> <dbl> <dbl> <chr>
1  500.   40.  5.20 z1   
2  500.   40.  3.50 z4   
3  500.   40.  4.20 z5 
+5

SQL, ( ) sqldf

library(sqldf)
df1 <- data.frame(v1 = c(500, 500, 500, 500, 500),
                  v2 = c(40, 40, 40, 40, 40),
                  v3 = c(5.2, 7.2, 9.0, 3.5, 4.2),
                  v4 = c("z1", "z2", "z3", "z4", "z5"))

df2 <- data.frame(v1 = c(500, 500, 500, 500, 500),
                  v2 = c(40, 40, 20, 40, 40),
                  v3 = c(5.1, 7.9, 8.6, 3.7, 4.0),
                  v4 = c("m1", "m2", "m3", "m4", "m5"))


sqldf('
  select df1.* 
  from df1
  join df2 
    on df1.v3 <= df2.v3+0.2
    and df1.v3 >= df2.v3-0.2
')
+2

Source: https://habr.com/ru/post/1695177/


All Articles