How to get around a nested loop?

So the situation is this: I basically have one data frame that contains about 100,000 rows of data. I am interested in a specific data column, POS, and I would like to check if the POS value is between the two values ​​of another data frame, Start and End, and keep track of how many instances of these objects exist.

For example, in my first data frame, I have something like

ID POS  
A   20  
B   533  
C   600 

And in my other data frame, I have things like

START      END  
123        150  
489        552  
590        600  

I want to know how many items in POS are in any of the START-END ranges. So in this case there are 2 elements. Also, if possible, can I get the identifiers of those who have POS between Start and End?

How can I do this without using a nested loop?

+4
source
4

, . , sqldf:

library(sqldf)

query <- "SELECT POS, ID FROM df1 INNER JOIN df2 "
query <- paste0(query, "ON df1.POS BETWEEN df2.START AND df2.END")
sqldf(query)

, POS. SELECT POS SELECT DISTINCT POS.

+6

data.table

library(data.table)
setDT(df1)[df2, on = .(POS > START, POS <= END)][, sum(!is.na(ID))]
#[1] 2
+6

, mapply base-R :

df1[mapply(function(x)any(x >= df2$START & x <= df2$END),df1$POS),]
#  ID POS
#2  B 533
#3  C 600

df1 <- read.table(text = 
"ID POS  
A   20  
B   533  
C   600", header = T)


df2 <- read.table(text = 
"START      END  
123        150  
489        552  
590        600", header = TRUE)
+1

: main

ID POS  
A   20  
B   533  
C   600 

: ran

START   END  
123     150  
489     552  
590     600

sapply :

sapply(main$POS, function(x) { sum(x>=ran$START & x<=ran$END) })

:

[1] 0 1 1

main:

main$Count <- sapply(main$POS, function(x) { sum(x>=ran$START & x<=ran$END) }))

  ID POS count
1  A  20     0
2  B 533     1
3  C 600     1

.

+1

Source: https://habr.com/ru/post/1695502/


All Articles