Efficient set intersection to get rows in a DataFrame

Question

Efficient set intersection to get rows in a DataFrame

I have dataframe in 3 levels, the relevant issue, :ID, :Position, :Probability. Each line is unique, but several lines can have the same one ID. What I would like to do is get all the rows for a particular value Positionthat are shared IDwith any row that is Probabilityabove a certain value in a different position.

For example, let's say I have the following DataFrame (df):

1020692×8 DataFrames.DataFrame
│ Row     │ ID  │ Position      │ Probability │
├─────────┼─────┼───────────────┼─────────────┤
│ 1       │ 425 │ "first"       │ 0.02        │
│ 2       │ 425 │ "last"        │ 0.03        │
│ 3       │ 425 │ "penultimate" │ 0.02        │
│ 4       │ 425 │ "other"       │ 0.04        │
│ 5       │ 421 │ "first"       │ 0.44        │
│ 6       │ 421 │ "last"        │ 0.85        │
│ 7       │ 421 │ "second"      │ 0.59        │
│ 8       │ 421 │ "other"       │ 1.0         │
⋮

If I set the threshold 0.8, I want to end all the lines where :Position == "first", if :IDhas :Position == "last" && :Probability > 0.8. In other words, I need line 5 because line 6 has :Probability > 0.8, but not line 1, since line 2 does not work.

, . , :Position == "first" "last" , .

, ID last Probability > 0.8, in(). ...

firsts = df[df[:Position] .== "first", :]
lasts = df[df[:Position] .== "last", :]
meetsthreshold = lasts[lasts[:Probability] .> 0.8, :ID]

final = firsts[[in(i, meetsthreshold) for i in firsts[:ID]], :]

ID, , ( length(meetsthreshold) > 100k). , , , , ID s (, intersect(Set(firsts[:ID]), Set(meetsthreshold))), . dataframe, ?

+3

dataframe julia-lang

kevbonham 20 . '16 21:06

1

kevbonham · Accepted Answer · 2016-07-24T01:36:00+0000

- . :

firsts = df[df[:Position] .== "first", :]
lasts = df[df[:Position] .== "last", :]
meetsthreshold = Set(lasts[lasts[:Probability] .> 0.8, :ID])

final = firsts[Vector{Bool}([in(i, meetsthreshold) for i in firsts[:ID]]), :]

Ran 1 .

Efficient set intersection to get rows in a DataFrame

More articles: