Intersection of two columns of a pandas frame

Question

Intersection of two columns of a pandas frame

I have 2 pandas dataframes: dataframe1and dataframe2that look like this:

mydataframe1
Out[15]: 
    Start   End  
    100     200
    300     450
    500     700


mydataframe2
Out[16]:
  Start   End       Value     
  0       400       0  
  401     499       -1  
  500     1000      1  
  1001    1698      1

Each line corresponds to a segment (start-end). For each segment in dataframe1, I would like to assign a value depending on the values assigned to the segments in dataframe2.

For instance:

the first segment in dataframe1 is 100 200included in the first segment of dataframe2 0 400, then I have to assign the value 0

the second segment in dataframe1 is 300 450contained in the first segments 0 400and the second 401 499data block2. In this case, I need to split these segments into 2 and assign 2 corresponding values. those. 300 400 -> value 0and401 - 450 value ->-1

final dataframe1 should look like

mydataframe1
Out[15]: 
    Start   End  Value
    100     200  0
    300     400  0
    401     450  -1
    500     700  1

I hope I was claer. Could you help me?

+4

python pandas dataframe

gabboshow 08 . '17 14:14

1

Martin Valgur · Accepted Answer · 2017-03-08T16:28:18+0000

, Pandas, , . , . intervaltree , .

IntervalTree.search() () , , . intersect(), .

import pandas as pd
from intervaltree import Interval, IntervalTree

def intersect(a, b):
    """Intersection of two intervals."""
    intersection = max(a[0], b[0]), min(a[1], b[1])
    if intersection[0] > intersection[1]:
        return None
    return intersection

def interval_df_intersection(df1, df2):
    """Calculate the intersection of two sets of intervals stored in DataFrames.
    The intervals are defined by the "Start" and "End" columns.
    The data in the rest of the columns of df1 is included with the resulting
    intervals."""
    tree = IntervalTree.from_tuples(zip(
            df1.Start.values,
            df1.End.values,
            df1.drop(["Start", "End"], axis=1).values.tolist()
        ))

    intersections = []
    for row in df2.itertuples():
        i1 = Interval(row.Start, row.End)
        intersections += [list(intersect(i1, i2)) + i2.data for i2 in tree[i1]]

    # Make sure the column names are in the correct order
    data_cols = list(df1.columns)
    data_cols.remove("Start")
    data_cols.remove("End")
    return pd.DataFrame(intersections, columns=["Start", "End"] + data_cols)

interval_df_intersection(mydataframe2, mydataframe1)

, .

Intersection of two columns of a pandas frame

More articles: