Match rows in one Pandas data frame to another based on three columns

Question

Match rows in one Pandas data frame to another based on three columns

I have two Pandas frames, one quite large (30,000+ lines) and one much smaller (100+ lines).

dfA looks something like this:

XY ONSET_TIME COLOUR 0 104 78 1083 6 1 172 78 1083 16 2 240 78 1083 15 3 308 78 1083 8 4 376 78 1083 8 5 444 78 1083 14 6 512 78 1083 14 ... ... ... ... ...

dfB looks something like this:

  TIME XY 0 7 512 350 1 1722 512 214 2 1906 376 214 3 2095 376 146 4 2234 308 78 5 2406 172 146 ... ... ... ...

I want every row in dfB to find a row in dfA, where the values of the columns X and Y are equal. And this is the first line, where the value of dfB ['TIME'] is greater than dfA ['ONSET_TIME'] and return the value of dfA ['COLOR'] for this line.

dfA is a display update, where X and Y are the coordinates of the elements on the display and therefore are repeated for each other ONSET_TIME (for each ONSET_TIME value, there are 108 coodinates pairs).

There will be a few lines where X and Y in two frames of data are equal, but I need one that also matches the time.

I did this using for loops, and if the instructions just see that it can be done, but obviously, given the size of the data, it takes a very long time.

 for s in range(0, len(dfA)): for r in range(0, len(dfB)): if (dfB.iloc[r,1] == dfA.iloc[s,0]) and (dfB.iloc[r,2] == dfA.iloc[s,1]) and (dfA.iloc[s,2] <= dfB.iloc[r,0] < dfA.iloc[s+108,2]): return dfA.iloc[s,3]

+6

python pandas dataframe

Alex mr Jul 14 '14 at 14:20

source share

2 answers

Use merge() - it works like a JOIN in SQL - and you completed the first part.

 d1 = ''' XY ONSET_TIME COLOUR 104 78 1083 6 172 78 1083 16 240 78 1083 15 308 78 1083 8 376 78 1083 8 444 78 1083 14 512 78 1083 14 308 78 3000 14 308 78 2000 14''' d2 = ''' TIME XY 7 512 350 1722 512 214 1906 376 214 2095 376 146 2234 308 78 2406 172 146''' import pandas as pd from StringIO import StringIO dfA = pd.DataFrame.from_csv(StringIO(d1), sep='\s+', index_col=None) #print dfA dfB = pd.DataFrame.from_csv(StringIO(d2), sep='\s+', index_col=None) #print dfB df1 = pd.merge(dfA, dfB, on=['X','Y']) print df1

result:

  XY ONSET_TIME COLOUR TIME 0 308 78 1083 8 2234 1 308 78 3000 14 2234 2 308 78 2000 14 2234

Then you can use it to filter the results.

 df2 = df1[ df1['ONSET_TIME'] < df1['TIME'] ] print df2

result:

  XY ONSET_TIME COLOUR TIME 0 308 78 1083 8 2234 2 308 78 2000 14 2234

0

furas Jul 14 '14 at 14:50

source share

flyingmeatball · Accepted Answer · 2014-07-14T14:51:04+0000

There is probably an even more efficient way to do this, but here is a method without these slow loops:

 import pandas as pd dfB = pd.DataFrame({'X':[1,2,3],'Y':[1,2,3], 'Time':[10,20,30]}) dfA = pd.DataFrame({'X':[1,1,2,2,2,3],'Y':[1,1,2,2,2,3], 'ONSET_TIME':[5,7,9,16,22,28],'COLOR': ['Red','Blue','Blue','red','Green','Orange']}) #create one single table mergeDf = pd.merge(dfA, dfB, left_on = ['X','Y'], right_on = ['X','Y']) #remove rows where time is less than onset time filteredDf = mergeDf[mergeDf['ONSET_TIME'] < mergeDf['Time']] #take min time (closest to onset time) groupedDf = filteredDf.groupby(['X','Y']).max() print filteredDf COLOR ONSET_TIME XY Time 0 Red 5 1 1 10 1 Blue 7 1 1 10 2 Blue 9 2 2 20 3 red 16 2 2 20 5 Orange 28 3 3 30 print groupedDf COLOR ONSET_TIME Time XY 1 1 Red 7 10 2 2 red 16 20 3 3 Orange 28 30

The main idea is to combine the two tables, so you have time together in one table. Then I filtered on the recs, which are the largest (closest to the time on your dfB). Let me know if you have any questions about this.

Match rows in one Pandas data frame to another based on three columns

More articles: