Comparing values from one data frame with values from columns in another data frame and retrieving data from the third column

Question

Comparing values from one data frame with values from columns in another data frame and retrieving data from the third column

The name is a bit confusing, but I will do my best to explain my problem here. I have 2 pandas dataframes, a and b:

>> print a id | value 1 | 250 2 | 150 3 | 350 4 | 550 5 | 450 >> print b low | high | class 100 | 200 | 'A' 200 | 300 | 'B' 300 | 500 | 'A' 500 | 600 | 'C'

I want to create a new column named class in table a that contains the value class according to table b. Here is the result I want:

 >> print a id | value | class 1 | 250 | 'B' 2 | 150 | 'A' 3 | 350 | 'A' 4 | 550 | 'C' 5 | 450 | 'A'

I have the following code written that does what I want:

 a['class'] = pd.Series() for i in range(len(a)): val = a['value'][i] cl = (b['class'][ (b['low'] <= val) \ (b['high'] >= val) ].iat[0]) a['class'].set_value(i,cl)

The problem is that this is fast for tables 10 or so long, but I am trying to do this with a table size of 100,000+ for a and b. Is there a faster way to do this using some function / attribute in pandas?

+5

python pandas

rbae Jul 28 '17 at 2:48

source share

2 answers

Here is a way to create a range based on the @piRSquared solution :

 A = a['value'].values bh = b.high.values bl = b.low.values i, j = np.where((A[:, None] >= bl) & (A[:, None] <= bh)) pd.DataFrame( np.column_stack([a.values[i], b.values[j]]), columns=a.columns.append(b.columns) )

Output:

  id value low high class 0 1 250 200 300 'B' 1 2 150 100 200 'A' 2 3 350 300 500 'A' 3 4 550 500 600 'C' 4 5 450 300 500 'A'

+5

Scott boston Jul 28 '17 at 3:30

source share

Bow · Accepted Answer · 2017-07-28T03:52:54+0000

Here's a solution that is admittedly less elegant than using Series.searchsorted, but it works really fast!

I extract data from pandas DataFrames and convert it to lists, and then use np.where to populate a variable called "aclass" where the conditions are saturated (in brute force for loops). Then I write "aclass" in the original data frame a.

The evaluation time was 0.07489705 s, so it is pretty fast, even with 200,000 data points!

 # create 200,000 fake a data points avalue = 100+600*np.random.random(200000) # assuming you extracted this from a with avalue = np.array(a['value']) blow = [100,200,300,500] # assuming you extracted this from b with list(b['low']) bhigh = [200,300,500,600] # assuming you extracted this from b with list(b['high']) bclass = ['A','B','A','C'] # assuming you extracted this from b with list(b['class']) aclass = [[]]*len(avalue) # initialize aclass start_time = time.time() # this is just for timing the execution for i in range(len(blow)): for j in np.where((avalue>=blow[i]) & (avalue<=bhigh[i]))[0]: aclass[j]=bclass[i] # add the class column to the original a DataFrame a['class'] = aclass print("--- %s seconds ---" % np.round(time.time() - start_time,decimals = 8))

Comparing values ​​from one data frame with values ​​from columns in another data frame and retrieving data from the third column

More articles:

Comparing values from one data frame with values from columns in another data frame and retrieving data from the third column