Comparing values ​​from one data frame with values ​​from columns in another data frame and retrieving data from the third column

The name is a bit confusing, but I will do my best to explain my problem here. I have 2 pandas dataframes, a and b:

>> print a id | value 1 | 250 2 | 150 3 | 350 4 | 550 5 | 450 >> print b low | high | class 100 | 200 | 'A' 200 | 300 | 'B' 300 | 500 | 'A' 500 | 600 | 'C' 

I want to create a new column named class in table a that contains the value class according to table b. Here is the result I want:

 >> print a id | value | class 1 | 250 | 'B' 2 | 150 | 'A' 3 | 350 | 'A' 4 | 550 | 'C' 5 | 450 | 'A' 

I have the following code written that does what I want:

 a['class'] = pd.Series() for i in range(len(a)): val = a['value'][i] cl = (b['class'][ (b['low'] <= val) \ (b['high'] >= val) ].iat[0]) a['class'].set_value(i,cl) 

The problem is that this is fast for tables 10 or so long, but I am trying to do this with a table size of 100,000+ for a and b. Is there a faster way to do this using some function / attribute in pandas?

+5
source share
2 answers

Here's a solution that is admittedly less elegant than using Series.searchsorted, but it works really fast!

I extract data from pandas DataFrames and convert it to lists, and then use np.where to populate a variable called "aclass" where the conditions are saturated (in brute force for loops). Then I write "aclass" in the original data frame a.

The evaluation time was 0.07489705 s, so it is pretty fast, even with 200,000 data points!

 # create 200,000 fake a data points avalue = 100+600*np.random.random(200000) # assuming you extracted this from a with avalue = np.array(a['value']) blow = [100,200,300,500] # assuming you extracted this from b with list(b['low']) bhigh = [200,300,500,600] # assuming you extracted this from b with list(b['high']) bclass = ['A','B','A','C'] # assuming you extracted this from b with list(b['class']) aclass = [[]]*len(avalue) # initialize aclass start_time = time.time() # this is just for timing the execution for i in range(len(blow)): for j in np.where((avalue>=blow[i]) & (avalue<=bhigh[i]))[0]: aclass[j]=bclass[i] # add the class column to the original a DataFrame a['class'] = aclass print("--- %s seconds ---" % np.round(time.time() - start_time,decimals = 8)) 
+1
source

Here is a way to create a range based on the @piRSquared solution :

 A = a['value'].values bh = b.high.values bl = b.low.values i, j = np.where((A[:, None] >= bl) & (A[:, None] <= bh)) pd.DataFrame( np.column_stack([a.values[i], b.values[j]]), columns=a.columns.append(b.columns) ) 

Output:

  id value low high class 0 1 250 200 300 'B' 1 2 150 100 200 'A' 2 3 350 300 500 'A' 3 4 550 500 600 'C' 4 5 450 300 500 'A' 
+5
source

Source: https://habr.com/ru/post/1270304/


All Articles