The fastest way to find which of the two column lists of each row is true in the pandas data frame

Question

The fastest way to find which of the two column lists of each row is true in the pandas data frame

I am looking for the fastest way to do the following:

We have pd.DataFrame:

df = pd.DataFrame({ 'High': [1.3,1.2,1.1], 'Low': [1.3,1.2,1.1], 'High1': [1.1, 1.1, 1.1], 'High2': [1.2, 1.2, 1.2], 'High3': [1.3, 1.3, 1.3], 'Low1': [1.3, 1.3, 1.3], 'Low2': [1.2, 1.2, 1.2], 'Low3': [1.1, 1.1, 1.1]})

It looks like this:

 In [4]: df Out[4]: High High1 High2 High3 Low Low1 Low2 Low3 0 1.3 1.1 1.2 1.3 1.3 1.3 1.2 1.1 1 1.2 1.1 1.2 1.3 1.2 1.3 1.2 1.1 2 1.1 1.1 1.2 1.3 1.1 1.3 1.2 1.1

What I want to know is that one of the values float High1, High2, High3 is the first that is greater than or equal to High. If not, it should be np.nan

And the same for Low1, Low2, Low3 value, but in this case one of them is the first one that is less than or equal to High. If not, it should be np.nan

In the end, I need to know which one, Low or High, was the first.

One way to solve this problem is a strange and not very effective way:

 df['LowIs'] = np.nan df['HighIs'] = np.nan for i in range(1,4): df['LowIs'] = np.where((np.isnan(df['LowIs'])) & ( df['Low'] >= df['Low'+str(i)]), i, df['LowIs']) df['HighIs'] = np.where((np.isnan(df['HighIs'])) & ( df['High'] <= df['High'+str(i)]), i, df['HighIs']) df['IsFirst'] = np.where( df.LowIs < df.HighIs, 'Low', np.where(df.LowIs > df.HighIs, 'High', 'None') )

What gives me:

 In [8]: df Out[8]: High High1 High2 High3 Low Low1 Low2 Low3 LowIs HighIs IsFirst 0 1.3 1.1 1.2 1.3 1.3 1.3 1.2 1.1 1.0 3.0 Low 1 1.2 1.1 1.2 1.3 1.2 1.3 1.2 1.1 2.0 2.0 None 2 1.1 1.1 1.2 1.3 1.1 1.3 1.2 1.1 3.0 1.0 High

How should I do this over and over again in many iterations where High / Low will be different, performance is key.

So, I would not mind if High1, High2, High3 and Low1, Low2, Low3 will be in a separate DataFrame, which will be transposed, or if it will be in a dict or something else. Thus, the process of preparing data in all cases providing the best performance can be slow and inconvenient.

One of the solutions that I was working on, but just couldn't finish the job in vector form, and that also seems pretty slow:

 df.loc[(df.index == 0), 'HighIs'] = np.where( df.loc[(df.index == 0), ['High1', 'High2', 'High3']] >= 1.3 )[1][0] + 1

So, check which of the columns it is true in this first row, and then look at the index number np.where ().

We are looking forward to offers and hope to learn something new! :)

+5

performance python vectorization numpy pandas

Marco Nov 22 '16 at 16:18

source share

3 answers

If I understand the question correctly, this is a semi-vector version:

 df = pd.DataFrame({ 'High': [1.3,1.7,1.1], 'Low': [1.3,1.2,1.1], 'High1': [1.1, 1.1, 1.1], 'High2': [1.2, 1.2, 1.2], 'High3': [1.3, 1.3, 1.3], 'Low1': [1.3, 1.3, 1.3], 'Low2': [1.2, 1.2, 1.2], 'Low3': [1.1, 1.1, 1.1]}) highs = ['High{:d}'.format(x) for x in range(0,4)] for h in highs[::-1]: mask = df['High'] <= df[h] df.loc[mask, 'FirstHigh'] = h

It produces:

  High High1 High2 High3 Low Low1 Low2 Low3 FirstHigh 0 1.3 1.1 1.2 1.3 1.3 1.3 1.2 1.1 High3 1 1.7 1.1 1.2 1.3 1.2 1.3 1.2 1.1 NaN 2 1.1 1.1 1.2 1.3 1.1 1.3 1.2 1.1 High1

Explanation: The key point here is that we iterate over the columns in reverse order. That is, we start with High3 , check if it is greater than High , and sets FirstHigh accordingly. Then go to High2 . If it is also larger, we simply overwrite the previous result, if it does not remain just as it is. Since we iterate in this reverse order, the result is that the first column will be higher, will be the final result.

+2

Aske doerge Nov 22 '16 at 16:44

source share

Check the High-n columns on the High column:

 a = df.iloc[:,1:4].ge(df.High, axis=0) a Out[67]: High1 High2 High3 0 False False True 1 False False False 2 True True True

Now replace False with np.nan and ask the column index min or max (this does not matter since everything is true for np.nan):

 a.replace(False, np.nan).idxmax(1) 0 High3 1 NaN 2 High1

Same principle for Low columns with le as the comparison operator.

+2

Boud Nov 22 '16 at 20:06

source share

Divakar · Accepted Answer · 2016-11-22T20:47:10+0000

Here's a vector approach with NumPy broadcasting -

 a = df.values out1 = (a[:,1:4] >= a[:,0,None]).argmax(1)+1 out2 = (a[:,5:8] <= a[:,4,None]).argmax(1)+1 df['LowIs'] = out2 df['HighIs'] = out1 df['IsFirst'] = np.where(out1!=out2,np.where(out1 > out2, 'Low', 'High'),None)

Output Example -

 In [195]: df Out[195]: High High1 High2 High3 Low Low1 Low2 Low3 LowIs HighIs IsFirst 0 1.3 1.1 1.2 1.3 1.3 1.3 1.2 1.1 1 3 Low 1 1.2 1.1 1.2 1.3 1.2 1.3 1.2 1.1 2 2 None 2 1.1 1.1 1.2 1.3 1.1 1.3 1.2 1.1 3 1 High

The fastest way to find which of the two column lists of each row is true in the pandas data frame

More articles: