Effectively replace values ​​from a column with another Pandas DataFrame column

I have a Pandas DataFrame as shown below:

col1 col2 col3 1 0.2 0.3 0.3 2 0.2 0.3 0.3 3 0 0.4 0.4 4 0 0 0.3 5 0 0 0 6 0.1 0.4 0.4 

I want to replace the col1 values col1 the values ​​in the second column ( col2 ), only if the col1 values ​​are 0, and after (for the remaining zero values) do it again, but with the third column ( col3 ). The desired result is as follows:

  col1 col2 col3 1 0.2 0.3 0.3 2 0.2 0.3 0.3 3 0.4 0.4 0.4 4 0.3 0 0.3 5 0 0 0 6 0.1 0.4 0.4 

I did this with the pd.replace function, but it seems too slow. I think this should be a faster way to accomplish this.

 df.col1.replace(0,df.col2,inplace=True) df.col1.replace(0,df.col3,inplace=True) 

is there a faster way to do this? using some other function instead of the pd.replace function?

+3
source share
3 answers

Using np.where is faster. Using a similar pattern used with replace :

 df['col1'] = np.where(df['col1'] == 0, df['col2'], df['col1']) df['col1'] = np.where(df['col1'] == 0, df['col3'], df['col1']) 

However, using nested np.where is slightly faster:

 df['col1'] = np.where(df['col1'] == 0, np.where(df['col2'] == 0, df['col3'], df['col2']), df['col1']) 

Delay

Using the following setting to create a larger selection of DataFrame and synchronization functions:

 df = pd.concat([df]*10**4, ignore_index=True) def root_nested(df): df['col1'] = np.where(df['col1'] == 0, np.where(df['col2'] == 0, df['col3'], df['col2']), df['col1']) return df def root_split(df): df['col1'] = np.where(df['col1'] == 0, df['col2'], df['col1']) df['col1'] = np.where(df['col1'] == 0, df['col3'], df['col1']) return df def pir2(df): df['col1'] = df.where(df.ne(0), np.nan).bfill(axis=1).col1.fillna(0) return df def pir2_2(df): slc = (df.values != 0).argmax(axis=1) return df.values[np.arange(slc.shape[0]), slc] def andrew(df): df.col1[df.col1 == 0] = df.col2 df.col1[df.col1 == 0] = df.col3 return df def pablo(df): df['col1'] = df['col1'].replace(0,df['col2']) df['col1'] = df['col1'].replace(0,df['col3']) return df 

I get the following timings:

 %timeit root_nested(df.copy()) 100 loops, best of 3: 2.25 ms per loop %timeit root_split(df.copy()) 100 loops, best of 3: 2.62 ms per loop %timeit pir2(df.copy()) 100 loops, best of 3: 6.25 ms per loop %timeit pir2_2(df.copy()) 1 loop, best of 3: 2.4 ms per loop %timeit andrew(df.copy()) 100 loops, best of 3: 8.55 ms per loop 

I tried to synchronize my method, but it worked for several minutes without completion. For comparison, the timing of your method using only the 6-line DataFrame example (not much larger than that above) took 12.8 ms.

+6
source

I'm not sure if this is faster, but you are right that you can slice a dataframe to get the desired result.

 df.col1[df.col1 == 0] = df.col2 df.col1[df.col1 == 0] = df.col3 print(df) 

Output:

  col1 col2 col3 0 0.2 0.3 0.3 1 0.2 0.3 0.3 2 0.4 0.4 0.4 3 0.3 0.0 0.3 4 0.0 0.0 0.0 5 0.1 0.4 0.4 

Alternatively, if you want it to be shorter (although I don’t know if it is faster), you can combine what you did with what I did.

 df.col1[df.col1 == 0] = df.col2.replace(0, df.col3) print(df) 

Output:

  col1 col2 col3 0 0.2 0.3 0.3 1 0.2 0.3 0.3 2 0.4 0.4 0.4 3 0.3 0.0 0.3 4 0.0 0.0 0.0 5 0.1 0.4 0.4 
+3
source

using pd.DataFrame.where and pd.DataFrame.bfill

 df['col1'] = df.where(df.ne(0), np.nan).bfill(axis=1).col1.fillna(0) df 

enter image description here

Another approach using np.argmax

 def pir2(df): slc = (df.values != 0).argmax(axis=1) return df.values[np.arange(slc.shape[0]), slc] 

I know that there is a better way to use numpy to slice. I just can't think about it at the moment.

+1
source

Source: https://habr.com/ru/post/1258053/


All Articles