Pandas alternative to apply - to create a new column based on multiple columns

I have a Pandas framework and I would like to add a new column based on the values ​​of other columns. The following is a minimal example illustrating my account.

df = pd.DataFrame([[4,5,19],[1,2,0],[2,5,9],[8,2,5]], columns=['a','b','c'])
df

    a   b   c
---------------
0   4   5   19
1   1   2   0
2   2   5   9
3   8   2   5

x = df.sample(n=2)
x

    a   b   c
---------------
3   8   2   5
1   1   2   0

def get_new(row):
    a, b, c = row
    return random.choice(df[(df['a'] != a) & (df['b'] == b) & (df['c'] != c)]['c'].values)

y = x.apply(lambda row: get_new(row), axis=1)
x['new'] = y
x

    a   b   c   new
--------------------
3   8   2   5   0
1   1   2   0   5

Note. The original framework has ~ 4 million rows and ~ 6 columns. The number of lines in the sample can vary from 50 to 500. I run on a 64-bit machine with 8 GB of RAM.

The above works, except that it is rather slow (takes about 15 seconds for me). I also tried using x.itertuples()instead apply, in which case the improvements are small.

  • It seems that the application (with axis = 1) is slow because it does not use vectorized operations. Is there any way to achieve this faster?

  • ( get_new) , ?

  • - numpy ?

: df.sample() , .iloc .loc, , .

+4
1

, .loc, :

import random, pandas as pd, numpy as np

df = pd.DataFrame([[4,5,19],[1,2,0],[2,5,9],[8,2,5]], columns=['a','b','c'])

df = pd.concat([df]*1000000)

x = df.sample(n=2)

def get_new(row):
    a, b, c = row
    return random.choice(df[(df['a'] != a) & (df['b'] == b) & (df['c'] != c)]['c'].values)

def get_new2(row):
    a, b, c = row
    return random.choice(df.loc[(df['a'] != a) & (df['b'] == b) & (df['c'] != c), 'c'].values)


%timeit x.apply(lambda row: get_new(row), axis=1)   # 159ms
%timeit x.apply(lambda row: get_new2(row), axis=1)  # 119ms
+1

Source: https://habr.com/ru/post/1694302/


All Articles