I have a Pandas framework and I would like to add a new column based on the values of other columns. The following is a minimal example illustrating my account.
df = pd.DataFrame([[4,5,19],[1,2,0],[2,5,9],[8,2,5]], columns=['a','b','c'])
df
a b c
---------------
0 4 5 19
1 1 2 0
2 2 5 9
3 8 2 5
x = df.sample(n=2)
x
a b c
---------------
3 8 2 5
1 1 2 0
def get_new(row):
a, b, c = row
return random.choice(df[(df['a'] != a) & (df['b'] == b) & (df['c'] != c)]['c'].values)
y = x.apply(lambda row: get_new(row), axis=1)
x['new'] = y
x
a b c new
--------------------
3 8 2 5 0
1 1 2 0 5
Note. The original framework has ~ 4 million rows and ~ 6 columns. The number of lines in the sample can vary from 50 to 500. I run on a 64-bit machine with 8 GB of RAM.
The above works, except that it is rather slow (takes about 15 seconds for me). I also tried using x.itertuples()instead apply, in which case the improvements are small.
: df.sample() , .iloc .loc, , .