How to assign values ​​randomly between data frames

I am trying to randomly assign values ​​from one column in one data frame to another data frame in 12 different categories (by agerange and gender). For example, I have two data frames; allows you to call one d1 and the other d2

d1: index agerange gender income 0 2 1 56700 1 2 0 25600 2 4 0 3000 3 4 0 106000 4 3 0 200 5 3 0 43000 6 4 0 10000000 d2: index agerange gender 0 3 0 1 2 0 2 4 0 3 4 0 

I want to group both data blocks using agerange and gender, i.e. 0-1,2,3,4,5,6 and 1-1,2,3,4,5,6, then randomly chose one of the incomes within d1 and assign it to d2.

t

 d1: index agerange gender income 0 2 1 56700 1 2 0 25600 2 4 0 3000 3 4 0 106000 4 3 0 200 5 3 0 43000 6 4 0 10000000 d2: index agerange gender income 0 3 0 200 1 2 0 25600 2 4 0 10000000 3 4 0 3000 
+5
source share
3 answers

Option 1
Approach with np.random.choice and pd.DataFrame.query
I make the implicit assumption that we are replacing randomly obtained values ​​for each row.

 def take_one(x): q = 'agerange == {agerange} and gender == {gender}'.format(**x) return np.random.choice(d1.query(q).income) d2.assign(income=d2.apply(take_one, 1)) agerange gender income index 0 3 0 200 1 2 0 25600 2 4 0 106000 3 4 0 106000 

Option 2
An attempt to make calling np.random.choice more effective once for each group.

 g = d1.groupby(['agerange', 'gender']).income.apply(list) f = lambda x: pd.Series(np.random.choice(g.get(x.name, [0] * len(x)), len(x)), x.index) d2.groupby(['agerange', 'gender'], group_keys=False).apply(f) agerange gender income index 0 3 0 200 1 2 0 25600 2 4 0 10000000 3 4 0 106000 

Debugging and tuning

 import pandas as pd import numpy as np d1 = pd.DataFrame({ 'agerange': [2, 2, 4, 4, 3, 3, 4], 'gender': [1, 0, 0, 0, 0, 0, 0], 'income': [56700, 25600, 3000, 106000, 200, 43000, 10000000] }, pd.Index([0, 1, 2, 3, 4, 5, 6], name='index') ) d2 = pd.DataFrame( {'agerange': [3, 2, 4, 4], 'gender': [0, 0, 0, 0]}, pd.Index([0, 1, 2, 3], name='index') ) g = d1.groupby(['agerange', 'gender']).income.apply(list) f = lambda x: pd.Series(np.random.choice(g.loc[x.name], len(x)), x.index) d2.assign(income=d2.groupby(['agerange', 'gender'], group_keys=False).apply(f)) 

  agerange gender income index 0 3 0 200 1 2 0 25600 2 4 0 106000 3 4 0 3000 
+4
source

How about creating a revenue dictionary based on ageranges and then displaying a random selection ie

 #Based on unutbu data df1 = pd.DataFrame({'agerange': [2, 2, 4, 4, 3, 3, 4], 'gender': [1, 0, 0, 0, 0, 0, 0], 'income': [56700, 25600, 3000, 106000, 200, 43000, 10000000], 'index': [0, 1, 2, 3, 4, 5, 6]}) df2 = pd.DataFrame({'agerange': [3, 2, 4, 4], 'gender': [0, 0, 0, 0], 'index': [0, 1, 2, 3]}) age_groups = df1.groupby('agerange')['income'].agg(lambda x: tuple(x)).to_dict() df2['income'] = df2['agerange'].map(lambda x: np.random.choice(age_groups[x])) 

Output:

  agerange gender index income
 0 3 0 0 43,000
 1 2 0 1 25600
 2 4 0 2 106000
 3 4 0 3 106000

If a gender group is also required, you can use the application if you want to fill in 0 for keys that were not found, if you can still use ie

 df2 = pd.DataFrame({'agerange': [3, 2, 6, 4], 'gender': [0, 0, 0, 0], 'index': [0, 1, 2, 3]}) df1 = pd.DataFrame({'agerange': [2, 2, 4, 4, 3, 3, 4], 'gender': [1, 0, 0, 0, 0, 0, 0], 'income': [56700, 25600, 3000, 106000, 200, 43000, 10000000], 'index': [0, 1, 2, 3, 4, 5, 6]}) age_groups = df1.groupby(['agerange','gender'])['income'].agg(lambda x: tuple(x)).to_dict() df2['income'] = df2.apply(lambda x: np.random.choice(age_groups[x['agerange'],x['gender']]) if (x['agerange'],x['gender']) in age_groups else 0,axis=1) 

Output:

  agerange gender index income
 0 3 0 0 43,000
 1 2 0 1 25600
 2 6 0 2 0
 3 4 0 3 106000
+3
source
 d2['income'] = d2.apply(lambda x: d1.loc[(d1.agerange==x.agerange) &(d1.gender == x.gender),'income'].sample(n=1).max(),axis=1) 

Output:

  index agerange gender income 0 0 3 0 200 1 1 2 0 25600 2 2 4 0 3000 3 3 4 0 106000 
+3
source

Source: https://habr.com/ru/post/1270397/


All Articles