Sophisticated (for me) reformatting with wide longs in Pandas

Persons (indexed from 0 to 5) choose between two locations: A and B. My data is in a wide format containing characteristics that depend on the individual (ind_var) and characteristics that differ only in location (location_var).

For example, I have:

In [281]: df_reshape_test = pd.DataFrame( {'location' : ['A', 'A', 'A', 'B', 'B', 'B'], 'dist_to_A' : [0, 0, 0, 50, 50, 50], 'dist_to_B' : [50, 50, 50, 0, 0, 0], 'location_var': [10, 10, 10, 14, 14, 14], 'ind_var': [3, 8, 10, 1, 3, 4]}) df_reshape_test Out[281]: dist_to_A dist_to_B ind_var location location_var 0 0 50 3 A 10 1 0 50 8 A 10 2 0 50 10 A 10 3 50 0 1 B 14 4 50 0 3 B 14 5 50 0 4 B 14 

The location variable is the one that the person has selected. dist_to_A - distance to location A from the location selected by the individual (same with dist_to_B)

I want my data to have the following form:

  choice dist_S ind_var location location_var 0 1 0 3 A 10 0 0 50 3 B 14 1 1 0 8 A 10 1 0 50 8 B 14 2 1 0 10 A 10 2 0 50 10 B 14 3 0 50 1 A 10 3 1 0 1 B 14 4 0 50 3 A 10 4 1 0 3 B 14 5 0 50 4 A 10 5 1 0 4 B 14 

where selection == 1 indicates that the person has chosen this location, and dist_S is the distance from the selected location.

I read about .stack but couldn't figure out how to apply it for this case, Thanks for your time!

NOTE. This is a simple example. The datasets I'm looking for have a different number of places and the number of people in each location, so I'm looking for a flexible solution, if possible.

+6
source share
3 answers

In fact, pandas has a wide_to_long command that can conveniently do what you intend to do.

 df = pd.DataFrame( {'location' : ['A', 'A', 'A', 'B', 'B', 'B'], 'dist_to_A' : [0, 0, 0, 50, 50, 50], 'dist_to_B' : [50, 50, 50, 0, 0, 0], 'location_var': [10, 10, 10, 14, 14, 14], 'ind_var': [3, 8, 10, 1, 3, 4]}) df['ind'] = df.index #The `location` and `location_var` corresponds to the choices, #record them as dictionaries and drop them #(Just realized you had a cleaner way, copied from yous). ind_to_loc = dict(df['location']) loc_dict = dict(df.groupby('location').agg(lambda x : int(np.mean(x)))['location_var']) df.drop(['location_var', 'location'], axis = 1, inplace = True) # now reshape df_long = pd.wide_to_long(df, ['dist_to_'], i = 'ind', j = 'location') # use the dictionaries to get variables `choice` and `location_var` back. df_long['choice'] = df_long.index.map(lambda x: ind_to_loc[x[0]]) df_long['location_var'] = df_long.index.map(lambda x : loc_dict[x[1]]) print df_long.sort() 

This gives you the table you requested:

  ind_var dist_to_ choice location_var ind location 0 A 3 0 A 10 B 3 50 A 14 1 A 8 0 A 10 B 8 50 A 14 2 A 10 0 A 10 B 10 50 A 14 3 A 1 50 B 10 B 1 0 B 14 4 A 3 50 B 10 B 3 0 B 14 5 A 4 50 B 10 B 4 0 B 14 

Of course, you can create a select variable that takes 0 and 1 if that is what you want.

+6
source

I'm a little curious why you want the format. There is probably a much better way to store your data. But here it goes.

 In [137]: import numpy as np In [138]: import pandas as pd In [139]: df_reshape_test = pd.DataFrame( {'location' : ['A', 'A', 'A', 'B', 'B ', 'B'], 'dist_to_A' : [0, 0, 0, 50, 50, 50], 'dist_to_B' : [50, 50, 50, 0, 0, 0], 'location_var': [10, 10, 10, 14, 14, 14], 'ind_var': [3, 8, 10, 1, 3, 4]}) In [140]: print(df_reshape_test) dist_to_A dist_to_B ind_var location location_var 0 0 50 3 A 10 1 0 50 8 A 10 2 0 50 10 A 10 3 50 0 1 B 14 4 50 0 3 B 14 5 50 0 4 B 14 In [141]: # Get the new axis separately: In [142]: idx = pd.Index(df_reshape_test.index.tolist() * 2) In [143]: df2 = df_reshape_test[['ind_var', 'location', 'location_var']].reindex(idx) In [144]: print(df2) ind_var location location_var 0 3 A 10 1 8 A 10 2 10 A 10 3 1 B 14 4 3 B 14 5 4 B 14 0 3 A 10 1 8 A 10 2 10 A 10 3 1 B 14 4 3 B 14 5 4 B 14 In [145]: # Swap the location for the second half In [146]: # replace any 6 with len(df) / 2 + 1 if you have more rows.d In [147]: df2['choice'] = [1] * 6 + [0] * 6 # may need to play with this. In [148]: df2.iloc[6:].location.replace({'A': 'B', 'B': 'A'}, inplace=True) In [149]: df2 = df2.sort() In [150]: df2['dist_S'] = np.abs((df2.choice - 1) * 50) In [151]: print(df2) ind_var location location_var choice dist_S 0 3 A 10 1 0 0 3 B 10 0 50 1 8 A 10 1 0 1 8 B 10 0 50 2 10 A 10 1 0 2 10 B 10 0 50 3 1 B 14 1 0 3 1 A 14 0 50 4 3 B 14 1 0 4 3 A 14 0 50 5 4 B 14 1 0 5 4 A 14 0 50 

This will not generalize well, but there are probably alternative (better) ways to get around the uglier parts, such as generating a col selection.

+3
source

Well, it took longer than I expected, but here is a more general answer that works with an arbitrary number of options for each person. I'm sure there are simpler ways, so it would be great if someone could intercept something better for some of the following code.

 df = pd.DataFrame( {'location' : ['A', 'A', 'A', 'B', 'B', 'B'], 'dist_to_A' : [0, 0, 0, 50, 50, 50], 'dist_to_B' : [50, 50, 50, 0, 0, 0], 'location_var': [10, 10, 10, 14, 14, 14], 'ind_var': [3, 8, 10, 1, 3, 4]}) 

which gives

  dist_to_A dist_to_B ind_var location location_var 0 0 50 3 A 10 1 0 50 8 A 10 2 0 50 10 A 10 3 50 0 1 B 14 4 50 0 3 B 14 5 50 0 4 B 14 

Then do:

 df.index.names = ['ind'] # Add choice var df['choice'] = 1 # Create dictionaries we'll use later ind_to_loc = dict(df['location']) # gives ind_to_loc equal to {0 : 'A', 1 : 'A', 2 : 'A', 3 : 'B', 4 : 'B', 5: 'B'} ind_dict = dict(df['ind_var']) #gives { 0: 3, 1 : 8, 2 : 10, 3: 1, 4 : 3, 5: 4} loc_dict = dict( df.groupby('location').agg(lambda x : int(np.mean(x)) )['location_var'] ) # gives {'A' : 10, 'B' : 14} 

Now I create a Multi-Index and re-index to get a long form

 df = df.set_index( [df.index, df['location']] ) df.index.names = ['ind', 'location'] # re-index to long shape loc_list = ['A', 'B'] ind_list = [0, 1, 2, 3, 4, 5] new_shape = [ (ind, loc) for ind in ind_list for loc in loc_list] idx = pd.Index(new_shape) df_long = df.reindex(idx, method = None) df_long.index.names = ['ind', 'loc'] 

The long figure is as follows:

  dist_to_A dist_to_B ind_var location location_var choice ind loc 0 A 0 50 3 A 10 1 B NaN NaN NaN NaN NaN NaN 1 A 0 50 8 A 10 1 B NaN NaN NaN NaN NaN NaN 2 A 0 50 10 A 10 1 B NaN NaN NaN NaN NaN NaN 3 A NaN NaN NaN NaN NaN NaN B 50 0 1 B 14 1 4 A NaN NaN NaN NaN NaN NaN B 50 0 3 B 14 1 5 A NaN NaN NaN NaN NaN NaN B 50 0 4 B 14 1 

So now fill in the NaN values ​​with dictionaries:

 df_long['ind_var'] = df_long.index.map(lambda x : ind_dict[x[0]] ) df_long['location'] = df_long.index.map(lambda x : ind_to_loc[x[0]] ) df_long['location_var'] = df_long.index.map(lambda x : loc_dict[x[1]] ) # Fill in choice df_long['choice'] = df_long['choice'].fillna(0) 

Finally, all that remains is to create dist_S
I will cheat here and assume that I can create a nested dictionary like this

 nested_loc = {'A' : {'A' : 0, 'B' : 50}, 'B' : {'A' : 50, 'B' : 0}} 

(That says: if you are at location A, then location A is at 0 km, and location B is at 50 km).

 def nested_f(x): return nested_loc[x[0]][x[1]] df_long = df_long.reset_index() df_long['dist_S'] = df_long[['loc', 'location']].apply(nested_f, axis=1) df_long = df_long.drop(['dist_to_A', 'dist_to_B', 'location'], axis = 1 ) df_long 

gives the desired result

  ind loc ind_var location_var choice dist_S 0 0 A 3 10 1 0 1 0 B 3 14 0 50 2 1 A 8 10 1 0 3 1 B 8 14 0 50 4 2 A 10 10 1 0 5 2 B 10 14 0 50 6 3 A 1 10 0 50 7 3 B 1 14 1 0 8 4 A 3 10 0 50 9 4 B 3 14 1 0 10 5 A 4 10 0 50 11 5 B 4 14 1 0 
+2
source

Source: https://habr.com/ru/post/949610/


All Articles