Assign a hash to a series of categorical data in pandas

Question

Assign a hash to a series of categorical data in pandas

So, I have many pandas data frames with 3 columns of categorical variables:

DF False TF False DF False TF False

The first and second columns can take one of three values. The third is binary. Thus, there is a total of 18 possible lines (not all combinations can be represented on each data frame).

I would like to assign the number 1-18 to each row, so that the rows with the same combination of factors are assigned the same number and vice versa - (without hash collision).

What is the most efficient way to do this in pandas?

So, all_combination_df is df with all possible combination of factors. I am trying to turn df, for example big_df , into a series with unique numbers in it

 import pandas, itertools def expand_grid(data_dict): """Create a dataframe from every combination of given values.""" rows = itertools.product(*data_dict.values()) return pandas.DataFrame.from_records(rows, columns=data_dict.keys()) all_combination_df = expand_grid( {'variable_1': ['D', 'A', 'T'], 'variable_2': ['C', 'A', 'B'], 'variable_3' : [True, False]}) big_df = pandas.concat([all_combination_df, all_combination_df, all_combination_df])

+5

python pandas hash dataframe

user189035 Nov 05 '16 at 12:26

source share

2 answers

Without considering performance issues, this will find duplicate lines and give you a dictionary (similar to the question here ).

 import pandas as pd, numpy as np # Define data d = np.array([["D", "T", "D", "T", "U"], ["F", "F", "F", "J", "K"], [False, False, False, False, True]]) df = pd.DataFrame(dT) # Find and remove duplicate rows df_nodupe = df[~df.duplicated()] # Make a list df_nodupe.T.to_dict('list') {0: ['D', 'F', 'False'], 1: ['T', 'F', 'False'], 3: ['T', 'J', 'False'], 4: ['U', 'K', 'True']}

Otherwise, you can use map , for example:

 import pandas as pd, numpy as np # Define data d = np.array([["D", "T", "D", "T", "U"], ["F", "F", "F", "J", "K"], [False, False, False, False, True]]) df = pd.DataFrame(dT) df.columns = ['x', 'y', 'z'] # Define your dictionary of interest dd = {('D', 'F', 'False'): 0, ('T', 'F', 'False'): 1, ('T', 'J', 'False'): 2, ('U', 'K', 'True'): 3} # Create a tuple of the rows of interest df['tupe'] = zip(df.x, df.y, df.z) # Create a new column based on the row values df['new_category'] = df.tupe.map(dd)

+2

p-robot Nov 05 '16 at 12:45

source share

Maxu · Accepted Answer · 2016-11-05T13:09:54+0000

UPDATE: like @ user189035 mentioned in the comment , it is much better to use the categorical type dtype, as it will save a lot of memory

I would try using the factorize method:

 In [112]: df['category'] = \ ...: pd.Categorical( ...: pd.factorize((df.a + '~' + df.b + '~' + (df.c*1).astype(str)))[0]) ...: In [113]: df Out[113]: abc category 0 AX True 0 1 BY False 1 2 AX True 0 3 CZ False 2 4 AZ True 3 5 CZ True 4 6 BY False 1 7 CZ False 2 In [114]: df.dtypes Out[114]: a object b object c bool category category dtype: object

Explanation: this simple way you can glue all the columns in one series:

 In [115]: df.a + '~' + df.b + '~' + (df.c*1).astype(str) Out[115]: 0 A~X~1 1 B~Y~0 2 A~X~1 3 C~Z~0 4 A~Z~1 5 C~Z~1 6 B~Y~0 7 C~Z~0 dtype: object

Assign a hash to a series of categorical data in pandas

More articles: