How to correlate between categorical columns

I have a set of columns (col1, col2, col3) in dataframe df1 I have another set of columns (col4, col5, col6) in dataframe df2 Suppose these two data frames have the same number of rows.

How to create a correlation table that makes pair correlation between df1 and df2?

the table will look like

col1 col2 col3 col4 .. .. .. col5 .. .. .. col6 .. .. .. 

I am using df1.corrwith(df2) , it does not seem to generate a table as needed.

I asked a similar question here: How to correlate between two data files with different column names but now I am dealing with categorical columns.

If this is not directly comparable, is there a standard way to make it comparable (e.g. using get_dummies)? and is this a faster way to automatically process all fields (suppose all of them are categorical) and calculate their correlation?

+5
source share
1 answer

You are correct that pd.get_dummies required to get the correlation. Below I will create some fake data with two categorical columns and then use corrwith

 df = pd.DataFrame({'col1':np.random.choice(list('abcde'),100), 'col2':np.random.choice(list('xyz'),100)}, dtype='category') df1 = pd.DataFrame({'col1':np.random.choice(list('abcde'),100), 'col2':np.random.choice(list('xyz'),100)}, dtype='category') dfa = pd.get_dummies(df) dfb = pd.get_dummies(df1) dfa.corrwith(dfb) col1_a -0.057735 col1_b 0.002513 col1_c 0.137956 col1_d -0.095050 col1_e -0.114022 col2_x 0.022568 col2_y -0.081699 col2_z -0.128350 
+4
source

Source: https://habr.com/ru/post/1263251/


All Articles