I am trying to combine multiple crosstabs into one. Please note that the data provided is obviously for testing purposes only. The evidence is much larger, so efficiency is very important to me.
Crosstabs are generated, listed, and then combined with the lambda function in the column word. However, the result of this merger is not what I expect. I think the problem is that columns with only N values โโof the crosstab are discarded even when used dropna = False, which will lead to a function error merge. First I will show the code and after that I will get intermediate data and errors.
import pandas as pd
import numpy as np
import functools as ft
def main():
df = pd.DataFrame(data=np.zeros((0, 3)), columns=['word','det','source'])
df["word"] = ('banana', 'banana', 'elephant', 'mouse', 'mouse', 'elephant', 'banana', 'mouse', 'mouse', 'elephant', 'ostrich', 'ostrich')
df["det"] = ('a', 'the', 'the', 'a', 'the', 'the', 'a', 'the', 'a', 'a', 'a', 'the')
df["source"] = ('BE', 'BE', 'BE', 'NL', 'NL', 'NL', 'FR', 'FR', 'FR', 'FR', 'FR', 'FR')
create_frequency_list(df)
def create_frequency_list(df):
total = pd.crosstab(df.word, df.det, dropna = False)
total.fillna(0)
total.reset_index(inplace=True)
total.columns = ['word', 'a', 'the']
crosstabs = [total]
first_index = [('total','total')]
second_index = [('a','the')]
for source, tempDf in df.groupby('source'):
crosstab = pd.crosstab(tempDf.word, tempDf.det, dropna = False)
crosstab.fillna(0)
crosstab.reset_index(inplace=True)
crosstab.columns = ['word', 'a', 'the']
crosstabs.append(crosstab)
first_index.extend((source,source))
second_index.extend(('a','the'))
for tab in crosstabs:
print(tab)
merged = ft.reduce(lambda left,right: pd.merge(left,right, on='word'), crosstabs).set_index('word')
print(merged)
arrays = [first_index, second_index]
columns = pd.MultiIndex.from_arrays(arrays)
df_freq = pd.DataFrame(data=merged.as_matrix(),
columns=columns,
index = crosstabs[0]['word'])
print(df_freq)
main()
Individual crosstabs : not as expected. NA columns are discarded
word a the
0 banana 2 1
1 elephant 1 2
2 mouse 2 2
3 ostrich 1 1
word a the
0 banana 1 1
1 elephant 0 1
word a the
0 banana 1 0
1 elephant 1 0
2 mouse 1 1
3 ostrich 1 1
word a the
0 elephant 0 1
1 mouse 1 1
, dataframes , , , .
: , ,
a_x the_x a_y the_y a_x the_x a_y the_y
word
elephant 1 2 0 1 1 0 0 1
:
columns = pd.MultiIndex.from_arrays(arrays)
, , , . , Python, .
, :
source total BE FR NL
det a the a the a the a the
word
0 banana 2 1 1 1 1 0 0 0
1 elephant 1 2 0 1 1 0 0 1
2 mouse 2 2 0 0 1 1 1 1
3 ostrich 1 1 0 0 1 1 0 0