I have a dataframe like this:
import pandas as pd
df = pd.DataFrame({"c0": list('ABC'),
"c1": [" ".join(list('ab')), " ".join(list('def')), " ".join(list('s'))],
"c2": list('DEF')})
c0 c1 c2
0 A a b D
1 B d e f E
2 C s F
I want to create a pivot table that looks like this:
c2
c0 c1
A a D
b D
B d E
e E
f E
C s F
So, records in c1are divided, and then processed as separate elements used in a multi-index.
I do it as follows:
newdf = pd.DataFrame()
for indi, rowi in df.iterrows():
n_elements = rowi['c1'].split()
if len(n_elements) == 1:
newdf = newdf.append(rowi)
else:
for eli in n_elements:
if not newdf.empty:
newdf = newdf.reset_index(drop=True)
newdf.index = -1 * newdf.index - 1
newdf = newdf.append(rowi)
newdf.loc[indi, 'c1'] = eli
print newdf.reset_index(drop=True)
what gives
c0 c1 c2
0 A a D
1 A b D
2 B d E
3 B e E
4 B f E
5 C s F
Then I can just call
pd.pivot_table(newdf, index=['c0', 'c1'], aggfunc=lambda x: ' '.join(set(str(v) for v in x)))
which gives me the desired result (see above).
For huge data frames, which can be quite slow, so I wonder if there is a more efficient way to do this.
source
share