I think that would be effective
# create Series with dictionaries >>> from collections import Counter >>> c = df["path"].str.split(',').apply(Counter) >>> c 0 {u'p2': 1, u'p3': 1, u'p1': 1, u'p4': 1} 1 {u'p2': 1, u'p1': 2} 2 {u'p1': 1, u'p7': 1, u'p5': 2} 3 {u'p2': 1, u'p3': 2, u'p1': 1}
Update
Another way to do this:
>>> dfN = df["path"].str.split(',').apply(lambda x: pd.Series(Counter(x))) >>> pd.DataFrame(dfN, columns=wordlist).fillna(0) p1 p2 p3 p4 p5 p6 p7 0 1 1 1 1 0 0 0 1 2 1 0 0 0 0 0 2 1 0 0 0 2 0 1 3 1 1 2 0 0 0 0
update 2
Some rough performance tests:
>>> dfL = pd.concat([df]*100) >>> timeit('c = dfL["path"].str.split(",").apply(Counter); d = pd.DataFrame({n: c.apply(lambda x: x.get(n, 0)) for n in wordlist})', 'from __main__ import dfL, wordlist; import pandas as pd; from collections import Counter', number=100) 0.7363274283027295 >>> timeit('splitted = dfL["path"].str.split(","); d = pd.DataFrame({name : splitted.apply(lambda x: x.count(name)) for name in wordlist})', 'from __main__ import dfL, wordlist; import pandas as pd', number=100) 0.5305424618886718
update 3
after reading in this section, I found that Counter is very slow. You can optimize it a bit using defaultdict :
>>> def create_dict(x): ... d = defaultdict(int) ... for c in x: ... d[c] += 1 ... return d >>> c = df["path"].str.split(",").apply(create_dict) >>> pd.DataFrame({n: c.apply(lambda x: x[n]) for n in wordlist}) p1 p2 p3 p4 p5 p6 p7 0 1 1 1 1 0 0 0 1 2 1 0 0 0 0 0 2 1 0 0 0 2 0 1 3 1 1 2 0 0 0 0
and some tests:
>>> timeit('c = dfL["path"].str.split(",").apply(create_dict); d = pd.DataFrame({n: c.apply(lambda x: x[n]) for n in wordlist})', 'from __main__ import dfL, wordlist, create_dict; import pandas as pd; from collections import defaultdict', number=100) 0.45942801555111146