Pandas data counter string values

I have a list of words as shown below.

wordlist = ['p1', 'p2', 'p3', 'p4', 'p5', 'p6', 'p7']

And the dataframe is as follows.

df = pd.DataFrame({'id' : [1,2,3,4], 'path' : ["p1,p2,p3,p4","p1,p2,p1","p1,p5,p5,p7","p1,p2,p3,p3"]}) 

output:

  id path 1 p1,p2,p3,p4 2 p1,p2,p1 3 p1,p5,p5,p7 4 p1,p2,p3,p3 

I want to count the path data to get the following output. Is it possible to get such a transformation?

 id p1 p2 p3 p4 p5 p6 p7 1 1 1 1 1 0 0 0 2 2 1 0 0 0 0 0 3 1 0 0 0 2 0 1 4 1 1 2 0 0 0 0 
+6
source share
3 answers

I think that would be effective

 # create Series with dictionaries >>> from collections import Counter >>> c = df["path"].str.split(',').apply(Counter) >>> c 0 {u'p2': 1, u'p3': 1, u'p1': 1, u'p4': 1} 1 {u'p2': 1, u'p1': 2} 2 {u'p1': 1, u'p7': 1, u'p5': 2} 3 {u'p2': 1, u'p3': 2, u'p1': 1} # create DataFrame >>> pd.DataFrame({n: c.apply(lambda x: x.get(n, 0)) for n in wordlist}) p1 p2 p3 p4 p5 p6 p7 0 1 1 1 1 0 0 0 1 2 1 0 0 0 0 0 2 1 0 0 0 2 0 1 3 1 1 2 0 0 0 0 

Update

Another way to do this:

 >>> dfN = df["path"].str.split(',').apply(lambda x: pd.Series(Counter(x))) >>> pd.DataFrame(dfN, columns=wordlist).fillna(0) p1 p2 p3 p4 p5 p6 p7 0 1 1 1 1 0 0 0 1 2 1 0 0 0 0 0 2 1 0 0 0 2 0 1 3 1 1 2 0 0 0 0 

update 2

Some rough performance tests:

 >>> dfL = pd.concat([df]*100) >>> timeit('c = dfL["path"].str.split(",").apply(Counter); d = pd.DataFrame({n: c.apply(lambda x: x.get(n, 0)) for n in wordlist})', 'from __main__ import dfL, wordlist; import pandas as pd; from collections import Counter', number=100) 0.7363274283027295 >>> timeit('splitted = dfL["path"].str.split(","); d = pd.DataFrame({name : splitted.apply(lambda x: x.count(name)) for name in wordlist})', 'from __main__ import dfL, wordlist; import pandas as pd', number=100) 0.5305424618886718 # now let make wordlist larger >>> wordlist = wordlist + list(lowercase) + list(uppercase) >>> timeit('c = dfL["path"].str.split(",").apply(Counter); d = pd.DataFrame({n: c.apply(lambda x: x.get(n, 0)) for n in wordlist})', 'from __main__ import dfL, wordlist; import pandas as pd; from collections import Counter', number=100) 1.765344003293876 >>> timeit('splitted = dfL["path"].str.split(","); d = pd.DataFrame({name : splitted.apply(lambda x: x.count(name)) for name in wordlist})', 'from __main__ import dfL, wordlist; import pandas as pd', number=100) 2.33328927599905 

update 3

after reading in this section, I found that Counter is very slow. You can optimize it a bit using defaultdict :

 >>> def create_dict(x): ... d = defaultdict(int) ... for c in x: ... d[c] += 1 ... return d >>> c = df["path"].str.split(",").apply(create_dict) >>> pd.DataFrame({n: c.apply(lambda x: x[n]) for n in wordlist}) p1 p2 p3 p4 p5 p6 p7 0 1 1 1 1 0 0 0 1 2 1 0 0 0 0 0 2 1 0 0 0 2 0 1 3 1 1 2 0 0 0 0 

and some tests:

 >>> timeit('c = dfL["path"].str.split(",").apply(create_dict); d = pd.DataFrame({n: c.apply(lambda x: x[n]) for n in wordlist})', 'from __main__ import dfL, wordlist, create_dict; import pandas as pd; from collections import defaultdict', number=100) 0.45942801555111146 # now let make wordlist larger >>> wordlist = wordlist + list(lowercase) + list(uppercase) >>> timeit('c = dfL["path"].str.split(",").apply(create_dict); d = pd.DataFrame({n: c.apply(lambda x: x[n]) for n in wordlist})', 'from __main__ import dfL, wordlist, create_dict; import pandas as pd; from collections import defaultdict', number=100) 1.5798653213942089 
+5
source

You can use the str.count() vectorized string str.count() see docs and link ), and for each item in the word list, specify what for the new data frame:

 In [4]: pd.DataFrame({name : df["path"].str.count(name) for name in wordlist}) Out[4]: p1 p2 p3 p4 p5 p6 p7 id 1 1 1 1 1 0 0 0 2 2 1 0 0 0 0 0 3 1 0 0 0 2 0 1 4 1 1 2 0 0 0 0 

UPDATE : some responses to comments. Actually this will not work if the strings can be substrings of each other (but the OP should clarify this). If so, this will work (and also faster):

 splitted = df["path"].str.split(",") pd.DataFrame({name : splitted.apply(lambda x: x.count(name)) for name in wordlist}) 

And some tests to support my claim to be faster :-)
Of course, I don't know what a realistic use case is, but I made the dataframe a bit bigger (just repeated it 1000 times, the differences are bigger):

 In [37]: %%timeit ....: splitted = df["path"].str.split(",") ....: pd.DataFrame({name : splitted.apply(lambda x: x.count(name)) for name i n wordlist}) ....: 100 loops, best of 3: 17.9 ms per loop In [38]: %%timeit ....: pd.DataFrame({name:df["path"].str.count(name) for name in wordlist}) ....: 10 loops, best of 3: 23.6 ms per loop In [39]: %%timeit ....: c = df["path"].str.split(',').apply(Counter) ....: pd.DataFrame({n: c.apply(lambda x: x.get(n, 0)) for n in wordlist}) ....: 10 loops, best of 3: 42.3 ms per loop In [40]: %%timeit ....: dfN = df["path"].str.split(',').apply(lambda x: pd.Series(Counter(x))) ....: pd.DataFrame(dfN, columns=wordlist).fillna(0) ....: 1 loops, best of 3: 715 ms per loop 

I also wordlist test with a lot of elements in a wordlist , and the conclusion: if you have a larger data size with a relatively smaller number of elements in a wordlist , my approach is faster if you have a large wordlist approach with Counter from @RomanPekar can be faster ( but only the last).

+5
source

something similar to this:

 df1 = pd.DataFrame([[path.count(p) for p in wordlist] for path in df['path']],columns=['p1','p2','p3','p4','p5','p6','p7']) 
0
source

Source: https://habr.com/ru/post/959308/


All Articles