Counting the number of duplicate words between two columns in python pandas

Question

Counting the number of duplicate words between two columns in python pandas

Suppose I have the following table in python pandas

friend_description friend_definition James is dumb dumb dude Jacob is smart smart guy Jane is pretty she looks pretty Susan is rich she is rich

here, in the first row, the word "dumb" is contained in both columns. The second row contains smart in both columns. The third row contains "pretty" in both columns, and the last row contains "is" and "rich" in both columns. I want to create the following columns:

 friend_description friend_definition word_overlap overlap_count James is dumb dumb dude dumb 1 Jacob is smart smart guy smart 1 Jane is pretty she looks pretty pretty 1 Susan is rich she is rich is rich 2

I could use a for loop to manually define a new column with such things, but I was wondering if a function exists in pandas, which makes this type of operation smoother.

+5

python pandas

user98235 Dec 10 '17 at 22:22

source share

3 answers

One liner ... because, why not? I was here to support @MaxU's answer anyway. I could also keep it for myself.

 df.join( df.applymap(lambda x: set(x.split())).pipe( lambda d: d.friend_definition - (d.friend_definition - d.friend_description) ).pipe(lambda s: pd.DataFrame(dict(word_overlap=s, overlap_count=s.str.len()))) ) friend_description friend_definition overlap_count word_overlap 0 James is dumb dumb dude 1 {dumb} 1 Jacob is smart smart guy 1 {smart} 2 Jane is pretty she looks pretty 1 {pretty} 3 Susan is rich she is rich 2 {rich, is}

+3

piRSquared Dec 10 '17 at 23:44

source share

Easier to understand for mere mortals (like me)?

 >>> import pandas as pd >>> df = pd.read_csv('user98235.csv', sep='\t') >>> def f(columns): ... f_desc, f_def = columns[0], columns[1] ... common = set(f_desc.split()).intersection(set(f_def.split())) ... return common, len(common) ... >>> df[['word_overlap', 'overlap_count']] = df.apply(f, axis=1, raw=True).apply(pd.Series) >>> df friend_description friend_definition word_overlap overlap_count 0 James is dumb dumb dude {dumb} 1 1 Jacob is smart smart guy {smart} 1 2 Jane is pretty she looks pretty {pretty} 1 3 Susan is rich she is rich {is, rich} 2

+1

Bill bell Dec 11 '17 at 15:52

source share

Maxu · Accepted Answer · 2017-12-10T22:40:34+0000

A simple list comprehension is the fastest when working with strings like this:

 In [112]: df['word_overlap'] = [set(x[0].split()) & set(x[1].split()) for x in df.values] In [113]: df['overlap_count'] = df['word_overlap'].str.len() In [114]: df Out[114]: friend_description friend_definition word_overlap overlap_count 0 James is dumb dumb dude {dumb} 1 1 Jacob is smart smart guy {smart} 1 2 Jane is pretty she looks pretty {pretty} 1 3 Susan is rich she is rich {rich, is} 2

single apply(..., axis=1) :

 In [85]: df['word_overlap'] = df.apply(lambda r: set(r['friend_description'].split()) & ...: set(r['friend_definition'].split()), ...: axis=1) ...: In [86]: df['overlap_count'] = df['word_overlap'].str.len() In [87]: df Out[87]: friend_description friend_definition word_overlap overlap_count 0 James is dumb dumb dude {dumb} 1 1 Jacob is smart smart guy {smart} 1 2 Jane is pretty she looks pretty {pretty} 1 3 Susan is rich she is rich {rich, is} 2

apply().apply(..., axis=1) :

 In [23]: df['word_overlap'] = (df.apply(lambda x: x.str.split(expand=False)) ...: .apply(lambda r: set(r['friend_description']) & set(r['friend_definition']), ...: axis=1)) ...: In [24]: df['overlap_count'] = df['word_overlap'].str.len() In [25]: df Out[25]: friend_description friend_definition word_overlap overlap_count 0 James is dumb dumb dude {dumb} 1 1 Jacob is smart smart guy {smart} 1 2 Jane is pretty she looks pretty {pretty} 1 3 Susan is rich she is rich {is, rich} 2

Timing against 40,000 lines of DF:

 In [104]: df = pd.concat([df] * 10**4, ignore_index=True) In [105]: df.shape Out[105]: (40000, 2) In [106]: %timeit [set(x[0].split()) & set(x[1].split()) for x in df.values] 223 ms ± 19.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [107]: %timeit df.apply(lambda r: set(r['friend_description'].split()) & set(r['friend_definition'].split()), axis=1) 3.65 s ± 46.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [108]: %timeit df.apply(lambda x: x.str.split(expand=False)).apply(lambda r: set(r['friend_description']) & set(r['friend_definition']), ...: axis=1) 4.63 s ± 84.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Counting the number of duplicate words between two columns in python pandas

More articles: