Best way to compare two large string sets in Python

Question

Best way to compare two large string sets in Python

I use Python (and have access to pandas, numpy, scipy).

I have two sets of strings A and a set B. Each set A and B contains c. 2000 elements (each element is a string). Strings are about 50-100 characters, including up to c. 20 words (these sets can be much more).

I want to check if an element of the set A is also a member of the set B.

Now I think that a naive implementation can be visualized as a matrix, where elements from A and B are compared with each other (for example, A1 == B1, A1 == B2, A1 == B3, etc.) and Boolean (0 , 1) from the comparison, they contain matrix elements.

What is the best way to effectively implement this?

Two further developments:

(i) I also think that for larger sets, I can use the Bloom Filter (e.g. using PyBloom, pybloomfilter) to hash each line (i.e. I can't resist the positive effects so much ...). Is this a good approach or are there other strategies I should consider?

(ii) I'm thinking of including the distance between the Levenshtein lines between the lines (which, as I know, can be slow), because I may need fuzzy matches - is there a way to combine this with the approach in (i) or otherwise make it more efficient?

Thanks in advance for your help!

+4

python string bloom-filter

user7289 Jun 23 '13 at 17:50

source share

2 answers

As mentioned in the comments:

 def compare(A, B): return list(set(A).intersection(B))

+1

Runholt Sep 11 '14 at 13:19

source share

michaelmeyer · Accepted Answer · 2013-06-23T18:51:38+0000

Firstly, 2000 * 100 characters are not so large, you can use the set directly.

Secondly, if your rows are sorted, there is a quick way (which I found here ) to compare them:

def compare(E1, E2): i, j = 0, 0 I, J = len(E1), len(E2) while i < I: if j >= J or E1[i] < E2[j]: print(E1[i], "is not in E2") i += 1 elif E1[i] == E2[j]: print(E1[i], "is in E2") i, j = i + 1, j + 1 else: j += 1

This, of course, is slower than using a set, but it doesn’t need lines to store in memory (only two are needed at a time).

For Levenshtein's subject, there is a C module, which you can find on Pypi, and which is pretty fast.

Best way to compare two large string sets in Python

More articles: