I use Python (and have access to pandas, numpy, scipy).
I have two sets of strings A and a set B. Each set A and B contains c. 2000 elements (each element is a string). Strings are about 50-100 characters, including up to c. 20 words (these sets can be much more).
I want to check if an element of the set A is also a member of the set B.
Now I think that a naive implementation can be visualized as a matrix, where elements from A and B are compared with each other (for example, A1 == B1, A1 == B2, A1 == B3, etc.) and Boolean (0 , 1) from the comparison, they contain matrix elements.
What is the best way to effectively implement this?
Two further developments:
(i) I also think that for larger sets, I can use the Bloom Filter (e.g. using PyBloom, pybloomfilter) to hash each line (i.e. I can't resist the positive effects so much ...). Is this a good approach or are there other strategies I should consider?
(ii) I'm thinking of including the distance between the Levenshtein lines between the lines (which, as I know, can be slow), because I may need fuzzy matches - is there a way to combine this with the approach in (i) or otherwise make it more efficient?
Thanks in advance for your help!
source share