The fastest algorithm for finding overlap between two very large lists?

Question

The fastest algorithm for finding overlap between two very large lists?

I am trying to build an algorithm in Python to filter a large block of RDF data.

I have one list of about 70 thousand items formatted as <"datum"> .

Then I have about 6 GB of items (triples) formatted as <"A"> <"B"> <"C">

I want to extract all triples that contain any element in the first list, and then extract any triples that contain any separate element from the first extraction (the net effect is to form a section of the graph connected in one step to the seeds from the first list).

I could not come up with an excellent algorithm for this (the fact that I did not have official CS training did not help).

The best I have come up with so far is to start by breaking the triple in the big list into a list of three lists of elements [<"A">, <"B">, <"C">] . Then I broke it into pieces and used multiprocessing to create processes that take a complete list and a piece of a large list and ...

 for line in big list: for item in small list: if item in line: bucket.append(line)

This algorithm takes a lot of time.

Is there a faster way to do this? If there is a specific algorithm, you can just give me a name and I will figure out how to implement it.

Thanks!

Explanation of comments:

All data items are strings. Such a small list may contain ["Mickey", "Mouse", "Minny", "Cat"] , and a large list may be [["Mickey","Pluto","Bluto"],["John", "Jane", "Jim]...]
Only one element in each triple of a large list must correspond to an element for a small list in order to count
All the items in the short list are actually unique, so I didn’t think that they would convert them into a set anyway. But I will try it.
I can create any intermediate structures that I want. I am experimenting with an inverted index created using a shelf right now.

+6

python list algorithm rdf

rogueleaderr May 03 '12 at 2:23

source share

1 answer

happydave · Accepted Answer · 2012-05-03T02:37:11+0000

You should probably save the small list in the set first, so the search is faster. This prevents 70,000 iterations for each item in the big_list.

 small_list_set = set(small_list) for line in big_list: for item in line: if item in small_list_set: bucket.append(line)

The fastest algorithm for finding overlap between two very large lists?

More articles: