Search for a large list of words in another large list

Question

Search for a large list of words in another large list

I have a sorted list of 1,000,000 rows with a maximum length of 256 with protein names. Each row has an associated identifier. I have another unsorted list of 4,000,000,000 lines with a maximum length of 256 words from words, and each word has an identifier.

I want to find all the matches between the list of protein names and the list of words from the articles. Which algorithm should I use? Should I use some pre-build API?

It would be nice if the algorithm worked on a regular PC without special equipment.

Estimates of the time required by the algorithm would be good, but not necessary.

+4

search

Christian Mar 31 '10 at 23:31

source share

5 answers

You can sort them and then make a "mergesort" which will not actually merge, but find duplicates / overlaps. Wikipedia has good links to this.

Sorting this amount of data probably requires more memory than is available. I don't know if unix sort can do this (available on Windows / Mac), but any decent SQL database can do it.

Another possibility is to use the base tree on your protein names (those starting with A go to basket A, B to bin B, etc.). Then simply loop over 4 gazillion words and find the overlaps (you probably need to inject more than one deep radial binning to drop more proteins at a time).

+1

Pasi savolainen Mar 31 '10 at 23:39

source share

This is essentially a relational join. Assuming you haven't already sorted the article’s words, your main algorithm should be:

 for word in article_words: if (proteins.find(word)): found_match(word)

protein.find () is the tricky part, and you need to experiment to get better performance, this problem is because cache effects are starting to come into play. At first I would try using radix sorting, it is quite simple and most likely fast enough, but binary search and hashing are also alternatives.

+1

Simon buchan Apr 01 '10 at 0:11

source share

It looks like you might need to use a binary tree.

0

Hobblin Mar 31 '10 at 23:36

source share

I would do this in one of two ways.

Paste it into the sql database and pull out the data you need (slower but simpler).
Sort the list, then do binary searches to find what you need (fast but complex).

0

rook Mar 31 '10 at 23:43

source share

PP · Accepted Answer · 2010-03-31T23:37:40+0000

4 billion lines is a lot of lines for search.

You can fit the entire data structure into a hash memory for quick searching, but most likely you will want to save the entire list on a more spacious (but slower) disk, in which case the sorted list will provide itself with a relatively efficient binary search algorithm.

If your binary search or such a function was called find_string_in_articles() , then the pseudocode:

 foreach $protein_name ( @protein_names ) { if ( $article_id = find_string_in_articles( $protein_name ) ) { print( "$protein_name matches $article_id\n" ); } }

Search for a large list of words in another large list

More articles: