Search for a large list of words in another large list

I have a sorted list of 1,000,000 rows with a maximum length of 256 with protein names. Each row has an associated identifier. I have another unsorted list of 4,000,000,000 lines with a maximum length of 256 words from words, and each word has an identifier.

I want to find all the matches between the list of protein names and the list of words from the articles. Which algorithm should I use? Should I use some pre-build API?

It would be nice if the algorithm worked on a regular PC without special equipment.

Estimates of the time required by the algorithm would be good, but not necessary.

+4
source share
5 answers

4 billion lines is a lot of lines for search.

You can fit the entire data structure into a hash memory for quick searching, but most likely you will want to save the entire list on a more spacious (but slower) disk, in which case the sorted list will provide itself with a relatively efficient binary search algorithm.

If your binary search or such a function was called find_string_in_articles() , then the pseudocode:

 foreach $protein_name ( @protein_names ) { if ( $article_id = find_string_in_articles( $protein_name ) ) { print( "$protein_name matches $article_id\n" ); } } 
+1
source

You can sort them and then make a "mergesort" which will not actually merge, but find duplicates / overlaps. Wikipedia has good links to this.

Sorting this amount of data probably requires more memory than is available. I don't know if unix sort can do this (available on Windows / Mac), but any decent SQL database can do it.

Another possibility is to use the base tree on your protein names (those starting with A go to basket A, B to bin B, etc.). Then simply loop over 4 gazillion words and find the overlaps (you probably need to inject more than one deep radial binning to drop more proteins at a time).

+1
source

This is essentially a relational join. Assuming you haven't already sorted the article’s words, your main algorithm should be:

 for word in article_words: if (proteins.find(word)): found_match(word) 

protein.find () is the tricky part, and you need to experiment to get better performance, this problem is because cache effects are starting to come into play. At first I would try using radix sorting, it is quite simple and most likely fast enough, but binary search and hashing are also alternatives.

+1
source

It looks like you might need to use a binary tree.

0
source

I would do this in one of two ways.

  • Paste it into the sql database and pull out the data you need (slower but simpler).
  • Sort the list, then do binary searches to find what you need (fast but complex).
0
source

Source: https://habr.com/ru/post/1305706/


All Articles