Python: how to check if an element is in a list efficiently?

I have a list of strings (like words), and when I parse the text, I need to check if the word belongs to the group of words in my current list.

However, my input is quite large (about 600 million lines), and checking for the presence of an item in the list is an O (n) operation according to the Python documentation.

My code looks something like this:

words_in_line = [] for word in line: if word in my_list: words_in_line.append(word) 

Since it takes too much time (actually days), I would like to improve the part that takes most of the time. I look at Python collections and, more specifically, deque. However, only give O (1) access time to the head and tail of the list, and not in the middle.

Does anyone have an idea on how to do this better?

+4
source share
4 answers

You may consider a trie or DAWG or database. There are several Python implementations.

Here are some relative timings for you to consider vs vs list:

 import timeit import random with open('/usr/share/dict/words','r') as di: # UNIX 250k unique word list all_words_set={line.strip() for line in di} all_words_list=list(all_words_set) # slightly faster if this list is sorted... test_list=[random.choice(all_words_list) for i in range(10000)] test_set=set(test_list) def set_f(): count = 0 for word in test_set: if word in all_words_set: count+=1 return count def list_f(): count = 0 for word in test_list: if word in all_words_list: count+=1 return count def mix_f(): # use list for source, set for membership testing count = 0 for word in test_list: if word in all_words_set: count+=1 return count print "list:", timeit.Timer(list_f).timeit(1),"secs" print "set:", timeit.Timer(set_f).timeit(1),"secs" print "mixed:", timeit.Timer(mix_f).timeit(1),"secs" 

Print

 list: 47.4126560688 secs set: 0.00277495384216 secs mixed: 0.00166988372803 secs 

those. matching a set of 10,000 words with a set of 250,000 words is 17,085 X faster than matching a list of 10,000 words in a list of the same 250,000 words. Using a list for source and a membership testing kit 28,392 X is faster than one unsorted list.

For membership testing, the list is O (n), and sets and dict are O (1) for the search.

Conclusion: Use the best data structures for 600 million lines of text!

+11
source

List comprehension is used here.

 words_in_line = [word for word in line if word in my_list] 

which would be more efficient than the code you submitted, although how much more is hard for your huge dataset to understand.

+2
source

There are two improvements you can make here.

  • Get your list of words back with a hash table. This will give you O (1) opportunity when you check to see if a word is in the list of words. There are several ways to do this; the most suitable in this case is converting your list into a set.
  • Use a more appropriate structure for your collection of matching words.
    • If you need to keep all matches in memory at the same time, use dequeue , as its add performance exceeds lists.
    • If you do not need all the matches in memory at once, consider using a generator. The generator is used to iterate over the agreed values โ€‹โ€‹in accordance with the logic you specified, but only stores part of the resulting list in memory at a time. It can provide increased performance if you encounter I / O bottlenecks.

Below is an example implementation based on my suggestions (choosing a generator, because I canโ€™t imagine that you need all these words in memory at once).

 from itertools import chain d = set(['a','b','c']) # Load our dictionary f = open('c:\\input.txt','r') # Build a generator to get the words in the file all_words_generator = chain.from_iterable(line.split() for line in f) # Build a generator to filter out the non-dictionary words matching_words_generator = (word for word in all_words_generator if word in d) for matched_word in matching_words_generator: # Do something with matched_word print matched_word # We're reading the file during the above loop, so don't close it too early f.close() 

input.txt

 ab dog cat c dog poop maybe b cat dog 

Output

 a b c b 
0
source

I donโ€™t understand why you chose the list in the first place, but here are a few alternatives:

Using set () is probably a good idea. It is very fast, although disordered, but sometimes it is exactly what you need.

If you need things ordered for an arbitrary search, you can use some kind of tree: http://stromberg.dnsalias.org/~strombrg/python-tree-and-heap-comparison/

If membership testing is installed with a small number of false positives here or is acceptable, you can check the flower filter: http://stromberg.dnsalias.org/~strombrg/drs-bloom-filter/

Depending on what you do, a trie can also be very good.

0
source

Source: https://habr.com/ru/post/1274915/


All Articles