Check if a string is in a list of 2GB strings in python

I have a large file ( A.txt ) of 2 GB containing a list of lines ['Question','Q1','Q2','Q3','Ans1','Format','links',...] .

Now I have another file (1TB) containing the above lines in 2nd position:

Output:

 a, Question, b The, quiz, is This, Q1, Answer Here, Ans1, is King1, links, King2 programming,language,drupal, ..... 

I want to save lines whose second position contains lines in the list stored in the A.txt file. That is, I want to save (save in another file) the following lines:

 a, Question, b This, Q1, Answer Here, Ans1, is King1, links, King2 

I know how to do this when the length of the list in the file (A.txt) is 100. using "any". But I do not understand how to do this when the length of the list in the file (A.txt) is 2 GB.

+6
source share
2 answers

Do not use a list; use the kit instead.

Read the first file in the set:

 with open('A.txt') as file_a: words = {line.strip() for line in file_a} 

0.5 GB of words is not much to store in a set.

Now you can test against words in O (1) constant time:

 if second_word in words: # .... 

Open the second file and process it line by line, possibly using the csv module, if the words of the lines are separated by a comma.

For a larger set of words, use the database instead; Python comes with sqlite3 library:

 import sqlite3 conn = sqlite3.connect(':memory:') conn.execute('CREATE TABLE words (word UNIQUE)') with open('A.txt') as file_a, conn: cursor = conn.cursor() for line in file_a: cursor.execute('INSERT OR IGNORE INTO words VALUES (?)', (line.strip(),)) 

then check for this:

 cursor = conn.cursor() for line in second_file: second_word = hand_waving cursor.execute('SELECT 1 from words where word=?', (second_word,)) if cursor.fetchone(): # .... 

Despite the fact that I use the database here :memory: SQLite is smart enough to store data in temporary files when you start filling up the memory. Connection :memory: basically a temporary one-time database. You can also use the real file path if you want to reuse the word database.

+8
source

Start with Martijn Pieters answer. If it's too slow, you can use Bloom Filter to reduce the number of times you use the database by eliminating lines that cannot match any words in your list. Python comes with a hash function that you can use for one of the hashes in the filter table, and you can search for any number of others.

+1
source

Source: https://habr.com/ru/post/946120/


All Articles