How can I delete a large number of phrases in one pass from a large text file?

I was wondering - can I somehow remove a large number (100 thousand) of text phrases in a single pass from a large (18 GB) text file?

+4
source share
4 answers

Rabin-Karp is good for searching in multiple substrings, but I think your phrases should be the same length.

If they have the same length, you can search for sub-phrases of length (minimum length in all phrases), and then expand when you find something.

And another thought that I have is that you can expand it to use a small set of words with a length of q subtasks, depending on your search phrases. And you can modify Rabin-Karp to have q rolling hashes instead of one, with q sets of hashes. This will help if you can break your phrases into q subsets that are of similar length.

0
source

You can create a suffix tree from your list of phrases and skip the file using it. This will allow you to identify all rows. This is often used to tag material, but you can also adapt it to delete lines.

0
source

I'm going to go limb here and suggest you use AWK , because it is very fast for this task.

0
source

Are these phrases the same? For example, is this the same word you want to delete? Then perhaps you can delete it using the keyword "in". checking each line using a while loop and removing all instances of the word from that line. You need more information about the problem.

0
source

Source: https://habr.com/ru/post/1380541/


All Articles