What is the best way to search a large file?

I am looking to apply KMP (or similar) search to a large file (> 4 GB).

I expect this to give me problems. I cannot copy all this into memory because there is not enough space.

My question is: what is the best way to do this search? If I just create a FILE * and search directly in the file, should I copy the blocks (say 4k) into memory and look for them or something else completely?

+3
source share
4 answers

If you use a platform that supports it, you can use mmap (). File pagination is also possible, but remember to keep the buffer as large as possible to reduce I / O overhead and be careful between the borders of two pages (suppose the line matches but is separated by the page border)

As an alternative, I suggest you create some kind of index and use the index to limit your search. KMP search is not particularly effective. This, of course, depends on the nature of your file, its creation, etc.

+2
source

, . Unix. , , . , .

Boyer More.

+2

, . , , (SearchLength), , , SearchLength .

+1

- . , , .

However, as a rule, it is more efficient to try to index the file in some way so that you do not have to linearly search the entire file. For example, KMP is a string search algorithm - are you just looking for cases of a word? Then you can simply create a hash table (on disk) of the words and their location in the file and conduct a very effective search.

+1
source

Source: https://habr.com/ru/post/1714055/


All Articles