How to search multiple lines in a text file

I work in text files. I want to implement a search algorithm in Java. I have text files to look for.

If I want to find a single word, I can do this by simply placing all the text in a hash map and saving each word. But is there any algorithm if I want to search for two strings (or maybe more)? Do I have to hash the strings in a pair of two?

+6
source share
2 answers

It depends on the size of the text file. Usually you need to consider several cases:

  • Many requests for very short documents (web pages, essay texts, etc.). Spread text like a regular language. The simple O (n ^ 2) algorithm is good. To query for length n, just take a window of length n and slide it. Compare and move the window until you find a match. This algorithm does not care about words, so you just see the whole search as a large string (including spaces). This is probably what most browsers do. KMP or Boyer Moore are not worth the effort, as the case of O (n ^ 2) is very rare.

  • Several requests for one large document. Pre-process the document and save it pre-processed. Common storage options are suffix trees and inverted lists. If you have several documents, you can create one document from the moment of their combination and storage of documents. This is the way for document databases where the collection is almost constant.

  • If you have several documents in which you have high redundancy and your collections change frequently, use KMP or Boyer Moore. For example, if you want to find specific sequences in DNA data, and you often get new sequences to find also new DNA from experiments, the O (n ^ 2) part of the naive algorithm will kill your time.

There are probably many features that require different algorithms and data structures, so you should figure out which one is better in your case.

+3
source

Before proposing an approach, additional information is required:

Are you looking for just whole words or any substring?

Are you going to look for many different words in one unchanged file?

Do you know the words you want to find right away?

There are many efficient (linear) search algorithms for strings. If possible, I would suggest using one that has already been written for you.

http://en.wikipedia.org/wiki/String_searching_algorithm

One simple idea is to use a sliding window hash with a window of the same size as the search bar. Then, in one pass, you can quickly check where the window hash matches the hash of your search string. Where it fits, you double-check to see if you have a real match.

+1
source

Source: https://habr.com/ru/post/898583/


All Articles