Java counting the appearance of a word from a huge text file

I have a 115 MB text file. It consists of 20 million words. I have to use this file as a set of words and use it to search for the occurrence of each user word from the collection. I use this process as a small part of my project. I need a method to determine the number of occurrences of given words faster and more correctly, since I can use it in iterations. I need to offer some kind of API that I can use, or in some other way that performs the task faster. Any recommendations are appreciated.

+3
source share
1 answer

This type is usually implemented using Lucene , especially if you intend to re-run your application again or do not run there is a lot of memory. Lucene supports many other goodies .

However, if you want to "collapse your" code and you have enough memory (possibly 1 GB), your application may:

  • analyze a file in a sequence of words,
  • filter stop words
  • create a "reverse index" like HashMap<String, List<Integer>>, where the values Stringare unique words and the objects List<Integer>give offsets of the words "occurring" in the file.

( ). , . ( , .)

+3

Source: https://habr.com/ru/post/1790884/


All Articles