Indexing Texts with Many Numbers in Lucene

Is it possible to create a term for each number in the text? Sample text:

I got 2295910 unique terms.

The numbers can be timestamps, port numbers, whatever. Unique numbers lead to a very large number of unique terms. He is not entitled to have as many unique terms as documents. Lucene memory usage is growing with a number of unique terms .

Is there a special analyzer or trick for texts with numbers? StandardAnalyzer creates a term for each unique number.

Needs:

The numbers should remain searchable. A document may have several numbers. Memory usage is a problem. I have 800M documents in multiple index directories. Using memory forces me to close recently used IndexSearchers.

Unconfirmed ideas:

  • Use a special analyzer. This would divide the numbers into pieces. 123456 will become "123 456". A query parser would use a phrase search to find a number.
  • Modify Lucene code to use the larger termInfosIndexDivisor when viewing numeric terms.

Maybe I'm reinventing the wheel. Has anyone already decided?

+3
source share
3 answers

? , Lucene , - , .

, , , Lucene, , . , , , .

+3

: - . , :

, , , , , . , , ( 256, ).

, , , (, , ).

+1

.

Do you need to search under these conditions? If you need to search on these terms, then this is just the nature of your search index. There are a few tricks you can do if you don’t need to look for exact values ​​(like finding a range), but if you need exact matches, then you are stuck with this.

If you do not need to search for these terms, why index them?

+1
source

Source: https://habr.com/ru/post/1786352/


All Articles