Is it possible to create a term for each number in the text? Sample text:
I got 2295910 unique terms.
The numbers can be timestamps, port numbers, whatever. Unique numbers lead to a very large number of unique terms. He is not entitled to have as many unique terms as documents. Lucene memory usage is growing with a number of unique terms .
Is there a special analyzer or trick for texts with numbers? StandardAnalyzer creates a term for each unique number.
Needs:
The numbers should remain searchable. A document may have several numbers. Memory usage is a problem. I have 800M documents in multiple index directories. Using memory forces me to close recently used IndexSearchers.
Unconfirmed ideas:
- Use a special analyzer. This would divide the numbers into pieces. 123456 will become "123 456". A query parser would use a phrase search to find a number.
- Modify Lucene code to use the larger termInfosIndexDivisor when viewing numeric terms.
Maybe I'm reinventing the wheel. Has anyone already decided?
source
share