Indexing Texts with Many Numbers in Lucene

Question

Indexing Texts with Many Numbers in Lucene

Is it possible to create a term for each number in the text? Sample text:

I got 2295910 unique terms.

The numbers can be timestamps, port numbers, whatever. Unique numbers lead to a very large number of unique terms. He is not entitled to have as many unique terms as documents. Lucene memory usage is growing with a number of unique terms .

Is there a special analyzer or trick for texts with numbers? StandardAnalyzer creates a term for each unique number.

Needs:

The numbers should remain searchable. A document may have several numbers. Memory usage is a problem. I have 800M documents in multiple index directories. Using memory forces me to close recently used IndexSearchers.

Unconfirmed ideas:

Use a special analyzer. This would divide the numbers into pieces. 123456 will become "123 456". A query parser would use a phrase search to find a number.
Modify Lucene code to use the larger termInfosIndexDivisor when viewing numeric terms.

Maybe I'm reinventing the wheel. Has anyone already decided?

+3

indexing lucene

Ivo Danihelka Jan 19 '11 at 9:47

source share

3 answers

: - . , :

, , , , , . , , ( 256, ).

, , , (, , ).

+1

Xodarap 19 . '11 15:15

.

Do you need to search under these conditions? If you need to search on these terms, then this is just the nature of your search index. There are a few tricks you can do if you don’t need to look for exact values (like finding a range), but if you need exact matches, then you are stuck with this.

If you do not need to search for these terms, why index them?

+1

rfeak Jan 19 '11 at 16:01

source share

bajafresh4life · Accepted Answer · 2011-01-19T14:33:27+0000

? , Lucene , - , .

, , , Lucene, , . , , , .

Indexing Texts with Many Numbers in Lucene

More articles: