See the Lucene documentation for limitations , it cannot contain more
- ~ 274 billion different terms,
- ~ 2.1 billion documents.
For such large datasets, it is generally recommended that you use Lucene only for the inverted index and store the actual contents of the documents elsewhere. You can expect that the size of the index will be ~ 30% of the size of the original document (provided that these are ordinary documents, documents with computational code with many unique terms will generate a much larger index).
source share