I need to index about 10 GB of data. Each of my “documents” is quite small, I think the basic information about the product, about 20 data fields, just a few words. Only 1 column is indexed, the rest are saved. I grab data from text files, so the part is pretty fast.
The current indexing speed is only about 40 MB per hour. I heard that other people say that they reached 100 times faster than this. For small files (about 20 mb), indexing is pretty fast (5 minutes). However, when I have a loop through all my data files (about 50 files totaling 10 GB), over time, the index growth seems to slow down a lot. Any ideas on how I can speed up indexing or what is the optimal indexing speed?
On the other hand, I noticed that the API in the .Net port does not seem to contain all the same methods as the original in Java ...
Update - here are the C # indexing code snippets: First I asked a question:
directory = FSDirectory.GetDirectory(@txtIndexFolder.Text, true);
iwriter = new IndexWriter(directory, analyzer, true);
iwriter.SetMaxFieldLength(25000);
iwriter.SetMergeFactor(1000);
iwriter.SetMaxBufferedDocs(Convert.ToInt16(txtBuffer.Text));
Then read from the tab-delim data file:
using (System.IO.TextReader tr = System.IO.File.OpenText(File))
{
string line;
while ((line = tr.ReadLine()) != null)
{
string[] items = line.Split('\t');
Then create the fields and add the document to the index:
fldName = new Field("Name", items[4], Field.Store.YES, Field.Index.NO);
doc.Add(fldName);
fldUPC = new Field("UPC", items[10], Field.Store.YES, Field.Index.NO);
doc.Add(fldUPC);
string Contents = items[4] + " " + items[5] + " " + items[9] + " " + items[10] + " " + items[11] + " " + items[23] + " " + items[24];
fldContents = new Field("Contents", Contents, Field.Store.NO, Field.Index.TOKENIZED);
doc.Add(fldContents);
...
iwriter.AddDocument(doc);
After he has fully indexed:
iwriter.Optimize();
iwriter.Close();