How to Optimize Lucene.Net Indexing

I need to index about 10 GB of data. Each of my “documents” is quite small, I think the basic information about the product, about 20 data fields, just a few words. Only 1 column is indexed, the rest are saved. I grab data from text files, so the part is pretty fast.

The current indexing speed is only about 40 MB per hour. I heard that other people say that they reached 100 times faster than this. For small files (about 20 mb), indexing is pretty fast (5 minutes). However, when I have a loop through all my data files (about 50 files totaling 10 GB), over time, the index growth seems to slow down a lot. Any ideas on how I can speed up indexing or what is the optimal indexing speed?

On the other hand, I noticed that the API in the .Net port does not seem to contain all the same methods as the original in Java ...

Update - here are the C # indexing code snippets: First I asked a question:

            directory = FSDirectory.GetDirectory(@txtIndexFolder.Text, true);
            iwriter = new IndexWriter(directory, analyzer, true);
            iwriter.SetMaxFieldLength(25000);
            iwriter.SetMergeFactor(1000);
            iwriter.SetMaxBufferedDocs(Convert.ToInt16(txtBuffer.Text));

Then read from the tab-delim data file:

    using (System.IO.TextReader tr = System.IO.File.OpenText(File))
    {
        string line;
        while ((line = tr.ReadLine()) != null)
        {
            string[] items = line.Split('\t');

Then create the fields and add the document to the index:

                fldName = new Field("Name", items[4], Field.Store.YES, Field.Index.NO);
                doc.Add(fldName);
                fldUPC = new Field("UPC", items[10], Field.Store.YES, Field.Index.NO);
                doc.Add(fldUPC);
                string Contents = items[4] + " " + items[5] + " " + items[9] + " " + items[10] + " "  + items[11] + " " + items[23] + " " + items[24];
                fldContents = new Field("Contents", Contents, Field.Store.NO, Field.Index.TOKENIZED);
                doc.Add(fldContents);
                ...
                iwriter.AddDocument(doc);

After he has fully indexed:

    iwriter.Optimize();
    iwriter.Close();
+3
2

-, 3- Lucene, - ... Lucene, , DLL, . , .

, Lucene . - Lucene.NET Lucene. . , . Lucene , , - . Subversion ( TortoiseSVN) Lucene.NET Apache SVN. Studio 2005 .NET 2.0, Visual Studio 2008 - . - . bin, DLL Lucene.Net .

+4

, 3- , Visual Studio NuGet Package Manager Lucene.NET . DLL , .

0

Source: https://habr.com/ru/post/1768556/


All Articles