How to have parallel indexes with AzureDirectory and Lucene.net?

I am using Lucene.net 3.0.3 and AzureDirectory 2.0.4937.26631, which I installed from NuGet (Lucene.Net.Store.Azure on NuGet).

The description of the azuredirectory.codeplex.com project reads: "To be more specific: you can have 1..N job roles adding documents to the index and 1..N web searches in the catalog in near real time." (in italics). Suppose that it is possible to write multiple work roles in an index simultaneously. However, when I try to do this, I get a lot of "Lock commit: AzureLock@write.lock " results. exceptions.

My code follows the example given in the AzureDirectory documentation ( azuredirectory.codeplex.com/documentation ). My code is approximately (simplified for the question).

var dbEntities = // Load database entities here var docFactory = // Create class that builds lucene documents from dbEntities var account = // get the CloudStorageAccount var directory = new AzureDirectory(account, "<my container name>"); using(var writer = new IndexWriter(directory, new StandardAnalyzer(Version.LUCENE_30), createEvenIfExists, IndexWriter.MaxFieldLength.UNLIMITED)) { foreach(var entity in entities) { writer.AddDocument(docFactory.CreateDocument(entity)); } } 

When run sequentially, this code works fine. However, if I run the same code in parallel on multiple threads / workers. I get a lot of "Lock commit: AzureLock@write.lock " results. Exceptions:

 [Lucene.Net.Store.LockObtainFailedException: Lock obtain timed out: AzureLock@write.lock. ] at Lucene.Net.Store.Lock.Obtain(Int64 lockWaitTimeout) in d:\Lucene.Net\FullRepo\trunk\src\core\Store\Lock.cs:line 83 at Lucene.Net.Index.IndexWriter.Init(Directory d, Analyzer a, Boolean create, IndexDeletionPolicy deletionPolicy, Int32 maxFieldLength, IndexingChain indexingChain, IndexCommit commit) in d:\Lucene.Net\FullRepo\trunk\src\core\Index\IndexWriter.cs:line 1228 at Lucene.Net.Index.IndexWriter..ctor(Directory d, Analyzer a, Boolean create, MaxFieldLength mfl) in d:\Lucene.Net\FullRepo\trunk\src\core\Index\IndexWriter.cs:line 1018 

I understand that the file "write.lock" is created in the blob repository, and when the file contains the text "written.lock", the lock is saved. According to my observations, users had problems with the fact that write.lock is not cleared. This does not seem like my problem, since I can get the same code to work correctly when it is executed sequentially, and the lock file is cleared in this case.

I see in the AzureDirectory documentation ( azuredirectory.codeplex.com/documentation ): "An index can only be updated by one process at a time, so it makes sense to push all the Add / Update / Delete operations using the indexing role." However, this does not make any sense, since any role you create must have several instances, therefore, in parallel with several instances that will be written to the index. In addition, the project description explicitly states that "you may have 1..N working roles adding documents to the index ". Note that this speaks of an“ index, ”not a fragment of the index.

Question:

So, is the project description just wrong? Or is there some way to have multiple indexers adding parallel index? I do not see anything in the API to allow this. If possible, provide a code snippet on how to use AzureDirectory to "have 1..N working role roles by adding documents to the index."

+4
source share
1 answer

The most effective way to do this ...

1) use the manufacturer / consumer design template

  • with this you can have x number of threads / tasks with each individual writer writing to the index
  • you may have x the number of consumers (i.e. threads / tasks) read from the database

2) For large indexes, the producer / consumer template must create separate indexes. For example, if I have 4 authors, I build 4 indexes, then I use the Lucene API to combine them.

3) After that, you have a good index on your hard drive. The final step for using AzureDirectory is to use the Lucene Directory.Copy command, which copies your index from FSDirectory (hard drive) to Azure Directory.

  • This is important because AzureDirectory internally uses the metadata properties in the Azure Blob store to determine the “last update fingerprint” for the index
  • AzureDirectory also compresses the indexes before loading ... That is why I like the hard drive step before sending it to Azure Blob Storage, because I can use parallel streams to compress the hard drive. I changed the implementation of AzureDirectory because it does everything in memory and does it for the 20 gig index bad :)

I used this to offer IaaS / PaaS in Azure and it works great. Keep in mind (I mentioned this before in posts). AzureDirectory, in my opinion, is not ready for "Enterprise" or "serious production" ... some things, such as: network retries, loading large indexes, compressing large indexes that need to be resolved before I could call it a "finished product". If you can, use the Azure IaaS offer and then you don't need the Azure Directory and you use vanilla FSDirectory to create / display your indexes.

+1
source

Source: https://habr.com/ru/post/1499175/


All Articles