Normalizing SOLR records for shards: issues with _version_

As part of my DSpace instance, I have a SOLR repository containing 12 million usage statistics records. Some records have been migrated using several SOLR updates and do not match the current schema. 5 million of these entries do not have a unique id field specified in my schema.

The DSpace system provides a mechanism to fine old usage statistics records in a separate solr shard using the following code.

LOGIC DSPACE SHARD:

for (File tempCsv : filesToUpload) { //Upload the data in the csv files to our new solr core ContentStreamUpdateRequest contentStreamUpdateRequest = new ContentStreamUpdateRequest("/update/csv"); contentStreamUpdateRequest.setParam("stream.contentType", "text/plain;charset=utf-8"); contentStreamUpdateRequest.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true); contentStreamUpdateRequest.addFile(tempCsv, "text/plain;charset=utf-8"); statisticsYearServer.request(contentStreamUpdateRequest); } statisticsYearServer.commit(true, true); 

When I tried to start this process, I received an error message for each of my entries that does not have a unique id field, and 5 million entries were deleted by the process.

I tried replacing these 5 million entries to force a unique id field for each entry. Here is the code that I run to run this update. The myQuery query iterates over several thousand records.

MY RECORD REPAIR PROCESS:

  ArrayList<SolrInputDocument> idocs = new ArrayList<SolrInputDocument>(); SolrQuery sq = new SolrQuery(); sq.setQuery(myQuery); sq.setRows(MAX); sq.setSort("time", ORDER.asc); QueryResponse resp = server.query(sq); SolrDocumentList list = resp.getResults(); if (list.size() > 0) { for(int i=0; i<list.size(); i++) { SolrDocument doc = list.get(i); SolrInputDocument idoc = ClientUtils.toSolrInputDocument(doc); idocs.add(idoc); } } server.add(idocs); server.commit(true, true); server.deleteByQuery(myQuery); server.commit(true, true); 

After starting this process, all entries in the repository have a unique identifier. The entries I touched also have a _version_ field.

When I try to restart the outline process that was included above, an error occurs with the value of the _version_ field, and the process terminates. If I try to set the version field explicitly, I get the same error.

Here is the error message I encounter when invoking the shard process:

 Exception: version conflict for e8b7ba64-8c1e-4963-8bcb-f36b33216d69 expected=1484794833191043072 actual=-1 org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: version conflict for e8b7ba64-8c1e-4963-8bcb-f36b33216d69 expected=1484794833191043072 actual=-1 at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:424) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:180) 

My goal is to restore my records so that I can start the shard process provided by DSpace. Can you recommend any additional actions that I should take to restore these records?

+5
source share
3 answers

The coding code in SolrLogger copies the entries to the new empty core. The problem is that DSpace usage statistics documents from near DSpace 3 contain the _version_ field, and this field is included in the copy during scalding.

When documents containing the _version_ field are added to the Solr index, this causes the Solr concurrency optimization, which checks for the existence of the document with the same unique identifier in the document. The logic goes something like this (see http://yonik.com/solr/optimistic-concurrency/ ):

  • _version_ > 1: The document version must exactly match
  • _version_ = 1: The document must exist
  • _version_ <0: The document must not exist
  • _version_ = 0: Do not care (normal rewriting if exists)

Usage statistics documents containing _version_ > 1 force Solr to look for an existing document with the same unique identifier in the newly created shard of the year; however, it is obvious that at this moment there is no such document, therefore there is a version conflict.

The copy process during the outline creates temporary CSV files, which are then imported into the new kernel. Fortunately, the Solr CSV update handler can be used to exclude certain fields from import using the skip parameter: https://wiki.apache.org/solr/UpdateCSV#skip

How to change shards code

 //Upload the data in the csv files to our new solr core ContentStreamUpdateRequest contentStreamUpdateRequest = new ContentStreamUpdateRequest("/update/csv"); contentStreamUpdateRequest.setParam("stream.contentType", "text/plain;charset=utf-8"); + contentStreamUpdateRequest.setParam("skip", "_version_"); contentStreamUpdateRequest.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true); contentStreamUpdateRequest.addFile(tempCsv, "text/plain;charset=utf-8"); 

skips the _version_ field, which in turn disables the optimistic concurrency check.

This is discussed at https://jira.duraspace.org/browse/DS-2212 with a transfer request https://github.com/DSpace/DSpace/pull/893 ; hopefully this will be included in DSpace 5.2.

+1
source

Forming csv will be easier.

Try adding id to csv by adding a method to do this before the first method.

FileUtils.copyInputStreamToFile (csvInputstream, csvFile);

// & -a method call to a function that opens the csv file again and adds the required identifier for each line

filesToUpload.add (csvFile); // Add 10000 and start again yearQueryParams.put (CommonParams.START, String.valueOf ((i + 10000))); }

for (File tempCsv: filesToUpload) {

(...)

+1
source

I tried to upgrade 1.8.3 to 4.2 with 4 million entries, all missing uid and version. I wrote a script to read from Solr (in batches of 10,000 each), write copies, and finally delete the originals. The results looked good until I tried to be fined when I saw the same problem that was reported here.

CSV files contain the correct version numbers. Exception report was

Exception: version conflict for 38dbd4db-240e-4c9b-a927-271fee5db750 expected=1490271991641407488 actual=-1 org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: version conflict for 38dbd4db-240e-4c9b-a927-271fee5db750 expected=1490271991641407488 actual=-1

The first entry in temp / temp.2012.0.csv starts with

38dbd4db-240e-4c9b-a927-271fee5db750,1490271991641407488, ...

0
source

Source: https://habr.com/ru/post/1206928/


All Articles