Wikipedia indexing dump to elasticsearch gets the structure of XML documents that must start and end inside the same entity error

I want to index wikipedia for elastics search.

I tried stream2es + elasticsearch 2.0.0 and Wikipedia River Plugin 2.6.0 + elasticsearch 1.6.0 to index the latest wikipedia https://dumps.wikimedia.org/enwiki/20151102/enwiki-20151102-pages-articles-multistream.xml. bz2 .

However, both received the same error message:

XML document structures must start and end within the same entity.
+4
source share
1 answer

, XML , . wikimedia elasticsearch.

, .

API elasticsearch. JSON, elasticsearch.

, :

  • : curl https://en.wikipedia.org/w/api.php?action=cirrus-mapping-dump&format=json > mapping.json
  • elasticsearch: jq .content < mapping.json | curl -XPUT localhost:9200/enwiki_content --data @-
  • : zcat enwiki-20151116-cirrussearch-general.json.gz | parallel --pipe -L 2 -N 2000 -j3 'curl -s http://localhost:9200/enwiki_content/_bulk --data-binary @- > /dev/null'
+10

Source: https://habr.com/ru/post/1615246/


All Articles