Effectively extract WikiData objects from text

I have many texts (millions), from 100 to 4000 words. Texts are formatted as written works with punctuation and grammar. Everything is in English.

The problem is simple: how to extract each WikiData object from a given text?

An entity is defined as every noun, regular or regular. Ie, the names of people, organizations, places and things, such as stools, potatoes, etc.

So far I have tried the following:

It works, but I feel I can do better. One obvious improvement would be to cache the relevant parts of WikiData locally, which I plan to do. However, before I do this, I want to check if there are other solutions.

Suggestions?

I noted the Scala question because I use Spark for the task.

+5
source share
1 answer

Some suggestions:

  • look at Stanford NER versus OpenNLP to see how it compares to your case.
  • Interestingly, the meaning for most entity names
  • I suspect that you may lose information by dividing the task into separate steps.
  • although Wikidata is new, the task is not like that, so you can see the docs for Freebase | DBpedia | recognition of the essence of Wikipedia | values

In particular, DBpedia Spotlight is one system designed specifically for this task.

http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/38389.pdf http://ceur-ws.org/Vol-1057/Nebhi_LD4IE2013.pdf

+2
source

Source: https://habr.com/ru/post/1242255/


All Articles