Effectively extract WikiData objects from text

Question

Effectively extract WikiData objects from text

I have many texts (millions), from 100 to 4000 words. Texts are formatted as written works with punctuation and grammar. Everything is in English.

The problem is simple: how to extract each WikiData object from a given text?

An entity is defined as every noun, regular or regular. Ie, the names of people, organizations, places and things, such as stools, potatoes, etc.

So far I have tried the following:

Label the text with OpenNLP and use pre-trained models to extract people, locations, organizations, and common nouns.
Apply Porter Stemming , where applicable.
Match all extracted nouns with the wmflabs-API to get the potential WikiData identifier.

It works, but I feel I can do better. One obvious improvement would be to cache the relevant parts of WikiData locally, which I plan to do. However, before I do this, I want to check if there are other solutions.

Suggestions?

I noted the Scala question because I use Spark for the task.

+5

scala machine-learning information-retrieval wikidata wikidata-api

habitats Feb 03 '16 at 23:33

source share

1 answer

Tom morris · Accepted Answer · 2016-02-04T05:02:30+0000

Some suggestions:

look at Stanford NER versus OpenNLP to see how it compares to your case.
Interestingly, the meaning for most entity names
I suspect that you may lose information by dividing the task into separate steps.
although Wikidata is new, the task is not like that, so you can see the docs for Freebase | DBpedia | recognition of the essence of Wikipedia | values

In particular, DBpedia Spotlight is one system designed specifically for this task.

http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/38389.pdf http://ceur-ws.org/Vol-1057/Nebhi_LD4IE2013.pdf

Effectively extract WikiData objects from text

More articles: