I have many texts (millions), from 100 to 4000 words. Texts are formatted as written works with punctuation and grammar. Everything is in English.
The problem is simple: how to extract each WikiData object from a given text?
An entity is defined as every noun, regular or regular. Ie, the names of people, organizations, places and things, such as stools, potatoes, etc.
So far I have tried the following:
It works, but I feel I can do better. One obvious improvement would be to cache the relevant parts of WikiData locally, which I plan to do. However, before I do this, I want to check if there are other solutions.
Suggestions?
I noted the Scala question because I use Spark for the task.
source share