Extract face name from structure text

I have a collection of bills and invoices, so there is no context in the text (I mean that they don’t tell the story). I want to extract the names of people from these accounts. I tried OpenNLP, but the quality of the trained model is not very good, because I have no context. so the first question is: can I train a model containing only the names of people without context? and if possible, you can give me a good article on how I create this new model (most of the article I read did not explain the steps that I had to take to create a new model).

I have a database name with a name of more than 100,000 people (name, surname), therefore, if NER systems do not work in my case (because there is no context), then what is the best way to search for these candidates (I mean searching for each name with all other names?)

thanks.

+5
source share
2 answers

As for the "context", I think you mean that you do not have whole sentences, i.e. there are no previous / next tokens, in which case you are faced with a non-standard NER. I do not know about the available program or training data for this particular problem, if you did not find them, you will have to create your own corps for training and / or evaluation purposes.

Your name database is likely to help significantly, depending on how much of the account name is actually present in the database. You probably also have to rely on character-level name morphology as patterns (see Pattern Examples in [1]). When you have a training kit with functions (availability in the database, morphology, other account information) and solutions (actual names of annotated bills), using standard machine learning as an SVM will be quite simple (if you are not familiar with this, simply ask).

Some other suggestions:

  • You can probably also use other account information: company name, position, tax reference, etc.
  • You can also act selectively - if all accounts must indicate (exactly?) The name of one person, you can exclude all other texts (for example, amounts, tax names, positions, etc.) or accept in a separate model that among the whole text in the account, only one should be guessed as a name.

[1] Ranking algorithms for retrieving the desired object: magnification and voted perceptron (Michael Collins, 2002)

+2
source

I would start with some regular expressions, and then maybe increase this using a dictionary approach (i.e. a large list of names).

No matter what you do, it will not be perfect, so be sure to remember this.

+2
source

Source: https://habr.com/ru/post/1210530/


All Articles