Disambiguating Named Entities in Java

I have a list of strings (company names, in this case) and a Java program that extracts a list of things that look like company names from unstructured text. I need to map each element of the selected text to a line in the list. Caution: Unstructured text has typos, things like "Blah, Inc." called "blah" etc. I tried Levenshtein's "Edit Distance", but this fails for predictable reasons. Are best practices for resolving this problem known? Or back to manual data entry?

+3
source share
3 answers

This is not a simple problem, and there are entire companies that are trying to solve it (even for shortened matches, such as company names and the general case).

If you can identify a discrete number of patterns that include real company names, and this noise does not make it, you can solve this with a series of regular expressions.

If the patterns are complex or too numerous, you can try to develop a probabilistic model, perhaps something like a Bayesian network. You would take a subset of your data for training, and perhaps a second subset for quickly checking and developing the network. Methods may include genetic programming or the creation of a neural network. This approach is obviously not easy, and you probably want to carefully study your needs before going down this road.

+3

, Apache Stanbol, NER- ( , ) . , , , .

TAC Knowledge Base Population (Entity Linking). , , ACL, EMNLP, SIGIR .. ( ).

TAC , , "", .

, "Apple Inc.", , , , , DBPedia Freebase.

  • AAPL
  • Apple
  • Apple Computer
  • Apple Computer Co.
  • Apple Computer Inc.
  • Apple Computer Incorporated
  • Apple Computer, Inc
  • Apple Computer, Inc.
  • Apple Inc
  • Apple Incorporate
  • Apple Incorporated
  • Apple compputer
  • Apple Computer Inc
  • Apple inc
  • Apple inc.
  • ...
+4

, , . , , Python. Python , Python Java-. , . - , . (, , 80% , , "" "BLAH INC", "Blah Inc." )

+2

Source: https://habr.com/ru/post/1749228/


All Articles