Definition of what the word is - token categorization

I am writing a bridge between the user and the search engine, not the search engine. Part of my added value will trigger the intention of the request. The intent of the tracking number, stock symbol or address is pretty obvious. If I can classify the query, I can decide if the user should even see the search results. Of course, if I can’t, they will see the search results. I am currently developing this output mechanism.

I am writing a parser; he must take any given token and assign it a category. Here are some theoretical examples in English:

  • "denver" is USCITY and PLACENAME
  • "aapl" is NASDAQSYMBOL and STOCKTICKERSYMBOL.
  • "555 555 5555" is USPHONENUMBER

I know that each of these cases is likely to require special handling, but I'm not sure where to start.

Ideally, I would get something simple:

queryCategory = magicCategoryFinder( query )

    >print queryCategory
    >"SOMECATEGORY or a list"
+3
source share
5 answers

Natural language analysis is a complex topic. One of the problems here is that determining what a word depends on context and implied knowledge. In addition, you are not as interested in words as you are in groups of words. Think of New York City as a place, but its three words, two of which (new and urban) have different meanings.

, , . , JAVA ( ) Sun Microsystems. , , . ? , .

, .

?

+3

"" ( , ), NLTK. , NLTK, Natural Language ToolKit, ( Python) ( , , , ), , ! -).

+3

... . . , , , : ) ) , ​​ -, ( , , ) , , , . .. . ... ..

+1

, . - :

  1. .
  2. , .
"" - . , . , "" - , - . : "" "" ( ). , "" "" .

, .

+1

Although this may not help you significantly, you can use Cyc . This is a huge database of what is intended for use in AI applications (although I have not heard a single success story).

+1
source

Source: https://habr.com/ru/post/1730312/


All Articles