Python NTL - definition of textual interest / topic

I am trying to create a model that will try to determine the interest category / topic of the provided text. For instance:

"I used to play soccer."

resolves to the top-level category, for example:

"Sport".

I'm not sure if the correct terminology is for what I'm trying to achieve here, so Google hasn’t got any libraries that can help. With that in mind, my approach would be something like this:

  • Extract functions from text. Use tags to classify each function / identify names / places. Probably NTLK or Topia is used for this.
  • Run the Naive Bayes classifier for each interest category (Sports, Video Games, Politics, etc.) and get the% relevance for each category.
  • Determine which category has the highest% accuracy and classifies the text.

My approach, most likely, assumes the existence of separate buildings for each category of interests, and I am sure that the accuracy will be rather miserable - I understand that it will never be so accurate.

In general, I am looking for some advice regarding the viability of what I am trying to accomplish, but the essence of my question is: a) is my approach correct? b) are there any libraries / resources that might be useful?

+4
source share
2 answers

You seem to know a lot of the right terminology. Try to find "document classification". This is a common problem that you are trying to solve. A classifier trained in the executive corps will be more accurate than you think.

  • (a) There is no right approach. The approach that you outline will work, however.
  • (b) Scikit More is a great library for this kind of work.

There is a lot of other information in this section, including tutorials, online on this topic:

  • This Naive Bayes classifier on github probably already does most of what you want to achieve.
  • This NLTK tutorial explains the topic in detail.
  • If you really want to enter it, I’m sure that a Google Scholar search will turn thousands of academic articles into computer science and linguistics on this topic.
+5
source

You should check the Latent Dirichlet distribution, this will give you unlabeled categories, as always ed chens bolg is a good start.

+3
source

Source: https://habr.com/ru/post/1490031/


All Articles