I am trying to create a model that will try to determine the interest category / topic of the provided text. For instance:
"I used to play soccer."
resolves to the top-level category, for example:
"Sport".
I'm not sure if the correct terminology is for what I'm trying to achieve here, so Google hasn’t got any libraries that can help. With that in mind, my approach would be something like this:
- Extract functions from text. Use tags to classify each function / identify names / places. Probably NTLK or Topia is used for this.
- Run the Naive Bayes classifier for each interest category (Sports, Video Games, Politics, etc.) and get the% relevance for each category.
- Determine which category has the highest% accuracy and classifies the text.
My approach, most likely, assumes the existence of separate buildings for each category of interests, and I am sure that the accuracy will be rather miserable - I understand that it will never be so accurate.
In general, I am looking for some advice regarding the viability of what I am trying to accomplish, but the essence of my question is: a) is my approach correct? b) are there any libraries / resources that might be useful?
source share