Help organize my data for this machine learning problem

I want to classify tweets within a specific set of categories, such as “sports”, “entertainment”, “love”, etc.

My idea is to take the term frequency of the most commonly used words to help me solve this problem. For example, the word "love" is most often found in the category of love, but it also appears in sports and entertainment in the form of "I love this game" and "I like this movie."

To solve this problem, I presented a 3-axis graph where x values ​​are all the words used in my tweets, y values ​​are categories, and z values ​​are a frequency term (or some type of rating) with respect to the word and category . Then I broke the tweet into a graph, and then added the z values ​​in each category. The category with the highest overall z value is most likely the correct category. I know this is confusing, so let me give you an example:

The word “watch” shows a lot in sports and entertainment (“I watch a game” and “I watch my favorite show”) ... Therefore, I narrowed it down to these two categories at least, But the word “game” does not often appear in entertainment, and the show often does not appear in sports. the Z value for "watch" + "game" will be the highest for the sport category, and "watch" + "show" will be the highest for entertainment.

Now that you understand how my idea works, I need help organizing this data so that the machine learning algorithm can predict the categories when I give it a word or a set of words. I read a lot about SVM and I think this is the way to go. I tried libsvm, but I can't come up with a good set of input. In addition, libsvm does not support non-numeric values, which adds more complexity.

Any ideas? Do I need a library, or do I have to make my own decision?

Thank you, I know it was a long time, sorry.

+3
source share
2 answers

, . . , . , . ; , Weka .

+2

( - ) (-) (, , ..).

Naive Bayes Classifier Fisher ( Fisher), . python.

, , - (, , ..) .

, , - 6 ( ) " : Smart Web 2.0". , python.

0

Source: https://habr.com/ru/post/1782385/


All Articles