I want to classify tweets within a specific set of categories, such as “sports”, “entertainment”, “love”, etc.
My idea is to take the term frequency of the most commonly used words to help me solve this problem. For example, the word "love" is most often found in the category of love, but it also appears in sports and entertainment in the form of "I love this game" and "I like this movie."
To solve this problem, I presented a 3-axis graph where x values are all the words used in my tweets, y values are categories, and z values are a frequency term (or some type of rating) with respect to the word and category . Then I broke the tweet into a graph, and then added the z values in each category. The category with the highest overall z value is most likely the correct category. I know this is confusing, so let me give you an example:
The word “watch” shows a lot in sports and entertainment (“I watch a game” and “I watch my favorite show”) ... Therefore, I narrowed it down to these two categories at least, But the word “game” does not often appear in entertainment, and the show often does not appear in sports. the Z value for "watch" + "game" will be the highest for the sport category, and "watch" + "show" will be the highest for entertainment.
Now that you understand how my idea works, I need help organizing this data so that the machine learning algorithm can predict the categories when I give it a word or a set of words. I read a lot about SVM and I think this is the way to go. I tried libsvm, but I can't come up with a good set of input. In addition, libsvm does not support non-numeric values, which adds more complexity.
Any ideas? Do I need a library, or do I have to make my own decision?
Thank you, I know it was a long time, sorry.