BACKGROUND
I have vectors with some sample data, and each vector has a category name (Places, Colors, Names).
['john','jay','dan','nathan','bob'] -> 'Names'
['yellow', 'red','green'] -> 'Colors'
['tokyo','bejing','washington','mumbai'] -> 'Places'
My goal is to prepare a model that will take a new input line and predict which category it belongs to. For example, if the new input is “purple,” then I should be able to predict “Colors” as the correct category. If the new entry is Calgary, it should predict the Places as the correct category.
AN APPROACH
I did some research and came across Word2vec . This library has a "similarity" and "largest" function that I can use. So, one brute force approach that I was thinking about is this:
- Take a new input.
- Calculate its similarity to each word in each vector and take the average value.
So, for example, to enter “pink” I can calculate its similarity with the words in vector “names”, take the average value, and then do it for the other two vectors. The vector that gives me the highest average of similarities will be the correct input vector that should belong.
Release
Given my limited knowledge of NLP and machine learning, I am not sure if this is the best approach, and therefore I am looking for help and suggestions on the best approaches to solving my problem. I am open to all suggestions, and please also point out any errors that I may have made, as I am new to computer training and the world of NLP.