Using word2vec to classify words in categories

BACKGROUND

I have vectors with some sample data, and each vector has a category name (Places, Colors, Names).

['john','jay','dan','nathan','bob']  -> 'Names'
['yellow', 'red','green'] -> 'Colors'
['tokyo','bejing','washington','mumbai'] -> 'Places'

My goal is to prepare a model that will take a new input line and predict which category it belongs to. For example, if the new input is “purple,” then I should be able to predict “Colors” as the correct category. If the new entry is Calgary, it should predict the Places as the correct category.

AN APPROACH

I did some research and came across Word2vec . This library has a "similarity" and "largest" function that I can use. So, one brute force approach that I was thinking about is this:

  • Take a new input.
  • Calculate its similarity to each word in each vector and take the average value.

So, for example, to enter “pink” I can calculate its similarity with the words in vector “names”, take the average value, and then do it for the other two vectors. The vector that gives me the highest average of similarities will be the correct input vector that should belong.

Release

Given my limited knowledge of NLP and machine learning, I am not sure if this is the best approach, and therefore I am looking for help and suggestions on the best approaches to solving my problem. I am open to all suggestions, and please also point out any errors that I may have made, as I am new to computer training and the world of NLP.

+9
1

/ , (Word2Vec GloVe) . , , .

:

import numpy as np

# Category -> words
data = {
  'Names': ['john','jay','dan','nathan','bob'],
  'Colors': ['yellow', 'red','green'],
  'Places': ['tokyo','bejing','washington','mumbai'],
}
# Words -> category
categories = {word: key for key, words in data.items() for word in words}

# Load the whole embedding matrix
embeddings_index = {}
with open('glove.6B.100d.txt') as f:
  for line in f:
    values = line.split()
    word = values[0]
    embed = np.array(values[1:], dtype=np.float32)
    embeddings_index[word] = embed
print('Loaded %s word vectors.' % len(embeddings_index))
# Embeddings for available words
data_embeddings = {key: value for key, value in embeddings_index.items() if key in categories.keys()}

# Processing the query
def process(query):
  query_embed = embeddings_index[query]
  scores = {}
  for word, embed in data_embeddings.items():
    category = categories[word]
    dist = query_embed.dot(embed)
    dist /= len(data[category])
    scores[category] = scores.get(category, 0) + dist
  return scores

# Testing
print(process('pink'))
print(process('frank'))
print(process('moscow'))

, GloVe (, 800Mb!), - :

{'Colors': 24.655489603678387, 'Names': 5.058711671829224, 'Places': 0.90213905274868011}
{'Colors': 6.8597321510314941, 'Names': 15.570847320556641, 'Places': 3.5302454829216003}
{'Colors': 8.2919375101725254, 'Names': 4.58830726146698, 'Places': 14.7840416431427}

... . ! , glove tf-idf. , , .

+8

Source: https://habr.com/ru/post/1682244/


All Articles