Word2vec - get the closest words

Reading the output of the word2vec model for tensor flow, how can I output words related to a specific word?

Reading src: https://github.com/tensorflow/tensorflow/blob/r0.11/tensorflow/examples/tutorials/word2vec/word2vec_basic.py allows you to see how the image will be displayed.

But is there a data structure (for example, a dictionary) created as part of model training that allows you to access the closest n words closest to a given word? For example, if the generated word2vec image:

enter image description here

image src: https://www.tensorflow.org/versions/r0.11/tutorials/word2vec/index.html

In this image, the words "to, he, it" are contained in one cluster, is there a function that takes "to" as input and outputs "he, she" (in this case n = 2)?

+5
source share
2 answers

This approach applies to word2vec in general. If you can save word2vec in a text / binary file, for example google / GloVe word vector. Then you just need gensim .

For installation:

Via github

Python Code:

from gensim.models import Word2Vec gmodel=Word2Vec.load_word2vec_format(fname) ms=gmodel.most_similar('good',10) for x in ms: print x[0],x[1] 

However, it will look for all the words to give results, there is an approximate nearest neighbor (ANN) that will give you the result faster, but with a compromise exactly.

In the latter, gensim annoy is used to execute ANN, see these notebooks for more information.

Flann is another library for close neighbors.

+4
source

Get gensim and use the Similar_by_word method on the gensim.models.Word2Vec model.

similar_by_word takes 3 parameters,

  • Input word
  • n - for the top n similar words (optional, default = 10)
  • restrict_vocab (optional, default = None)

Example

 import gensim, nltk class FileToSent(object): """A class to load a text file efficiently """ def __init__(self, filename): self.filename = filename # To remove stop words (optional) self.stop = set(nltk.corpus.stopwords.words('english')) def __iter__(self): for line in open(self.filename, 'r'): ll = [i for i in unicode(line, 'utf-8').lower().split() if i not in self.stop] yield ll 

Then, depending on your input suggestions (description_file.txt),

 sentences = FileToSent('sentence_file.txt') model = gensim.models.Word2Vec(sentences=sentences, min_count=2, hs=1) print model.similar_by_word('hack', 2) # Get two most similar words to 'hack' # [(u'debug', 0.967338502407074), (u'patch', 0.952264130115509)] (Output specific to my dataset) 
0
source

Source: https://habr.com/ru/post/1258298/


All Articles