PCA for embedding word2vec

Question

PCA for embedding word2vec

I am trying to reproduce the results of this article: https://arxiv.org/pdf/1607.06520.pdf

In particular, this part:

To determine the gender subspace, we took ten difference vectors of paired pairs and calculated its main components (PC). As shown in Figure 6, there is one direction that explains the large dispersion of these vectors. The first eigenvalue is much larger than the others.

I use the same set of word vectors as the authors (Google News Corpus, 300 dimensions) that I upload to word2vec.

The "decimal difference vectors of gender pairs" referenced by the authors are calculated from the following pairs of words:

I calculated the differences between each normalized vector as follows:

model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors- negative300.bin', binary = True) model.init_sims() pairs = [('she', 'he'), ('her', 'his'), ('woman', 'man'), ('Mary', 'John'), ('herself', 'himself'), ('daughter', 'son'), ('mother', 'father'), ('gal', 'guy'), ('girl', 'boy'), ('female', 'male')] difference_matrix = np.array([model.word_vec(a[0], use_norm=True) - model.word_vec(a[1], use_norm=True) for a in pairs])

Then I perform PCA on the resulting matrix with 10 components according to the document:

 from sklearn.decomposition import PCA pca = PCA(n_components=10) pca.fit(difference_matrix)

However, I get very different results when I look at pca.explained_variance_ratio_ :

 array([ 2.83391436e-01, 2.48616155e-01, 1.90642492e-01, 9.98411858e-02, 5.61260498e-02, 5.29706681e-02, 2.75670634e-02, 2.21957722e-02, 1.86491774e-02, 1.99108478e-32])

or with the schedule:

The first component accounts for less than 30% of the variance when it should be above 60%!

The results that I get are similar to what I get when I try to do ATP on randomly selected vectors, so I have to do something wrong, but I can’t understand what.

Note. I tried without the normalization of vectors, but I get the same results.

+5

python scikit-learn nlp pca word2vec

user2969402 Dec 29 '17 at 8:45

source share

No one has answered this question yet.

See related questions:

48

word2vec: negative selection (in an unprofessional term)?

eleven

Explain the speed difference between the numpy vectorized function VS python for loop application

3