I am trying to reproduce the results of this article: https://arxiv.org/pdf/1607.06520.pdf
In particular, this part:
To determine the gender subspace, we took ten difference vectors of paired pairs and calculated its main components (PC). As shown in Figure 6, there is one direction that explains the large dispersion of these vectors. The first eigenvalue is much larger than the others.

I use the same set of word vectors as the authors (Google News Corpus, 300 dimensions) that I upload to word2vec.
The "decimal difference vectors of gender pairs" referenced by the authors are calculated from the following pairs of words:

I calculated the differences between each normalized vector as follows:
model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors- negative300.bin', binary = True) model.init_sims() pairs = [('she', 'he'), ('her', 'his'), ('woman', 'man'), ('Mary', 'John'), ('herself', 'himself'), ('daughter', 'son'), ('mother', 'father'), ('gal', 'guy'), ('girl', 'boy'), ('female', 'male')] difference_matrix = np.array([model.word_vec(a[0], use_norm=True) - model.word_vec(a[1], use_norm=True) for a in pairs])
Then I perform PCA on the resulting matrix with 10 components according to the document:
from sklearn.decomposition import PCA pca = PCA(n_components=10) pca.fit(difference_matrix)
However, I get very different results when I look at pca.explained_variance_ratio_ :
array([ 2.83391436e-01, 2.48616155e-01, 1.90642492e-01, 9.98411858e-02, 5.61260498e-02, 5.29706681e-02, 2.75670634e-02, 2.21957722e-02, 1.86491774e-02, 1.99108478e-32])
or with the schedule:

The first component accounts for less than 30% of the variance when it should be above 60%!
The results that I get are similar to what I get when I try to do ATP on randomly selected vectors, so I have to do something wrong, but I canβt understand what.
Note. I tried without the normalization of vectors, but I get the same results.