Using the pre-wikipedia Word2Vec model

I need to use gensim to get vector representations of words, and I believe that it is best to use the word2vec module, which was previously prepared in English by wikipedia corpus. Does anyone know where to download it, how to install it, and how to use gensim to create vectors?

+4
source share
2 answers

@imanzabet provided useful links with pre-trained vectors, but if you want to train models yourself using genius, then you need to do two things:

  • Get the Wikipedia data you can get here . It looks like the last shot of English Wikipedia was on the 20th, and it can be found here . I believe that other English-language “wikis,” for example, quotation marks are fixed separately, so if you want to enable them, you will also need to download them.

  • Download the data and use it to create models. This is a pretty broad question, so I'll just link you to the excellent genius documentation and the word2vec tutorial .

Finally, I will point out that there seems to be a blog post that describes exactly your use case.

+1
source

WebVectors, Word2Vec, . readme, . , , . , , , , , , , gensim, txt-, . , (POS), , , vacation, a KeyError, , vacation_NOUN. , (, , ),

import gensim.models

model = "./WebVectors/3/enwiki_5_ner.txt"

word_vectors = gensim.models.KeyedVectors.load_word2vec_format(model, binary=False)
print(word_vectors.most_similar("vacation_NOUN"))
print(word_vectors.most_similar(positive=['woman_NOUN', 'king_NOUN'], negative=['man_NOUN']))

▶ python3 wiki_model.py
[('vacation_VERB', 0.6829521656036377), ('honeymoon_NOUN', 0.6811978816986084), ('holiday_NOUN', 0.6588436365127563), ('vacationer_NOUN', 0.6212040781974792), ('resort_NOUN', 0.5720850825309753), ('trip_NOUN', 0.5585346817970276), ('holiday_VERB', 0.5482848882675171), ('week-end_NOUN', 0.5174300670623779), ('newlywed_NOUN', 0.5146450996398926), ('honeymoon_VERB', 0.5135983228683472)]
[('monarch_NOUN', 0.6679952144622803), ('ruler_NOUN', 0.6257176995277405), ('regnant_NOUN', 0.6217397451400757), ('royal_ADJ', 0.6212111115455627), ('princess_NOUN', 0.6133661866188049), ('queen_NOUN', 0.6015778183937073), ('kingship_NOUN', 0.5986001491546631), ('prince_NOUN', 0.5900266170501709), ('royal_NOUN', 0.5886058807373047), ('throne_NOUN', 0.5855424404144287)]

UPDATE :

:

Fasttext:

Google Word2Vec

GloVe:

WebVectors

  • , , (POS)
+5

Source: https://habr.com/ru/post/1682242/


All Articles