Disabling Gensim remove punctuation, etc. When parsing a wiki

I want to train the word2vec model on english wikipedia using python with gensim. I watched closely https://groups.google.com/forum/#!topic/gensim/MJWrDw_IvXw for this.

This works for me, but what I don't like about the resulting word2vec model is that named objects are separated, which makes the model unsuitable for my particular application. The model I need should represent named objects as a single vector.

That's why I planned to analyze Wikipedia articles with open spaces and combine entities such as North Carolina into North Carolina so that word2vec represents them as one vector. So far so good.

Spatial analysis should be part of the pre-processing that I initially did, as recommended in the related discussion, using:

...
wiki = WikiCorpus(wiki_bz2_file, dictionary={})
for text in wiki.get_texts():
    article = " ".join(text) + "\n"
    output.write(article)
...

This removes punctuation, stop words, numbers and capital letters and saves each article on a separate line in the final output file. The problem is that the spacious NER doesn't actually work with this pre-processed text, as I believe it is based on punctuation and capitalization for NER (?).

Does anyone know if I can "disable" gensim preprocessing so that it does not remove punctuation, etc., But does it analyze Wikipedia articles in text directly from a compressed Wikipedia dump? Or does anyone know a better way to do this? Thanks in advance!

+5
2

, spacy . (, ..). spacy NER (, , POS Tagger ) .

gensim LSI - ( ). , gensim.

model.wv.vocab model = gensim.models.Word2Vec(...) . , , .

0

gensim word2vec spaCy, , :

  1. Gensim
  2. spaCy
  3. W2V ( SpaCy) (?)

, , spaCy , , NER... : https://www.youtube.com/watch?v=sqDHBH9IjRU

, , , :

  1. spaCy
  2. spaCy NER
  3. - ,
  4. gensim w2v spacy.load()
  5. w2v spaCy

, , gensim spaCy :

  1. wget [URL ]
  2. python -m init -m odel [] [, ]

init -m odel: https://spacy.io/api/cli#init-model

, en_core_web_md, .txt,.zip .tgz.

0

Source: https://habr.com/ru/post/1675193/


All Articles