Use Gensim to score points in each document. Also a Python memory issue

Question

Use Gensim to score points in each document. Also a Python memory issue

I use GENSIMon the case of 50,000 documents with a dictionary of about 4,000 functions. I also have a model LSIalready prepared for her.

Now I want to find the most suitable functions for each of the added documents. To find the best features in a specific document, I run the gensim similarity module for each of the functions on all documents. This gives us an estimate for each of the functions that we want to use later. But, as you can imagine, this is an expensive process, as we have to iterate over 50,000 indexes and run 4,000 iterations of similarity on each.

I need a better way to do this, since I am running out of 8 GB of memory on my system for about 1000 iterations. There really is no reason for the memory to continue to grow, since I only redistribute it during iterations. Surprisingly, memory begins to grow only after 200 iterations.

Why is there a memory problem? How can this be solved?
Is there a better way to find the highest earned features in a particular document (rather than in topics)?

Here is a snippet of code that has run out of memory:

dictionary = corpora.Dictionary.load('features-dict.dict')
corpus = corpora.MmCorpus('corpus.mm')
lsi = models.LsiModel.load('model.lsi')
corpus_lsi = lsi[corpus]
index = similarities.MatrixSimilarity(list(corpus_lsi))
newDict = dict()

for feature in dictionary.token2id.keys():
  vec_bow = dictionary.doc2bow([feature])
  vec_lsi = lsi[vec_bow]
  sims = index[vec_lsi]
  li = sorted(enumerate(sims * 100), key=lambda item: -item[1])

  for data in li:
    dict[data[0]] = (feature,data[1]) # Store feature and score for each document


# Do something with the dict created above

EDIT:

The memory issue was resolved using the memory profiler. There was something else in this cycle that made him rise sharply.

. , ( - ), . .

corpus = [[Olive Oil, Tomato, Brocolli, Oregano], [Garlic, Olive Oil, Bread, Cheese, Oregano], [Avocado, Beans, Cheese, Lime], [Jalepeneo, Lime, Tomato, Tortilla, Sour Cream], [Chili Sauce, Vinegar, Mushrooms, Rice], [Soy Sauce, Noodles, Brocolli, Ginger, Vinegar]]

. , , - 0 100 ( ). .

+4

python memory-management dictionary nlp gensim

zhanup 24 . '14 13:51

1

Radim · Answer 1 · 2014-04-24T17:50:01+0000

:

, lsi.projection.u, :

# create #features x #corpus 2D matrix of weights
doc_feature_matrix = numpy.dot(lsi.projection.u, index.index.T)

" ", , .

list() list(lsi[corpus]) . doc-topic . list() , : index = MatrixSimilarity(lsi[corpus], num_features=lsi.num_topics).
LSI . - (= ) , , TF-IDF , LSI.

Use Gensim to score points in each document. Also a Python memory issue

More articles: