Gensim LdaMulticore not multiprocessor?

When I run the gensim LdaMulticore model on a machine with 12 cores, using:

 lda = LdaMulticore(corpus, num_topics=64, workers=10) 

I get a registration message that says

 using serial LDA version on this node 

A few lines later, I see another message saying

 training LDA model using 10 processes 

When I run the top, I see that 11 python processes were spawned, but 9 were sleeping, i.e. only one worker is active. The machine has 24 cores and is not overloaded by any means. Why is LdaMulticore not running in parallel?

+5
source share
1 answer

First, be sure to install the fast BLAS library , since most of the time spent on processing is carried out inside low-level procedures for linear algebra.

On my machine, gensim.models.ldamodel.LdaMulticore can use all 20 cpu cores with workers=4 during training. Installing workers in excess of this did not accelerate training. One reason may be that the corpus iterator is too slow to use LdaMulticore effectively .

You can try using ShardedCorpus to serialize and replace corpus , which should be much faster to read / write. Also, just zipping up a large .mm file so that it takes up less space (= less I / O) can also help. For instance.

 mm = gensim.corpora.MmCorpus(bz2.BZ2File('enwiki-latest-pages-articles_tfidf.mm.bz2')) lda = gensim.models.ldamulticore.LdaMulticore(corpus=mm, id2word=id2word, num_topics=100, workers=4) 
+5
source

Source: https://habr.com/ru/post/1236865/


All Articles