Why does fitting and partial_fit from sklearn LatentDirichletAllocation return different results?

Question

Why does fitting and partial_fit from sklearn LatentDirichletAllocation return different results?

It is strange that this looks like the same code for fitting and for partial_fit.

You can see the code at the following link:

https://github.com/scikit-learn/scikit-learn/blob/c957249/sklearn/decomposition/online_lda.py#L478

+8

python scikit-learn

augustin-barillec Feb 12 '16 at 14:30

source share

1 answer

Guiem bosch · Accepted Answer · 2016-02-12T16:26:21+0000

Not exactly the same code; partial_fit uses total_samples :

"total_samples: int, optional (default = 1e6) The total number of documents. Used only in the part_fit method."

https://github.com/scikit-learn/scikit-learn/blob/c957249/sklearn/decomposition/online_lda.py#L184

( partial fit ) https://github.com/scikit-learn/scikit-learn/blob/c957249/sklearn/decomposition/online_lda.py#L472

( suitable ) https://github.com/scikit-learn/scikit-learn/blob/c957249/sklearn/decomposition/online_lda.py#L510

Just in case you are interested in this: partial_fit is a good candidate to use when your data set is really very large. Thus, instead of encountering possible memory problems, you are tuning in small batches, called incremental learning .

So, in your case, you should take into account that total_samples default total_samples is 1000000.0 . Therefore, if you do not change this number and your actual number of samples will be larger, you will get other results from the fit and fit_partial . Or maybe this is the case when you use mini-packages in fit_partial and do not cover all the samples that you provide for the fit method. And even if you do it right, you can also get other results, as indicated in the documentation:

"The incremental student itself may not be able to handle the new / invisible target classes. In this case, you need to pass all possible classes to the first call to part_fit using the classes = parameter."
"[...] choosing the right algorithm is that they all do not attach the same value to each example over time [...]"

Sklearn documentation: https://scikit-learn.org/0.15/modules/scaling_strategies.html#incremental-learning

Why does fitting and partial_fit from sklearn LatentDirichletAllocation return different results?

More articles: