Text clustering and topic extraction

I am developing text using the excellent scikit-learn module. I am trying to group and classify scientific theses.

I am looking for a way to copy my set of tf-id views without specifying the number of clusters in advance. I could not find a good algorithm that can do this, and still handle large sparse matrices decently. I studied just using scikit-learn kmeans, but it has no way to determine the optimal number of clusters (e.g. using BIC). I also tried using Gaussian mixture models (using the best BIC score for model selection), but they are terribly slow.

After I grouped the documents, I would like to be able to view the topics of each cluster, that is, the words that they usually use. Is there any way to extract this information given the data matrix and cluster labels? Maybe taking the average value of the cluster and its inverse transformation using the vector tf-id? I previously tried using chi-square and randomforest to evaluate the importance of a function, but that doesn't say which class of shortcuts uses what.

I tried using the NMF decomposition method (using just the sample code from scikit-learns) to detect a topic. He did a great job and very quickly produced very significant topics. However, I did not find a way to use it to assign each data set to a cluster and automatically determine the โ€œoptimalโ€ number of clusters. But this is what I am looking for.

I also read somewhere that you can extract information about a topic directly from an installed LDA model, but I donโ€™t understand how this is done. Since I have already implemented LDA as a basic classifier and visualization tool, this can be a simple solution.

If I manage to create a meaningful cluster / themes, I'm going to compare them with some human-created shortcuts (not based on the theme) to see how they fit. But this is a topic for another thread :-)

+7
source share
3 answers

You can try TF-IDF with low max_df , for example. max_df=0.5 , and then k-means (or MiniBatchKMeans ). To find a good value for K, you can try one of these heuristics:

  • space statistics
  • forecast power

Executive descriptions are provided in this blog post: http://blog.echen.me/2011/03/19/counting-clusters/

None of these methods are implemented in sklearn. I would be very interested if you find any of them useful for your problem. If so, it would be interesting to discuss how best to contribute to the default implementation of scikit-learn.

+4
source

There are two ways to do this:

  • Clustering approach . Use the converted function set provided by NMF as an input to the clustering algorithm. For example, if you use the k-mean algorithm, you can set k to the number of topics (i.e., New functions / components) that you have. I think this paper speaks of something similar.

  • Tagged approach . This is the approach I used recently. This allows you to tag messages with one or more topics. Use the transform() function of the NMF model object to get the n * n_topics . Then set a threshold for each topic. For my example, "0.02" did a great job for me. Assign a subject to the document if this value is greater than the specified threshold. Please note that this will mean that although some documents will have more than one topic assigned to them, some documents will not have topics assigned to them. But I found that this approach yielded very significant and interesting results.

+2
source

For the LDA, I found that this article provides a very good explanation.

0
source

Source: https://habr.com/ru/post/946158/