I am developing text using the excellent scikit-learn module. I am trying to group and classify scientific theses.
I am looking for a way to copy my set of tf-id views without specifying the number of clusters in advance. I could not find a good algorithm that can do this, and still handle large sparse matrices decently. I studied just using scikit-learn kmeans, but it has no way to determine the optimal number of clusters (e.g. using BIC). I also tried using Gaussian mixture models (using the best BIC score for model selection), but they are terribly slow.
After I grouped the documents, I would like to be able to view the topics of each cluster, that is, the words that they usually use. Is there any way to extract this information given the data matrix and cluster labels? Maybe taking the average value of the cluster and its inverse transformation using the vector tf-id? I previously tried using chi-square and randomforest to evaluate the importance of a function, but that doesn't say which class of shortcuts uses what.
I tried using the NMF decomposition method (using just the sample code from scikit-learns) to detect a topic. He did a great job and very quickly produced very significant topics. However, I did not find a way to use it to assign each data set to a cluster and automatically determine the โoptimalโ number of clusters. But this is what I am looking for.
I also read somewhere that you can extract information about a topic directly from an installed LDA model, but I donโt understand how this is done. Since I have already implemented LDA as a basic classifier and visualization tool, this can be a simple solution.
If I manage to create a meaningful cluster / themes, I'm going to compare them with some human-created shortcuts (not based on the theme) to see how they fit. But this is a topic for another thread :-)
source share