I have 900 different text files uploaded to the console, about 3.5 million words in total. I run the document clustering algorithms seen here , and I am having problems with the function TfidfVectorizer. Here is what I look at:
from sklearn.feature_extraction.text import TfidfVectorizer
#define vectorizer parameters
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,
min_df=0.4, stop_words='english',
use_idf=True, tokenizer=tokenize_and_stem, ngram_range=(1,3))
store_matrix = {}
for key,value in speech_dict.items():
tfidf_matrix = tfidf_vectorizer.fit_transform(value) #fit the vectorizer to synopses
store_matrix[key] = tfidf_matrix
This code works until it appears ValueError: After pruning, no terms remain. Try a lower min_df or a higher max_df.. However, the code will not fail if I have not changed max_dfto 0.99and lowered min_dfto 0.01. Then it works, it would seem, forever, since it includes basically all 3.5 million terms.
How can I get around this?
My text files are stored in speech_dict, whose keys are file names and whose values are text.