NLTK Document clustering: no clipping left after trimming?

I have 900 different text files uploaded to the console, about 3.5 million words in total. I run the document clustering algorithms seen here , and I am having problems with the function TfidfVectorizer. Here is what I look at:

from sklearn.feature_extraction.text import TfidfVectorizer

#define vectorizer parameters
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,
                                 min_df=0.4, stop_words='english',
                                 use_idf=True, tokenizer=tokenize_and_stem, ngram_range=(1,3))

store_matrix = {}
for key,value in speech_dict.items():
    tfidf_matrix = tfidf_vectorizer.fit_transform(value) #fit the vectorizer to synopses
    store_matrix[key] = tfidf_matrix

This code works until it appears ValueError: After pruning, no terms remain. Try a lower min_df or a higher max_df.. However, the code will not fail if I have not changed max_dfto 0.99and lowered min_dfto 0.01. Then it works, it would seem, forever, since it includes basically all 3.5 million terms.

How can I get around this?

My text files are stored in speech_dict, whose keys are file names and whose values ​​are text.

+4
1

, scikit-learn, TF-IDF vectorizer,

max_df: float in range [0.0, 1.0] int, default = 1.0

, , ( -). float, , . , .

min_df: float [0.0, 1.0] int, default = 1

, , . . float, , . , .

, totalvocab_stemmed_body. , .

1: = 20 000 000, min_df=0.5.

(, 2 ), , , , , 10 000 000 ( 20 000 000 * 0,5).

2: = 200, max_df=0.95

( 200), , . max_df=0.95 , , 190 , . , - .

.

+1

Source: https://habr.com/ru/post/1615333/


All Articles