How to accurately classify text with many potential values ​​using scikit?

I have many blacklisted terms that I want to find in the body of text paragraphs. Each term lasts about 1-5 words and contains certain keywords that I don’t want in my document body. If a term is defined in the case or something similar to it, I want it to be removed from my case.

Removing aside, I am struggling with the exact definition of these terms in my case. I am using scikit-learn and have tried two different approaches:

  • A multi-line approach to classifying NB using tf-idf vector functions with a combination of blacklisted terms and pure terms used as training data.

  • One OneClassSVM approach, in which only blacklisted keywords are used as training data, and any text passed in that that does not seem to be blacklisted is considered outliers.

Here is the code for my OnceClassSVm approach:

df = pd.read_csv("keyword_training_blacklist.csv") keywords_list = df['Keyword'] pipeline = Pipeline([ ('vect', CountVectorizer(analyzer='char_wb', max_df=0.75, min_df=1, ngram_range=(1, 5))), # strings to token integer counts ('tfidf', TfidfTransformer(use_idf=False, norm='l2')), # integer counts to weighted TF-IDF scores ('clf', OneClassSVM(nu=0.1, kernel="rbf", gamma=0.1)), # train on TF-IDF vectors w/ Naive Bayes classifier ]) kf = KFold(len(keywords_list), 8) for train_index, test_index in kf: # make training and testing datasets X_train, X_test = keywords_list[train_index], keywords_list[test_index] pipeline.fit(X_train) # Train classifier using training data and labels predicted = pipeline.predict(X_test) print(predicted[predicted == 1].size / predicted.size) csv_df = pd.read_csv("corpus.csv") testCorpus = csv_df['Terms'] testCorpus = testCorpus.drop_duplicates() for s in testCorpus: if pipeline.predict([s])[0] == 1: print(s) 

In practice, I get a lot of false positives when I try to transfer an algorithm to my body. Blacklisted training data is about 3,000 terms. Is my training data increasing or am I losing something obvious?

+5
source share
1 answer

Try using difflib to determine the closest match in the enclosure with each of your black listed terms.

 import difflib from nltk.util import ngrams words = corpus.split(' ') # split corpus to words based on spaces ( can be improved ) words_ngrams = [] # ngrams from 1 to 5 words for n in range(1,6): words_ngrams.extend( ' '.join(ngrams(words, n)) ) to_delete = [] # will contain tuples (index, length) of matched terms to delete from corpus. sim_rate = 0.8 # similarity rate max_matches = 4 # maximum number of matches for each term for term in terms: matches = difflib.get_close_matches(term,words_ngrams,n=max_matches,cutoff=sim_rate) for match in matches: to_delete.append( (corpus.index(match), len(match) ) ) 

You can also use difflib.SequenceMatcher if you want to get an assessment of the similarity between terms and ngrams.

+2
source

Source: https://habr.com/ru/post/1244827/


All Articles