I have many blacklisted terms that I want to find in the body of text paragraphs. Each term lasts about 1-5 words and contains certain keywords that I don’t want in my document body. If a term is defined in the case or something similar to it, I want it to be removed from my case.
Removing aside, I am struggling with the exact definition of these terms in my case. I am using scikit-learn and have tried two different approaches:
A multi-line approach to classifying NB using tf-idf vector functions with a combination of blacklisted terms and pure terms used as training data.
One OneClassSVM approach, in which only blacklisted keywords are used as training data, and any text passed in that that does not seem to be blacklisted is considered outliers.
Here is the code for my OnceClassSVm approach:
df = pd.read_csv("keyword_training_blacklist.csv") keywords_list = df['Keyword'] pipeline = Pipeline([ ('vect', CountVectorizer(analyzer='char_wb', max_df=0.75, min_df=1, ngram_range=(1, 5))), # strings to token integer counts ('tfidf', TfidfTransformer(use_idf=False, norm='l2')), # integer counts to weighted TF-IDF scores ('clf', OneClassSVM(nu=0.1, kernel="rbf", gamma=0.1)), # train on TF-IDF vectors w/ Naive Bayes classifier ]) kf = KFold(len(keywords_list), 8) for train_index, test_index in kf: # make training and testing datasets X_train, X_test = keywords_list[train_index], keywords_list[test_index] pipeline.fit(X_train) # Train classifier using training data and labels predicted = pipeline.predict(X_test) print(predicted[predicted == 1].size / predicted.size) csv_df = pd.read_csv("corpus.csv") testCorpus = csv_df['Terms'] testCorpus = testCorpus.drop_duplicates() for s in testCorpus: if pipeline.predict([s])[0] == 1: print(s)
In practice, I get a lot of false positives when I try to transfer an algorithm to my body. Blacklisted training data is about 3,000 terms. Is my training data increasing or am I losing something obvious?
source share