Python: using scikit-learn to predict gives empty predictions

I work in customer support, and I use scikit-learn to predict tags for our tickets, given the set of training tickets (about 40,000 tickets in the training set).

I use a classification model based on this . It predicts simply "()" as tags for many of my test ticket sets, although none of the tickets in the training set contain tags.

My training data for tags is a list of lists, for example:

tags_train = [['international_solved'], ['from_build_guidelines my_new_idea eligibility'], ['dropbox other submitted_faq submitted_help'], ['my_new_idea_solved'], ['decline macro_backer_paypal macro_prob_errored_pledge_check_credit_card_us loading_problems'], ['dropbox macro__turnaround_time other plq__turnaround_time submitted_help'], ['dropbox macro_creator__logo_style_guide outreach press submitted_help']] 

Although my training data for ticket descriptions is just a list of strings, for example:

 descs_train = ['description of ticket one', 'description of ticket two', etc] 

Here is the relevant part of my code to build the model:

 import numpy as np import scipy from sklearn.pipeline import Pipeline from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfTransformer from sklearn.multiclass import OneVsRestClassifier from sklearn.svm import LinearSVC # We have lists called tags_train, descs_train, tags_test, descs_test with the test and train data X_train = np.array(descs_train) y_train = tags_train X_test = np.array(descs_test) classifier = Pipeline([ ('vectorizer', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', OneVsRestClassifier(LinearSVC(class_weight='auto')))]) classifier.fit(X_train, y_train) predicted = classifier.predict(X_test) 

However, "predicted" gives a list that looks like this:

 predicted = [(), ('account_solved',), (), ('images_videos_solved',), ('my_new_idea_solved',), (), (), (), (), (), ('images_videos_solved', 'account_solved', 'macro_launched__edit_update other tips'), ('from_guidelines my_new_idea', 'from_guidelines my_new_idea macro__eligibility'), ()] 

I don’t understand why it predicts blank () when they are not in the training set. Shouldn't he predict the nearest tag? Can anyone recommend any improvements to the model I'm using?

Thank you so much for your help in advance!

+6
source share
2 answers

The problem is your tags_train variable. According to the OneVsRestClassifier documentation, goals should be a "sequence of label sequences", and your goals are element lists of one .

Below is an edited, standalone, and working version of your code. Notice the change in tags_train , in particular that tags_train is a singleton tuple.

 import numpy as np import scipy from sklearn.pipeline import Pipeline from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfTransformer from sklearn.multiclass import OneVsRestClassifier from sklearn.svm import LinearSVC # We have lists called tags_train, descs_train, tags_test, descs_test with the test and train data tags_train = [('label', ), ('international' ,'solved'), ('international','open')] descs_train = ['description of ticket one', 'some other ticket two', 'label'] X_train = np.array(descs_train) y_train = tags_train X_test = np.array(descs_train) classifier = Pipeline([ ('vectorizer', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', OneVsRestClassifier(LinearSVC(class_weight='auto')))]) classifier = classifier.fit(X_train, y_train) predicted = classifier.predict(X_test) print predicted 

Output signal

 [('international',), ('international',), ('international', 'open')] 
+5
source

Still encountering prediction (), even after converting a target from a list of one element in a sequence

enter image description here

0
source

Source: https://habr.com/ru/post/946550/


All Articles