Multilevel training for text data: ValueError with partial fit

I am trying to create a multi-valued text classifier. As described here , the idea is to read (large-scale) sets of text data in batches and partially fit them to classifiers. In addition, if you have instances with several labels, as described here , the idea is to build a lot of binary classifiers in the form of the number of classes in the data set in One-Vs-All mode.

When combining the MultiLabelBinarizer and OneVsRestClassifier classes with sklearn with a partial connection, I get the following error:

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any () or a.all ()

The code is as follows:

from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.multiclass import OneVsRestClassifier

categories = ['a', 'b', 'c']
X = ["This is a test", "This is another attempt", "And this is a test too!"]
Y = [['a', 'b'],['b'],['a','b']]

mlb = MultiLabelBinarizer(classes=categories)
vectorizer = HashingVectorizer(decode_error='ignore', n_features=2 ** 18,         non_negative=True)
clf = OneVsRestClassifier(MultinomialNB(alpha=0.01))

X_train = vectorizer.fit_transform(X)
Y_train = mlb.fit_transform(Y)
clf.partial_fit(X_train, Y_train, classes=categories)

, -, .

OneVsRestClassifier MultinomialNB, .

+4
2

y_train MultiLabelBinarizer, [[1, 1, 0], [0, 1, 0], [1, 1, 0]], ['a','b','c'], : -

if np.setdiff1d(y, self.classes_):
raise ValueError(("Mini-batch contains {0} while classes " +
                 "must be subset of {1}").format(np.unique(y),
                                              self.classes_))

, [False, True,..]. if , , .

, , Y_train. , , label_binarizer_ OneVsRestClassifier , "multiclass", multilabel . , , OneVsRestClassifer / LabelBinarizer.

, scikit-learn github partial_fit , .

-, "" "" (y) scikit-learn - , .

+4

, , , , OneVsRestClassifier scikit-multilearn, scikit-learn, , , OneVsRest.

scikit-multilearn . Tsoumakas MLC.

, , , , Label Powerset, - , .

scikit-multilearn :

from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.preprocessing import MultiLabelBinarizer

from skmultilearn.ensemble import LabelSpacePartitioningClassifier
from skmultilearn.cluster import IGraphLabelCooccurenceClusterer
from skmultilearn.problem_transform import LabelPowerset

categories = ['a', 'b', 'c']
X = ["This is a test", "This is another attempt", "And this is a test too!"]
Y = [['a', 'b'],['b'],['a','b']]

mlb = MultiLabelBinarizer(classes=categories)
vectorizer = HashingVectorizer(decode_error='ignore', n_features=2 ** 18,         non_negative=True)

X_train = vectorizer.fit_transform(X)
Y_train = mlb.fit_transform(Y)

# base single-label classifier 
base_classifier = MultinomialNB(alpha=0.01)

# problem transformation from multi-label to single-label 
transformation_classifier = LabelPowerset(base_classifier)

# clusterer dividing the label space using fast greedy modularity maximizing scheme
clusterer = IGraphLabelCooccurenceClusterer('fastgreedy', weighted=True, include_self_edges=True) 

# ensemble
clf = LabelSpacePartitioningClassifier(transformation_classifier, clusterer)

clf.fit(X_train, Y_train)
+3

Source: https://habr.com/ru/post/1669966/


All Articles