Extract intermediate functions from the pipeline in Scikit (Python)

I use a pipeline very similar to the one specified in this example :

>>> text_clf = Pipeline([('vect', CountVectorizer()), ... ('tfidf', TfidfTransformer()), ... ('clf', MultinomialNB()), ... ]) 

over which I use GridSearchCV to find the best ratings on the parameter grid.

However, I would like to get the column names of my training set using the get_feature_names() method from CountVectorizer() . Is this possible without implementing CountVectorizer() outside the pipeline?

+5
source share
2 answers

Using the get_params() function, you can access the various parts of the pipeline and their corresponding internal parameters. Here is an example of access to 'vect'

 text_clf = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', MultinomialNB())] print text_clf.get_params()['vect'] 

gives (for me)

 CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict', dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content', lowercase=True, max_df=1.0, max_features=None, min_df=1, ngram_range=(1, 1), preprocessor=None, stop_words=None, strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b', tokenizer=None, vocabulary=None) 

I did not bind the pipeline to any data in this example, so calling get_feature_names() at this point will return an error.

+6
source

for reference

 The estimators of a pipeline are stored as a list in the steps attribute: >>> >>> clf.steps[0] ('reduce_dim', PCA(copy=True, n_components=None, whiten=False)) and as a dict in named_steps: >>> >>> clf.named_steps['reduce_dim'] PCA(copy=True, n_components=None, whiten=False) 

from http://scikit-learn.org/stable/modules/pipeline.html

+2
source

Source: https://habr.com/ru/post/1233534/


All Articles