Extract intermediate functions from the pipeline in Scikit (Python)

Question

Extract intermediate functions from the pipeline in Scikit (Python)

I use a pipeline very similar to the one specified in this example :

>>> text_clf = Pipeline([('vect', CountVectorizer()), ... ('tfidf', TfidfTransformer()), ... ('clf', MultinomialNB()), ... ])

over which I use GridSearchCV to find the best ratings on the parameter grid.

However, I would like to get the column names of my training set using the get_feature_names() method from CountVectorizer() . Is this possible without implementing CountVectorizer() outside the pipeline?

+5

python scikit-learn pipeline

Tanguy Oct 12 '15 at 16:22

source share

2 answers

for reference

 The estimators of a pipeline are stored as a list in the steps attribute: >>> >>> clf.steps[0] ('reduce_dim', PCA(copy=True, n_components=None, whiten=False)) and as a dict in named_steps: >>> >>> clf.named_steps['reduce_dim'] PCA(copy=True, n_components=None, whiten=False)

from http://scikit-learn.org/stable/modules/pipeline.html

+2

Abtpst Dec 31 '15 at 15:33

source share

NBartley · Accepted Answer · 2015-10-12T18:47:01+0000

Using the get_params() function, you can access the various parts of the pipeline and their corresponding internal parameters. Here is an example of access to 'vect'

 text_clf = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', MultinomialNB())] print text_clf.get_params()['vect']

gives (for me)

 CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict', dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content', lowercase=True, max_df=1.0, max_features=None, min_df=1, ngram_range=(1, 1), preprocessor=None, stop_words=None, strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b', tokenizer=None, vocabulary=None)

I did not bind the pipeline to any data in this example, so calling get_feature_names() at this point will return an error.

Extract intermediate functions from the pipeline in Scikit (Python)

More articles: