Use Featureunion in scikit-learn to merge two pandas columns for tfidf

Question

Use Featureunion in scikit-learn to merge two pandas columns for tfidf

When using this as a model for classifying spam, I would like to add an extra feature to Subject plus the body.

I have all my functions in the pandas framework. For example, the object is df ['Subject'], the body is df ['body_text'], and the spam / ham label is df ['ham / spam']

I get the following error: TypeError: FeatureUnion object is not iterable

How can I use both df ['Subject'] and df ['body_text'] as functions at runtime through a pipeline function?

from sklearn.pipeline import FeatureUnion features = df[['Subject', 'body_text']].values combined_2 = FeatureUnion(list(features)) pipeline = Pipeline([ ('count_vectorizer', CountVectorizer(ngram_range=(1, 2))), ('tfidf_transformer', TfidfTransformer()), ('classifier', MultinomialNB())]) pipeline.fit(combined_2, df['ham/spam']) k_fold = KFold(n=len(df), n_folds=6) scores = [] confusion = numpy.array([[0, 0], [0, 0]]) for train_indices, test_indices in k_fold: train_text = combined_2.iloc[train_indices] train_y = df.iloc[test_indices]['ham/spam'].values test_text = combined_2.iloc[test_indices] test_y = df.iloc[test_indices]['ham/spam'].values pipeline.fit(train_text, train_y) predictions = pipeline.predict(test_text) prediction_prob = pipeline.predict_proba(test_text) confusion += confusion_matrix(test_y, predictions) score = f1_score(test_y, predictions, pos_label='spam') scores.append(score)

+5

pandas scikit-learn sklearn-pandas

BLodge Jan 10 '16 at 20:11

source share

1 answer

David maust · Accepted Answer · 2016-01-10T20:26:12+0000

FeatureUnion not intended to be used in this way. Instead, it uses two function extractors / vectorizers and applies them to the input. It does not accept data in the constructor as shown.

CountVectorizer expects a string sequence. The easiest way to provide this is to combine the lines together. This would pass both texts in both columns to the same CountVectorizer .

 combined_2 = df['Subject'] + ' ' + df['body_text']

An alternative method would be to run the CountVectorizer and optionally TfidfTransformer separately for each column, and then collect the results.

 import scipy.sparse as sp subject_vectorizer = CountVectorizer(...) subject_vectors = subject_vectorizer.fit_transform(df['Subject']) body_vectorizer = CountVectorizer(...) body_vectors = body_vectorizer.fit_transform(df['Subject']) combined_2 = sp.hstack([subject_vectors, body_vectors], format='csr')

The third option is to implement your own transformer, which will extract the dataframe column.

 class DataFrameColumnExtracter(TransformerMixin): def __init__(self, column): self.column = column def fit(self, X, y=None): return self def transform(self, X, y=None): return X[self.column]

In this case, you can use FeatureUnion on two pipelines, each of which contains your custom transformer, and then CountVectorizer .

 subj_pipe = make_pipeline( DataFrameColumnExtracter('Subject'), CountVectorizer() ) body_pipe = make_pipeline( DataFrameColumnExtracter('body_text'), CountVectorizer() ) feature_union = make_union(subj_pipe, body_pipe)

This pipeline aggregation function will receive a data frame, and each pipeline will process its own column. This will concatenate the word count matrices from the two specified columns.

  sparse_matrix_of_counts = feature_union.fit_transform(df)

This combination of functions can also be added as a first step in a larger pipeline.

Use Featureunion in scikit-learn to merge two pandas columns for tfidf

More articles: