FeatureUnion not intended to be used in this way. Instead, it uses two function extractors / vectorizers and applies them to the input. It does not accept data in the constructor as shown.
CountVectorizer expects a string sequence. The easiest way to provide this is to combine the lines together. This would pass both texts in both columns to the same CountVectorizer .
combined_2 = df['Subject'] + ' ' + df['body_text']
An alternative method would be to run the CountVectorizer and optionally TfidfTransformer separately for each column, and then collect the results.
import scipy.sparse as sp subject_vectorizer = CountVectorizer(...) subject_vectors = subject_vectorizer.fit_transform(df['Subject']) body_vectorizer = CountVectorizer(...) body_vectors = body_vectorizer.fit_transform(df['Subject']) combined_2 = sp.hstack([subject_vectors, body_vectors], format='csr')
The third option is to implement your own transformer, which will extract the dataframe column.
class DataFrameColumnExtracter(TransformerMixin): def __init__(self, column): self.column = column def fit(self, X, y=None): return self def transform(self, X, y=None): return X[self.column]
In this case, you can use FeatureUnion on two pipelines, each of which contains your custom transformer, and then CountVectorizer .
subj_pipe = make_pipeline( DataFrameColumnExtracter('Subject'), CountVectorizer() ) body_pipe = make_pipeline( DataFrameColumnExtracter('body_text'), CountVectorizer() ) feature_union = make_union(subj_pipe, body_pipe)
This pipeline aggregation function will receive a data frame, and each pipeline will process its own column. This will concatenate the word count matrices from the two specified columns.
sparse_matrix_of_counts = feature_union.fit_transform(df)
This combination of functions can also be added as a first step in a larger pipeline.