I am trying to classify text using scikit-learn DecisionTree and Pandas Dataframe: First I built a dataframe that looks like this:
cat1 cat2 corpus title
0 0 1 Test Test Test erster titel
1 1 0 Test Super Super zweiter titel
2 0 1 Test Test Test dritter titel
3 0 1 Test Super Test vierter titel
4 1 0 Super Test Super fuenfter titel
5 1 1 Super einfacher Test Super fuenfter titel
6 1 1 Super simple einfacher Test Super fuenfter titel
Then I create a TF-IDF matrix:
_matrix = generate_tf_idf_matrix(training_df['corpus'].values)
which returns csr-Matrix (CountVectorizer -> TfidfTransformer)
for my classifier I would like to use
train_X = _matrix
train_Y = training_df[['cat1','cat2']]
for multi-valued classification
My question is:
How can I split my data framework and my csr matrix into a test and training set? If I split my data file before creating the matrix, the csr matrix has a different size because my documents have different functions.
Limitation: I do not want to convert my matrix to an array, so I could easily split it.