Sklearn split Pandas Dataframe and CSR Matrix into a set for testing and training

I am trying to classify text using scikit-learn DecisionTree and Pandas Dataframe: First I built a dataframe that looks like this:

   cat1  cat2                             corpus           title
0     0     1                     Test Test Test    erster titel
1     1     0                   Test Super Super   zweiter titel
2     0     1                     Test Test Test   dritter titel
3     0     1                    Test Super Test   vierter titel
4     1     0                   Super Test Super  fuenfter titel
5     1     1         Super einfacher Test Super  fuenfter titel
6     1     1  Super simple einfacher Test Super  fuenfter titel

Then I create a TF-IDF matrix:

_matrix = generate_tf_idf_matrix(training_df['corpus'].values)

which returns csr-Matrix (CountVectorizer -> TfidfTransformer)

for my classifier I would like to use

    train_X = _matrix
    train_Y = training_df[['cat1','cat2']]

for multi-valued classification

My question is:

How can I split my data framework and my csr matrix into a test and training set? If I split my data file before creating the matrix, the csr matrix has a different size because my documents have different functions.

Limitation: I do not want to convert my matrix to an array, so I could easily split it.

+4
1

scikit-learns . sklearn.cross_validation ( API-).

train_test_split :

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

, y , StratifiedShuffleSplit, / , /.

, X = _matrix y = training_df[['cat1', 'cat2']], scikit-learn funtions, /.

+4

Source: https://habr.com/ru/post/1616744/


All Articles