Can I extract y-values ​​(data labels) from within cross-validation in scikit-learn?

In my text classification, this step:

  • Chunking with a custom transformer with several parameters (input: XML text file; output: a bunch of documents and shortcuts for these documents)
  • Vectorization , with TfidfVectorizer (input: list of documents; output: DxF matrix, where D is the number of documents and F is the number of functions)
  • Matrix transformer with sparse density (input: sparse matrix, output: dense matrix)
  • Dimension reduction , with PCA or similar technology (input: DxF matrix, output: DxN matrix, where N is parameter: number of desired components)
  • Prediction using GaussianMixture (input: DxN matrix, output: cluster assignment, i.e. grouping of documents)

There are so many parameters for each of these steps that it is inefficient to view all possible combinations of parameters manually, so I am trying to do a grid search using cross-references with CVGridSearch(). This can use a counter to compare output groups with source groups (shortcuts). (I am using scorer metrics.adjusted_rand_index().)

1, chunker, , 2, 2-4, . , , 1, , , 1. , 1 , .

: , CVGridSearch , , ?

: , , . ( .)

+4

Source: https://habr.com/ru/post/1664069/


All Articles