In my text classification, this step:
- Chunking with a custom transformer with several parameters (input: XML text file; output: a bunch of documents and shortcuts for these documents)
- Vectorization , with TfidfVectorizer (input: list of documents; output: DxF matrix, where D is the number of documents and F is the number of functions)
- Matrix transformer with sparse density (input: sparse matrix, output: dense matrix)
- Dimension reduction , with PCA or similar technology (input: DxF matrix, output: DxN matrix, where N is parameter: number of desired components)
- Prediction using GaussianMixture (input: DxN matrix, output: cluster assignment, i.e. grouping of documents)
There are so many parameters for each of these steps that it is inefficient to view all possible combinations of parameters manually, so I am trying to do a grid search using cross-references with CVGridSearch(). This can use a counter to compare output groups with source groups (shortcuts). (I am using scorer metrics.adjusted_rand_index().)
1, chunker, , 2, 2-4, . , , 1, , , 1. , 1 , .
: , CVGridSearch , , ?
: , , . ( .)