How to convert new data into PCA components of my training data?

Suppose I have some text sentences that I want to copy using kmeans.

sentences = [ "fix grammatical or spelling errors", "clarify meaning without changing it", "correct minor mistakes", "add related resources or links", "always respect the original author" ] from sklearn.feature_extraction.text import CountVectorizer from sklearn.cluster import KMeans vectorizer = CountVectorizer(min_df=1) X = vectorizer.fit_transform(sentences) num_clusters = 2 km = KMeans(n_clusters=num_clusters, init='random', n_init=1,verbose=1) km.fit(X) 

Now I could predict which classes the new text would come from,

 new_text = "hello world" vec = vectorizer.transform([new_text]) print km.predict(vec)[0] 

However, let's say I use PCA to reduce 10,000 functions to 50.

 from sklearn.decomposition import RandomizedPCA pca = RandomizedPCA(n_components=50,whiten=True) X2 = pca.fit_transform(X) km.fit(X2) 

I can no longer do the same to predict the cluster for the new text, because the results from the vectorizer are no longer relevant

 new_text = "hello world" vec = vectorizer.transform([new_text]) ## print km.predict(vec)[0] ValueError: Incorrect number of features. Got 10000 features, expected 50 

So, how can I convert my new text to a lower dimensional space?

+5
source share
2 answers

You want to use pca.transform your new data before loading it into the model. This will reduce the dimension using the same PCA model that was set when pca.fit_transform run pca.fit_transform source data. You can then use your tailored model to predict these abbreviated data.

Basically, think of it as one large model, consisting of stacking three smaller models. First you have a CountVectorizer model that defines how to process data. Then you run the RandomizedPCA model, which performs the dimensional reduction. And finally, you start the KMeans model for clustering. When you approach the models, you go down in the foot and fit into each of them. And when you want to make a prediction, you also need to go down on the stack and apply each of them.

 # Initialize models vectorizer = CountVectorizer(min_df=1) pca = RandomizedPCA(n_components=50, whiten=True) km = KMeans(n_clusters=2, init='random', n_init=1, verbose=1) # Fit models X = vectorizer.fit_transform(sentences) X2 = pca.fit_transform(X) km.fit(X2) # Predict with models X_new = vectorizer.transform(["hello world"]) X2_new = pca.transform(X_new) km.predict(X2_new) 
+4
source

Use Pipeline :

 >>> from sklearn.cluster import KMeans >>> from sklearn.decomposition import RandomizedPCA >>> from sklearn.decomposition import TruncatedSVD >>> from sklearn.feature_extraction.text import CountVectorizer >>> from sklearn.pipeline import make_pipeline >>> sentences = [ ... "fix grammatical or spelling errors", ... "clarify meaning without changing it", ... "correct minor mistakes", ... "add related resources or links", ... "always respect the original author" ... ] >>> vectorizer = CountVectorizer(min_df=1) >>> svd = TruncatedSVD(n_components=5) >>> km = KMeans(n_clusters=2, init='random', n_init=1) >>> pipe = make_pipeline(vectorizer, svd, km) >>> pipe.fit(sentences) Pipeline(steps=[('countvectorizer', CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict', dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content', lowercase=True, max_df=1.0, max_features=None, min_df=1, ngram_range=(1, 1), preprocessor=None, stop_words=None,...n_init=1, n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001, verbose=1))]) >>> pipe.predict(["hello, world"]) array([0], dtype=int32) 

(Display TruncatedSVD because RandomizedPCA will stop working with frequency matrices in the upcoming version, but it actually did SVD, not a full PCA.)

+3
source

Source: https://habr.com/ru/post/1203949/


All Articles