“Creating a persistent object” basically means that you are going to flush the binary code stored in memory that represents the object in a file on your hard drive, so that later in your program or in any other program, the object can be reloaded from the file on your hard drive memory.
Either scikit-learn has enabled joblib , or stdlib pickle and cPickle will do the job. I prefer cPickle because it is much faster. Using the ipython% timeit command :
>>> from sklearn.feature_extraction.text import TfidfVectorizer as TFIDF >>> t = TFIDF() >>> t.fit_transform(['hello world'], ['this is a test'])
Now, if you want to save several objects in one file, you can easily create a data structure for storing them, and then unload the data structure. This will work with tuple , list or dict . For example, your question:
# train vectorizer = TfidfVectorizer() X_train = vectorizer.fit_transform(corpus) selector = SelectKBest(chi2, k = 5000 ) X_train_sel = selector.fit_transform(X_train, y_train)
Later or in another program, the following statements will return the data structure in your program memory:
# reload with open('storage.bin', 'rb') as f: data_struct = cPickle.load(f) vectorizer, selector = data_struct['vectorizer'], data_struct['selector']
source share