How to store TfidfVectorizer for future use in scikit-learn?

Question

How to store TfidfVectorizer for future use in scikit-learn?

I have a TfidfVectorizer , which vectorize a collection of articles, followed by a choice of function.

 vectroizer = TfidfVectorizer() X_train = vectroizer.fit_transform(corpus) selector = SelectKBest(chi2, k = 5000 ) X_train_sel = selector.fit_transform(X_train, y_train)

Now I want to save this and use it in other programs. I do not want to re-run TfidfVectorizer() and the function selector in the training dataset. How can I do it? I know how to make a model permanent using joblib , but I wonder if this is so, how to make a model permanent.

+5

python python-3.x scikit-learn tf-idf joblib

user2161903 24 sept '15 at 15:14

source share

3 answers

Marco ferragina · Answer 1 · 2015-09-29T14:15:53+0000

You can simply use the built-in pickle lib:

 pickle.dump(vectorizer, open("vectorizer.pickle", "wb")) pickle.dump(selector, open("selector.pickle", "wb"))

and download it with:

 vectorizer = pickle.load(open("vectorizer.pickle"), "rb")) selector = pickle.load(open("selector.pickle"), "rb"))

Pickle will serialize objects to disk and load them into memory again when you need it.

pickle lib docs

user2161903 · Answer 2 · 2015-09-24T17:21:40+0000

Here is my answer using joblib:

 joblib.dump(vectorizer, 'vectroizer.pkl') joblib.dump(selector, 'selector.pkl')

Later I can download it and ready to go:

 vectorizer = joblib.load('vectorizer.pkl') selector = joblib.load('selector.pkl') test = selector.trasnform(vectorizer.transform(['this is test']))

Romain g · Answer 3 · 2016-03-18T15:03:38+0000

“Creating a persistent object” basically means that you are going to flush the binary code stored in memory that represents the object in a file on your hard drive, so that later in your program or in any other program, the object can be reloaded from the file on your hard drive memory.

Either scikit-learn has enabled joblib , or stdlib pickle and cPickle will do the job. I prefer cPickle because it is much faster. Using the ipython% timeit command :

 >>> from sklearn.feature_extraction.text import TfidfVectorizer as TFIDF >>> t = TFIDF() >>> t.fit_transform(['hello world'], ['this is a test']) # generic serializer - deserializer test >>> def dump_load_test(tfidf, serializer): ...: with open('vectorizer.bin', 'w') as f: ...: serializer.dump(tfidf, f) ...: with open('vectorizer.bin', 'r') as f: ...: return serializer.load(f) # joblib has a slightly different interface >>> def joblib_test(tfidf): ...: joblib.dump(tfidf, 'tfidf.bin') ...: return joblib.load('tfidf.bin') # Now, time it! >>> %timeit joblib_test(t) 100 loops, best of 3: 3.09 ms per loop >>> %timeit dump_load_test(t, pickle) 100 loops, best of 3: 2.16 ms per loop >>> %timeit dump_load_test(t, cPickle) 1000 loops, best of 3: 879 µs per loop

Now, if you want to save several objects in one file, you can easily create a data structure for storing them, and then unload the data structure. This will work with tuple , list or dict . For example, your question:

 # train vectorizer = TfidfVectorizer() X_train = vectorizer.fit_transform(corpus) selector = SelectKBest(chi2, k = 5000 ) X_train_sel = selector.fit_transform(X_train, y_train) # dump as a dict data_struct = {'vectorizer': vectorizer, 'selector': selector} # use the 'with' keyword to automatically close the file after the dump with open('storage.bin', 'wb') as f: cPickle.dump(data_struct, f)

Later or in another program, the following statements will return the data structure in your program memory:

 # reload with open('storage.bin', 'rb') as f: data_struct = cPickle.load(f) vectorizer, selector = data_struct['vectorizer'], data_struct['selector'] # do stuff... vectors = vectorizer.transform(...) vec_sel = selector.transform(vectors)

How to store TfidfVectorizer for future use in scikit-learn?

More articles: