Save classifier to disk in scikit-learn
Persisting a scikit-learn classifier is a breeze with joblib
:
Store classifier clf
as 'model.joblib'
for future use:
When you need it later:
With joblib, even huge numpy arrays can enjoy a ride with maximum compression level set to 9
.
Handling classifier+vectorizer combo
Often, you're not just saving a classifier but a whole pipeline which includes a classifier and possibly a vectorizer (like TfIdfVectorizer). The process is similar:
To save:
To load:
Keeping both elements of your pipeline snug in their '.joblib
' files ensures you're ready to transform and predict data when the need arises.
Model efficiency tips
Pickle
is great, but joblib is greater when the object to be serialized encases large numpy arrays. It's all about pickling with style, eh?
Prep your model before saving. If clf
is your classifier, it must be trained using clf.fit()
. To enhance the storage efficiency, remove stop_words_
for vectorizers.
Sparsifying and versioning classifiers
Sparse it. For models like SGDClassifier with large coefficient matrices, convert them to a sparse matrix.
Generation gap is a real thing! Always maintain the scikit-learn
version used to train your model. Versions may differ in compatibility, and you don't want any 'version-skew'.
Advanced feature handling
Pssst, ensemble methods can act prissy. If using RandomForest or suchlike, ensure you have appropriate saving state using __getstate__()
methods (if they exist).
Additional features to store:
- Consider storing metadata, say class names, training data characteristics.
- Cross-validation scores or grid search results can help keep track of model context.
Space-efficient storage
When disc space is a concern, compress=9
in dump
comes to the rescue. A little more time, a lot fewer bytes!
Was this article helpful?