Explain Codes LogoExplain Codes Logo

Save classifier to disk in scikit-learn

python
scikit-learn
model-saving
data-science
Alex KataevbyAlex Kataev·Dec 25, 2024
TLDR

Persisting a scikit-learn classifier is a breeze with joblib:

Store classifier clf as 'model.joblib' for future use:

from joblib import dump dump(clf, 'model.joblib', compress=9) # Set maximum compression level to 9 because we love efficiency.

When you need it later:

from joblib import load clf = load('model.joblib') # Loading... hold on, almost there!

With joblib, even huge numpy arrays can enjoy a ride with maximum compression level set to 9.

Handling classifier+vectorizer combo

Often, you're not just saving a classifier but a whole pipeline which includes a classifier and possibly a vectorizer (like TfIdfVectorizer). The process is similar:

To save:

dump(vectorizer, 'vectorizer.joblib') dump(clf, 'classifier.joblib') # One last selfie with clf before it goes for a long sleep!

To load:

vectorizer = load('vectorizer.joblib') # Rise and shine, Vectorizer! clf = load('classifier.joblib') # Good morning, Clf. Ready for work?

Keeping both elements of your pipeline snug in their '.joblib' files ensures you're ready to transform and predict data when the need arises.

Model efficiency tips

Pickle is great, but joblib is greater when the object to be serialized encases large numpy arrays. It's all about pickling with style, eh?

Prep your model before saving. If clf is your classifier, it must be trained using clf.fit(). To enhance the storage efficiency, remove stop_words_ for vectorizers.

Sparsifying and versioning classifiers

Sparse it. For models like SGDClassifier with large coefficient matrices, convert them to a sparse matrix.

Generation gap is a real thing! Always maintain the scikit-learn version used to train your model. Versions may differ in compatibility, and you don't want any 'version-skew'.

Advanced feature handling

Pssst, ensemble methods can act prissy. If using RandomForest or suchlike, ensure you have appropriate saving state using __getstate__() methods (if they exist).

Additional features to store:

  • Consider storing metadata, say class names, training data characteristics.
  • Cross-validation scores or grid search results can help keep track of model context.

Space-efficient storage

When disc space is a concern, compress=9 in dump comes to the rescue. A little more time, a lot fewer bytes!