Explain Codes LogoExplain Codes Logo

Saving an Object (Data persistence)

python
serialization
pandas
data-persistence
Anton ShumikhinbyAnton Shumikhin·Mar 5, 2025
TLDR

In Python, persist data by utilizing pickle for serialization: pickle.dump(obj, file) to store, and pickle.load(file) to access objects. Pickle transmutes Python objects into a byte stream ready for storage, and then restores them back into objects.

Example:

import pickle # Serialize: "Immortalize" objects to binary format with open('data.pkl', 'wb') as file: pickle.dump({'key': 'value'}, file) # R2-D2, is that you? # Deserialize: Revive objects from their digital slumber with open('data.pkl', 'rb') as file: print(pickle.load(file)) # {'key': 'value'}

Note: Only unpickle data you trust; it’s akin to accepting candy from strangers!

Deep dive into serialization

When dealing with multiple objects, it's practical to aggregate them using a list, tuple, or dictionary before serializing. For large data sets, introduce a generator function during deserialization for efficient memory utilization.

Set pickle.HIGHEST_PROTOCOL or -1 to utilize the latest protocol for accelerated dumping and loading.

# Saving multiple objects? Bundle them up! data_to_save = {'object_1': obj1, 'object_2': obj2} with open('data.pkl', 'wb') as file: pickle.dump(data_to_save, file, protocol=-1) # "Engage light speed!"

To supercharge your performance, unleash _pickle (Python 3's C implementation of pickle) – simply import _pickle as pickle and enjoy the speed!

Picking your persistence sidekick

The dill library is your companion for complex objects or when your task involves saving the state of the entire session. Setting up dill is easy-peasy-lemon-squeezy with pip.

import dill # Save session with dill with open('session.pkl', 'wb') as file: dill.dump_session(file) # Memory in a can, anyone?

When working with Pandas data structures, such as DataFrame or Series, make use of pd.to_pickle() - a swift way to pickle Pandas objects, preserving their native structures and types.

For those computations that are like deja-vu, anycache enables decorator-based caching, together with cache size management. Tailor your Python libraries choice to your needs, factoring in ease of use, performance, and data complexity.

Extra tidbits: Beyond Pickle's reach

Transcending data with JSON and databases

While our primary tool is pickle, sometimes JSON or a SQLite database could serve better, especially when data interoperability with other systems or languages is required. The json module is your handy toolkit for serializing most built-in Python data types.

SQLite, a lightweight database, provides structured persistence and can be interacted with using Python's sqlite3 module.

Bridging objects and relational data using SQLAlchemy ORM

For complex database interactions, SQLAlchemy brings to the table Object-Relational Mapping (ORM), abstracting away SQL intricacies into Python objects.

Serialization for the web

In a web environment or when dealing with APIs, it’s common to serialize objects to JSON format using json.dumps(). The right format is crucial to effective data interchange.

Safe practices and words of caution

Ensure security is given due consideration; avoid unpickling data from untrusted sources.Pickle allows data execution, which can unintentionally run malicious code.

Plain text formats (like JSON) are recommended for non-sensitive data as they are readable and safer. For sensitive data, consider encrypting before serialization.

To uphold data integrity, include checksums or hashes of your serialized data, and verify them when deserializing.