Python 3 standard library: pickling object serialization

Time:2020-10-21

1. Pickle object serialization

The pickle module implements an algorithm to convert any Python object into a series of bytes. This process is also known as serializing objects. You can transfer or store a stream of bytes representing an object and then reconstruct it to create a new object with the same properties.

1.1 encoding and decoding data in strings

The first example uses dumps () to encode a data structure into a string, and then prints the string to the console. It uses a data structure that consists entirely of built-in types. Any instance of a class can be pickled, as shown in the following example.

import pickle
import pprint

data = [{'a': 'A', 'b': 2, 'c': 3.0}]
print('DATA:', end=' ')
pprint.pprint(data)

data_string = pickle.dumps(data)
print('PICKLE: {!r}'.format(data_string))

By default, pickle is written in a binary format that is best compatible when shared between Python 3 programs.

After data is serialized, it can be written to a file, socket, pipe, or other location. You can then read the file and unpickled the data to construct a new object with the same values.

import pickle
import pprint

data1 = [{'a': 'A', 'b': 2, 'c': 3.0}]
print('BEFORE: ', end=' ')
pprint.pprint(data1)

data1_string = pickle.dumps(data1)

data2 = pickle.loads(data1_string)
print('AFTER : ', end=' ')
pprint.pprint(data2)

print('SAME? :', (data1 is data2))
print('EQUAL?:', (data1 == data2))

The newly constructed object is equal to the original object, but not the same object.

1.2 process flow

In addition to dumps() and loads(), pickle also provides some convenient functions to handle file like streams. You can write multiple objects to a stream and then read them from the stream without knowing how many objects to write or how large they are.

import io
import pickle

class SimpleObject:

    def __init__(self, name):
        self.name = name
        self.name_backwards = name[::-1]
        return

data = []
data.append(SimpleObject('pickle'))
data.append(SimpleObject('preserve'))
data.append(SimpleObject('last'))

# Simulate a file.
out_s = io.BytesIO()

# Write to the stream
for o in data:
    print('WRITING : {} ({})'.format(o.name, o.name_backwards))
    pickle.dump(o, out_s)
    out_s.flush()

# Set up a read-able stream
in_s = io.BytesIO(out_s.getvalue())

# Read the data
while True:
    try:
        o = pickle.load(in_s)
    except EOFError:
        break
    else:
        print('READ    : {} ({})'.format(
            o.name, o.name_backwards))

This example uses two bytesio buffers to simulate a stream. The first buffer receives the pickled object, and its value is filled into the second buffer, which is read by load(). Simple database formats can also use pickle to store objects. The shelve module is such an implementation.

In addition to storing data, pickle is also convenient for inter process communication. For example, os.fork () and os.pipe () can be used to create a worker process, read job instructions from one pipe, and write the results to another. The core code that manages the worker thread pool and sends jobs and receives responses can be reused because the job and response objects do not have to be based on a specific class. When using pipes or sockets, don’t forget to refresh the output after dumping individual objects to push data over the connection to the other end. See the multiprocessing module for a reusable worker pool manager.

1.3 problems with refactoring objects

When processing a custom class, the pickled class must appear in the same namespace as the process that reads the pickle. It will only pickle the data of this instance, not the class definition. The class name is used to find the constructor to refer to the new object when unpickled. The following example writes an instance of a class to a file.

import pickleclass SimpleObject:

    def __init__(self, name):
        self.name = name
        l = list(name)
        l.reverse()
        self.name_backwards = ''.join(l)

if __name__ == '__main__':
    data = []
    data.append(SimpleObject('pickle'))
    data.append(SimpleObject('preserve'))
    data.append(SimpleObject('last'))

    with open('Test.py', 'wb') as out_s:
        for o in data:
            print('WRITING: {} ({})'.format(
                o.name, o.name_backwards))
            pickle.dump(o, out_s)

When you run this script, a file is created based on the name given as a command-line argument.

The pickled object obtained by a simple attempt to load will fail.

import pickle

with open('Test.py', 'rb') as in_s:
    while True:
        try:
            o = pickle.load(in_s)
        except EOFError:
            break
        else:
            print('READ: {} ({})'.format(
                o.name, o.name_backwards))

This version failed because there was no simpleobject class.

The modified version imported simpleobject from the original script. This run will succeed. After adding the import statement at the end of the import list, the script can now find the class and construct the object.

from demo import SimpleObject

Now allow the modified script to produce the desired results.

1.4 unpickable objects

Not all objects are pickable. Sockets, file handles, database connections, and other runtime state dependent objects that depend on the operating system or other processes may not be preserved in a meaningful way. If the object contains properties that are not pickable, you can define__ getstate__ () and__ setstate__ () to return a subset of the state of the pickled instance.

Wei getstate__ The () method must return an object that contains the internal state of the picked object. A convenient way to represent state is to use a dictionary, but the value can be any pickable object. Save the state, and then pass in the saved state when loading the object from the pickle__ setstate__ ()。

import pickle

class State:

    def __init__(self, name):
        self.name = name

    def __repr__(self):
        return 'State({!r})'.format(self.__dict__)

class MyClass:

    def __init__(self, name):
        print('MyClass.__init__({})'.format(name))
        self._set_name(name)

    def _set_name(self, name):
        self.name = name
        self.computed = name[::-1]

    def __repr__(self):
        return 'MyClass({!r}) (computed={!r})'.format(
            self.name, self.computed)

    def __getstate__(self):
        state = State(self.name)
        print('__getstate__ -> {!r}'.format(state))
        return state

    def __setstate__(self, state):
        print('__setstate__({!r})'.format(state))
        self._set_name(state.name)

inst = MyClass('name here')
print('Before:', inst)

dumped = pickle.dumps(inst)

reloaded = pickle.loads(dumped)
print('After:', reloaded)

This example uses a separate state object to hold the internal state of MyClass. When an instance of MyClass is loaded from pickle, the__ setstate__ () passes in a state instance to initialize the object.

1.5 circular reference

The pickle protocol will automatically handle the circular references between objects, so complex data structures do not need any special processing.

import pickle

class Node:
    """A simple digraph
    """
    def __init__(self, name):
        self.name = name
        self.connections = []

    def add_edge(self, node):
        "Create an edge between this node and the other."
        self.connections.append(node)

    def __iter__(self):
        return iter(self.connections)

def preorder_traversal(root, seen=None, parent=None):
    """Generator function to yield the edges in a graph.
    """
    if seen is None:
        seen = set()
    yield (parent, root)
    if root in seen:
        return
    seen.add(root)
    for node in root:
        recurse = preorder_traversal(node, seen, root)
        for parent, subnode in recurse:
            yield (parent, subnode)

def show_edges(root):
    "Print all the edges in the graph."
    for parent, child in preorder_traversal(root):
        if not parent:
            continue
        print('{:>5} -> {:>2} ({})'.format(
            parent.name, child.name, id(child)))

# Set up the nodes.
root = Node('root')
a = Node('a')
b = Node('b')
c = Node('c')

# Add edges between them.
root.add_edge(a)
root.add_edge(b)
a.add_edge(b)
b.add_edge(a)
b.add_edge(c)
a.add_edge(a)

print('ORIGINAL GRAPH:')
show_edges(root)

# Pickle and unpickle the graph to create
# a new set of nodes.
dumped = pickle.dumps(root)
reloaded = pickle.loads(dumped)

print('\nRELOADED GRAPH:')
show_edges(reloaded)

The reloaded node is not the same object, but maintains the relationship between the nodes, and if the object has multiple references, only one copy of the object is reloaded. To verify these two points, you can check the node’s ID () value before and after passing the node through pickle.