A faster Python library that supports serialization and deserialization of more data types

Time:2022-5-13

A fast and correct Python JSON library, which itself supports data class, date, time and numpy data type.

Recently, in the development, we encountered the scenario of serializing and deserializing a large number of complex JSON, but found that the built-in JSON library was very slow. We wanted to use a third-party library to replace it, so we found the orjson library mentioned above.

Next, let’s take a look at the advantages and disadvantages of orjson compared with other Python JSON libraries.

Serializing instances of data classes (dataclasses. Dataclas) is 40-50 times faster than other libraries.

Serialize datetime, date and time instances into RFC 3339 format, for example, “1970-01-01t00:00:00 + 00:00”.

Serialize numpy The instantiation speed of ndarray is 0.3 times that of other libraries.

Serialize to bytes instead of strings, that is, not temporary replacements.

Serialize strings without escaping Unicode to ASCII, for example, “å ¥” ½ “Instead of” \ u597d “.

Serializing floating-point data is 10 times faster than other libraries, and deserializing is twice as fast as other libraries.

It has strict consistency between UTF-8 and JSON formats and is more correct than the standard library.

The load () or dump () class is not used to read additional files.

install

Install directly using PIP as follows:

pip install orjson

Simple example

The following is an example including serialization and deserialization:

data = {“emoji”: “”, “integer”: 9527, “float”: 9.527, “boolean”: False,

        “list”: [“”, 9527, 9.527, False],

        “dict”: {“key1”: “value1”, “key2”: “value2”},

“Chinese”: “hello”, “Japanese”: “こ ん に ち”,

        “created_at”: datetime.datetime(1970, 1, 1),

        “status”: “🆗”, “payload”: numpy.array([[1, 2], [3, 4]])}

#Serialize data

data_dumps = orjson.dumps(data, option=orjson.OPT_NAIVE_UTC | orjson.OPT_SERIALIZE_NUMPY)

print(data_dumps)

#Deserialize data_ dumps

data_loads = orjson.loads(data_dumps)

print(data_loads)

#Execute the above code and the output result is:

b'{“emoji”:”\xf0\x9f\x98\x82″,”integer”:9527,”float”:9.527,”boolean”:false,”list”:[“\xf0\x9f\x98\x82″,9527,9.527,false],”dict”:{“key1″:”value1″,”key2″:”value2″},”chinese”:”\xe6\x82\xa8\xe5\xa5\xbd”,”japanese”:”\xe3\x81\x93\xe3\x82\x93\xe3\x81\xab\xe3\x81\xa1\xe3\x81\xaf”,”created_at”:”1970-01-01T00:00:00+00:00″,”status”:”\xc3\xb0\xc5\xb8\xe2\x80\xa0\xe2\x80\x94″,”payload”:[[1,2],[3,4]]}’

{‘Emoji’: ”, ‘Integer’: 9527, ‘float’: 9.527, ‘Boolean’: false, ‘list’: [”, 9527, 9.527, false], ‘dict’: {key1 ‘:’ value1 ‘,’ key2 ‘:’ Value2 ‘},’ Chinese ‘:’ hello ‘,’ Japanese ‘:’ created ‘_ at’: ‘1970-01-01T00:00:00+00:00’, ‘status’: ‘ 🆗’, ‘ payload’: [[1, 2], [3, 4]]}

Due to the problem of headline display, some data cannot be displayed. The screenshot of the above code is attached as follows:

A faster Python library that supports serialization and deserialization of more data types

serialize

Parameter description

For serialization, you can specify the following two input parameters:

*Default: to serialize a subclass or any type, specify default as the callable object that returns the supported type. In addition, you can enforce rules to handle unsupported date types by throwing exceptions such as typeerror.

Option: to modify the way data is serialized, specify the option. In orjson, each option is an integer constant. To specify multiple options, mask them together, for example: option = orjson OPT_ STRICT_ INTEGER | orjson. OPT_ NAIVE_ UTC 。

Serialize default parameter

An error is raised when the input contains an unsupported decimal data type.

import decimal

orjson.dumps(decimal.Decimal(“3.141592653”))

Running the above code will get the following output:

TypeError: Type is not JSON serializable: decimal.Decimal

To make orjson serialization support decimal data types, we can create a callable function or lambda expression and pass it as a default parameter, as follows.

def default(obj):

    if isinstance(obj, decimal.Decimal):

        return str(obj)

    raise TypeError

data = orjson.dumps(decimal.Decimal(“0.0842389659712649442845”), default=default)

#Execute the above code and the output result is:

b'”0.0842389659712649442845″‘

Serialization option parameter

OPT_ APPEND_ Newline: append \ n to output.

OPT_ INDENT_ 2: A printout indented two spaces.

OPT_ NAIVE_ UTC: there will be no datetime for tzinfo The datetime object is serialized to UTC. Tzinfo datetime. Is set for this pair Datetime object has no effect.

OPT_ NON_ STR_ Keys: serialize dictionary keys of types other than strings. It allows the keys of dict to be string, integer, floating point, Boolean, none, time (datetime. Time, datetime. Datetime), date (datetime. Date), enumeration (enum. Enum), UUID UUID。

OPT_ OMIT_ Microseconds: do not serialize datetime Datetime or datetime Microsecond data on the time instance.

OPT_ PASSTHROUGH_ Dataclass: when serializing data class (dataclasses. Dataclas) instances, customize the output content through the default parameter.

OPT_ PASSTHROUGH_ Datetime: serialize datetime datetime, datetime. date, and datetime. Time instance, customize the format through the default parameter.

OPT_ SERIALIZE_ Numpy: serialize numpy Ndarray instance.

OPT_ SORT_ Keys: serialize dictionary keys in sort order. The default is to serialize in an unspecified order.

OPT_ STRICT_ Integer: enforces a 53 bit limit on integers instead of the standard 64 bits.

The code example is as follows:

import orjson, datetime, uuid

#The serialized dict key is UUID UUID data

orjson.dumps(

        {uuid.UUID(“9527d115-6ff8-9aj1-n3b1-128sj384392135reiop”): [1, 2]},

        option=orjson.OPT_NON_STR_KEYS,

    )

#The serialized dict key is datetime Datetime data

orjson.dumps(

        {datetime.datetime(2021, 1, 1, 0, 0, 0): [1, 2]},

        option=orjson.OPT_NON_STR_KEYS | orjson.OPT_NAIVE_UTC,

    )

#Do not serialize fields

orjson.dumps(

        datetime.datetime(2021, 1, 1, 0, 0, 0, 1),

        option=orjson.OPT_OMIT_MICROSECONDS,

    )

#When serializing dataclasses data, customize the output content through the default parameter

Import dataclasses data

@dataclasses.dataclass

class User:

    id: str

    name: str

    password: str

def default(obj):

    if isinstance(obj, User):

        return { “name”: obj.name,”password”:obj.password}

    raise TypeError

orjson.dumps(

        User(“123”, “Tom”, “123456”),

        option=orjson.OPT_PASSTHROUGH_DATACLASS,

        default=default,

    )

#Serialize datetime When using the datetime instance, customize the format through the default parameter.

def default(obj):

    if isinstance(obj, datetime.datetime):

        return obj.strftime(“%a, %d %b %Y %H:%M:%S GMT”)

    raise TypeError

  orjson.dumps(

        {“created_at”: datetime.datetime(2021, 1, 1)},

        option=orjson.OPT_PASSTHROUGH_DATETIME,

        default=default,

    )

Deserialization

Loads() accepts bytes, byte array, memoryview, STR input. It is deserialized into dict, list, int, float, STR, bool, and none objects. If the input exists as a memoryview, byte array, or bytes object, it is recommended to pass these directly instead of creating an unnecessary STR object. This can reduce memory usage and latency. The input must be a valid UTF-8.

File read / write

Usually, we can save the byte data returned after serialization to a file through the write () function. However, you need to include mode B in the mode parameter.

data = {“emoji”: “”, “integer”: 9527, “float”: 9.527, “boolean”: False,

        “list”: [“”, 9527, 9.527, False],

        “dict”: {“key1”: “value1”, “key2”: “value2”},

“Chinese”: “hello”, “Japanese”: “こ ん に ち”,

        “created_at”: datetime.datetime(1970, 1, 1),

        “status”: “🆗”, “payload”: numpy.array([[1, 2], [3, 4]])

        }

with open(“example.json”, “wb”) as f:

    f.write(orjson.dumps(data, option=orjson.OPT_NAIVE_UTC | orjson.OPT_SERIALIZE_NUMPY))

Due to the problem of headline display, some data cannot be displayed. The screenshot of the above code is attached as follows:

A faster Python library that supports serialization and deserialization of more data types

Generated example JSON is as follows:

A faster Python library that supports serialization and deserialization of more data types

Similarly, reading data from a file is simple, as follows:

with open(“example.json”, “rb”) as f:

    json_data = orjson.loads(f.read())

print(json_data)

#Execute the above code and the output result is

{‘Emoji’: ”, ‘Integer’: 9527, ‘float’: 9.527, ‘Boolean’: false, ‘list’: [”, 9527, 9.527, false], ‘dict’: {key1 ‘:’ value1 ‘,’ key2 ‘:’ Value2 ‘},’ Chinese ‘:’ hello ‘,’ Japanese ‘:’ created ‘_ at’: ‘1970-01-01T00:00:00+00:00’, ‘status’: ‘ 🆗’, ‘ payload’: [[1, 2], [3, 4]]}

Finally, performance test

Let’s compare the serialization performance of JSON, ujson and orjson through simple tests. The JSON library is pthon built-in library, the ujson library is implemented in C and the orjson library is implemented in rust.

# -*- coding: utf-8 -*-

import json

import random

import ujson

import orjson

import time

def cost_time(func):

    def inner(*args, **kwargs):

        start_time = time.time()

        result = func(*args, **kwargs)

        stop_time = time.time()

Print (“{0} time consuming: {1}”. Format (func. _name_, stop_time – start_time))

        return result

    return inner

@cost_time

def json_dumps(obj):

    return json.dumps(obj)

@cost_time

def ujson_dumps(obj):

    return ujson.dumps(obj)

@cost_time

def orjson_dumps(obj):

    return orjson.dumps(obj)

if __name__ == ‘__main__’:

    test = {}

    for i in range(1, 2000000):

        test[str(i)] = ”.join(random.sample(

            [‘z’, ‘y’, ‘x’, ‘w’, ‘v’, ‘u’, ‘t’, ‘s’, ‘r’, ‘q’, ‘p’, ‘o’, ‘n’, ‘m’,

              ‘l’, ‘k’, ‘j’, ‘i’, ‘h’, ‘g’, ‘f’, ‘e’,’d’, ‘c’, ‘b’, ‘a’], 10))

    json_dumps(test)

    ujson_dumps(test)

    orjson_dumps(test)

We can see that the same sequence contains 2 million K-V dict objects, and the processing performance using orjson is much more efficient than the other two libraries.

json_ Dumps time: 1.1578669548034668

ujson_ Dumps time: 0.45979905128479004

orjson_ Dumps time: 0.09074163436889648

After reading this article, I firmly believe that the following two things will also help you improve yourself:

1. Like it so that more people can see it. At the same time, your recognition will encourage me to create more high-quality content.

2. To make yourself stronger: think about it. If you want to do it in the testing industry for a long time, your work experience and testing technology are definitely not enough. You need to improve and enrich your technology stack! What are you waiting for!

A faster Python library that supports serialization and deserialization of more data types

These materials should be relatively complete for friends who do [software testing]. These learning materials have also accompanied me through the most difficult journey. I hope they can also help you! Everything should be done as soon as possible, especially in the technology industry. We must improve our technical skills.

Recommended Today

Big data Hadoop — spark cluster deployment (standalone)

catalogue 1、 Spark overview 2、 Operation mode of spark 1) Standalone (explained in this chapter) 2)Mesos 3) Yarn (recommended) 4) K8s (new mode) 3、 Standalone mode operation mechanism 1) Standalone client mode 2) Standalone cluster mode 4、 Spark cluster installation (standalone) 1) Machine and role division 2) Install JDK environment on three machines 3) Download […]