Python IO efficiency improvement — oriented to huge ndarray object

Time:2021-5-4

Original address – my blog

Maybe you have a headache in the research of data science / AI / machine learning about the speed of loading and unloading large data. After all, IO process is the most time-consuming. We often joke that Python can optimize the space is not much, but in fact we can try to do better. I hope this article is helpful to your program.

The discussion of IO efficiency improvement in this paper is limited in the field of data sciencenumpy.ndarrayFor representing large array (tensor, matrix) data object IO problem. The solution is parallel write / read based on multi thread / multi process. Different from network IO and ordinary IO problems with small amount of data, large matrix objects in data science are often accompanied by operations such as matrix slicing. They are not sure about the memory occupation (whether to copy, move, etc.), so they are more likely to fall into the problem of memory redundancy occupation, which will affect the IO efficiency. This paper discusses the following topics

  • Parallel read / write method based on multithread / process and its performance comparison
  • Pay attention to redundant copy of memory in parallel IO
  • Best practice summary

IO scenario

The IO scenario discussed in this paper is very simple, loading big data from disk for processing, and then storing the results. This kind of situation is common in all kinds of machine learning frameworks, and the load and dump of data is the most basic problem to be solved. Some of the principles and techniques discussed below are also discussedpytorchtensorflowAnd so on.

In the context of data science, to optimize the efficiency of reading and writing, we can start from the following directions:

  • Starting from the file coding format, this paper uses thepklSpeed up reading and writing by binary coding
  • Starting from the optimization of read-write interface, the optimization of directio / zero copy is adopted
  • Block, batch parallel read and write, suitable for data relatively independent situation

The first one is easy to operate, but the coding form is not convenient and compatible with other languages / tools. The second is a bit of a fuss for Python, and Python’s IO interface is not as explicit as static language, although it can also be used directlyos.open(CLONE_FLATS=...)At the bottom of the interface, but using theDirectIO[4] OrmmapSuch optimizations need to increase design costs. Although the third method involves multithreading / process, it does not involve communication and synchronization, so the practice is relatively simple.

Multi thread / multi process parallel read / write

Parallel basic logic

The parallel read-write logic caused by multi process is very simple, and the main cost is the process management of the operating system. It is necessary to mention the theoretical support of multithreading for parallel reading and writing (for Cpython). The following figure [1] shows the processing of Gil for thread IO scenarios.
Python IO efficiency improvement -- oriented to huge ndarray object

The figure above also shows that the main overhead of multithreading is each threadrunThe total number of phases and the overhead of thread management by the operating system.

For Cpython multithreading, we still need to pay attention to the following

  • Linux is completely POSIX thread, which means that the scheduling mode is still 1:1 user kernel mapping
  • Global variables in Cpython multithreading default shared interpreter
  • The IO time for thread to release Gil is after the underlying basic IO system call
  • Multithreading uses semaphores and conditional variables for scheduling communication

Standard library interface evaluation

We design a small experiment to test the efficiency of parallel file writing of multithread / process results provided by Cpython standard library

import os
import numpy as np
import time
from multiprocessing import Process
from multiprocessing import Pool
from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor
from threading import Thread
from memory_profiler import profile
# Time calculator
class Benchmark:
    def __init__(self, text):
        self.text = text
    def __enter__(self):
        self.start = time.time()
    def __exit__(self, *args):
        self.end = time.time()
        print("%s: consume: %s" % (self.text, self.end - self.start))

# Base Task
def store_task(data: np.ndarray, output, index):
    fname = "%s_worker_%s.csv" % (output, index)
    np.savetxt(fname, data, delimiter='\t')

#main data source
worker_num = os.cpu_count()
big_data = np.random.rand(1000000, 10)
task_num = big_data.shape[0] // worker_num

# 1. multiprocessing.Porcess
@profile
def loop_mp():
    pool = []
    for i in range(worker_num):
        start = i * task_num
        end = (i+1) * task_num
        p = Process(target=store_task, args=(big_data[start: end], 'testdata/', i))
        p.start()
        pool.append(p)
    for p in pool:
        p.join()

# 2. threading.Thread
@profile
def mt_thread():
    pool = []
    for i in range(worker_num):
        start = i * task_num
        end = (i+1) * task_num
        t = Thread(target=store_task, args=(big_data[start: end], 'testdata/thread', i))
        t.start()
        pool.append(t)
    for p in pool:
        p.join()

# 3. multiprocessing.Pool
@profile
def mp_pool():
    with Pool(processes=worker_num) as pool:
        tasks = []
        for i in range(worker_num):
            start = i * task_num
            end = (i+1) * task_num
            tasks.append(
                pool.apply_async(store_task_inner, (big_data[start: end], 'testdata/mp_pool', i)))
        pool.close()
        pool.join()

# 4. ProcessPoolExecutor
@profile
def loop_pool():
    with ProcessPoolExecutor(max_workers=worker_num) as exe:
        for i in range(worker_num):
            start = i * task_num
            end = (i+1) * task_num
            exe.submit(store_task, big_data[start: end], 'testdata/pool', i)

# 5. ThreadPoolExecutor
def loop_thread():
    with ThreadPoolExecutor(max_workers=worker_num) as exe:
        for i in range(worker_num):
            start = i * task_num
            end = (i+1) * task_num
            exe.submit(store_task, big_data[start: end], 'testdata/pool_thread', i)

# 6.  direct
@profile
def direct():
    store_task(big_data, 'testdata/all', 0)

if __name__ == '__main__':
    with Benchmark("loop mp"):
        loop_mp()
    with Benchmark("mt thread"):
        mt_thread()
    with Benchmark("mp pool"):
        mp_pool()
    with Benchmark("loop pool"):
        loop_pool()
    with Benchmark("direct"):
        direct()
    with Benchmark("Thread"):
        loop_thread()

Analyze the efficiency of each interface from time consumption and memory (test environment)MacOS 2.2 GHz quad core Intel Core i7):

Interface time consuming Memory
multiprocessing.Process 5.14s p.start()Generate additional overhead, trigger the replication of parameters
theading.Thread 10.34s No extra cost
multiprocessing.Pool 4.18s Pool()Construction overhead, parameters are not copied
ProcessPoolExecutor 3.69s The parameter is not copied
ThreadPoolExecutor 10.82s No extra cost
direct 22.04s No extra cost

Time cost analysis

Intuitively, the multiprocess interface speeds up 4-4.5x, and multithreading speeds up half the time. The reason why multithreading is slower than multiprocessing is complex. In principle, the switching cost of threads is less than that of processes. However, in this case, multithreading also involves the scheduling communication between threads, while multiprocessing runs independently. Of course, interested friends can also chooseasyncio.tasksCompared with the interface based on multiplexing, it is difficult to find a suitable non blocking read-write interface.

It is worth noting that there is also a big difference in the speed of the two interfaces of multi process,ProcessThe mode of thread pool is much slower than that of thread pool because of the cost of data copy. The next section discusses why pooling technology avoids copying data.

Memory overhead analysis

Due to the limitation of Cpython’s data type, for multithreadingthreadingAnd multiprocessingmultiprocessingIn principle, whether the data is copied or not cannot be shown explicitlyThread()There is no need to copy data,ProcessYou need to copy the data. However, as shown in the table abovemultiprcocessing.PoolandProcessPoolExecutorThe two thread pool based methods do not copy data.

In code@profileIs a memory analysis of the tripartite library, but his results can not fully explain the essence.
amongProcessThe result is that

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
    29    101.3 MiB    101.3 MiB           1   @profile
    30                                         def loop_mp():
    31    101.3 MiB      0.0 MiB           1       pool = []
    32    120.6 MiB      0.0 MiB           9       for i in range(worker_num):
    33    120.6 MiB      0.0 MiB           8           start = i * task_num
    34    120.6 MiB      0.0 MiB           8           end = (i+1) * task_num
    35    120.6 MiB      0.0 MiB           8           p = Process(target=store_task, args=(big_data[start: end], 'testdata/', i))
    36    120.6 MiB     19.3 MiB           8           p.start()
    37    120.6 MiB      0.0 MiB           8           pool.append(p)
    38    120.6 MiB      0.0 MiB           9       for p in pool:
    39    120.6 MiB      0.0 MiB           8           p.join()

Obviouslyp.start()When the data is copied, what is copied isbig_data[start: end]Actual size. This is related toforkSystem calls are very different, and they should be passed in explicitlyCLONE_FLAGSTo reduce the data copy between the child process and the parent process. Let’s seeProcessPoolExecutor

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
    68    121.1 MiB    121.1 MiB           1   @profile
    69                                         def loop_pool():
    70    121.1 MiB      0.0 MiB           1       with ProcessPoolExecutor(max_workers=worker_num) as exe:
    71    121.2 MiB     -0.0 MiB           9           for i in range(worker_num):
    72    121.2 MiB      0.0 MiB           8               start = i * task_num
    73    121.2 MiB      0.0 MiB           8               end = (i+1) * task_num
    74    121.2 MiB      0.1 MiB           8               exe.submit(store_task, big_data[start: end], 'testdata/pool', i)

On the surface, there is no copy, but is it? becauseexe.submitAfter all, it’s not a direct triggerProcess()To understand this problem, we need to go deep into itPoolThe principle of technology.

About the source code analysis of Cpython, a lot of work has been done by python. From the reference of [2]ProcessPoolExecutorThe encapsulation logic is

|======================= In-process =====================|== Out-of-process ==|

+----------+     +----------+       +--------+     +-----------+    +---------+
|          |  => | Work Ids |    => |        |  => | Call Q    | => |         |
|          |     +----------+       |        |     +-----------+    |         |
|          |     | ...      |       |        |     | ...       |    |         |
|          |     | 6        |       |        |     | 5, call() |    |         |
|          |     | 7        |       |        |     | ...       |    |         |
| Process  |     | ...      |       | Local  |     +-----------+    | Process |
|  Pool    |     +----------+       | Worker |                      |  #1..n  |
| Executor |                        | Thread |                      |         |
|          |     +----------- +     |        |     +-----------+    |         |
|          | <=> | Work Items | <=> |        | <=  | Result Q  | <= |         |
|          |     +------------+     |        |     +-----------+    |         |
|          |     | 6: call()  |     |        |     | ...       |    |         |
|          |     |    future  |     |        |     | 4, result |    |         |
|          |     | ...        |     |        |     | 3, except |    |         |
+
−+----------+     +------------+     +--------+     +-----------+    +---------+

Is this process familiar? That’s right, he is different from the previous article [[C + + wheel] Based on pthread thread pool](http://zhikai.pro/post/103 )In:

  • Using queue to maintain task
  • Pool is accompanied by the creation of empty process
  • There is a special management thread to manage and monitor the pool

Then it is specific to the parameter data copyQueue.put()AndQueue.get()Whether data copy has occurred during the operation of.multiporcessing.QueueIt is an important interface of multiprocess communication. It is based on shared memory, and the transfer of parameter data does not copy, which is very important for large processesndarrayIt’s extremely important for people.

ndarrayObject copy of

Everything in Python is an object– Famous saying of Py circle

In the face of enterprise level big data, it is not so easy to find out the reasons for the high memory / video memory occupancy rate of Python programs. Dynamic reference type + GC brings convenience to the memory management of python, but unnecessary data copy should be avoided as far as possible.

Slice and combination

Slicing and combining are in ordernumpyTo represent the common operations of vector / matrix / tensor operation library, it is difficult to analyze whether their underlying layer is copied

import numpy as np
A = np.random.rand(1 << 8, 1 << 8)
B = A[:16]
del A  ## can not release A's mem, for B's reference
print(A)  ## error, the ref A has not exist yet,however its mem still exist
C = np.random.rand(1 << 4, 1 << 8)
D = np.concatenate([B, C], axis=1) ## D is a copy of B+C memory

aboutconcatenateMainly depends on the memory distribution to determine whether replication occurs [6]:

00    04    08    0C    10    14    18    1C    20    24    28    2C
|     |     |     |     |     |     |     |     |     |     |     |
[data1     ][foo ][data2     ][bar ][concat(data1, data2)  ]

data1 & data2 displayed in different place, concat them can only cover a new place.

Slicing also looks at the memory distribution. The memory arrangement based on row and column is different, which can be used specificallyorder=['C', 'F']Determines whether the array is arranged in memory by row or by column[ 7] Another way is to explore whether slices can be converted intoslice(start, offset, stride)If yes, it will be view, if not, it will be copy, for example, manyfancy_indexThe form is copy,[:]In fact, it isslice(None, None, None)It is also a copy

Whether the slice is view or copy doesn’t need care when the amount of data is small, but when the data size reaches the upper limit of memory, the operation of large-scale ndarray slice must be careful

Replication on process creation

We hope that the data will be sliced and passed to the child process. At the same time, we hope that the data will not be copied, and each process will share the large datandarray. First of all, it is clear from the previous chapter that themultiprocessing.Process(target=func, args=(ndarray[start:offset]))The way to create a child process is to copy the ndarray. In fact, the main technology used here ismultiprocessingShared memory method.

New addition after Python 3.8shared_memeoryIt makes a unified and simple interface for all kinds of shared memory. We use share_ Memory changes the code in the previous section

from multiprocessing import shared_memory
def store_task_sha_v2(start, end, output, index, sha_name, shape, dtype):
    fname = "%s_worker_%s.csv" % (output, index)
    exist_sham = shared_memory.SharedMemory(name=sha_name)
    data = np.ndarray(shape, dtype=dtype, buffer=exist_sham.buf)
    print(sha_name, data.shape, index)
    np.savetxt(fname, data[start: end], delimiter='\t')
    del data
    exist_sham.close()

@profile
def mp_pool_sha():
    shm = shared_memory.SharedMemory(create=True, size=big_data.nbytes)
    b = np.ndarray(big_data.shape, dtype=big_data.dtype, buffer=shm.buf)
    b[:] = big_data[:]
    print(b.shape)
    with ProcessPoolExecutor(max_workers=worker_num) as pool:
        tasks = []
        for i in range(worker_num):
            start = i * task_num
            end = (i+1) * task_num
            tasks.append(
                pool.submit(store_task_sha_v2, 
                    start, end, 'testdata/mp_pool_sha', i ,
                    shm.name, b.shape, b.dtype))
        for t in tasks:
            # Note!  Exception is caught here, which is recommended by processpool executor!
            try:
                print(t.result())
            except Exception as e:
                print(f'{e}')
    del b
    shm.close()
    shm.unlink()

The code is quite complicated, but the logic is very simple: shared buffer application > mapping local ndarray Object > putting data into shared buffer > other processes reading and writing > closing buffer.share_memeoryIn addition, he can apply for local variable to share at any time.

Best practice summary

Parallel read file loadingndarray

Join your training data is very big, need to stream processing (training), direct usetorch.datasetsWhen the modules are loaded, they encapsulate the parallel stream processing.

If you need to load ram once (such as KNN algorithm), you can use block parallel read

def parallize_load(file, total_num, worker_num):
    """Load embedding file parallelization
       @emb_file: source filename
       @total_num: total lines
       @worker_num: parallelize process num
    return: np.ndaary
    """
    def load_from_txt(emb, start, n_rows, arr_list):
        data = np.loadtxt(emb, skiprows=start, max_rows=n_rows)
        arr_list.append(data)

    worker_load_num = total_num // worker_num
    pool = []
    with Manager() as manager:
        arr_list = manager.list([])
        for index in range(worker_num):
            s = index * worker_load_num
            if index != worker_num - 1:
                e = worker_load_num
            else:
                e = total_num - (worker_load_num * index)
            p = Process(target=load_from_txt, args=(emb_file, s, e, arr_list))
            pool.append(p)
            p.start()
        for p in pool:
            p.join()
        arr = np.concatenate(arr_list)
    return arr
source_total_num = sum(1 for line in open("souce_big_file", "rb"))
source_emb_data = parallize_load("souce_big_file", source_total_num, worker_num)

This is basicallyworker_numX-fold acceleration.

Parallel writing practice

  • Try to avoid slicing and combining operations on large ndarray objects.
  • Try to avoid using itfor-loop, multi use matrix operation
  • Writing files to multiple processes is more efficient and the logic is more concise, but we should always pay attention not to copy data between processes
  • Use the IO interface of three-party library as far as possible, such asnp.savetxt,df.to_csvThey may optimize exception, chunk writing, etc
  • When writing strings, try to splice them as much as possible'\t'.join(List[])Do not usefor ele in List: fp.write("%s\t%s\n" % (ele))

More work

The object of this paper is limited tohost-deviceFor the more common GPU mem and the interfaces of Python’s many third-party libraries, it’s too painful. They often omit the process of allocation application scheduling communication destruction. When an oom exception occurs, the troubleshooting can only rely on indicator observation. In this regard, we can continue to study the best practice of video memory.

Finally, maybe the content of this article will surprise you, because optimizing Python is a thankless thing. But I have to say that these methods have solved many problems of the original program in my current work under the constraint of certain resources. Of course, the current mainstream machine learning algorithm process is based on stream processing, which rarely takes too much at one time, but there are also some places that need to use manual reading and writing, such as embedding reading and writing.