# What is Mars, what can be done and how to do it

Time：2020-3-15

Recently, our latest work Mars, matrix based unified computing framework, has been shared in pycon China 2018’s Beijing main venue, Chengdu and Hangzhou branch venues. In this paper, we will elaborate the sharing of pycon in China again in the form of words.

Hearing about Mars, many students who first heard about it will ask soul three: what is Mars, what can be done and how to do it. Today we will start from the background and an example to answer these questions.

# background First is the panorama of the SciPy technology stack. Numpy is the foundation. It provides the data structure of multi-dimensional array and various calculations on it. On the other hand, it is important to have SciPy, which is mainly for various scientific computing operations; panda, the core concept of which is dataframe, which provides functions such as table type data processing and cleaning. On the upper level, there is a classic library, scikit learn, which is one of the most famous machine learning frameworks. The top layer is a variety of vertical domain libraries, such as astropy mainly for astronomy, biopython for biology and so on.

From the SciPy technology stack, we can see that numpy is a core position. A large number of upper level libraries use numpy data structure and calculation. Our real world data is not just a table of two-dimensional data. In many cases, we have to face multidimensional data, such as our common image processing. First, we have the number of images, then the length and width of images, and RGBA channels, which are four-dimensional data. There are numerous examples. With such multidimensional processing ability, we have the ability to process all kinds of more complex, even scientific fields. At the same time, because multidimensional data itself contains two-dimensional data, we also have the ability to process table type data.

In addition, if we need to explore the intrinsic of data, it is absolutely not enough to do some statistical operations on table data alone. We need deeper “mathematical” methods, such as the ability to use matrix multiplication, Fourier transform and so on, to analyze the data in a deeper level. Because numpy is a library of numerical calculation, plus various upper level libraries, we think they are suitable for providing this ability. So, why the Mars project? Let’s take an example.

We try to use Monte Carlo method to solve PI, Monte Carlo method is actually very simple, that is to use random number method to solve specific problems. As shown in the figure, here we have a circle with a radius of 1 and a square with a side length of 2. We can generate many random points by using the formula at the lower right corner. Then we can calculate that the value of PI is 4 times the number of points in the circle divided by the total number of points. The more randomly generated points, the more accurate the calculated pi will be. It’s very simple to implement in pure python. We only need to traverse n times to generate X and Y points, and calculate whether they fall in the circle. It takes more than 10 seconds to run 10 million points. Python is a common way to speed up Python code. Python defines a superset of Python language, translates the language into C / C + +, and then compiles it to speed up execution. Here, we add several variable types, and you can see that the performance is improved by 40% compared with pure python.

Python has now become the standard configuration of Python projects. The Core Python libraries basically use Python to speed up the performance of Python code. The data in our example is of one type. We can think of using a special numerical calculation library to speed up the performance of this task through vectorization. Numpy is the right choice. To use numpy, we need an array oriented way of thinking. We should reduce the use cycle. Use it here first`numpy.random.uniform`To generate a two-dimensional array of n * 2, and`data ** 2`Square all the data in the array, and then`sum(axis=1)`And we will sum axis = 1, that is, the row direction. At this time, we get the vector with length N, and then we use`numpy.sqrt`To find the square of each value of this vector, < 1 will get a Boolean vector, that is, whether each point is in the circle, and finally a sum, then the total number of points can be found. Numpy may not be used to it for the first time, but after using it a lot, you will find it convenient. In fact, it is very intuitive.

As you can see, by using numpy, we have written simpler code, but the performance has been greatly improved, more than 10 times higher than that of pure python. Can numpy’s code be optimized? The answer is yes. We use a library called numexpr to combine multiple operations of numpy into one operation to speed up the execution of numpy.

As you can see, the performance of code optimized by numexpr is more than 25 times higher than that of pure Python code. At this time, the code is running quite fast. If we have GPU on hand, we can use hardware to speed up task execution.

There must be a library of Amway called cupy. It provides an API consistent with numpy. Through simple import replacement, numpy code can run on NVIDIA’s graphics card.

At this time, you can see that the performance has been greatly improved by more than 270 times. It’s really exaggerated.

In order to make the results of Monte Carlo method more accurate, we increase the calculation by 1000 times. What will happen?  Yes, that’s what we encounter from time to time. OUTOFMEMORY, memory overflow. What’s worse, in jupyter, sometimes the memory overflow causes the process to be killed, and even results in the loss of all previous results. Monte Carlo method is relatively easy to deal with. I decompose the problem into 1000 pieces. It’s good to solve 10 million data each. Write a cycle and make a summary. But at this time, the whole calculation time is more than 12 minutes, too slow.  At this point, we will find that during the whole operation process, only one CPU is working, and other cores are shouting in place. So, how can we parallelize numpy?

First of all, there are some parallel operations in numpy, such as tensordot to do matrix multiplication, and most of the other operations cannot use multi-core. To parallelize numpy, we can:

2. Distributed It’s very easy to rewrite the Monte Carlo method PI into multi-threaded and multi process implementation. We write a function to process 10 million data. We submit the function 1000 times for multi-threaded and multi process execution respectively through the ThreadPoolExecutor and processpoolexecutor of concurrent.futures. It can be seen that the performance can be increased to 2 times and 3 times. However, Monte Carlo is easy to solve PI by hand in parallel and consider more complex cases.

``````import numpy as np

a = np.random.rand(100000, 100000)
(a.dot(a.T) - a).std()``````

Here we create a matrix A of 100000 * 100000, and the input is about 75g. Let’s multiply a matrix by a’s transposition, then subtract a itself, and finally find the standard deviation. The input data of this task is difficult to be crammed into memory, and the subsequent handwriting parallel is more difficult. Here comes the question. What kind of framework do we need?

1. Provide familiar interface, such as cupy, through simple import replacement, you can make the original code written by numpy parallel.
2. Scalability. As small as a single machine, it can also use multi-core parallelism; as large as a large cluster, it can support thousands of machines to handle distributed tasks together.
3. Support hardware acceleration, support GPU and other hardware to speed up task execution.
4. It supports various optimizations, such as operation merging, and can utilize some libraries to speed up the merging operation.
5. Although we are memory computing, we do not want to run out of memory on a single machine or cluster, and the task will fail. We should make the temporarily unavailable data spill to disk and other storage to ensure that the whole calculation can be completed even if the memory is not enough.

# What is Mars? What can we do

Mars is such a framework, and its goal is to solve these problems. At present, Mars includes sensor: distributed multidimensional matrix computing. The problem scale of solving PI by Monte Carlo with the size of 10 billion is 150g, which will lead to oom. Through the Mars sensor API, you only need to`import numpy as np`Replace with`import mars.tensor as mt`, the subsequent calculation is completely consistent. But there is a difference. Mars sensor needs to pass the`execute`Trigger execution, the advantage of which is to optimize the whole intermediate process as much as possible, such as operation merging and so on. However, this method is not very friendly to debug. Later, we will provide eager mode to trigger calculation for every step of operation, which is completely consistent with numpy code.

It can be seen that the computing time is the same as the handwriting parallel time, and the peak memory usage is only 1 + G, so you can see Mars sensorIt can not only fully parallel, but also save the use of memory 。 At present, Mars has implemented 70% of the common numpy interfaces. See the complete list here. We are all trying to provide more interfaces between numpy and SciPy. We have just completed the support of inverse matrix calculation. Mars sensor also provides support for GPU and sparse matrix.`eye`Is to create a unit diagonal matrix, it only has a value of 1 on the diagonal, if it is stored in a dense way, it will waste storage. However, at present, Mars sensor only supports two-dimensional sparse matrix.

# How does Mars achieve parallelism and save more memory Like all dataflow frameworks, Mars itself has the concept of computing graph. The difference is that Mars contains the concept of coarse-grained graph and fine-grained graph. The code written by the user generates the coarse-grained graph on the client side. After it is submitted to the server side, there will betileThe coarse-grained graph tile is transformed into the fine-grained graph, and then we will schedule the execution of the fine-grained graph. Here, the code written by the user will be expressed in memory as a coarse-grained diagram composed of tensor and operator.  When the user calls`execute`Method, the coarse-grained graph will be serialized to the server. After deserialization, we will tile this graph into a fine-grained graph. For input 10002000 matrix, assuming that the chunk size on each dimension is 500, it will be tiled to 24 there are 8 chunks in total.

Later, we will provide tile operations for each of the operands we implement, that is, operators, and turn a coarse-grained tile into a fine-grained tile. At this time, we can see that if there are 8 cores in a single machine, we can execute the whole fine-grained graph in parallel; in addition, given 1 / 8 of memory, we can complete the calculation of the whole graph. However, before the actual execution, we will optimize the whole graph with fuse, that is, operation merging. When the three operations are actually executed, they will be merged into one operator. According to the different execution targets, we will use the fuse support of numexpr and cupy to operate and execute the CPU and GPU respectively. The above examples are all tasks that we can easily create in parallel. As we mentioned earlier, the fine-grained graph generated after tile is actually very complex. In real world computing scenarios, there are many such tasks. In order to fully schedule the execution of these complex fine-grained graphs, we must meet some basic criteria to make the execution efficient enough.

First of all, the allocation of initial nodes is very important. For example, in the figure above, suppose we have two workers. If we assign 1 and 3 to one worker, and 2 and 4 to another worker, when 5 or 6 are scheduled, they need to trigger remote data pulling, so the execution efficiency will be greatly reduced. If we start by assigning 1 and 2 to one worker and 3 and 4 to another worker, the execution will be very efficient. The allocation of initial nodes has a great impact on the overall execution, which requires us to have a global grasp of the whole fine-grained graph, so we can achieve a better initial node allocation.

In addition, the strategy of depth first execution is also very important. Suppose that at this time, we only have one worker. After executing 1 and 2, if we schedule 3, the memory of 1 and 2 will not be released, because 5 has not been triggered yet. However, if we schedule 5 execution after 1 and 2 execution, the memory of 1 and 2 can be released after 5 execution, so the memory in the whole execution process will be the most economical.

Therefore, initial node allocation and depth first execution are the two most basic criteria, which are not enough. There are many challenging tasks in the whole execution scheduling of Mars, which are also the objects we need to optimize for a long time.

# Mars distributed So Mars is essentially a fine-grained, heterogeneous graph scheduling system. We schedule fine-grained operators to each machine. When we actually execute, we call the libraries of numpy, cupy, numexpr, and so on. We make full use of the mature and highly optimized single machine library instead of repeatedly building wheels in these fields.

In this process, we will encounter some difficulties:

1. Because we are the master slave architecture, how can we avoid single point?
2. How can our workers avoid the limitations of Python’s Gil (global interpreter lock)?
3. Master’s control logic is interlaced and complex. It’s easy for us to write highly coupled, smelly and long code. How can we decouple the code? Our solution is to use actor model. The actor model defines a parallel way, that is, all actors maintain an internal state. They all hold mailboxes. The actors deliver messages to each other through message delivery, and the messages received will be placed in the mailboxes. The actors take messages from the mailboxes for processing. An actor can only process one message at a time. Actor is the smallest parallel unit. Since an actor can only handle one message at a time, you don’t need to worry about concurrency at all. Concurrency should be handled by the actor framework. Whether all actors are on the same machine or not becomes unimportant in the actor model. When actors are on different machines, as long as the message can be delivered, the actor model naturally supports distributed systems.

Because actor is the smallest parallel unit, when we write code, we can decompose the whole system into many actors. Each actor has a single responsibility, which is similar to the idea of object-oriented, so that our code can be decoupled.

In addition, after the master is decoupled into actors, we can distribute these actors on different machines, so that the master is no longer a single point. At the same time, we let these actors be allocated according to the consistency hash. In the future, if a scheduler machine hangs up, the actors can be reallocated and recreated according to the consistency hash to achieve the purpose of fault tolerance.

Finally, our actors are running on multiple processes, and there are many processes in each process. In this way, our workers will not be limited by Gil. JVM languages like Scala or Java can use akka as the actor framework. For Python, we don’t have any standard practice. We think we just need a lightweight actor framework to meet our needs. We don’t need some high-level functions in akka. Therefore, we have developed Mars actors, a lightweight actor framework. The entire distributed schedulers and workers of Mars are all on top of the Mars actors layer. This is the architecture diagram of our Mars actors. When we start actor pool, our subprocesses will start several subprocesses according to the concurrency. The main process has a socket handler to accept the remote socket connection to deliver messages. In addition, the main process has a dispatcher object to distribute messages according to their destination. All of our actors are created on the subprocess. When the actor receives a message to process, we will call it through the cooperation process`Actor.on_receive(message)`Method.

There are three situations in which one actor sends messages to the other.

1. They are in the same process, so they can be called directly through the coroutine.
2. They are in different processes of a machine. This message will be serialized and sent to the dispatcher of the main process through the pipeline. The dispatcher will get the process ID of the target by unpacking the binary header information and send it to the corresponding subprocess through the corresponding pipeline. The subprocess can trigger the message processing of the corresponding actor through the cooperation.
3. If they are on different machines, the current subprocess will send the serialized message to the main process of the corresponding machine through socket, and the machine will send the message to the corresponding subprocess through dispatcher.

Because we use the cooperation process as the parallel mode within the subprocess, and the cooperation process itself has strong performance in io processing, so our actor framework will also have good performance in io. In the figure above, Mars actors is used to solve the Monte Carlo method to calculate pi. There are two actors defined here. One is chunkinside, which accepts the size of a chunk to calculate the number of points falling in the circle. The other is picalculator, which is responsible for accepting the total number of points to create chunkinside. This example is to directly create 1000 chunkinside, and then trigger their calculation by sending a message.`create_actor`When you specify address, you can assign actors to different machines.

As you can see here, the performance of naked Mars actors is faster than that of many process versions. Let’s summarize here. By using Mars actors, we can write distributed code without Gil limitation. It makes our IO more efficient. In addition, because of actor decoupling, the code is easier to maintain. Now let’s take a look at the complete execution process of Mars distributed. Now there are 1 client, 3 schedulers and 5 workers. The user creates a session, and creates a sessionactuator object on the server, which is allocated to scheduler1 through the consistency hash. At this point, the user runs a sensor. First, the sessionactor creates a graphactor, which will tile the coarse-grained graph. Assuming that there are three nodes on the graph, three operanactors will be created and allocated to different schedulers. Each operandactor controls operations such as the commit of operand, the supervision of task status, and the release of memory. At this time, the operandactors of 1 and 2 are found to be independent and the cluster resources are sufficient. Then they will submit the task to the corresponding worker for execution. After the completion of execution, they will notify 3 that the task is completed. After the completion of 3 discovery 1 and 2, because the data is executed in different workers, the data pull operation will be triggered first and then the execution. On the client side, if you know that the task is completed by polling the graphactor, you will trigger the operation of pulling data to the local area. The whole task is done. We have made two benchmarks for Mars distribution. The first is to multiply each element of 3.6 billion data by 2. In the figure, the Red Cross is the execution time of numpy. We can see that we are several times higher than numpy, and the blue dotted line is the theoretical operation time. We can see that our real acceleration is very close to the theoretical acceleration. In the second benchmark, we have increased the amount of data to 14.4 billion data. After adding 1 to each element and multiplying it by 2, we can find that a single numpy can no longer complete the task. At this time, we can also achieve a good acceleration ratio for this task.

# Future plan Mars has been on GitHub’s source code. Let more students join us to build Mars: https://github.com/mars-project/mars. In the subsequent Mars development plan, as mentioned above, we will support eager mode, let each step trigger execution, improve the performance insensitive task development and debugging experience; we will support more numpy and SciPy interfaces; in the follow-up, it is very important that we will provide 100% pandas compatible interfaces, because Mars sensor is used As a foundation, we can also provide GPU support; we will provide scikit learn compatible machine learning support; we will also provide the ability to schedule custom functions and classes on fine-grained graphs to enhance flexibility; finally, because our clients do not rely on Python, any language can serialize coarse-grained graph, so we can provide multi language client version completely, but this point, we will depend on the demand.

In a word, open source is very important for us. The parallelization of the huge SciPy technology stack is not enough just relying on our strength. We need everyone to help us build it together.

# Scene pictures

At last, I’d like to add some pictures of the scene. The audience still have many questions about Mars. Let me summarize:

1. The performance of Mars in some specific calculations, such as SVD decomposition, here we have some test data of cooperative projects with users. The input data is 800 million * 32 matrix for SVD decomposition. After decomposition, the matrix is multiplied and compared with the original matrix. The whole calculation process uses 100 workers (8 cores) and takes 7 minutes to complete
2. When is Mars open source? We have opened it: https://github.com/mars-project/mars
3. Will Mars be closed after open source
4. How Mars actors work in detail
5. Whether Mars is a static graph or a dynamic graph is currently a static graph. After eager mode is completed, it can support dynamic graphs
6. Will Mars involve in-depth learning? A: not at present   Author: Jisheng