Spark source code learning — built in RPC framework (1)

Time:2020-10-18

Many places in spark involve network communication, such as message exchange between various components of spark, upload of user files and jar packets, shuffle process between nodes, copy and backup of block data, etc. In spark 0. X.x and spark 1. X.x, message communication between components mainly relies on akka, which can easily build powerful high concurrency and distributed applications. However, akka has been removed in spark 2.0.0. The official spark website document describes this as: “akka’s dependency has been removed, so users can use any version of akka to program.” Now we are using nettystreammanager based on Spark’s built-in RPC framework. The shuffle process between nodes and the copy and backup of block data still use netty in spark version 2.0.0. By redesigning the interface and program, the message exchange between various components, the upload of user files and jar packages are integrated into Spark’s RPC framework.

Let’s take a look at the basic architecture of the RPC framework

Spark source code learning -- built in RPC framework (1)

The transportcontext contains the configuration information of the transport context, transportconf and the rphandler that processes the client request message. Transportconf is required when creating transportclientfactory and transportserver, while rphandler is only used to create transportserver. Transportclientfactory is a factory class for RPC clients. Transportserver is the implementation of RPC server. The meaning of the mark in the figure is as follows.

The token (1) indicates that an instance of the transport client factory transportclientfactory is created by calling the createclientfactory method of the transportcontext. When an instance of transportclientfactory is constructed, a list of the client bootloader transportclientbootstrap is also passed. In addition, there is a connection pool clientpool for each socket address in the transportclientfactory. The definition of the connection pool cache is as follows:

private final ConcurrentHashMap<SocketAddress, ClientPool> connectionPool;

The type of clientpool is defined as follows:

private static class ClientPool {
 TransportClient[] clients;
 Object[] locks;

 ClientPool(int size) {
   clients = new TransportClient[size];
   locks = new Object[size];
   for (int i = 0; i < size; i++) {
     locks[i] = new Object();
   }
 }
  }

It can be seen that the clientpool is actually composed of an array of transportclients, and the objects in the locks array correspond to the transportclients in the clients array one by one according to the array index. By using different locks for each transportclient, the lock contention between threads in concurrency can be reduced, thus reducing the blocking and improving the concurrency.

The token (2) indicates that an instance of the transport server is created by calling the createserver method of the transportcontext. When constructing an instance of transportserver, you need to pass the list of transportcontext, host, port, rphandler and server bootstrap.

With an understanding of the basic architecture of Spark’s built-in RPC framework, we will now formally introduce the components of Spark’s RPC framework.
Transportcontext: Transport context, which contains the context information used to create Transport server and transport client factory, and supports the use of transport channelhandler to set the pipeline of socketchannel provided by netty.
Transportconf: configuration information of transport context.
Rphandler: a program that processes messages sent by calling the sendrpc method of the transportclient.
Message encoder: before putting the message into the pipeline, encode the message content to prevent packet loss and parsing errors when reading from the other end of the pipeline.
Message decoder: parses the bytebuf read from the pipeline to prevent packet loss and parsing errors.
Transportframedecoder: parses the bytebuf read from the pipeline according to the data frame.
RpcRespons eCallback:RpcHandler The interface to call back after processing the requested message.
Transportclientfactory: creates the transport client factory class for the transportclient.
Clientpool: a pool of transportclients maintained between two peers. Clientpool is an internal component of transportclientfactory.
Transp ortClient:RPC The client of the framework, which is used to obtain consecutive blocks in a pre negotiated stream. The transportclient is designed to allow efficient transfer of large amounts of data that will be split into blocks of hundreds of KB to several MB. When the transportclient processes blocks fetched from the stream, the actual setup is done outside the transport layer. The sendrpc method can make these settings in the same horizontal line communication between the client and the server.
Transportclientbootstrap: bootstrap that is executed once at the client when the server responds to a client connection.
Transportrequesthandler: a handler that processes client requests and returns after writing block data.
Transportresponsehandler: a handler used to process the response from the server and respond to the client that issued the request.
Transportchannelhandler: agent the request processed by transportrequesthandler and the response processed by transport responsehandler, and adds the processing of transport layer.
Transport server bootstrap: a bootstrap that is executed once at the server when the client connects to the server.
Transp ortServer:RPC The server side of the framework provides efficient and low-level stream services.

The blog is based on the book the art of spark kernel design: architecture design and implementation