[system architecture] understand the computing unit of shengteng Da Vinci architecture

Time:2022-2-21

Welcome to my official account, to get more notes.

O_o>_<   o_OO_o~_~o_O

This paper explains the architecture and computing principle of computing unit in shengteng Da Vinci architecture in detail.

1. Da Vinci Architecture Overview

Da Vinci architecture is a “domain specific architecture” (DSA) chip.

The computing core of shengteng AI processor is mainly composed of AI core, which contains three basic computing resources: cube unit, vector unit and scalar unit. It is responsible for performing tensor, vector and scalar calculations. The matrix calculation unit in AI core supports the calculation of int8 and fp16, and the vector calculation unit supports the calculation of fp16 and fp32. The basic structure of AI core is as follows:

Here we mainly explain the computing unit in AI core, that is, the yellow part in the figure above (matrix computing unit, vector computing unit, scalar computing unit and accumulator module). Other modules will not be expanded here.

2. Matrix calculation unit

2.1 matrix multiplication

Because a large number of matrix calculations are used in modern CNN network algorithms, the Da Vinci architecture is also deeply optimized for matrix calculation, and the matrix calculation unit is the hardware unit that supports high-throughput matrix calculation.

For example, the following figure shows the multiplication of two matrices A and B, that is, C = a x B, where the dimension of a is (m, K) and the dimension of B is (k, n).

The above calculation process can be represented by the following code on the CPU:

for(int m = 0; m < M; m++)
    for(int n = 0; n < N; n++)
        for(int k = 0; k < K; k++)
            C[m][n] += A[m][k] * B[k][n]

  the above code requires at least m on a single transmit CPU K N clock cycles to complete. In the process of CPU calculation, we expect matrix A to be scanned by rows and matrix B to be scanned by columns. Considering the typical matrix storage mode, both matrix A and matrix B will be stored in rows, that is, row major. When reading memory data, it will open a whole row in the memory and read out all the numbers in the same row. This memory reading method is very friendly to matrix A, but very weak to matrix B. If only matrix B could be stored in columns, like this:

Therefore, in matrix multiplication calculation, the efficiency of matrix calculation is often improved by changing the storage mode of a matrix.

2.2 calculation method of matrix calculation unit

Generally, when the matrix is large, due to the limited resources of calculation and storage on the chip, the matrix needs to be tiled in blocks, as shown in the figure below.

With limited cache resources, matrix B is divided into B0, B1, B2 and B3 sub matrices. The size of each sub matrix is suitable for one-time storage in the cache on the chip and calculated with matrix A to obtain the result sub matrix. The advantage of matrix blocking is that it makes full use of the capacity of cache and maximizes the local characteristics in the process of data calculation. It can efficiently realize large-scale matrix multiplication calculation, which is a common optimization method.

In CNN network, the common convolution acceleration is to convert convolution operation into matrix operation. GPU adopts GEMM to accelerate matrix operation. To realize a 16 x 16 matrix multiplication, 256 parallel threads need to be started, and each thread independently calculates a point in the output result matrix. Assuming that each thread can complete a multiplication and addition operation in one clock cycle, GPU needs 16 clock cycles to complete the whole matrix multiplication operation, This delay is an unavoidable bottleneck for traditional GPUs (it’s unfair not to mention tensorcore). The shengteng Da Vinci architecture is deeply optimized for this problem. The matrix calculation unit can complete the multiplication budget of two 16 x 16 matrices with one instruction (that is, the cube of 16, which is also the origin of the cube name), which is equivalent to 4096 multiplication and addition operations in a very short time.

The same is the matrix multiplication example above. When the Da Vinci matrix calculation unit performs the a x B matrix multiplication operation, it will store a in the input buffer by row, B in the input buffer by column, and the resulting matrix C in the output buffer by row. Among them, the first element of C is obtained by 16 times of multiplication and 15 times of addition through the matrix calculation unit subcircuit from the 16 elements of the first row of a and the 16 elements of the first column of B. There are 256 matrix calculation sub circuits in Da Vinci matrix calculation unit, which can complete the calculation of 256 elements of C in parallel by one instruction. In the overview of Da Vinci architecture, it can be seen that the matrix calculation unit is followed by a group of accumulators, which is for the scenario of adding bias after matrix operation.

The matrix calculation unit can quickly complete 16 x 16 matrix multiplication, but when more than 16 x 16 matrix phase operation is input, block processing is required, as shown in the following figure.

The layout of matrix A is big Z and small Z (big Z means sorting by rows between blocks of a, and small Z means sorting by rows within each block), the layout of matrix B is big Z and small n, and the layout of resulting matrix C is big n and small Z.

In addition to supporting fp16 accuracy, the matrix calculation unit can also support int8 accuracy. For int8 accuracy, the matrix calculation unit can complete the multiplication of a 16 x 32 matrix and a 32 x 16 matrix at one time. The operation accuracy of the matrix calculation unit can be adjusted appropriately according to the accuracy requirements of the neural network to obtain better performance.

2.3 calculation method of vector calculation unit

The vector calculation unit in AI core is mainly responsible for completing vector related operations, and can realize the calculation between vectors and scalars / vectors, including the calculation of fp32, fp16, int32 and int8 precision. The vector calculation unit can quickly complete the vector calculation of two fp16 types, as shown in the following figure.

The input and output of the vector computing unit are stored in the output buffer. For the vector computing unit, the input data can be discontinuous, which depends on the addressing mode of the input data. The addressing model supported by vector computing unit includes vector continuous addressing and fixed interval addressing. In special cases, for vectors with irregular addresses, vector address register addressing is also provided to realize the irregular addressing of vectors. In the previous overview of Da Vinci architecture, it can be seen that the vector computing unit can be used as the data path between the matrix computing unit and the output buffer. When the result of matrix calculation is transmitted to the output buffer, the vector calculation unit can complete the format conversion of relu, pooling and other layers. The data processed by the vector calculation unit can be written back to the output buffer or matrix calculation unit for the next calculation. It forms a functional complementarity with the matrix calculation unit and improves the ability of AI core to calculate non matrix data.

2.4 calculation method of scalar calculation unit

Scalar computing unit is responsible for completing scalar related operations in AI core, which is equivalent to a micro CPU and controls the life activities of the whole AI core. The scalar computing unit can control the loop in the program and realize branch judgment. As a result, the execution of other units in the AI core can be controlled by inserting a synchronization symbol in the event synchronization module. It also provides the calculation of data address and related parameters for matrix calculation unit and vector calculation unit, and can realize basic arithmetic operation. Other scalar operations with high complexity are completed by a special AI CPU.

A plurality of general purpose registers (GPR) and special purpose registers (SPR) are equipped around the scalar computing unit. The general registers can be used to register variables or addresses, provide input operands for arithmetic and logic operations and store intermediate calculation results; Special registers can support the special functions of some instructions in the instruction set. Generally, they cannot be accessed directly, and only some can be read and written through instructions.

That’s all for the explanation of the structure of shengteng Da Vinci. See you in the next chapter~