[Experience sharing] GPU CUDA uses memory padding to avoid bank conflict

Time:2022-11-25

Welcome to pay attention to my public account [Jizhi Vision], reply 001 to get Google programming specifications

O_o>_<   o_OO_o~_~o_O

This article talks about how to use memory padding in GPU CUDA programming to avoid bank conflict.

1、Shared memory

Shared memory is a small, low-latency on-chip memory, which is hundreds of times faster than global memory. Shared memory can be used as a programmable cache. The main functions are as follows:

  • An intra-block thread communication channel: a communication channel between threads;
  • A program-managed cache for global memory data: programmable cache;
  • Scratch pad memory for transforming data to improve global memory access patterns: Reduce the number of glabal memory accesses by caching data.

Shared memory can be allocated dynamically or statically, and its declaration can be inside the kernel or as a global variable, and can be declared through the following keywords:

__shared__ /// identifier

__shared__ float tile[_y][_x]; /// statically declares a 2D floating-point array
    
/// Declaration in kernel
extern __shared__ int tile[];

kernel<<<grid, block, isize * sizeof(int)>>>(...);

In order to obtain high bandwidth, the shared memory is divided into 32 memory cards of equal size, corresponding to threads in the warp, and they can be accessed at the same time.

[Experience sharing] GPU CUDA uses memory padding to avoid bank conflict

2. Use memory padding to avoid bank conflict

Shared memory is as fast as registers if there are no bank conflicts.

Quick case:

  • All threads in a warp access different banks without conflicts;
  • All threads in the warp read the same address, triggering the broadcast mechanism, and there is no conflict.

Slow case:

  • bank conflict: Multiple threads in a warp access the same bank;
  • Memory access must be serialized;
  • The maximum number of threads that multiple threads access the same bank at the same time.

To give an example of bank conflict, the following is a piece of shared memory:

In case of no bank conflict:

int x_id = blockDim.x * blockIdx.x + threadIdx.x; // column coordinates
int y_id = blockDim.y * blockIdx.y + threadIdx.y; // row coordinates
int index = y_id * col + x_id;

__shared__ float sData[BLOCKSIZE][BLOCKSIZE];

if(x_id < col && y_id < row)
{
    sData[threadIdx.y][threadIdx.x] = matrix[index];
    __syncthreads();
    matrixTest[index] = sData[threadIdx.y][threadIdx.x];
}

In case of bank conflict:

int x_id = blockDim.x * blockIdx.x + threadIdx.x; // column coordinates
int y_id = blockDim.y * blockIdx.y + threadIdx.y; // row coordinates
int index = y_id * col + x_id;

__shared__ float sData[BLOCKSIZE][BLOCKSIZE];

if(x_id < col && y_id < row)
{
    sData[threadIdx.x][threadIdx.y] = matrix[index];
    __syncthreads();
    matrixTest[index] = sData[threadIdx.x][threadIdx.y];
}

In the above example, there is only a small change from no bank conflict to bank conflict. Let’s see how to solve the above bank conflict.

Taking the above example as an example, you can simply use memory padding to avoid bank conflict, as shown in the figure below:

From the code point of view, how to improve the performance of the above code with bank conflict through memory padding:

int x_id = blockDim.x * blockIdx.x + threadIdx.x; // column coordinates
int y_id = blockDim.y * blockIdx.y + threadIdx.y; // row coordinates
int index = y_id * col + x_id;

__shared__ float sData[BLOCKSIZE][BLOCKSIZE + 1];     // memory padding

if(x_id < col && y_id < row)
{
    sData[threadIdx.x][threadIdx.y] = matrix[index];
    __syncthreads();
    matrixTest[index] = sData[threadIdx.x][threadIdx.y];
}

The above shared the method of using memory padding to avoid bank conflict in GPU CUDA programming. I hope my sharing will be of some help to your study.


【Official account transmission】
[Experience sharing] GPU CUDA uses memory padding to avoid bank conflict