“GPU” of STM32 — detailed explanation of dma2d example


This article was first published in RT thread community and cannot be reproduced without authorization.

GPU, namely graphics processor, is the core of modern graphics card. In the era without GPU, the drawing of all graphics is completed by the CPU. The CPU needs to calculate the boundary, color and other data of graphics, and is responsible for writing the data to the video memory. There is no problem with simple graphics, but with the development of computers (especially the development of games), the graphics and images that need to be displayed are more and more complex, and the CPU is more and more powerless. So later, GPU came into being, saved the CPU from the heavy graphics computing task, and greatly accelerated the graphics display speed.

MCU has a similar development process. In the early MCU use scenarios, there is little demand for graphic display. Even if there is, it is only a simple display device such as 12864. The amount of computation is small, and the CPU of single chip microcomputer can handle it well. However, with the development of embedded graphics, MCU needs to undertake more and more graphics calculation and display tasks, and the display resolution and color of embedded system are also soaring. Slowly, the CPU of MCU began to be unable to do these calculations. Therefore, since stm32f429, a GPU like peripheral began to be added to STM32 MCU. St is called chrome art accelerator, also known as dma2d (this name will be used in this paper). Dma2d can provide acceleration in many 2D drawing occasions, perfectly fitting the function of “GPU” in modern graphics card.

Although this “GPU” can only provide 2D acceleration, and its function is very simple, it is different from the GPU in PC. However, it can already meet the needs of graphics display acceleration in most embedded development. As long as we make good use of dma2d, we can also make a smooth and gorgeous UI effect on the single chip microcomputer.

This paper will introduce the role of dma2d in embedded graphics development from an example. The purpose is to enable readers to simply and quickly establish the most basic concepts of dam2d and learn the most basic usage. In order to prevent the content from being too obscure and difficult to understand, this paper will not deeply analyze the advanced functions and features of dma2d (such as detailed introduction of dma2d architecture, all registers, etc.). If you need to learn dam2d in more detail and professionally, you can refer to stm32h743 Chinese programming manual after reading this article.

Before reading this article, you need to have a certain understanding of TFT liquid crystal controller (ltdc) in STM32 and basic graphics knowledge (such as frame buffer, frame buffer, pixel, color format, etc.).

In addition, in addition to st, there are peripherals with similar functions in MCU produced by many other manufacturers (such as PXP designed by NXP in RT series), but these are not within the scope of this article. Interested friends can understand them by themselves.
Hardware preparation

You can use any STM32 development board with dma2d peripherals to verify the examples in this paper, such as the development board of MCU such as stm32f429, stm32f746, stm32h750, etc. The development board used in this article is art PI. Art Pi is a development board officially produced by RT thread. It adopts the powerful configuration of stm32h750xb + 32MB SDRAM with a main frequency of 480mhz. Moreover, the on-board debugger (st link v2.1) is very convenient to use, especially suitable for the verification of various technical schemes. It is very suitable to be used as the hardware demonstration platform of this paper.

The display can be any color TFT display. It is recommended to use 16 bit or 24 bit RGB interface display. In this paper, a 3.5 ” TFT LCD is used, the interface is rgb666 and the resolution is 320×240 (QVGA). In ltdc, the color format used for configuration is rgb565

Development environment preparation

The content and code introduced in this article can be used in any development environment you like, such as RT thread studio, MDK, IAR, etc.

Before starting the experiment in this paper, you need a basic project to drive LCD display screen with framebuffer technology. Dma2d needs to be enabled in advance before running all the code in this article.

Enabling dma2d can be realized through this macro (enabling once during hardware initialization):

//Dma2d peripherals must be enabled before using dma2d

Introduction to dma2d

Let’s first look at how st describes dma2d


At first glance, it seems a little obscure, but in fact, to put it bluntly, it has the following functions:

Color fill (rectangular area)
Image (memory) copy
Color format conversion (such as YCbCr to RGB or rgb888 to rgb565)
Alpha blend

The first two are memory operations, and the last two are operation acceleration operations. Among them, transparency mixing and color format conversion can be carried out together with image replication, which brings greater flexibility.

It can be seen that the positioning of dma2d by St, like its name, is a DMA strengthened for image processing function. In the actual development process, we will find that the use of dma2d is very similar to the traditional DMA controller. In some non graphics processing occasions, dma2d can even replace the traditional DMA to play a role.

It should be noted that dma2d accelerators of different product lines of St are slightly different. For example, dma2d of stm32f4 Series MCU does not have the function of ARGB and AgBr color format conversion. Therefore, when a specific function needs to be used, it is best to check the programming manual to see whether the required function is supported.

This article only introduces the common functions of dma2d on all platforms.
Operating mode of dma2d

Just like the traditional DMA has three working modes: peripheral to peripheral, peripheral to memory and memory to peripheral, dma2d, as a DMA, is also divided into the following four working modes:

Register to memory
Memory to memory
Memory to memory并执行像素颜色格式转换
Memory to memory且支持像素颜色格式转换和透明度混合

It can be seen that the first two modes start with simple memory operation, while the latter two modes perform color format conversion or / and transparency mixing as required during memory replication.
Dma2d and Hal Libraries

In most cases, using Hal library can simplify coding and improve portability. However, the use of dma2d is an exception. Because the biggest problem of Hal library is the number of nesting layers and too many security tests, and the efficiency is not high enough. When operating other peripherals, the efficiency lost by using Hal library will not have much impact. However, for dma2d peripherals for calculation and acceleration, considering that relevant operations will be called many times in a screen drawing cycle, the use of Hal library will lead to a serious decline in the acceleration efficiency of dam2d.

Therefore, most of the time, we will not use the related functions in Hal library to operate dma2d. For efficiency, we will directly operate the register, so as to maximize the acceleration effect.

Because the working mode is frequently changed in most occasions when we use dma2d, the graphical configuration of dma2d in cubemx also loses its significance.
Dma2d scenario instance

  1. Color fill

The following figure is a simple histogram:


Let’s think about how to draw it.

First, we need to fill the screen with white as the background of the pattern. This process cannot be ignored, otherwise the original pattern displayed on the screen will interfere with our main body. Then, the histogram is actually composed of four blue rectangular blocks and a line segment, and the line segment can also be regarded as a special rectangle with a height of 1. Therefore, the drawing of this figure can be decomposed into a series of “rectangle filling” operations:

Fill a rectangle with white that is equal to the size of the screen
Fill the four data bars with blue
Fill a segment with a height of 1 with black

The essence of drawing a rectangle of any size at any position in the canvas is to set the data of the corresponding pixel position in the memory area to the specified color. However, because the storage of framebuffer in memory is linear, the addresses of seemingly continuous rectangular areas in memory are discontinuous unless the width of the rectangle coincides with the width of the display area.

The following figure shows the typical memory distribution. The numbers in the figure represent the memory address of each pixel in the frame buffer (the offset from the first address is ignored here. One pixel accounts for multiple bytes). The blue area is the rectangle we want to fill. It can be seen that the memory address of the rectangular area is discontinuous.


This feature of framebuffer makes it impossible for us to simply use efficient operations such as memset to fill rectangular areas. Generally, we will use the following double loop to fill any rectangle, where XS and ys are the coordinates of the upper left corner of the rectangle on the screen, width and height represent the width and height of the rectangle, and color represents the color to be filled:

for(int y = ys; y < ys + height; y++){

for(int x = xs; x < xs + width; x++){
    framebuffer[y][x] = color;        


Although the code is simple, during actual execution, a large number of CPU cycles are wasted on operations such as judgment, addressing and self increment, and the actual time to write memory accounts for very little. In this way, the efficiency will be reduced.

At this time, the register to memory working mode of dma2d can be used. Dam2d can fill rectangular memory areas at a very high speed, even if these areas are actually discontinuous in memory.

Still taking the situation illustrated in this figure as an example, let’s see how it is implemented:


First of all, because we only fill the memory and do not need to copy the memory, we need to make the dam2d work in register to memory mode. This is achieved by setting the [17:16] bit of the CR register of dma2d to 11. The code is as follows:

DMA2D->CR = 0x00030000UL;

Then, we will tell dam2d the properties of the rectangle to be filled, such as where the starting address of the area is, how many pixels the width of the rectangle is, and how high the rectangle is.

The area start address is the memory address of the first pixel in the upper left corner of the rectangular area (the address of the red pixel in the figure), which is managed by the Omar register of the dam2d. The width and height of the rectangle are in pixels and are managed by the high 16 bits (width) and low 16 bits (height) of the NLR register respectively. The specific code is as follows:

DMA2D->OMAR = (uint32_t)(&framebuffery); // Sets the starting pixel memory address of the filled area
DMA2D->NLR = (uint32_t)(width << 16) | (uint16_t)height; // Sets the width and height of the rectangular area

Next, because the addresses of the rectangle in memory are discontinuous, we want to tell dma2d how many pixels to skip after filling a row of data (that is, the length of the yellow area in the figure). This value is managed by the oor register. A simple way to calculate the number of skipped pixels is to subtract the width of the rectangle from the width of the display area. The specific implementation code is as follows:

DMA2D->OOR = screenWidthPx – width; // Sets the row offset, that is, the pixels to skip

Finally, we need to tell dam2d what color you will use for filling and what the color format is. This is managed by the ocolr and opfccr registers, respectively, where the color format is controlled by ltdc_ PIXEL_ FORMAT_ XXX macro. The specific code is as follows:

DMA2D->OCOLR = color; // Sets the color used for the fill
DMA2D->OPFCCR = pixelFormat; // Set the color format. For example, if you want to set it to rgb565, you can use the macro ltdc_ PIXEL_ FORMAT_ RGB565

After everything is set, dma2d has obtained all the information required to fill this rectangle. Next, we want to start the transmission of dma2d by setting bit 0 of Cr register of dma2d to 1:

DMA2D->CR |= DMA2D_ CR_ START; // Start data transmission of dma2d, dma2d_ CR_ Start is a macro with a value of 0x01

After dma2d transmission starts, we just need to wait for it to complete transmission. After the dam2d transmission is completed, bit 0 of the CR register will be automatically set to 0, so we can wait for the dam2d transmission to complete through the following code:

While (dma2d – > CR & dma2d_cr_start) {} / / wait for dma2d transmission to complete

Tips0: if you use OS, you can interrupt the transmission of dma2d. Then we can create a semaphore and wait for it after the transmission is turned on, and then release the semaphore in the transmission completion interrupt service function of dma2d. In this way, the CPU can do something else while dma2d is working instead of waiting here.

Tips1: of course, during actual execution, the speed of dma2d memory filling is so fast that the cost of OS switching task is longer than this time. Therefore, even if the OS is used, we will still choose to die:).

For the sake of generality of the function, the starting transmission address and line offset are passed in after calculation outside the function. The complete function code extracted by us is as follows:

static inline void DMA2D_Fill( void * pDst, uint32_t width, uint32_t height, uint32_t lineOff, uint32_t pixelFormat, uint32_t color) {

/*Dma2d configuration*/  
DMA2D->CR      = 0x00030000UL;                                  //  Configured as register to memory mode
DMA2D->OCOLR   = color;                                         //  Set the color used for filling. The format should be the same as the set color format
DMA2D->OMAR    = (uint32_t)pDst;                                //  The starting memory address of the filled area
DMA2D->OOR     = lineOff;                                       //  Row offset, that is, the skipped pixels. Note that it is in pixels
DMA2D->OPFCCR  = pixelFormat;                                   //  Format color
DMA2D->NLR     = (uint32_t)(width << 16) | (uint16_t)height;    //  Sets the width and height of the filled area in pixels

/*Start transmission*/

/*Wait for dma2d transmission to complete*/
while (DMA2D->CR & DMA2D_CR_START) {}


To facilitate code writing, we wrap a rectangle filling function for the screen coordinate system used:

void FillRect(uint16_t x, uint16_t y, uint16_t w, uint16_t h, uint16_t color){

void* pDist = &(((uint16_t*)framebuffer)[y*320 + x]);
DMA2D_Fill(pDist, w, h, 320 - w, LTDC_PIXEL_FORMAT_RGB565, color);


Finally, we try to draw the example chart at the beginning of this section with code:

//Fill background color
FillRect(0, 0, 320, 240, 0xFFFF);
//Draw data bar
FillRect(80, 80, 20, 120, 0x001f);
FillRect(120, 100, 20, 100, 0x001f);
FillRect(160, 40, 20, 160, 0x001f);
FillRect(200, 60, 20, 140, 0x001f);
//Draw X axis
FillRect(40, 200, 240, 1, 0x0000);

Code running effect:

2. Picture display (memory copy)

Suppose we want to develop a game now, and then want to display a beating flame on the screen. Generally, artists draw each frame of the flame first, and then put it into the same picture material, as shown in the following figure:


Then we can display each frame of image in turn at a certain interval, and we can achieve the effect of “beating flame” on the screen.

Let’s skip the process of loading the material file into memory and assume that the material image is already in memory. Then let’s consider how to display one of the frames on the screen. Usually, we will do this: first calculate the address of each frame of data in memory, and then copy the data of this frame of picture to the corresponding position in the framebuffer. The code is similar to this:


  • Copy a frame in the material to the corresponding position in the framebuffer
  • Index is the index of the picture in the frame sequence

static void General_DisplayFrameAt(uint16_t index) {

//Macro description
// #define FRAME_ Counts 25 // number of frames
// #define TILE_ WIDTH_ Pixel 96 // the width of each frame (equal to the height)
// #define TILE_ COUNT_ Row 5 // how many frames are there in each line of the material

//Calculate frame start address
uint16_t *pStart = (uint16_t *) img_fireSequenceFrame;
pStart += (index % TILE_COUNT_ROW) * TILE_WIDTH_PIXEL;

//Calculate material address offset
uint32_t offlineSrc = (TILE_COUNT_ROW - 1) * TILE_WIDTH_PIXEL;
//Calculate the framebuffer address offset (320 is the screen width)
uint32_t offlineDist = 320 - TILE_WIDTH_PIXEL;

//Copy data to framebuffer
uint16_t* pFb = (uint16_t*) framebuffer;
for (int y = 0; y < TILE_WIDTH_PIXEL; y++) {
    memcpy(pFb, pStart, TILE_WIDTH_PIXEL * sizeof(uint16_t));
    pStart += offlineSrc + TILE_WIDTH_PIXEL;
    pFb += offlineDist + TILE_WIDTH_PIXEL;


It can be seen that a large number of memory copying operations are required to achieve this effect. In embedded systems, when a large amount of data replication is required, the efficiency of hardware DMA is the highest. However, hardware DMA can only handle data with continuous addresses. Here, the addresses of the data to be copied in the source picture and the framebuffer are discontinuous, which leads to additional overhead (the same as the problem in the first section), and also makes it impossible for us to use hardware DMA for efficient data replication.

Therefore, although we have achieved the goal, the efficiency is not high (or not the highest).

In order to move a piece of data in the material picture to the frame buffer at the fastest speed, let’s see how to use dma2d.

First of all, because the data is copied in the memory this time, we need to set the working mode of dma2d to “memory to memory mode”, which is realized by setting the [17:16] bit of the CR register of dma2d to 00. The code is as follows:

DMA2D->CR = 0x00000000UL;

Then we need to set the memory addresses of the source and target respectively, which is different from that in Section 1. Because the data source also has memory offset, we need to set the data offset of the source and target at the same time

DMA2D->FGMAR = (uint32_t)pSrc; // source address
DMA2D->OMAR = (uint32_t)pDst; // Destination address
DMA2D->FGOR = OffLineSrc; // Source data offset (pixels)
DMA2D->OOR = OffLineDst; // Destination address offset (pixels)

Then still set the width and height of the image to be copied, as well as the color format, which is the same as in the first section

DMA2D->FGPFCCR = pixelFormat;
DMA2D->NLR = (uint32_t)(xSize << 16) | (uint16_t)ySize;

In the same way, we start the transmission of dma2d and wait for the transmission to complete:

/Start transmission/

/Wait for dma2d transmission to complete/
while (DMA2D->CR & DMA2D_CR_START) {}

Finally, we extract the following functions:

static void DMA2D_MemCopy(uint32_t pixelFormat, void pSrc, void pDst, int xSize, int ySize, int OffLineSrc, int OffLineDst)

/*Dma2d configuration*/
DMA2D->CR      = 0x00000000UL;
DMA2D->FGMAR   = (uint32_t)pSrc;
DMA2D->OMAR    = (uint32_t)pDst;
DMA2D->FGOR    = OffLineSrc;
DMA2D->OOR     = OffLineDst;
DMA2D->FGPFCCR = pixelFormat;
DMA2D->NLR     = (uint32_t)(xSize << 16) | (uint16_t)ySize;

/*Start transmission*/

/*Wait for dma2d transmission to complete*/
while (DMA2D->CR & DMA2D_CR_START) {}


For convenience, we wrap a function that calls it:

static void DMA2D_DisplayFrameAt(uint16_t index){

uint16_t *pStart = (uint16_t *)img_fireSequenceFrame;
pStart += (index % TILE_COUNT_ROW) * TILE_WIDTH_PIXEL;
uint32_t offlineSrc = (TILE_COUNT_ROW - 1) * TILE_WIDTH_PIXEL;

DMA2D_MemCopy(LTDC_PIXEL_FORMAT_RGB565, (void*) pStart, pDist, TILE_WIDTH_PIXEL, TILE_WIDTH_PIXEL, offlineSrc, offlineDist);


Then play each frame of picture in turn. The frame interval set here is 50ms, and the target address is defined in the center of the framebuffer:


for(int i = 0; i < FRAME_COUNTS; i++){


Effect of final operation:

3. Picture gradient switching

Suppose we want to develop a picture viewing application. When two pictures are switched, the direct switching will appear stiff, so we need to add the dynamic effect during switching, and the gradient switching (fade in and fade out) is a very often used and looks good effect.

Just use these two pictures:


Here, we need to understand the basic concept of alpha blend. First, transparency mixing needs to have a foreground and a background. The result of mixing is equivalent to the effect of looking at the background through the foreground. If the foreground is completely opaque, you can’t see the background at all. On the contrary, if the foreground is completely transparent, you can only see the background. If the foreground is translucent, the result is that the two are mixed according to certain rules according to the transparency of the foreground color.

If 1 is completely transparent and 0 is opaque, the mixing formula of transparency is as follows, where a is the background color and B is the foreground color:

X(C)=(1-alpha)X(B) + alphaX(A)

Because the color has three RGB channels, we need to calculate all three channels and combine them after calculation:

R(C)=(1-alpha)R(B) + alphaR(A)
G(C)=(1-alpha)G(B) + alphaG(A)
B(C)=(1-alpha)B(B) + alphaB(A)

In the program, for the sake of efficiency (the CPU is very slow for floating-point operation), we do not use the value in the range of 0 ~ 1. Generally, we will use an 8-bit value to represent transparency, ranging from 0 to 255. It should be noted that the larger the value, the more opaque it is, that is, 255 is completely opaque, and 0 is completely transparent (so it is also called opacity). Then we can get the final formula:

outColor = ((int) (fgColor alpha) + (int) (bgColor) (256 – alpha)) >> 8;

Realize the transparency mixing code of pixels in rgb565 color format:

typedef struct{

uint16_t r:5;
uint16_t g:6;
uint16_t b:5;


static inline uint16_t AlphaBlend_RGB565_8BPP(uint16_t fg, uint16_t bg, uint8_t alpha) {

RGB565Struct *fgColor = (RGB565Struct*) (&fg);
RGB565Struct *bgColor = (RGB565Struct*) (&bg);
RGB565Struct outColor;

outColor.r = ((int) (fgColor->r * alpha) + (int) (bgColor->r) * (256 - alpha)) >> 8;
outColor.g = ((int) (fgColor->g * alpha) + (int) (bgColor->g) * (256 - alpha)) >> 8;
outColor.b = ((int) (fgColor->b * alpha) + (int) (bgColor->b) * (256 - alpha)) >> 8;

return *((uint16_t*)&outColor); 


After understanding the concept of transparency blending and realizing the transparency blending of a single pixel, let’s see how to realize the gradient switching of pictures.

Assuming that the whole gradient is completed within 30 frames, we need to open up a buffer in memory equal to the size of the picture. Then we take the first picture (currently displayed picture) as the background and the second picture (next displayed picture) as the foreground. Then we set a transparency for the foreground, mix the transparency of each pixel, and temporarily store the mixing result in the buffer. After mixing, copy the data in the buffer to the framebuffer to complete the display of one frame. Next, continue with the second and third frames… Gradually increase the opacity of the foreground until the foreground color becomes opaque, that is, the gradient switching of the picture is completed.

Because each frame needs to mix each pixel in the two pictures, which brings a huge amount of computation. It’s unwise to leave it to the CPU, so let’s leave it to the dma2d.

The dma2d mixing function is used this time, so we want to enable the memory to memory mode with color mixing of dam2d, and the value of [17:16] bit of the corresponding CR register is 10, that is:

DMA2D->CR = 0x00020000UL; // Set the operating mode to memory to memory with color mixing

Then set the memory address and data transmission offset of the foreground, background and output data, and the width and height of the transmitted image respectively:

DMA2D->FGMAR = (uint32_t)pFg; // Set foreground data memory address
DMA2D->BGMAR = (uint32_t)pBg; // Set background data memory address
DMA2D->OMAR = (uint32_t)pDst; // Set data output memory address

DMA2D->FGOR = offlineFg; // Set foreground data transfer offset
DMA2D->BGOR = offlineBg; // Set background data transfer offset
DMA2D->OOR = offlineDist; // Set data output transmission offset

DMA2D->NLR = (uint32_t)(xSize << 16) | (uint16_t)ySize; // Set image data width and height (pixels)

Format colors. Attention should be paid to setting the color format of the foreground color here, because if the color format such as ARGB is used, the alpha channel in the color data will affect the mixing result when we mix the transparency. Therefore, we should set here to ignore the alpha channel of the foreground color itself during the mixing operation. And force the transparency when blending.

Output color format and background color format

Dma2d – > fgpfccr = PixelFormat / / set foreground color format

|(1ul < 16) // ignore the alpha channel in the foreground color data
    | ((uint32_t)opa << 24);            //  Set foreground color opacity

DMA2D->BGPFCCR = pixelFormat; // Format background color
DMA2D->OPFCCR = pixelFormat; // Format output color

Tips0: sometimes we will encounter a picture with transparent channel superimposed on the background. At this time, we should not disable the alpha channel of the color itself

Tips1: in this mode, we can not only mix colors, but also convert color formats at the same time. We can set the foreground, background and output color formats as needed

Finally, start the transmission:

/Start transmission/

/Wait for dma2d transmission to complete/
while (DMA2D->CR & DMA2D_CR_START) {}

The complete code is as follows:

void _DMA2D_MixColors(void pFg, void pBg, void* pDst,

uint32_t offlineFg, uint32_t offlineBg, uint32_t offlineDist,
    uint16_t xSize, uint16_t ySize,
    uint32_t pixelFormat, uint8_t opa) {

DMA2D->CR    = 0x00020000UL;                //  Set the operating mode to memory to memory with color mixing

DMA2D->FGMAR = (uint32_t)pFg;               //  Set foreground data memory address
DMA2D->BGMAR = (uint32_t)pBg;               //  Set background data memory address
DMA2D->OMAR  = (uint32_t)pDst;              //  Set data output memory address

DMA2D->FGOR  = offlineFg;                   //  Set foreground data transfer offset
DMA2D->BGOR  = offlineBg;                   //  Set background data transfer offset
DMA2D->OOR   = offlineDist;                 //  Set data output transmission offset

DMA2D->NLR = (uint32_t)(xSize << 16) | (uint16_t)ySize; //  Set image data width and height (pixels)

Dma2d - > fgpfccr = PixelFormat // set foreground color format
        |(1ul < 16) // ignore the alpha channel in the foreground color data
        | ((uint32_t)opa << 24);            //  Set foreground color opacity

DMA2D->BGPFCCR = pixelFormat;               //  Format background color
DMA2D->OPFCCR  = pixelFormat;                //  Format output color

/*Start transmission*/

/*Wait for dma2d transmission to complete*/
while (DMA2D->CR & DMA2D_CR_START) {}


Write the test code. There is no need to wrap the function twice this time:

void DMA2D_AlphaBlendDemo(){

const uint16_t lcdXSize = 320, lcdYSize = 240;
const uint8_ t cnvFrames = 60; //  60 frames complete switching
const uint32_ t interval = 33; //  30 frames per second
uint32_t time = 0;

//Calculate the memory address of the output location
uint16_t distX = (lcdXSize - DEMO_IMG_WIDTH) / 2;
uint16_t distY = (lcdYSize - DEMO_IMG_HEIGHT) / 2;
uint16_t* pFb = (uint16_t*) framebuffer;
uint16_t* pDist = pFb + distX + distY * lcdYSize;
uint16_t offlineDist = lcdXSize - DEMO_IMG_WIDTH;

uint8_t nextImg = 1;
uint16_t opa = 0;
void* pFg = 0;
void* pBg = 0;
    //Toggle foreground / background picture
        pFg = (void*)img_cat;
        pBg = (void*)img_fox;
        pFg = (void*)img_fox;
        pBg = (void*)img_cat;

    //Switch complete
    for(int i = 0; i < cnvFrames; i++){
        time = HAL_GetTick();
        opa = 255 * i / (cnvFrames-1);
        _DMA2D_MixColors(pFg, pBg, pDist,
                LTDC_PIXEL_FORMAT_RGB565, opa);
        time = HAL_GetTick() - time;
        if(time < interval){
            HAL_Delay(interval - time);
    nextImg = !nextImg;


Final effect:

Performance comparison

Three examples of embedded graphics development are introduced, and the implementation methods through traditional and dma2d are introduced respectively. At this time, some friends will ask, how much faster can dma2d implementation be compared with traditional methods? Let’s actually test it.

The common test conditions are as follows:

Framebuffer is placed in SDRAM, 320x240, rgb565
SDRAM operating frequency 100MHz, Cl2, 16 bit bandwidth.
MCU is stm32h750xb, the main frequency is 400MHz, and I-cache and D-cache are enabled
The code and resources are on the internal flash, 64 bit Axi bus, and the speed is 200MHz.
Gcc compiler (version: arm atolic EABI gcc-6.3.1)

Rectangle fill

Test method:

Draw the chart in Section 1 of the previous chapter, draw 10000 times, and calculate the results

Test results:
Drawing method consumes time (- o0) consumes time (- O3)
Software implementation 39641 MS 9930 MS
DMA2D 9827 ms 9817 ms
Memory copy

Test method:

Draw 10000 sequence frames in Section 2 of the previous chapter and count the results

Test results:
Drawing method consumes time (- o0) consumes time (- O3)
Software implementation 68787 MS 48654 MS
DMA2D 26201 ms 26160 ms
Transparency blending

Test method:

The gradient switches the two pictures in Section 3 of the previous chapter 100 times, 30 frames each time, a total of 3000 frames
The mixed results are directly output to the framebuffer and are no longer buffered through the buffer

Test results:
Drawing method consumes time (- o0) consumes time (- O3)
Software implementation 20824 MS 2617 MS
DMA2D 681 ms 681 ms
Performance test summary

As can be seen from the above test results, dam2d has at least two advantages:

First, the speed is faster: in some projects, the speed of dma2d implementation can reach a gap of up to 30 times compared with pure software implementation! This is also the result of the test on the stm32h750 platform with a dominant frequency of 400MHz and L1 cache. If the test is carried out on the stm32f4 platform without cache and with a lower dominant frequency, the gap will be further widened.

Second, the performance is more stable: from the test results, it can be seen that the implementation of dma2d is very little affected by the compiler optimization level, which means that you can achieve the same performance with dma2d no matter you use IAR, GCC or MDK. There will be no significant difference in performance after the migration of the same piece of code.

In addition to these two intuitive results, there is actually a third advantage, that is, the code is simpler. Dma2d has few registers and is more intuitive. In some cases, it is much more convenient to use than software.

The three examples in this paper are the situations I often encounter in embedded graphics development. In fact, there are many uses of dma2d. If you are interested, you can refer to the relevant contents in stm32h743 Chinese programming manual. I believe that with the basis of this article, you will get twice the result with half the effort when reading the contents.

Limited by the author’s technology, the content in the article cannot be 100% correct. If there are errors, please point them out. Thank you.

Original link:https://club.rt-thread.org/as…

Recommended Today


Supervisor [note] Supervisor – H view supervisor command help Supervisorctl – H view supervisorctl command help Supervisorctl help view the action command of supervisorctl Supervisorctl help any action to view the use of this action 1. Introduction Supervisor is a process control system. Generally speaking, it can monitor your process. If the process exits abnormally, […]