PI Ziheng embedded: compare the performance difference between the pure software implementation of mbedtls algorithm library and the implementation of DCP and CAAM hardware accelerators on i.mxrt

Time:2022-7-23

Hello everyone, I’m ruffian Heng, a serious technical ruffian. What ruffian Heng introduced to you today isThe performance difference between the pure software implementation of mbedtls algorithm library and the implementation of DCP and CAAM hardware accelerators on i.mxrt

Recently, i.mxrt customers are integratingOTA SBLThe project encountered the performance problem of mbedtls library algorithm when implementing the 2nd bootloader of the product. The customer wants to know how far the performance gap between mbedtls pure software implementation and using the hardware accelerator in i.mxrt chip is. Taking advantage of the customer’s problem, we will measure the performance difference between the two methods on i.mxrt today.

The customer uses i.mxrt1170. The hardware accelerator on this model is CAAM. Compared with the DCP on the previous generation architecture i.mxrt10xx series, it is upgraded. Today, we will test DCP and CAAM at the same time.

1、 Introduction to mbedtls algorithm library

Mbedtls (formerly polarssl) is an open source ssl/tls algorithm library, which was first opened and maintained by arm company, and has now been handed over to the trustedfirmware community for maintenance. Mbedtls open source warehouse address is:

Mbedtls code is written in C language, which realizes ssl/tls function and various encryption algorithms with the minimum coding space. It is easy to understand, use, integrate and expand, and it is convenient for developers to easily use ssl/tls function in embedded products.

Mbedtls software package mainly provides the following support:

1. Complete SSL V3, TLS v1.0, TLS v1.1 and TLS v1.2 protocol implementation

2、 I. Introduction to hardware accelerator on mxrt

2.1 DCP on i.mxrt10xx series

DCP is the abbreviation of data co processor. From the name, it is a general-purpose data coprocessor. In the i.mxrt1060 security reference manual, there is a diagram of the overall security architecture of the system, in which the main functions of the DCP module are marked: CRC-32 algorithm, AES algorithm, hash algorithm, DMA like data movement. For further usage, see the two old articles of ruffian HengKey precautions for i.mxrt10xx DCP useCache precautions for i.mxrt10xx DCP

2.2 CAAM on i.mxrt11xx series

CAAM is the abbreviation of cryptographic acceleration and assurance module. It is a super full-featured security algorithm accelerator. In the i.mxrt1170 security reference manual, there is a diagram of the overall security architecture of the system. This diagram marks the main functions of the CAAM module, which further expands on the existing functions of DCP and enriches the algorithm support.

3、 Compare the performance differences between software and hardware implementation of common algorithms

3.1 introduction to official SDK routines

If you want to run mbedtls algorithm on MCU, you normally need to transplant mbedtls source code first. However, the transplantation has been done in the official SDK package of NXP i.mxrt, and the source code is placed in the \sdk_ 2.11.0_ Mimxrt1xxx-evk\middleware\mbedtls, so we omit the migration steps. Note: mbedtls 2.27.0 is ported in SDK version 2.11.

In addition, the official SDK also provides the following two basic routines about mbedtls, among which mbedtls_ Selftest is to traverse all algorithms to detect the correctness of algorithm execution; mbedtls_ Benchmark provides the actual performance data of all algorithms (encoding and decoding rate kb/s).

\SDK_2.11.0_MIMXRT1xxx-EVK\boards\evkmimxrt1xxx\mbedtls_examples\mbedtls_selftest
\SDK_2.11.0_MIMXRT1xxx-EVK\boards\evkmimxrt1xxx\mbedtls_examples\mbedtls_benchmark

3.2 measured on i.mxrt1060

We now measure the algorithm performance on mimxrt1060-evk board, using mbedtls_ Benchmark routine, choose debug build, that is, let the code run in TCM, so as to achieve the best performance, and do not let the memory performance become a bottleneck, which will affect the algorithm performance data. In addition, the i.mxrt1060 core frequency is also configured to a maximum of 600MHz.

mbedtls_ By default, the benchmark routine enables the hardware accelerator DCP to implement the algorithm. Because we want to compare the performance difference between mbedtls pure software implementation and DCP hardware implementation, we need to test the pure software method in the project source file mimxrt1062_ features. The following macro is temporarily set to 0 in H. at this time, the project may fail to compile (the code chain is in 128KB ITCM), because the pure software mode code is much larger than the hardware driven mode code. At this time, it can be in benchmark C or KSDK_ mbedtls_ config. H comment out some algorithm execution to reduce the final code body (keep the algorithm you are interested in).

/* @brief DCP availability on the SoC. */
#define FSL_FEATURE_SOC_DCP_COUNT (0)

Algorithm performance data is also related to IDE and compilation optimization options. Here we choose IAR, and the optimization options test none and high speed, no size constraints respectively. Because there are so many algorithms, we pick the commonly used Sha and AES, and the comparison results are as follows:

Test algorithm item Test results (IAR v9.10)
Opt-None


SW-mbedtls
Opt-HighSpeed


SW-mbedtls
Opt-None


HW-DCP
Opt-HighSpeed


HW-DCP
SHA-1 15967.90 KB/s


36.02 cycles/byte
19260.52 KB/s


30.13 cycles/byte
55207.68 KB/s


10.09 cycles/byte
66164.77 KB/s


8.54 cycles/byte
SHA-256 6141.10 KB/s


94.83 cycles/byte
15473.87 KB/s


37.57 cycles/byte
60976.40 KB/s


9.09 cycles/byte
74910.71 KB/s


7.51 cycles/byte
SHA-512 4723.55 KB/s


123.51 cycles/byte
7428.60 KB/s


78.55 cycles/byte
4720.28 KB/s


123.61 cycles/byte
7430.49 KB/s


78.56 cycles/byte
AES-CBC-128 6731.48 KB/s


86.55 cycles/byte
10957.42 KB/s


53.18 cycles/byte
58411.12 KB/s


9.52 cycles/byte
61560.47 KB/s


9.17 cycles/byte

3.3 measured on i.mxrt1170

Using the same method as in the previous section, measure it on the mimxrt1170-evk board. Similarly, mbedtls_ Benchmark routine debug build. Note that i.mxrt1170 is a dual core chip. We test it under cortex-m7 and set the core frequency to the highest 996mhz.

When testing the software only mode on i.mxrt1170, you only need to set the crypto in the project option precompiled macro_ USE_ DRIVER_ CAAM can be removed. Of course, it can also be found in mimxrt1176_ cm7_ features. In H, the following macro is temporarily set to 0. At this time, there is no code space concern, i The default ITCM on mxrt1170 is 256Kb. The final test results are as follows:

/* @brief CAAM availability on the SoC. */
#define FSL_FEATURE_SOC_CAAM_COUNT (0)
Test algorithm item Test results (IAR v9.10)
Opt-None


SW-mbedtls
Opt-HighSpeed


SW-mbedtls
Opt-None


HW-CAAM
Opt-HighSpeed


HW-CAAM
SHA-1 13156.48 KB/s


72.45 cycles/byte
14298.92 KB/s


66.73 cycles/byte
20981.07 KB/s


44.78 cycles/byte
27023.34 KB/s


34.61 cycles/byte
SHA-256 7206.51 KB/s


133.46 cycles/byte
12208.04 KB/s


78.36 cycles/byte
20970.20 KB/s


44.84 cycles/byte
27007.46 KB/s


34.62 cycles/byte
SHA-512 5897.39 KB/s


163.43 cycles/byte
8238.67 KB/s


116.73 cycles/byte
5894.95 KB/s


163.57 cycles/byte
8227.76 KB/s


116.91 cycles/byte
AES-CBC-128 5419.23 KB/s


178.02 cycles/byte
6352.19 KB/s


151.85 cycles/byte
39786.80 KB/s


22.96 cycles/byte
41433.36 KB/s


22.04 cycles/byte
AES-CBC-192 5059.84 KB/s


190.79 cycles/byte
6064.90 KB/s


159.10 cycles/byte
36596.29 KB/s


25.08 cycles/byte
38127.75 KB/s


24.15 cycles/byte
AES-CBC-256 4745.47 KB/s


203.54 cycles/byte
5803.56 KB/s


166.32 cycles/byte
34012.50 KB/s


27.11 cycles/byte
35229.83 KB/s


26.17 cycles/byte

3.4 performance test summary

  • Conclusion 1: using hardware accelerator CAAM module /dcp module, compared with mbedtls pure software implementation, the performance of most algorithms will be improved, but the specific improvement ratio varies with the complexity of the algorithm itself.
  • Conclusion 2: 3des/des (nearly 10 times), aes/ecdsa/ecdhe (nearly 7 times), RSA (3-5 times), sha-1/256 (nearly 2 times) are the most important hardware accelerators.
  • Conclusion 3: in the hardware accelerator mode, for some algorithms, the longer the test data length is (the default is 1KB buffer, for example, it is adjusted to 10KB), the more obvious the performance improvement is.
  • Conclusion 4: compiler optimization level setting has a certain impact on mbedtls pure software and hardware accelerator mode.
  • Conclusion 5: CAAM module has much higher algorithm support than DCP module, but the encoding and decoding speed performance has not been significantly improved.

So far, the difference between the pure software implementation of mbedtls algorithm library and the performance of DCP and CAAM hardware accelerators on i.mxrt has been introduced. Where is the applause~~~

Welcome to subscribe

The article will be published to me at the same timeBlog Park HomepageCSDN home pageZhihu HomepageWechat official accountOn the platform.

Wechat search“Ruffian scale embedded“Or scan the QR code below, and you can see it for the first time on your mobile phone.