A bug of NDK clang compiler

Time:2021-2-23

A bug of NDK clang compiler

Problem code

float32_t Sum_float(float32_t *data, const int count)
{
    float32x4_t res = vdupq_n_f32(0.0f);
    for(int i = 0; i < (count & (~15)); i += 16)
    {
        #if 01
        float32x4x4_t v0 = vld1q_f32_x4(data + i);
        float32x4_t v00 = v0.val[0];
        float32x4_t v01 = v0.val[1];
        float32x4_t v02 = v0.val[2];
        float32x4_t v03 = v0.val[3];

        #else
        float32x4_t v00 = vld1q_f32(data + i);
        float32x4_t v01 = vld1q_f32(data + i + 4);
        float32x4_t v02 = vld1q_f32(data + i + 8);
        float32x4_t v03 = vld1q_f32(data + i + 12);

        #endif

        v00 = vaddq_f32(v00, v02);
        v01 = vaddq_f32(v01, v03);
        res = vaddq_f32(res, vaddq_f32(v00, v01));        

    }
    float32x2_t res1 = vadd_f32(vget_low_f32(res), vget_high_f32(res));

    float32_t v0[2];
    vst1_f32(v0, res1);
    v0[0] += v0[1];
    for(int i = count & (~15); i < count; ++i){
        v0[0] += data[i];
    }

    return v0[0];
}

Compile test

First of all, look up thehttps://static.docs.arm.com/ihi0073/c/IHI0073C_arm_neon_intrinsics_ref.pdf, forvld1q_f32_x4This command,v7/A32/A64They are all supportive.

Different compiler version results: first, for all versions, if you use#elseBlock code, can be compiled successfully, for the use of#if 01The result is as follows:

armeabi-v7a with o1 armeabi=v7a with o0 arm64-v8a
r20c clang++: error: clang frontend command failed due to signal (use -v to see invocation) ok ok
r19c ok ok ok
r15c error: use of undeclared identifier ‘vld1q_f32_x4’ error: use of undeclared identifier ‘vld1q_f32_x4’ ok

Not onlyvld1q_f32_x4, forvld1_u8_x2;vst1q_f32_x4And other similar instructions have such problems.

Performance comparison

Test code:

int main()
{
    const size_t len = 1024*1024 * 16;
    float32_t *data = new float32_t[len];
    for(size_t i = 0; i < len; ++i) {
        data[i] = std::rand() / 100.0;
    }

    clock_t t0 = std::clock();
    float32_t sum = Sum_float(data, len);
    printf("sum=%f , time cost=%f \n", sum, 1000.0 * (double)(std::clock() - t0) / CLOCKS_PER_SEC);
    return 0;
}

Three versions of NDK were testedarm64-v8aAt the same time, it compiles with r19carmeabi-v7a, respectively#ifand#else55 ms, there was no significant difference.

Similar questions:https://github.com/mattgodbolt/compiler-explorer/issues/1906

Address alignment

Although using the r19c version to compilearmeabi-v7aIt’s the same with success or the use of an unoptimized r20c, but a crash occurs during execution. The reason is executionvldN(q)_type_xNInstruction, the address is not aligned caused by the crash.

And forarm64-v8aVersion, pass all tovldN(q)_type_xNPrint out the address of, also found that there are0x7350800001Such an address, and the last bit of the address is 0 to e, but there is no error.That is, for this instruction, onlyarmeabi-v7aThere are address alignment requirements, andarm64-v8aBut no?

At the same time, conventionalvldN(q)_typeInstructions do not require address alignment, so it is best not to use themvldN(q)_type_xN

Crash log caused by address alignment in Code:

libc    : Fatal signal 7 (SIGBUS), code 1 (BUS_ADRALN), fault addr 0xf0900001 in tid 27659 (ClarityOpt), pid 27659 (ClarityOpt)
crash_dump32: obtaining output fd from tombstoned, type: kDebuggerdTombstone
crash_dump32: performing dump of process 27659 (target tid = 27659)
DEBUG   : Process name is /data/local/tmp/ClarityOpt, not key_process
DEBUG   : *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
DEBUG   : Build fingerprint: 'OPPO/PCCM00/OP4A7A:10/QKQ1.191222.002/1584699103:user/release-keys'
DEBUG   : Revision: '0'
DEBUG   : ABI: 'arm'
DEBUG   : Timestamp: 2020-05-09 15:15:16+0800
DEBUG   : pid: 27659, tid: 27659, name: ClarityOpt  >>> /data/local/tmp/ClarityOpt <<<
DEBUG   : uid: 0
crash_dump32: type=1400 audit(0.0:27044): avc: denied { read } for name="ClarityOpt" dev="sda11" ino=30524 scontext=u:r:crash_dump:s0 tcontext=u:object_r:shell_data_file:s0 tclass=file permissive=1
crash_dump32: type=1400 audit(0.0:27045): avc: denied { open } for path="/data/local/tmp/ClarityOpt" dev="sda11" ino=30524 scontext=u:r:crash_dump:s0 tcontext=u:object_r:shell_data_file:s0 tclass=file permissive=1
crash_dump32: type=1400 audit(0.0:27046): avc: denied { getattr } for path="/data/local/tmp/ClarityOpt" dev="sda11" ino=30524 scontext=u:r:crash_dump:s0 tcontext=u:object_r:shell_data_file:s0 tclass=file permissive=1
crash_dump32: type=1400 audit(0.0:27047): avc: denied { map } for path="/data/local/tmp/ClarityOpt" dev="sda11" ino=30524 scontext=u:r:crash_dump:s0 tcontext=u:object_r:shell_data_file:s0 tclass=file permissive=1
DEBUG   : signal 7 (SIGBUS), code 1 (BUS_ADRALN), fault addr 0xf0900001
DEBUG   :     r0  00000043  r1  00000000  r2  a9a5ac6f  r3  00000003
DEBUG   :     r4  f0900001  r5  ffcb0a00  r6  ffcb0a40  r7  ffcb0b60
DEBUG   :     r8  f0900007  r9  00000001  r10 f0900000  r11 f0900000
DEBUG   :     ip  ffcb0500  sp  ffcb09f0  lr  00000004  pc  021d265e
DEBUG   : 
DEBUG   : backtrace:
DEBUG   :       #00 pc 0000365e  /data/local/tmp/ClarityOpt (BuildId: fb1d8b990741386becb60ff1c8b10583efb05f70)
DEBUG   :       #01 pc 00004271  /data/local/tmp/ClarityOpt (BuildId: fb1d8b990741386becb60ff1c8b10583efb05f70)
DEBUG   :       #02 pc 00004c9f  /data/local/tmp/ClarityOpt (BuildId: fb1d8b990741386becb60ff1c8b10583efb05f70)
DEBUG   :       #03 pc 00004dd3  /data/local/tmp/ClarityOpt (BuildId: fb1d8b990741386becb60ff1c8b10583efb05f70)
DEBUG   :       #04 pc 000513bb  /apex/com.android.runtime/lib/bionic/libc.so (__libc_init+66) (BuildId: 8e41d0dce7911ae25a51deb63aa9720c)
DEBUG   :       #05 pc 00002a98  /data/local/tmp/ClarityOpt (BuildId: fb1d8b990741386becb60ff1c8b10583efb05f70)