Crash tracking journey of IOS development (I)

Time:2022-5-6

Preface: Recently, I encountered a crash blood disaster caused by crash in daily development. In a release in early May, the crash rate of the app developed by the author was directly increased from one thousand to nearly two thousand. At that time, the project leader just needed to report the relevant situation of project QA to the upper level, so I was stunned at that time.

problem

At the beginning of the year, it took a lot of effort to completely migrate the original OC based project to swift. Due to the safety of swift, crash has been maintained well. Suddenly, it was very confused. After checking the crash tracking of Umeng, it all reported attempted to dereference garbage pointer 0x18ffd63d72d0. The symbolic crash function call stack is also very confused. Crash is on the indefinite thread, The function symbols of app are located in line 0 of a model class.
After sorting out the user behavior on the relevant crash log and Umeng, I know that I should have encountered a memory problem or wild pointer. If this brain damaging problem is just a few sporadic crashes, it will be resolutely solved later. However, two thousandths of crashes have a direct impact on the job problem and can only be solved with a hard head. It is also to reshape the spirit of the clock’s pursuit of the ultimate technology.

result

After sorting out the crash log and behavior log of Umeng and the crash log uploaded by app connent users, it can be determined that the random crash is caused by the wild pointer / memory problem, and the following clues are sorted out

  • The proportion of random crash after the app runs for 5 minutes is very high
  • Crash will also occur during the cold start (only the user track has the viewcontroller), indicating that the code in question should be executed at the start-up block
  • It is related to the change of IDFA acquisition method, because the new version (2.1.0) has been rejected due to the change of IDFA policy, so it can be located that the risk control SDK provided by the group is highly suspected. The revision and review and risk control department have collected IDFA on their SDK. It is recalled that when integrating the code, it was found that the risk control SDK was developed in C and C + +, so the bridge file was added when changing swift integration.
  • There are many crashes collected on the high version IOS system and the new mobile phone (arm64e). The exception types of the original crash of the low version and the high version are different: arm64e is SIGSEGV and arm64 is sigbus

Here, the crash collection of Umeng under roast does not show the original error, which is summarized as attempted to dereference garbage pointer, which is not conducive to troubleshooting and locating errors

Solution steps:

  1. Pull 2.1.0 release code
  2. Xcode opens address sanitizer (asan) to recompile and run – > buggy address. You can indeed find memory usage problems
  3. Note: start the initialization code of the risk control SDK and find that it is really caused by the risk control SDK
  4. Contact the risk control group to replace the SDK and pass the test
  5. Waiting for online verification

So far, the crash problem has come to an end, but during the tracking process, we have reviewed and returned to many previous underlying technologies and tools. After solving the problem, we sit down again and make an in-depth summary and record.

Technical points involved

  1. IOS memory management mechanism: there are a lot of information about OC C + +, which can be expanded to see Swift’s summary
  2. Symbol file parsing, lldb advanced debugging and plug-in writing, asdn related
  3. IOS system crash: exception (Mach OC) and UNIX BSD signal error
  4. Principle and implementation of APM tool based on bugly
  5. PAC (PAC Technology)[https://justinyan.me/post/4129]

I will explore relevant technical points in several chapters

Crash in IOS system

1. Classification of crash

I remember exploring why mobile apps crash in previous articles:memory management , because the mobile system abandons the swap mechanism in order to protect flash memory.

The main reason for crash is that app receives unprocessed signals. The core operating system of IOS is Darwin, the Darwin kernel is XNU (“x is not UNIX”), and XNU is a hybrid kernel based on Mach + BSD. Therefore, the signals causing crash can be divided into three types:

  1. Mach exception: mach is responsible for the underlying task of XNU comparison, so mach exception refers to the underlying kernel level exception. Developers in user mode can directly set the exception ports of thread, task and host through Mach API to catch Mach exception
  2. UNIX signal: also known as BSD signal (sent by BSD in XNU). If the developer does not catch Mach exception, it will be rejected by the method UX of host layer_ Exception() is converted into the corresponding UNIX signal, and the signal is delivered to the wrong thread through threadsignal(). The signal can be captured through signal (x, signalhandler)
  3. Nsexception: application level exception, which can also be regarded as OC language level exception, causing the program to send sigabort signal to itself. Crash can be caught by try catch or through nssetuncaughtexceptionhandler() mechanism

Swift’s exception mechanism is rarely shared by leaders in this field. You can study Swift’s error mechanism

Among the above three levels of crash, the language level (OC) application level crash is the best solution. The array is out of bounds and the runtime MSG_ The crash caused by the send message forwarding mechanism and the crash of OC language mechanisms such as KVC can be quickly located through the backtrace in the crash log. The crash caused by Mach exception and UNIX signal is also a great challenge for advanced development.

2. Mach exception and UNIX signal

What is Mach exception? How does it connect with UNIX signals?

//Crash log header
Exception Type:  EXC_BAD_ACCESS (SIGSEGV)
Exception Subtype: KERN_INVALID_ADDRESS at 0x0022000000000000 -> 0x0000000000000000 (possible pointer authentication failure)
  1. Mach exception is a kernel level exception that occurs during the operation of XNU’s microkernel core mach. Each thread, task and host(What is this host?)There is an exception port array. Some APIs of Mach are exposed to the user state. Developers in the user state can directly set the exception ports of thread, task and host through the Mach API to catch Mach exceptions.
  2. All unhandled Mach exceptions will pass UX_ Exception() is converted into UNIX signal, and the signal is passed to the wrong thread through threadsignal. The POSIX API of IOS is implemented through the BSD layer of mach.

Note: the most basic object of mach is “host”, that is, the object representing the machine itself

For example, the crash log header posted above is taken from the crash log, exc_ BAD_ Access (access to invalid memory) exception, because it was not captured in the Mach layer, was converted into SIGSEGV signal by the host layer and passed to the wrong thread.

So:

  1. The unhandled Mach exception will be converted to UNIX signal, and the unhandled application level exception will also be converted to nsexception, and then call C’s abort(), and the kernel will send it to the app__ pthread_ Kill signal to trigger Mach exception, so any uncapped exception will be converted into a UNIX signal.
  2. The signal generated by hardware (through the trap mechanism of CPU: mach_msg_trap (), the concept of trap is equivalent to system call in Mach) is captured by Mach and then converted into UNIX signal.
  3. For Apple’s unified mechanism, the signals generated by the operating system or users (kill and thread_kill) will also be transformed into Mach exceptions, and finally into UNIX signals.
Crash tracking journey of IOS development (I)

Crash generation principle and transmission process

4. Classification of Mach anomaly and UNIX signal

Common Mach exceptions

  • EXC_ Crash: process abnormal exit (sigabort) or watch dog timeout kills app (sigkill)
  • EXC_BREAKPOINT (SIGTRAP)
  • EXC_ BAD_ Access: invalid memory access
  • EXC_ BAD_ Instruction: the thread attempted to access an illegal / invalid instruction or pass an invalid parameter (operand) to the instruction
  • EXC_ Aritmean: exception thrown by dividing by 0 or integer overflow / underflow
  • EXC_ Syscall and exc_ MACH_ Syscall: issued when an application accesses kernel services (such as file I / O) or network access
  • Other Mach exceptions are defined in Mach / exception_ types. H medium. Processor related exceptions are defined in Mach / (i386, PPC,…)/ exception. H medium
    The most common exception in development should be exc_ BAD_ Access, such as the one tracked this time

UNIX signal

The signal processing function can be set through the signal () system call. If the corresponding processing function is not set for a signal, the default processing function will be used, otherwise the signal will be intercepted by the process and the corresponding processing function will be called. Without a handler, the program can specify two behaviors: ignore this signal sig_ Ign or use the default handler sig_ DFL 。 However, there are two signals that cannot be intercepted and processed: sigkill and sigstop.

Signal type:

  • Sigabrt — program abort command abort signal
  • Sigalrm — program timeout signal
  • SIGFPE — program floating point exception signal
  • Sigill — program illegal instruction signal
  • SIGHUP — program terminal stop signal
  • SIGINT — program keyboard interrupt signal
  • Sigkill — program end receiving abort signal
  • SIGTERM — program kill abort signal
  • Sigstop — program keyboard stop signal
  • SIGSEGV — program invalid memory abort signal
  • Sigbus — program memory byte misaligned abort signal
  • SIGPIPE — program socket sending failure abort signal

5. Simulate Mach message sending and capturing Mach exceptions

5.1 Mach

Mach is the microkernel of XNU
Several basic concepts of Mach:
Tasks: objects that have a set of system resources, allowingthreadExecute in it
Threads: the basic unit of execution. It has the context of the task and shares its resources
Ports: a set of protected message queues for communication between tasks. Tasks can send / receive data to any port
Message: a collection of data objects of type, which can only be sent to the host

5.2 simulate sending of Mach message
  1. Create post authorization
+ (mach_port_t)createPortAndListener {
    //Mach found in Mach's header file_ port_ T is completely equivalent to mach_ port_ name_ t
    // typedef unsigned int            __darwin_natural_t;
    // typedef __darwin_natural_t __darwin_mach_port_name_t; /* Used by mach */
    // typedef __darwin_mach_port_name_t __darwin_mach_port_t; /* Used by mach */
    // typedef __darwin_mach_port_t mach_port_t;
    // typedef natural_t mach_port_name_t;
    // typedef __darwin_natural_t      natural_t;
    
    mach_port_t server_port;
    kern_return_t kr = mach_port_allocate(mach_task_self(),
                                          MACH_PORT_RIGHT_RECEIVE,
                                          &server_port);
    assert(kr == KERN_SUCCESS);
    
    NSLog(@"Create a port: %d", server_port);
    
    kr = mach_port_insert_right(mach_task_self(),
                                server_port,
                                server_port,
                                MACH_MSG_TYPE_MAKE_SEND);
    assert(kr == KERN_SUCCESS);
    
    return server_port;
}
  1. Mach port listening
+ (void)setMachPortListener:(mach_port_t)mach_port {
    dispatch_async(dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0), ^{
        mach_msg_header_t mach_message;
        
        mach_message.msgh_size = 1024;
        mach_message.msgh_local_port = mach_port;
        
        mach_msg_return_t mr;
        
        while (true) {
            mr = mach_msg(&mach_message,
                          MACH_RCV_MSG | MACH_RCV_LARGE,
                          0,
                          mach_message.msgh_size,
                          mach_message.msgh_local_port,
                          MACH_MSG_TIMEOUT_NONE,
                          MACH_PORT_NULL);
            if (mr != MACH_MSG_SUCCESS && mr != MACH_RCV_TOO_LARGE) {
                NSLog(@"error!");
            }
            
            mach_msg_id_t msg_id = mach_message.msgh_id;
            mach_port_t remote_port = mach_message.msgh_remote_port;
            mach_port_t local_port = mach_message.msgh_local_port;
            
            NSLog(@"Recevie a mach messag:[%d], remote_port: %d, local_port: %d, exception",
                  msg_id, remote_port, local_port);
        }
    });
}
  1. Send a message to the created Mach port
+ (void)sendMachPostMessage:(mach_port_t)mach_port {
    kern_return_t kr;
    mach_msg_header_t msg_header;
    msg_header.msgh_bits = MACH_MSGH_BITS(MACH_MSG_TYPE_COPY_SEND, 0);
    msg_header.msgh_size = sizeof(mach_msg_header_t);
    msg_header.msgh_remote_port = mach_port;
    msg_header.msgh_local_port = MACH_PORT_NULL;
    msg_header.msgh_id = 100;
    NSLog(@"Send a mach message: [%d]", msg_header.msgh_id);
    
    kr = mach_msg(&msg_header,
                  MACH_SEND_MSG,
                  msg_header.msgh_size,
                  0,
                  MACH_PORT_NULL,
                  MACH_MSG_TIMEOUT_NONE,
                  MACH_PORT_NULL);
}

5.3 capturing exceptions in Mach layer

6. Signal registration and processing

7.PAC

During this crash tracing, I found that:

  1. On relatively new machines (generally, the IOS system version is also relatively high), the probability of crash is relatively high
  2. The Mach exception of crash in the original crash log collected through app connect is different from the signal after conversion

Compare the crash log header collected by older devices: Unix signal – > SIGSEGV

//Older phone: iPhone 8
Incident Identifier: 9DCFF105-1CBE-4947-B386-68E4375EC340
Hardware Model:      iPhone10,1
Process:             esport-app [15056]
Path:                /private/var/containers/Bundle/Application/86BE49B4-3975-45D9-AC97-CD9CABF4F7D0/esport-app.app/esport-app
Identifier:          com.wmzq.esportapp
Version:             2 (2.1.0)
AppStoreTools:       12E262
AppVariant:          1:iPhone10,1:13
Beta:                YES
Code Type:           ARM-64 (Native)
Role:                Foreground
Parent Process:      launchd [1]
Coalition:           com.wmzq.esportapp [2538]


Date/Time:           2021-05-06 15:08:30.2709 +0800
Launch Time:         2021-05-06 15:08:28.6146 +0800
OS Version:          iPhone OS 13.7 (17H35)
Release Type:        User
Baseband Version:    5.70.01
Report Version:      104

Exception Type:  EXC_BAD_ACCESS (SIGBUS)
Exception Subtype: KERN_PROTECTION_FAILURE at 0x000000016c1bfdc0
VM Region Info: 0x16c1bfdc0 is in 0x16c1bc000-0x16c1c0000;  bytes after start: 15808  bytes before end: 575
      REGION TYPE                      START - END             [ VSIZE] PRT/MAX SHRMOD  REGION DETAIL
      Stack                  000000016c0ec000-000000016c1bc000 [  832K] rw-/rwx SM=COW  thread 21
--->  STACK GUARD            000000016c1bc000-000000016c1c0000 [   16K] ---/rwx SM=NUL  ...for thread 22
      Stack                  000000016c1c0000-000000016c248000 [  544K] rw-/rwx SM=COW  thread 22

Termination Signal: Bus error: 10
Termination Reason: Namespace SIGNAL, Code 0xa
Terminating Process: exc handler [15056]
Triggered by Thread:  22

Crash log header collected by the new device: Unix signal – > SIGSEGV

// iPhone XR
Incident Identifier: 95414C75-D357-4AFC-9951-2EAE098F31B3
Hardware Model:      iPhone11,8
Process:             esport-app [13681]
Path:                /private/var/containers/Bundle/Application/07BF60A9-D0A0-4B29-A9F2-C5E6C99D84EC/esport-app.app/esport-app
Identifier:          com.wmzq.esportapp
Version:             2105031 (2.1.0)
AppStoreTools:       12E262
AppVariant:          1:iPhone11,8:14
Beta:                YES
Code Type:           ARM-64 (Native)
Role:                Foreground
Parent Process:      launchd [1]
Coalition:           com.wmzq.esportapp [742]


Date/Time:           2021-05-08 09:41:52.5606 +0800
Launch Time:         2021-05-08 08:38:34.2531 +0800
OS Version:          iPhone OS 14.5 (18E199)
Release Type:        User
Baseband Version:    3.03.05
Report Version:      104

Exception Type:  EXC_BAD_ACCESS (SIGSEGV)
Exception Subtype: KERN_INVALID_ADDRESS at 0x0022000000000000 -> 0x0000000000000000 (possible pointer authentication failure)
VM Region Info: 0 is not in any region.  Bytes before following region: 4373348352
      REGION TYPE                 START - END      [ VSIZE] PRT/MAX SHRMOD  REGION DETAIL
      UNUSED SPACE AT START
--->  
      __TEXT                   104ac0000-104b24000 [  400K] r-x/r-x SM=COW  ...pp/esport-app

Termination Signal: Segmentation fault: 11
Termination Reason: Namespace SIGNAL, Code 0xb
Terminating Process: exc handler [13681]
Triggered by Thread:  11

Through consulting materials, we know that Apple has supported arm64e instruction set since A12, and provided instruction address encryption function, that is, PAC (abbreviation of point authentication code)

7.1 what is PAC

PAC is armv8 3 new function. Although the system is 64 bit, the arm64 instruction address is not used at all, so the upper bits are used to store a signature of the pointer address.
PAC pointer verification code is to take the high-order signature and low-order actual address of the pointer to sit down and verify before the CPU executes the instruction. If it fails, an exception will be thrown directly
In order to implement PAC, arm64e adds two new instructions:

  • Paciasp calculates PAC encryption and adds it to the pointer address
  • Autiasp verifies the encrypted part and restores the pointer address
7.2 example of PAC application

Here, I mainly record the process of tracing the crash and summarize the generation principle of crash in IOS system, the process of crash from kernel state – > throw to user state, as well as some conceptual things such as PAC.
In the following articles, I will summarize the memory diagnosis tools of Xcode, the principle and use of zombie objects, address sanitizer and malloc scribble, and try to implement an APM tool to collect and locate memory problems through code and wowcrash examples.

I haven’t studied some technologies in depth for a long time. I once thought that IOS technology was not cost-effective. Today’s development is a page party, which is too replaceable, so I’ve been hesitant to switch to the back-end or web. However, this crash tracking process makes me feel that becoming a senior developer or even an expert in relevant aspects is still attractive to me. I can get that joy from crash log analysis – > use of reverse tools – > recognition of underlying principles – > problem solving, so that I can find my direction again, come on and get out of the comfort zone.

reference material

IOS Mach exception, UNIX signal and nsexception exception
IOS Mach exception and signal signal
Why is the pointer address of arm64e free to support PAC?

Recommended Today

Android uses gradle to print So library address

reference resources:[Android development] how to quickly know which library a so comes from Recently, I’m doing package volume optimization. I want to find each Which third-party library does the so file come from, so it is convenient to exclude it. Finally found the following method: Under app, build Add the following code to the gradle […]