IOS binary file rearrangement, boot speed increased by more than 15%

Time:2020-8-21

background

Startup is the first impression of app to users, which is very important to user experience. Tiktok business iteration is fast, and if you let go, the startup speed will be a little worse. Tiktok iOS client team has done a lot of optimization work. Besides the traditional way of modifying the business code, we also made some pioneering explorations. It was found that the layout of binary code can improve the startup performance of the modified code. After the program tiktok, the starting speed of the modified voice is raised by about 15%.

Starting from the principle, this paper introduces how to find the function called at startup through static scanning and runtime trace, and then modify the compilation parameters to complete the rearrangement of binary files.

principle

Page Fault

If the process can directly access the physical memory, it is very unsafe, so the operating system establishes a layer of virtual memory on top of the physical memory. In order to improve efficiency and facilitate management, virtual memory and physical memory are paged. When a process accesses a virtual memory page but the corresponding physical memory does not exist, it will trigger a page fault to allocate physical memory, and read data from disk MMAP if necessary.

For apps distributed through the app store channel, page fault also performs signature verification, so a page fault takes more time than expected:

Page Fault

rearrangement

When generating binary code, the compiler writes files in the order of linked object file (. O) by default, and writes functions according to the function order inside the object file.

Static library file. A is the AR package of a group of. O files. You can usear -tSee all. O’s contained in. A.

Default layout

Simplification problem: suppose we only have two pages: page1 / Page2. The green method1 and method3 need to be called when starting. In order to execute the corresponding code, the system must perform two page faults.

But if we arrange method1 and method3 together, we only need a page fault, which is the core principle of binary file rearrangement.

After rearrangement

Our experience is to optimize a page fault and increase the startup speed by 0.6 ~ 0.8ms.

Core issues

In order to complete the rearrangement, there are several problems to be solved:

  • How about the rearrangement – get the number of page faults in the startup phase

  • Is the rearrangement successful? Get the current binary function layout

  • How to rearrange – let the linker generate mach-o in the specified order

  • Rearranged content – get the function used at startup

As a developer, it is particularly important to have a learning atmosphere and a communication circle. This is my IOS communication group: 519832104, no matter you are Xiaobai or Danio, welcome to settle in, share experience, discuss technology, and communicate with each other, learn and grow together!

Also attached is a large factory interview questions collected by friends. You need IOS development learning materials and real interview questions. You can add IOS development advanced communication group and download it by yourself!

System Trace

Time profiler is undoubtedly the most commonly used tool for performance analysis in daily development. However, time profiler is based on sampling and can only count the actual running time of threads. When page fault occurs, threads are blocked. Therefore, we need to use a tool that is not commonly used but powerful: system trace.

Select the main thread. The number of file backed page in in VM activity is the number of page fault. Double click can also see the stack causing page fault in chronological order

System Trace

signpost

Now that we can get the page in times in a certain period of time in instrument, how can it be mapped with startup?

Our answer is:os_signpost

os_signpostIt is a set of APIs introduced by IOS 12. You can draw a time period in instruments. The code is also very simple

1os_log_t logger = os_log_create(“com.bytedance.tiktok”, “performance”);
2os_signpost_id_t signPostId = os_signpost_id_make_with_pointer(logger,sign);
3 / / marks the beginning of the time period
4os_signpost_interval_begin(logger, signPostId, “Launch”,”%{public}s”, “”);
5 / / mark end
6os_signpost_interval_end(logger, signPostId, “Launch”);

Generally, startup can be divided into four stages

Start up phase

As many mach-o, there will be as many load and C + + static initialization stages. Use signpost API to dot the corresponding phases to facilitate tracking the optimization effect of each stage.

Linkmap

Linkmap is an intermediate product of IOS compilation. It records the layout of binary files. You need to open write link map file in build settings of Xcode

Build Settings

For example, the following is the linkmap of a single page demo project.

linkmap

Linkmap consists of three parts

  • Object files generates the path and file number of link unit used in binary system

  • Sections record the address range of each segment / section of mach-o

  • Symbols record the address range of each symbol in order

ld

Xcode uses a link device called LD, which has an unusual parameter-order_file, throughman ldYou can see the detailed documentation:

Alters the order in which functions and data are laid out. For each section in the output file, any symbol in that section that are specified in the order file file is moved to the start of its section and laid out in the same order as in the order file file.

You can see, order_ The symbols in the file will be arranged in order at the beginning of the corresponding section, which perfectly meets our needs.

Xcode’s GUI also provides an order_ File options:

order_file

If order_ What if the symbols in the file don’t actually exist?

LD ignores these symbols if the link option is provided-order_file_statisticsThese missing symbols will be printed in the log in the form of warning.

Get the symbol

The last and most important problem is to get the function symbols used at startup.

First of all, we exclude the resolution of instruments (time profiler / system trace) trace file scheme, because they are based on a specific scene sampling, most symbols can not be obtained. Finally, the solution of combining static scan with runtime trace is selected.

Load

The symbolic name of objective C is+-[Class_name(category_name) method:name:]In which+Represents a class method,-Represents an instance method.

Just mentioned that linkmap records all the symbol names, so just scan the linkmap__TEXT,__text, regular matching("^\+\[.*\ load\]$")You can get all the load method symbols.

C + + static initialization

C + + is not like the objective C method, most method calls are compiledobjc_msgSendThere is no entry function to run-time hook.

But it works-finstrument-functionsIn the compilation period, the hook is inserted tiktok, but because many of the tremble relies on static libraries provided by other teams, the scheme needs to modify the dependent construction process. Binary file rearrangement in the absence of industry experience for reference, uncertain benefits, choose the imperfect but the lowest cost static scanning scheme.

1//__mod_init_func
20×100008060    0x00000008  [  5] ltmp7
3 / / [5]
4[  5] …/Build/Products/Debug-iphoneos/libStaticLibrary.a(StaticLibrary.o)

2. Extract. O from the file number.

1➜  lipo libStaticLibrary.a -thin arm64 -output arm64.a
2➜  ar -x arm64.a StaticLibrary.o

3. Get the symbol name of static initialization through. O_demo_constructor

1➜  objdump -r -section=__mod_init_func StaticLibrary.o
2
3StaticLibrary.o:    file format Mach-O arm64
4
5RELOCATION RECORDS FOR [__mod_init_func]:
60000000000000000 ARM64_RELOC_UNSIGNED _demo_constructor

4. Through the symbol name and file number, find the binary range of the symbol in linkmap

10x100004A30    0x0000001C  [  5] _demo_constructor

5. Disassemble the code by starting address

1➜  objdump -d –start-address=0x100004A30 –stop-address=0x100004A4B demo_arm64 
2
3_demo_constructor:
4100004a30:    fd 7b bf a9     stp x29, x30, [sp, #-16]!
5100004a34:    fd 03 00 91     mov x29, sp
6100004a38:    20 0c 80 52     mov w0, #97
7100004a3c:    da 06 00 94     bl  #7016 
8100004a40:    40 0c 80 52     mov w0, #98
9100004a44:    fd 7b c1 a8     ldp x29, x30, [sp], #16
10100004a48:    d7 06 00 14     b   #7004

6. By scanningblThe start address of the subroutine in binary is 100004a3c + 1b68 (corresponding to 7016 in decimal system).

1100004a3c:    da 06 00 94     bl  #7016

7. Through the start address, you can find the symbol name and end address, and then repeat 5-7 to find all the function symbols called by subroutines recursively.

Small pit

STL will generate initialization function for string, which will cause symbols with the same name in multiple. O, for example:

1__ZNSt3__112basic_stringIcNS_11char_traitsIcEENS_9allocatorIcEEEC1IDnEEPKc

There are a lot of repeated symbols like this in C + +, so the C / C + + symbol is in the order_ The. O information should be carried in the file:

1//order_file.txt
2libDemoLibrary.a(object.o):__GLOBAL__sub_I_demo_file.cpp

limitations

In addition to BL / B, branch assembly instructions also have br / BLR, that is, through indirect subroutine calls of registers, static scanning cannot cover this situation.

Local symbol

In the process of C + + static initialization scanning, we found that many symbols similar to l002 were scanned. After some research, it is found that the local symbol is clipped when the relying party outputs the static library. cause__GLOBAL__sub_I_demo_file.cppIt became l002.

When the static library needs to be sent out of the package, the local symbol should be retained, and the CI script should not be executedstrip -xAt the same time, the strip style of Xcode corresponding to target is modified to debugging symbol:

Strip Style

The local symbols reserved by the static library will be cut before the host app generates IPA, so it will not affect the final IPA package size. All symbols should be selected for the strip style of the host app, and non global symbols should be selected for the host dynamic library.

Objective C method

Most objective C methods will run after compilationobjc_msgSendThrough fishhook( https://github.com/facebook/fishhook )Hook is a C function to obtain the objective C symbol. becauseobjc_msgSendIt is a variable length parameter, so the hook code needs to be implemented in assembly

1 / / code reference inspectivec
2__attribute__((naked))
3static void hook_Objc_msgSend() {
4    save()
5    __asm volatile (“mov x2, lr\n”);
6    __asm volatile (“mov x3, x4\n”);
7    call(blr, &before_objc_msgSend)
8    load()
9    call(blr, orig_objc_msgSend)
10    save()
11    call(blr, &after_objc_msgSend)
12    __asm volatile (“mov lr, x0\n”);
13    load()
14    ret()
15}

When the subroutine is called, it is necessary to save and recover the parameter registers, so save and load stack x0 ~ x9 and Q0 ~ Q9 respectively. Call calls the function indirectly through registers

1#define save() 
2__asm volatile ( 
3″stp q6, q7, [sp, #-32]!\n”
4…
5
6#define load() 
7__asm volatile ( 
8″ldp x0, x1, [sp], #16\n” 
9…
10
11#define call(b, value) 
12__asm volatile (“stp x8, x9, [sp, #-16]!\n”); 
13__asm volatile (“mov x12, %0\n” :: “r”(value)); 
14__asm volatile (“ldp x8, x9, [sp], #16\n”); 
15__asm volatile (#b ” x12\n”);

staybefore_objc_msgSendStore LR in the stack inafter_objc_msgSendRecover LR. Since trace file is to be generated, in order to reduce the size of the file, the function address is directly written, and only the mach-o (APP and dynamic library) code segments of the current executable file will be written:

In IOS, due to alsr( https://en.wikipedia.org/wiki/Address_ space_ layout_ Before writing, you need to subtract the offset slide:

1IMP imp = (IMP)class_getMethodImplementation(object_getClass(self), _cmd);
2unsigned long imppos = (unsigned long)imp;
3unsigned long addr = immpos – macho_slide

Get a binary__textSegment address range:

1unsigned long size = 0;
2unsigned long start = (unsigned long)getsectiondata(mhp,  “__TEXT”, “__text”, &size);
3unsigned long end = start + size;

After getting the function address, the symbolic name of the method can be found by checking linkmap.

Block

Block is a special unit. The compiled function body of block is a C function. When calling, it is called directly through pointer instead of objc_ Msgsend, so a separate hook is required.

From the source code of block, you can see that the memory layout of block is as follows:

1struct Block_layout {
2    void *isa;
3    int32_t flags; // contains ref count
4    int32_t reserved;
5    void  *invoke;
6    struct Block_descriptor1 *descriptor;
7};
8struct Block_descriptor1 {
9    uintptr_t reserved;
10    uintptr_t size;
11};

The idea of hook is to replace the invoke with a user-defined implementation, and then save it as the original implementation in reserved.

1 / / reference https://github.com/youngsoft/YSBlockHook
2if (layout->descriptor != NULL && layout->descriptor->reserved == NULL)
3{
4    if (layout->invoke != (void *)hook_block_envoke)
5    {
6        layout->descriptor->reserved = layout->invoke;
7        layout->invoke = (void *)hook_block_envoke;
8    }
9}

Because the function signatures corresponding to block are different, assembly is still used herehook_block_envoke

1__attribute__((naked))
2static void hook_block_envoke() {
3    save()
4    __asm volatile (“mov x1, lr\n”);
5    call(blr, &before_block_hook);
6    __asm volatile (“mov lr, x0\n”);
7    load()
8 / / call the original invoke, that is, the address of the recovered storage
9    __asm volatile (“ldr x12, [x0, #24]\n”);
10    __asm volatile (“ldr x12, [x12]\n”);
11    __asm volatile (“br x12\n”);
12}

staybefore_block_hookTo get the function address (also subtract slide).

1intptr_t before_block_hook(id block,intptr_t lr)
2{
3    Block_layout * layout = (Block_layout *)block;
4 / / layout > descriptor > reserved is the function address of the block
5    return lr;
6}

Similarly, the block symbol can be found by checking the function address back to linkmap.

bottleneck

There are still a few bottlenecks in the scheme based on static scan and runtime trace

  • Initialize hook is not available

  • Some block hooks are not available

  • C + + can’t scan through indirect function call of register

The current reordering scheme can cover 80% – 90% of the symbols. In the future, we will try to implement 100% symbol coverage by using such schemes as pile insertion at compile time to achieve the optimal effect.

Overall process

technological process

  1. Set conditions to trigger process

  2. Project injection trace dynamic library, select release mode to compile. App / linkmap / intermediate product

  3. After running the app once to the end of startup, trace dynamic library will generate trace log in sandbox

  4. With trace log, intermediate and linkmap as input, run the script to parse the order_ file

summary

At present, in the absence of industry experience, we have successfully verified the feasibility and stability of binary file rearrangement scheme in the development of IOS app. Based on binary file rearrangement, we have achieved about 15% of the tiktok speed in the optimization work on the iOS client.

Abstractly speaking, we will encounter such a common problem in app development, that is, in some cases, APP operation requires a large number of page faults, which will affect the code execution speed. At present, the binary file rearrangement scheme seems to be a better solution to this general problem.

In the future, we will make more attempts to rearrange binary files in more business scenarios.

Click here to communicate with IOS Daniel immediately

Recommended Today

PHP 12th week function learning record

sha1() effect sha1()Function to evaluate the value of a stringSHA-1Hash. usage sha1(string,raw) case <?php $str = “Hello”; echo sha1($str); ?> result f7ff9e8b7bb2e09b70935a5d785e0cc5d9d0abf0 sha1_file() effect sha1_file()Function calculation fileSHA-1Hash. usage sha1_file(file,raw) case <?php $filename = “test.txt”; $sha1file = sha1_file($filename); echo $sha1file; ?> result aaf4c61ddcc5e8a2dabede0f3b482cd9aea9434d similar_text() effect similar_text()Function to calculate the similarity between two strings. usage similar_text(string1,string2,percent) case […]