background
Startup is the first impression of app to users, which is very important to user experience. Tiktok business iteration is fast, and if you let go, the startup speed will be a little worse. Tiktok iOS client team has done a lot of optimization work. Besides the traditional way of modifying the business code, we also made some pioneering explorations. It was found that the layout of binary code can improve the startup performance of the modified code. After the program tiktok, the starting speed of the modified voice is raised by about 15%.
Starting from the principle, this paper introduces how to find the function called at startup through static scanning and runtime trace, and then modify the compilation parameters to complete the rearrangement of binary files.
principle
Page Fault
If the process can directly access the physical memory, it is very unsafe, so the operating system establishes a layer of virtual memory on top of the physical memory. In order to improve efficiency and facilitate management, virtual memory and physical memory are paged. When a process accesses a virtual memory page but the corresponding physical memory does not exist, it will trigger a page fault to allocate physical memory, and read data from disk MMAP if necessary.
For apps distributed through the app store channel, page fault also performs signature verification, so a page fault takes more time than expected:
Page Fault
rearrangement
When generating binary code, the compiler writes files in the order of linked object file (. O) by default, and writes functions according to the function order inside the object file.
Static library file. A is the AR package of a group of. O files. You can use
ar -t
See all. O’s contained in. A.
Default layout
Simplification problem: suppose we only have two pages: page1 / Page2. The green method1 and method3 need to be called when starting. In order to execute the corresponding code, the system must perform two page faults.
But if we arrange method1 and method3 together, we only need a page fault, which is the core principle of binary file rearrangement.
After rearrangement
Our experience is to optimize a page fault and increase the startup speed by 0.6 ~ 0.8ms.
Core issues
In order to complete the rearrangement, there are several problems to be solved:
-
How about the rearrangement – get the number of page faults in the startup phase
-
Is the rearrangement successful? Get the current binary function layout
-
How to rearrange – let the linker generate mach-o in the specified order
-
Rearranged content – get the function used at startup
As a developer, it is particularly important to have a learning atmosphere and a communication circle. This is my IOS communication group: 519832104, no matter you are Xiaobai or Danio, welcome to settle in, share experience, discuss technology, and communicate with each other, learn and grow together!
Also attached is a large factory interview questions collected by friends. You need IOS development learning materials and real interview questions. You can add IOS development advanced communication group and download it by yourself!
System Trace
Time profiler is undoubtedly the most commonly used tool for performance analysis in daily development. However, time profiler is based on sampling and can only count the actual running time of threads. When page fault occurs, threads are blocked. Therefore, we need to use a tool that is not commonly used but powerful: system trace.
Select the main thread. The number of file backed page in in VM activity is the number of page fault. Double click can also see the stack causing page fault in chronological order
System Trace
signpost
Now that we can get the page in times in a certain period of time in instrument, how can it be mapped with startup?
Our answer is:os_signpost
。
os_signpost
It is a set of APIs introduced by IOS 12. You can draw a time period in instruments. The code is also very simple
1os_log_t logger = os_log_create(“com.bytedance.tiktok”, “performance”);
2os_signpost_id_t signPostId = os_signpost_id_make_with_pointer(logger,sign);
3 / / marks the beginning of the time period
4os_signpost_interval_begin(logger, signPostId, “Launch”,”%{public}s”, “”);
5 / / mark end
6os_signpost_interval_end(logger, signPostId, “Launch”);
Generally, startup can be divided into four stages
Start up phase
As many mach-o, there will be as many load and C + + static initialization stages. Use signpost API to dot the corresponding phases to facilitate tracking the optimization effect of each stage.
Linkmap
Linkmap is an intermediate product of IOS compilation. It records the layout of binary files. You need to open write link map file in build settings of Xcode
Build Settings
For example, the following is the linkmap of a single page demo project.
linkmap
Linkmap consists of three parts
-
Object files generates the path and file number of link unit used in binary system
-
Sections record the address range of each segment / section of mach-o
-
Symbols record the address range of each symbol in order
ld
Xcode uses a link device called LD, which has an unusual parameter-order_file
, throughman ld
You can see the detailed documentation:
Alters the order in which functions and data are laid out. For each section in the output file, any symbol in that section that are specified in the order file file is moved to the start of its section and laid out in the same order as in the order file file.
You can see, order_ The symbols in the file will be arranged in order at the beginning of the corresponding section, which perfectly meets our needs.
Xcode’s GUI also provides an order_ File options:
order_file
If order_ What if the symbols in the file don’t actually exist?
LD ignores these symbols if the link option is provided-order_file_statistics
These missing symbols will be printed in the log in the form of warning.
Get the symbol
The last and most important problem is to get the function symbols used at startup.
First of all, we exclude the resolution of instruments (time profiler / system trace) trace file scheme, because they are based on a specific scene sampling, most symbols can not be obtained. Finally, the solution of combining static scan with runtime trace is selected.
Load
The symbolic name of objective C is+-[Class_name(category_name) method:name:]
In which+
Represents a class method,-
Represents an instance method.
Just mentioned that linkmap records all the symbol names, so just scan the linkmap__TEXT,__text
, regular matching("^\+\[.*\ load\]$"
)You can get all the load method symbols.
C + + static initialization
C + + is not like the objective C method, most method calls are compiledobjc_msgSend
There is no entry function to run-time hook.
But it works-finstrument-functions
In the compilation period, the hook is inserted tiktok, but because many of the tremble relies on static libraries provided by other teams, the scheme needs to modify the dependent construction process. Binary file rearrangement in the absence of industry experience for reference, uncertain benefits, choose the imperfect but the lowest cost static scanning scheme.
1//__mod_init_func
20×100008060 0x00000008 [ 5] ltmp7
3 / / [5]
4[ 5] …/Build/Products/Debug-iphoneos/libStaticLibrary.a(StaticLibrary.o)
2. Extract. O from the file number.
1➜ lipo libStaticLibrary.a -thin arm64 -output arm64.a
2➜ ar -x arm64.a StaticLibrary.o
3. Get the symbol name of static initialization through. O_demo_constructor
。
1➜ objdump -r -section=__mod_init_func StaticLibrary.o
2
3StaticLibrary.o: file format Mach-O arm64
4
5RELOCATION RECORDS FOR [__mod_init_func]:
60000000000000000 ARM64_RELOC_UNSIGNED _demo_constructor
4. Through the symbol name and file number, find the binary range of the symbol in linkmap
10x100004A30 0x0000001C [ 5] _demo_constructor
5. Disassemble the code by starting address
1➜ objdump -d –start-address=0x100004A30 –stop-address=0x100004A4B demo_arm64
2
3_demo_constructor:
4100004a30: fd 7b bf a9 stp x29, x30, [sp, #-16]!
5100004a34: fd 03 00 91 mov x29, sp
6100004a38: 20 0c 80 52 mov w0, #97
7100004a3c: da 06 00 94 bl #7016
8100004a40: 40 0c 80 52 mov w0, #98
9100004a44: fd 7b c1 a8 ldp x29, x30, [sp], #16
10100004a48: d7 06 00 14 b #7004
6. By scanningbl
The start address of the subroutine in binary is 100004a3c + 1b68 (corresponding to 7016 in decimal system).
1100004a3c: da 06 00 94 bl #7016
7. Through the start address, you can find the symbol name and end address, and then repeat 5-7 to find all the function symbols called by subroutines recursively.
Small pit
STL will generate initialization function for string, which will cause symbols with the same name in multiple. O, for example:
1__ZNSt3__112basic_stringIcNS_11char_traitsIcEENS_9allocatorIcEEEC1IDnEEPKc
There are a lot of repeated symbols like this in C + +, so the C / C + + symbol is in the order_ The. O information should be carried in the file:
1//order_file.txt
2libDemoLibrary.a(object.o):__GLOBAL__sub_I_demo_file.cpp
limitations
In addition to BL / B, branch assembly instructions also have br / BLR, that is, through indirect subroutine calls of registers, static scanning cannot cover this situation.
Local symbol
In the process of C + + static initialization scanning, we found that many symbols similar to l002 were scanned. After some research, it is found that the local symbol is clipped when the relying party outputs the static library. cause__GLOBAL__sub_I_demo_file.cpp
It became l002.
When the static library needs to be sent out of the package, the local symbol should be retained, and the CI script should not be executedstrip -x
At the same time, the strip style of Xcode corresponding to target is modified to debugging symbol:
Strip Style
The local symbols reserved by the static library will be cut before the host app generates IPA, so it will not affect the final IPA package size. All symbols should be selected for the strip style of the host app, and non global symbols should be selected for the host dynamic library.
Objective C method
Most objective C methods will run after compilationobjc_msgSend
Through fishhook( https://github.com/facebook/fishhook )Hook is a C function to obtain the objective C symbol. becauseobjc_msgSend
It is a variable length parameter, so the hook code needs to be implemented in assembly
1 / / code reference inspectivec
2__attribute__((naked))
3static void hook_Objc_msgSend() {
4 save()
5 __asm volatile (“mov x2, lr\n”);
6 __asm volatile (“mov x3, x4\n”);
7 call(blr, &before_objc_msgSend)
8 load()
9 call(blr, orig_objc_msgSend)
10 save()
11 call(blr, &after_objc_msgSend)
12 __asm volatile (“mov lr, x0\n”);
13 load()
14 ret()
15}
When the subroutine is called, it is necessary to save and recover the parameter registers, so save and load stack x0 ~ x9 and Q0 ~ Q9 respectively. Call calls the function indirectly through registers
1#define save()
2__asm volatile (
3″stp q6, q7, [sp, #-32]!\n”
4…
5
6#define load()
7__asm volatile (
8″ldp x0, x1, [sp], #16\n”
9…
10
11#define call(b, value)
12__asm volatile (“stp x8, x9, [sp, #-16]!\n”);
13__asm volatile (“mov x12, %0\n” :: “r”(value));
14__asm volatile (“ldp x8, x9, [sp], #16\n”);
15__asm volatile (#b ” x12\n”);
staybefore_objc_msgSend
Store LR in the stack inafter_objc_msgSend
Recover LR. Since trace file is to be generated, in order to reduce the size of the file, the function address is directly written, and only the mach-o (APP and dynamic library) code segments of the current executable file will be written:
In IOS, due to alsr( https://en.wikipedia.org/wiki/Address_ space_ layout_ Before writing, you need to subtract the offset slide:
1IMP imp = (IMP)class_getMethodImplementation(object_getClass(self), _cmd);
2unsigned long imppos = (unsigned long)imp;
3unsigned long addr = immpos – macho_slide
Get a binary__text
Segment address range:
1unsigned long size = 0;
2unsigned long start = (unsigned long)getsectiondata(mhp, “__TEXT”, “__text”, &size);
3unsigned long end = start + size;
After getting the function address, the symbolic name of the method can be found by checking linkmap.
Block
Block is a special unit. The compiled function body of block is a C function. When calling, it is called directly through pointer instead of objc_ Msgsend, so a separate hook is required.
From the source code of block, you can see that the memory layout of block is as follows:
1struct Block_layout {
2 void *isa;
3 int32_t flags; // contains ref count
4 int32_t reserved;
5 void *invoke;
6 struct Block_descriptor1 *descriptor;
7};
8struct Block_descriptor1 {
9 uintptr_t reserved;
10 uintptr_t size;
11};
The idea of hook is to replace the invoke with a user-defined implementation, and then save it as the original implementation in reserved.
1 / / reference https://github.com/youngsoft/YSBlockHook
2if (layout->descriptor != NULL && layout->descriptor->reserved == NULL)
3{
4 if (layout->invoke != (void *)hook_block_envoke)
5 {
6 layout->descriptor->reserved = layout->invoke;
7 layout->invoke = (void *)hook_block_envoke;
8 }
9}
Because the function signatures corresponding to block are different, assembly is still used herehook_block_envoke
:
1__attribute__((naked))
2static void hook_block_envoke() {
3 save()
4 __asm volatile (“mov x1, lr\n”);
5 call(blr, &before_block_hook);
6 __asm volatile (“mov lr, x0\n”);
7 load()
8 / / call the original invoke, that is, the address of the recovered storage
9 __asm volatile (“ldr x12, [x0, #24]\n”);
10 __asm volatile (“ldr x12, [x12]\n”);
11 __asm volatile (“br x12\n”);
12}
staybefore_block_hook
To get the function address (also subtract slide).
1intptr_t before_block_hook(id block,intptr_t lr)
2{
3 Block_layout * layout = (Block_layout *)block;
4 / / layout > descriptor > reserved is the function address of the block
5 return lr;
6}
Similarly, the block symbol can be found by checking the function address back to linkmap.
bottleneck
There are still a few bottlenecks in the scheme based on static scan and runtime trace
-
Initialize hook is not available
-
Some block hooks are not available
-
C + + can’t scan through indirect function call of register
The current reordering scheme can cover 80% – 90% of the symbols. In the future, we will try to implement 100% symbol coverage by using such schemes as pile insertion at compile time to achieve the optimal effect.
Overall process
technological process
-
Set conditions to trigger process
-
Project injection trace dynamic library, select release mode to compile. App / linkmap / intermediate product
-
After running the app once to the end of startup, trace dynamic library will generate trace log in sandbox
-
With trace log, intermediate and linkmap as input, run the script to parse the order_ file
summary
At present, in the absence of industry experience, we have successfully verified the feasibility and stability of binary file rearrangement scheme in the development of IOS app. Based on binary file rearrangement, we have achieved about 15% of the tiktok speed in the optimization work on the iOS client.
Abstractly speaking, we will encounter such a common problem in app development, that is, in some cases, APP operation requires a large number of page faults, which will affect the code execution speed. At present, the binary file rearrangement scheme seems to be a better solution to this general problem.
In the future, we will make more attempts to rearrange binary files in more business scenarios.
Click here to communicate with IOS Daniel immediately