Principle analysis and practice of java basic function real-time compiler

Time:2021-9-27

1、 Reading guide

Common compiled languages, such as C + +, usually compile the code directly into machine code that the CPU can understand. In order to realize the feature of “one compilation, run everywhere”, Java divides the compilation process into two parts. First, it will be compiled into a common intermediate form bytecode by javac, and then interpreted by the interpreter as machine code one by one. Therefore, in terms of performance, Java is usually inferior to compiled languages such as C + +.

In order to optimize the performance of Java, the JVM introduces a just in time compiler in addition to the interpreter: when the program runs, the interpreter plays a role first, and the code can be executed directly. Over time, the real-time compiler gradually plays a role in compiling and optimizing more and more code into local code to obtain higher execution efficiency. At this time, the interpreter can be used as a degradation means for compilation and operation. When some unreliable compilation optimization problems occur, it can switch back to interpretation and execution to ensure that the program can run normally.

The real-time compiler greatly improves the running speed of Java programs, and compared with static compilation, the real-time compiler can selectively compile hot code, saving a lot of compilation time and space. At present, the real-time compiler is very mature, and it can even be compared with compiled languages in terms of performance. However, in this field, we are still exploring how to combine different compilation methods and use more intelligent means to improve the running speed of the program.

2、 Java execution process

The overall execution process of Java can be divided into two parts. In the first step, javac compiles the source code into bytecode. In this process, lexical analysis, syntax analysis and semantic analysis will be carried out. In the compilation principle, this part of compilation is called front-end compilation. Next, the bytecode is interpreted and executed item by item without compilation. In the process of interpretation and execution, the virtual machine collects the program running information at the same time. On the basis of this information, the compiler will gradually play a role. It will perform back-end compilation – compile the bytecode into machine code, but not all codes will be compiled. Only the hot code recognized by the JVM, Can be compiled.

How can it be considered hot code? A threshold will be set in the JVM. When the number of calls of a method or code block within a certain time exceeds this threshold, it will be compiled and stored in codecache. When this code is encountered again during the next execution, the machine code will be read from codecache and executed directly, so as to improve the performance of the program. The overall implementation process is roughly shown in the figure below:

Principle analysis and practice of java basic function real-time compiler

1. Compiler in JVM

Two compilers, client compiler and server compiler, are integrated into the JVM, and their functions are also different. Client compiler pays more attention to startup speed and local optimization, while server compiler pays more attention to global optimization, and the performance will be better. However, due to more global analysis, the startup speed will be slower. The two compilers have different application scenarios and play a role in the virtual machine at the same time.

Client Compiler

The hotspot VM comes with a client compiler   C1 compiler. This compiler starts quickly, but its performance is worse than that of server compiler. C1 will do three things:

  • Local simple and reliable optimizations, such as some basic optimizations on bytecode, method inlining, constant propagation, etc., give up many time-consuming global optimizations.
  • The bytecode is constructed into high-level intermediate representation (hereinafter referred to as HIR). HIR is independent of the platform and usually adopts graph structure, which is more suitable for JVM to optimize the program.
  • Finally, HIR is converted into low-level intermediate representation (hereinafter referred to as LIR), and register allocation and peephole optimization will be carried out on the basis of LIR (local optimization method: the compiler generates machine code in one basic block or multiple basic blocks according to the generated code, combined with the characteristics of the CPU’s own instructions, through some conversion rules that may improve performance, or through overall analysis, instruction conversion to improve code performance) and other operations.

Server Compiler

Server compiler mainly focuses on some global optimizations that take a long time to compile, and even makes some unreliable radical optimizations according to the program running information. This compiler has a long startup time and is suitable for long-running background programs. Its performance is usually more than 30% higher than that of client compiler. Currently, there are two types of server compilers used in the hotspot virtual machine: C2 and grail.

C2 Compiler

In the hotspot VM, the default server compiler is the C2 compiler.

When compiling and optimizing, C2 compiler will use a graph data structure combining control flow and data flow, which is called ideal graph. The idea graph represents the dependency between the data flow and instructions of the current program. Depending on this graph structure, some optimization steps (especially those involving floating code blocks) become less complex.

The construction of ideal graph is to add nodes to an empty graph according to the instructions in the bytecode when parsing the bytecode. The nodes in the graph usually correspond to an instruction block, and each instruction block contains multiple associated instructions. The JVM will use some optimization techniques to optimize these instructions, such as global value numbering, constant folding, etc. after parsing, There will also be some operations to eliminate dead code. After generating the idea graph, some global optimizations will be carried out on this basis in combination with the collected program operation information. At this stage, if the JVM judges that there is no need for global optimization at this time, it will skip this part of optimization.

No matter whether global optimization is carried out or not, the idea graph will be transformed into a machnode graph closer to the machine level. The last compiled machine code is obtained from the machnode graph. Before generating the machine code, there will be some operations including register allocation, peephole optimization and so on. The idea graph and various global optimization methods will be described in detail in later chapters. The process of server compiler compilation optimization is shown in the following figure:

Principle analysis and practice of java basic function real-time compiler

Graal Compiler

Starting from JDK 9, a new server compiler, Graal compiler, has been integrated into hotspot VM. Compared with the C2 compiler, Graal has several key features:

  • As mentioned earlier, the JVM will collect all kinds of information about program operation during interpretation and execution, and then the compiler will carry out some radical optimization based on prediction according to this information, such as branch prediction, and selectively compile some branches with high probability according to the operation probability of different branches of the program. Graal prefers this optimization over C2, so Graal’s peak performance is usually better than C2.
  • Written in Java, it is more friendly to the Java language, especially new features such as lambda and stream.
  • Deeper optimization, such as virtual function inlining, partial escape analysis, etc.

The Grail compiler can be enabled through the Java virtual machine parameter – XX: + unlockexperimentalvmoptions – XX: + usejvmcompiler. When enabled, it will replace the C2 compiler in hotspot and respond to the compilation request originally undertaken by C2.

2. Layered compilation

Before Java 7, developers needed to select compilers according to the nature of services. For services that need to be started quickly or will not run for a long time, C1 with high compilation efficiency can be used, corresponding to the parameter – client. For long-term running services or background services requiring peak performance, C2 with better peak performance can be used, corresponding to the parameter – server. Java 7 began to introduce the concept of layered compilation, which combines the advantages of C1 and C2 and pursues a balance between startup speed and peak performance. Hierarchical compilation divides the execution state of the JVM into five levels. The five levels are:

  • Explain execution.
  • Execute C1 code without profiling.
  • Execute C1 code with only the number of method calls and the number of loopback execution profiling.
  • Execute C1 code with all profiling.
  • Execute C2 code.

Profiling is to collect data that can reflect the execution state of the program. The most basic statistics are the number of method calls and the execution times of loopback.

Generally, the execution efficiency of C2 code is more than 30% higher than that of C1 code. The code executed by layer C1 is ranked from high to low according to execution efficiency, which is layer 1 > Layer 2 > Layer 3. Among the five layers, layer 1 and layer 4 are in the termination state. When a method reaches the termination state, as long as the compiled code is not invalid, the JVM will not issue the compilation request of the method again. When the service is actually running, the JVM will choose different compilation paths from interpretation execution to termination according to the service operation. The following figure lists several common compilation paths:

Principle analysis and practice of java basic function real-time compiler

  • The ① path in the figure represents the general situation of compilation. The hot method is interpreted and executed, compiled by C1 of layer 3, and finally compiled by C2 of layer 4.
  • If the method is relatively small (such as the common getter / setter method in Java services) and the layer 3 profiling does not collect valuable data, the JVM will conclude that the execution efficiency of this method is the same for C1 code and C2 code, and will execute path ② in the figure. In this case, after layer 3 compilation, the JVM will give up entering C2 compilation and directly choose to run with layer 1 C1 compilation.
  • When C1 is busy, execute the ③ path in the figure, profile the program during interpretation and execution, and compile it directly from C2 of layer 4 according to the information.
  • As mentioned above, the execution efficiency in C1 is layer 1 > Layer 2 > Layer 3, and layer 3 is generally more than 35% slower than layer 2. Therefore, when C2 is busy, execute path ④ in the figure. At this time, the method will be compiled by layer 2 C1 and then by layer 3 C1 to reduce the execution time of the method in layer 3.
  • If the compiler makes some radical optimizations, such as branch prediction, and finds that the prediction is wrong during actual operation, it will carry out de optimization and re-enter the interpretation execution. The ⑤ execution path in the figure represents de optimization.

Generally speaking, the compilation speed of C1 is faster and the compilation quality of C2 is higher. Different compilation paths of hierarchical compilation are a process for the JVM to find the best balance point of the current service according to the operation of the current service. Starting from JDK 8, the JVM starts layered compilation by default.

3. Trigger of immediate compilation

The Java virtual machine triggers immediate compilation according to the number of method calls and the number of execution times of loopback. Loop back edge is a concept in the control flow diagram. In the program, it can be simply understood as an instruction to jump back, such as the following code:

Loop back edge

public void nlp(Object obj) {
  int sum = 0;
  for (int i = 0; i < 200; i++) {
    sum += i;
  }
}

The above code is compiled to generate the following bytecode. The bytecode with offset of 18 will jump back to the bytecode with offset of 4. When interpreting execution, the Java virtual machine increments the loop back side counter of the method by 1 whenever the instruction is run.

Bytecode

public void nlp(java.lang.Object);
    Code:
       0: iconst_0
       1: istore_1
       2: iconst_0
       3: istore_2
       4: iload_2
       5: sipush        200
       8: if_icmpge     21
      11: iload_1
      12: iload_2
      13: iadd
      14: istore_1
      15: iinc          2, 1
      18: goto          4
      21: return

During immediate compilation, the compiler recognizes the head and tail of the loop. In the above bytecode, the head and tail of the loop body are bytecode with offset of 11 and bytecode with offset of 15 respectively. The compiler will increase the code of the loop back edge counter at the end of the loop body to count the loop.

When the sum of the number of method calls and the number of edge loops exceeds the threshold specified by the – XX: compilethreshold parameter (the default value is 1500 when C1 is used; the default value is 10000 when C2 is used), immediate compilation will be triggered.

When layered compilation is enabled, the threshold set by the – XX: compilethreshold parameter will become invalid, and the triggering of compilation will be judged by the following conditions:

  • The number of method calls is greater than the threshold specified by the parameter – XX: tierxnvocationthreshold multiplied by the coefficient.
  • When the number of method calls is greater than the threshold multiplied by the coefficient specified by the parameter – XX: tierxinvocationthreshold, and the sum of the number of method calls and the number of edge loops is greater than the threshold multiplied by the coefficient specified by the parameter – XX: tierxcompilethreshold.

Layered compilation trigger condition formula

i > TierXInvocationThreshold * s || (i > TierXMinInvocationThreshold * s  && i + b > TierXCompileThreshold * s) 
I is the number of calls, and B is the number of edge loops

Meeting one of the above conditions will trigger immediate compilation, and the JVM will dynamically adjust the coefficient s according to the current number of compilation methods and compilation threads.

3、 Compilation optimization

The real-time compiler will perform a series of optimizations on the running services, including the analysis in the bytecode parsing process, local optimization according to some intermediate forms of code in the compilation process, and global optimization according to the program dependency graph. Finally, the machine code will be generated.

1. Intermediate representation

In the compilation principle, the compiler is usually divided into front-end and back-end. The front-end compiler generates intermediate representation (IR) after lexical analysis, syntax analysis and semantic analysis, and the back-end optimizes IR to generate object code.

Java bytecode is a kind of IR, but the structure of bytecode is complex, and the IR in the form of bytecode is not suitable for global analysis and optimization. Modern compilers generally use graph IR, and static single assignment (SSA) IR is a commonly used one. The characteristic of this IR is that each variable can only be assigned once, and can only be used after the variable is assigned. for instance:

SSA IR

Plain Text
{
  a = 1;
  a = 2;
  b = a;
}

In the above code, we can easily find that the assignment of a = 1 is redundant, but the compiler cannot. Traditional compilers need to use data flow analysis to confirm which variable values are overwritten from back to front. However, if SSA IR is used, the compiler can easily identify redundant assignments.

The pseudo code in SSA IR form of the above code can be expressed as:

SSA IR

Plain Text
{
  a_1 = 1;
  a_2 = 2;
  b_1 = a_2;
}

Since each variable can only be assigned once in SSA IR, a in the code will be divided into a in SSA IR_ 1、a_ 2 to assign values to two variables, so that the compiler can easily find a by scanning these variables_ 1 is not used after assignment, and assignment is redundant.

In addition, SSA IR is also very helpful to other optimization methods, such as the following example of dead code elimination:

DeadCodeElimination

public void DeadCodeElimination{
  int a = 2;
  int b = 0
  if(2 > 1){
    a = 1;
  } else{
    b = 2;
  }
  add(a,b)
}

SSA IR pseudocode can be obtained:

DeadCodeElimination

a_1 = 2;
b_1 = 0
if true:
  a_2 = 1;
else
  b_2 = 2;
add(a,b)

The compiler can find B by executing bytecode_ 2 will not be used after assignment, and else branch will not be executed. After deleting the dead code, you can get the code:

DeadCodeElimination

public void DeadCodeElimination{
  int a = 1;
  int b = 0;
  add(a,b)
}

We can regard each optimization of the compiler as a graph optimization algorithm, which receives an IR graph and outputs the converted IR graph. The process of compiler optimization is the optimization of graph nodes in series.

Intermediate expression in C1

As mentioned earlier, the C1 compiler uses the high-level intermediate expression HIR and the low-level intermediate expression LIR for various optimizations. Both IR are in the form of SSA.

HIR is a control flow diagram structure composed of many basic blocks. Each block contains many instructions in the form of SSA. The structure of the basic block is shown in the following figure:

Principle analysis and practice of java basic function real-time compiler

Among them, predecessors represents the basic block of the precursor (because there may be multiple precursors, it is a blocklist structure and a scalable array composed of multiple blockbegins). Similarly, successors represent multiple subsequent basic blocks blockend. In addition, these two parts are the main block, which contains the instructions executed by the program and a next pointer to the main block to be executed next.

From bytecode to HIR, the final call is graphbuilder. Graphbuilder will traverse bytecode to construct all code basic blocks and store them as a linked list structure, but at this time, the basic blocks only have blockbegin, excluding specific instructions. In the second step, graphbuilder will use a ValueStack as the operand stack and local variable table to simulate the execution of bytecode, construct the corresponding HIR, and fill in the previously empty basic block. Here is an example of the process of constructing HIR from simple bytecode blocks, as shown below:

Bytecode construction HIR

Bytecode                       Local   Value               operand   stack                HIR
      5: iload_1                  [i1,i2]                 [i1]
      6: iload_2                  [i1,i2]                 [i1,i2]   
                                  ................................................   i3: i1 * i2
      7: imul                                   
      8: istore_3                 [i1,i2,i3]              [i3]

As you can see, when Iload is executed_ 1, the operand stack is pushed into the variable I1 to execute Iload_ At 2, the operand stack is pushed into the variable I2. When the multiplication instruction imul is executed, two values at the top of the stack are popped up to construct HIR I3: I1 * I2, and the generated I3 is put into the stack.

C1 compiler optimization is mostly done on HIR. After optimization, it will convert HIR into LIR. LIR is similar to HIR and is also an IR used inside the compiler. HIR can generate LIR by eliminating some intermediate nodes through optimization, which is more simplified in form.

Sea-of-Nodes IR

The idea graph in C2 compiler adopts an intermediate expression called sea of nodes, which is also in SSA form. Its biggest feature is to remove the concept of variable and directly use value for operation. In order to facilitate understanding, the IR visualization tool ideal graph visualizer (IGV) can be used to display specific IR graphs. For example, the following code:

example

public static int foo(int count) {
  int sum = 0;
  for (int i = 0; i < count; i++) {
    sum += i;
  }
  return sum;
}

The corresponding IR diagram is as follows:

Principle analysis and practice of java basic function real-time compiler

Several sequentially executed nodes in the figure will be included in the same basic block, such as B0, B1, etc. In B0 basic block, start node 0 is the method entry, and return node 21 in B3 is the method exit. Red bold lines are control flows, blue lines are data flows, and lines of other colors are special control flows or data flows. The controlled flow edge is connected to fixed nodes, and the others are floating nodes (floating nodes refer to nodes that can be placed in different positions as long as they can meet the data dependency. The process of floating node change is called schedule).

This graph has a lightweight edge structure. An edge in the graph is represented only by a pointer to another node. A node is an instance of a subclass of node, with an array of pointers that specify input edges. The advantage of this representation is that it changes the input edge of the node quickly. If you want to change the input edge, just point the pointer to the node and store it in the pointer array of the node.

Depending on this graph structure, the JVM can schedule those floating nodes by collecting the program running information, so as to obtain the best compilation effect.

Phi And Region Nodes

Ideal graph is SSA IR. Since there is no concept of variable, this will lead to a problem that different execution paths may set different values for the same variable. For example, the following code returns 5 and 6 respectively in the two branches of the if statement. At this time, the read values are likely to be different according to different execution paths.

example

int test(int x) {
int a = 0;
  if(x == 1) {
    a = 5;
  } else {
    a = 6;
  }
  return a;
}

In order to solve this problem, a concept of phi nodes is introduced, which can select different values according to different execution paths. Therefore, the above code can be represented as the following figure:

Principle analysis and practice of java basic function real-time compiler

All values contained in different paths are saved in pHi nodes. Region nodes obtains the values that should be given to variables in the current execution path from Phi nodes according to the judgment conditions of different paths. The pseudo code in the form of SSA with Phi nodes is as follows:

Phi  Nodes

int test(int x) {
  a_1 = 0;
  if(x == 1){
    a_2 = 5;
  }else {
    a_3 = 6;
  }
  a_4 = Phi(a_2,a_3);
  return a_4;
}

Global Value Numbering

Global value Numbering (gvn) is an optimization technique that makes sea of nodes very easy.

Gvn means assigning a unique number to each calculated value, and then traversing the instructions to find optimization opportunities. It can find and eliminate the optimization technology of equivalent calculation. If multiple multiplications with the same operand occur in a program, the immediate compiler can combine these multiplications into one, thereby reducing the size of the output machine code. If these multiplications occur on the same execution path, gvn will also save redundant multiplication operations. In sea of nodes, because there is only the concept of value, the gvn algorithm will be very simple: the real-time compiler only needs to judge whether the floating node has the same number as the existing floating node and whether the input IR node is consistent, so the two floating nodes can be merged into one. For example, the following code:

GVN

a = 1;
b = 2;
c = a + b;
d = a + b;
e = d;

Gvn will use the hash algorithm number. When calculating a = 1, it will get number 1, when calculating B = 2, it will get number 2, and when calculating C = a + B, it will get number 3. These numbers will be saved in the hash table. When calculating d = a + B, it will find that a + B already exists in the hash table, so it will not calculate again, and the calculated value will be taken directly from the hash table. The last e = D can also be found in the hash table and reused.

Gvn can be understood as common subexpression elimination (CSE) on IR graph. The difference between the two is that gvn directly compares whether the values are the same or not, while CSE judges whether the two expressions are the same or not with the help of lexical analyzer.

2. Method inline

Method inlining refers to an optimization method that brings the method body of the target method into the compilation scope and replaces the original method call when a method call is encountered during compilation. Most JIT optimizations are based on inlining. Method inlining is a very important part of the real-time compiler.

There are a large number of getter / setter methods in Java services. If there is no method inline, when calling getter / setter, the program execution needs to save the execution position of the current method, create and press the stack frame, access field and pop-up stack frame for getter / setter, and finally resume the execution of the current method. After the method call to getter / setter is inlined, only field access remains in the above operation. In the C2 compiler, method inlining is completed in the process of parsing bytecode. When encountering method call bytecode, the compiler will decide whether to inline the current method call according to some threshold parameters. If inlining is required, start parsing the bytecode of the target method. For example, the following example (from the network):

Method inline process

public static boolean flag = true;
public static int value0 = 0;
public static int value1 = 1;
public static int foo(int value) {
    int result = bar(flag);
    if (result != 0) {
        return result;
    } else {
        return value;
    }
}
public static int bar(boolean flag) {
    return flag ? value0 : value1;
}

IR diagram of bar method:

Principle analysis and practice of java basic function real-time compiler

IR diagram after inline:

Principle analysis and practice of java basic function real-time compiler

Inline not only copies the IR graph node of the called method to the IR graph of the caller method, but also performs other operations.

  1. The parameters of the called method are replaced by the parameters passed in when the caller method makes a method call. In the above example, replace the P (0) node 1 in the bar method with the loadfield node 3 in the foo method.
  2. In the IR diagram of the caller’s method, the data dependency of the method calling node will become the return of the called method. If there are multiple return nodes, a Phi node will be generated to aggregate these return values and serve as a replacement object for the original method call node. In the figure, connect the No. 8 = = node and the No. 12 return node to the edge of the original No. 5 invoke node, and then point to the newly generated No. 24 Phi node.
  3. If the called method will throw an exception of a certain type, and the caller method happens to have a handler of the exception type, and the exception handler overrides the method call, the immediate compiler needs to connect the path where the called method throws an exception with the exception handler of the caller method.

Conditions for method inlining

Most compiler optimizations are based on method inlining. Therefore, generally speaking, the more inline methods, the higher the execution efficiency of generated code. However, for the real-time compiler, the more inline methods, the longer the compilation time, and the later the program reaches the peak performance.

The number of inline layers and the direct recursive call of layer 1 can be adjusted through the virtual machine parameter – XX: maxinlinelevel (can be adjusted through the virtual machine parameter – XX: maxinlinelevel). Some common inline related parameters are shown in the following table:

Principle analysis and practice of java basic function real-time compiler

Virtual function inlining

Inlining is the main means of JIT to improve performance, but virtual functions make it difficult to inlining, because they don’t know which method they will call in the inlining stage. For example, we have a data processing interface. There are three methods in this interface to implement add, sub and multi. The JVM stores all the virtual functions in the class object by saving the virtual function table virtual method table (hereinafter referred to as VMT). The instance object of class stores a pointer to VMT. When the program runs, the instance object is loaded first, Then find the VMT through the instance object and the address of the corresponding method through the VMT. Therefore, the performance of virtual function calls is worse than that of classical calls directly pointing to the method address. Unfortunately, all calls to non private member functions in Java are virtual calls.

The C2 compiler is smart enough to detect this and optimize virtual calls. For example, the following code example:

virtual call

public class SimpleInliningTest
{
    public static void main(String[] args) throws InterruptedException {
        VirtualInvokeTest obj = new VirtualInvokeTest();
        VirtualInvoke1 obj1 = new VirtualInvoke1();
        for (int i = 0; i < 100000; i++) {
            invokeMethod(obj);
            invokeMethod(obj1);
        }
        Thread.sleep(1000);
    }
    public static void invokeMethod(VirtualInvokeTest obj) {
        obj.methodCall();
    }
    private static class VirtualInvokeTest {
        public void methodCall() {
            System.out.println("virtual call");
        }
    }
    private static class VirtualInvoke1 extends VirtualInvokeTest {
        @Override
        public void methodCall() {
            super.methodCall();
        }
    }
}

After JIT compiler optimization, disassemble to obtain the following assembly code:

0x0000000113369d37: callq  0x00000001132950a0  ; OopMap{off=476}
                                                ;* invokevirtual   methodCall   // Represents a virtual call
                                                ; - SimpleInliningTest::[email protected] (line 18)
                                                ;     {optimized   virtual_call}   // Virtual calls have been optimized

You can see that JIT has made virtual calls to the methodcall method and optimized virtual_ call。 The optimized method can be inlined. However, the C2 compiler has limited capabilities, and it is “powerless” to make virtual calls to multiple implementation methods.

For example, in the following code, we add an implementation:

Virtual call with multiple implementations

public class SimpleInliningTest
{
    public static void main(String[] args) throws InterruptedException {
        VirtualInvokeTest obj = new VirtualInvokeTest();
        VirtualInvoke1 obj1 = new VirtualInvoke1();
        VirtualInvoke2 obj2 = new VirtualInvoke2();
        for (int i = 0; i < 100000; i++) {
            invokeMethod(obj);
            invokeMethod(obj1);
        invokeMethod(obj2);
        }
        Thread.sleep(1000);
    }
    public static void invokeMethod(VirtualInvokeTest obj) {
        obj.methodCall();
    }
    private static class VirtualInvokeTest {
        public void methodCall() {
            System.out.println("virtual call");
        }
    }
    private static class VirtualInvoke1 extends VirtualInvokeTest {
        @Override
        public void methodCall() {
            super.methodCall();
        }
    }
    private static class VirtualInvoke2 extends VirtualInvokeTest {
        @Override
        public void methodCall() {
            super.methodCall();
        }
    }
}

After decompilation, the following assembly code is obtained:

0x000000011f5f0a37: callq  0x000000011f4fd2e0  ; OopMap{off=28}
                                                ;* invokevirtual   methodCall   // Represents a virtual call
                                                ; - SimpleInliningTest::[email protected] (line 20)
                                                ;     {virtual_call}   // Virtual calls are not optimized

You can see that the virtual calls of multiple implementations have not been optimized and are still virtual_ call。

For this situation, the Graal compiler will collect the execution information of this part. For example, in a period of time, it is found that the previous interface method calls add and sub account for 50% respectively, so the JVM will inline add when encountering add every time it runs, and then inline sub function when encountering sub, so that the execution efficiency of these two paths will be improved. In the follow-up, if other unusual situations are encountered, the JVM will perform de optimization operations, mark the location, and switch back to interpretation and execution when this happens again.

3. Escape analysis

Escape analysis is “a static analysis to determine the dynamic range of the pointer, which can analyze where the pointer can be accessed in the program”. The real-time compiler of Java virtual machine will perform escape analysis on the newly created object to determine whether the object escapes from the thread or method. There are two criteria for the immediate compiler to judge whether an object Escapes:

  1. Whether the object is stored in the heap (static field or instance field of the object in the heap). Once the object is stored in the heap, other threads can obtain the reference of the object, and the immediate compiler cannot track all code locations using the object.
  2. Whether the object is passed into unknown code or not, the immediate compiler will treat the code that is not inlined as unknown code, because it cannot confirm whether the method call will store the caller or the passed parameters in the heap. In this case, it can be directly considered that the caller and parameters of the method call are escaped.

Escape analysis is usually carried out on the basis of method inlining. The real-time compiler can optimize such as lock elimination, on stack allocation and scalar replacement according to the results of escape analysis. The following code is an example of an object not escaping:

pulbic class Example{
    public static void main(String[] args) {
      example();
    }
    public static void example() {
      Foo foo = new Foo();
      Bar bar = new Bar();
      bar.setFoo(foo);
    }
  }
  class Foo {}
  class Bar {
    private Foo foo;
    public void setFoo(Foo foo) {
      this.foo = foo;
    }
  }
}

In this example, two objects Foo and bar are created, one of which is provided as a parameter to the other method. The method setfoo () stores a reference to the received foo object. If the bar object is on the heap, the reference to foo will escape. However, in this case, the compiler can determine through escape analysis that the bar object itself will not call escape example (). This means that references to foo cannot escape. Therefore, the compiler can safely allocate two objects on the stack.

Lock elimination

When learning Java Concurrent Programming, you can unlock and eliminate, and lock elimination is based on escape analysis.

If the real-time compiler can prove that the lock object does not escape, the locking and unlocking operations of the lock object are meaningless. Because the thread cannot obtain the lock object. In this case, the immediate compiler will eliminate the locking and unlocking operations of the non escaping lock object. In fact, the compiler only needs to prove that the lock object does not escape from the thread to eliminate the lock. Due to the limitation of real-time compilation of Java virtual machine, the above condition is strengthened to prove that the lock object does not escape from the currently compiled method. However, lock elimination based on escape analysis is actually rare.

On stack allocation

We all know that Java objects are allocated on the heap, and the heap is visible to all objects. At the same time, the JVM needs to manage the allocated heap memory and reclaim the memory occupied by the object when it is no longer referenced. If escape analysis can prove that some newly created objects do not escape, the JVM can allocate them to the stack, and automatically reclaim the allocated memory space by popping the stack frame of the current method when the method where the new statement is located exits.

In this way, we do not need to use the garbage collector to deal with objects that are no longer referenced. However, the hotspot virtual machine does not actually allocate on the stack, but uses the scalar replacement technology. Scalars are variables that can store only one value, such as basic types in Java code. On the contrary, aggregate may store multiple values at the same time. A typical example is Java objects. The compiler will decompose the unaccepted aggregate into multiple scalars within the method to reduce the allocation on the heap. The following is an example of scalar substitution:

scalar replacement

public class Example{
  @AllArgsConstructor
  class Cat{
    int age;
    int weight;
  }
  public static void example(){
    Cat cat = new Cat(1,10);
    addAgeAndWeight(cat.age,Cat.weight);
  }
}

After escape analysis, the cat object does not escape the call of example (). Therefore, the aggregate cat can be decomposed to obtain two scalar ages and weights. The pseudo code after scalar replacement:

public class Example{
  @AllArgsConstructor
  class Cat{
    int age;
    int weight;
  }
  public static void example(){
    int age = 1;
    int weight = 10;
    addAgeAndWeight(age,weight);
  }
}

Partial escape analysis

Partial escape analysis is also Graal’s application to probability prediction. Generally speaking, if an object is found to escape from a method or thread, the JVM will not optimize it, but the Graal compiler will still analyze the execution path of the current program. Based on the escape analysis, it will collect and judge which paths the object will escape and which will not. Then, according to this information, lock elimination and stack allocation are carried out on the path that will not escape.

4. Loop Transformations

In the part of introducing C2 compiler in the article, it is mentioned that C2 compiler will carry out many global optimizations after building idea graph, including loop conversion. The two most important conversions are loop expansion and loop separation.

Loop expansion

Loop unwrapping is a loop conversion technology, which attempts to optimize the execution speed of the program at the expense of the size of the program binary code. It is an optimization means of exchanging space for time.

Loop unrolling reduces the computational overhead by reducing or eliminating the instructions that control the program loop, including increasing the pointer arithmetic to the next index or instruction in the array. If the compiler can calculate these indexes in advance and build them into machine code instructions, the program runtime does not have to do this calculation. That is, some loops can be written as repetitive independent code. For example, the following cycle:

Loop expansion

public void loopRolling(){
  for(int i = 0;i<200;i++){
    delete(i);  
  }
}

The above code needs to be deleted 200 times. The following code can be obtained through cyclic expansion:

Loop expansion

public void loopRolling(){
  for(int i = 0;i<200;i+=5){
    delete(i);
    delete(i+1);
    delete(i+2);
    delete(i+3);
    delete(i+4);
  }
}

This expansion can reduce the number of cycles, and the calculation in each cycle can also use the CPU pipeline to improve the efficiency. Of course, this is just an example. During actual deployment, the JVM will evaluate the benefits brought by deployment and then decide whether to deploy.

Cyclic separation

Cycle separation is also a means of cycle conversion. It separates one or more special iterations in the loop and executes them outside the loop. For example, the following code:

Cyclic separation

int a = 10;
for(int i = 0;i<10;i++){
  b[i] = x[i] + x[a];
  a = i;
}

It can be seen that except for the first cycle a = 10, a in this code is equal to I-1. Therefore, special cases can be separated into the following code:

Cyclic separation

b[0] = x[0] + 10;
for(int i = 1;i<10;i++){
  b[i] = x[i] + x[i-1];
}

This equivalent transformation eliminates the need for a variables in the loop, thus reducing the overhead.

5. Peephole optimization and register allocation

The peephole optimization mentioned above is the last step of optimization, after which the program will be converted into machine code. Peephole optimization is to replace some of the adjacent instructions in the intermediate code (or object code) generated by the compiler with more efficient instruction groups, such as strength reduction, constant merging, etc, The following example is an example of strength reduction:

Strength reduction

y1=x1*3    Obtained after strength reduction    y1=(x1<<1)+x1

The compiler uses shift and addition to reduce the intensity of multiplication and use more efficient instruction sets.

Register allocation is also an optimization method of compilation, which is widely used in C2 compiler. By saving frequently used variables in registers, CPU accesses registers much faster than memory, which can improve the running speed of programs.

Register allocation and peephole optimization are the last steps of program optimization. After register allocation and peephole optimization, the program will be converted into machine code and saved in codecache.

4、 Practice

The real-time compiler is complex, and there is little practical experience on the network. Here are some adjustment experiences of our team.

1. Important parameters related to compilation

  • -20: + tieredcompilation: enables layered compilation, which is enabled by default after jdk8
  • -20: + cicompilercount = n: number of compilation threads. After setting the number, the JVM will automatically allocate the number of threads, C1: C2 = 1:2
  • -20: Tierxbackedgethreshold: threshold for OSR compilation
  • -20: Tierxmainvocationthreshold: enables the threshold of each layer call after layered compilation
  • -20: Tierxcompilethreshold: enables the compilation threshold after hierarchical compilation
  • -20: Reservedcodachesize: the maximum size of the codecache
  • -20: Initialcodecache size: initial codecache size

-20: Tierxpinvocationthreshold is the threshold parameter that triggers compilation when layered compilation is enabled. When the number of method calls is greater than the threshold specified by the parameter – XX: tierxinvocationthreshold multiplied by the coefficient, or when the number of method calls is greater than the threshold specified by the parameter – XX: tierxpinvocationthreshold multiplied by the coefficient, When the sum of the number of method calls and the number of edge loops is greater than the threshold specified by the parameter – XX: tierxcompilethreshold multiplied by the coefficient, layer x immediate compilation will be triggered. When layered compilation is enabled, it will be multiplied by a coefficient. The coefficient is determined according to the current compilation method and the number of compilation threads. Reducing the threshold can improve the number of compilation methods. Some commonly used methods that cannot be compiled can be compiled and optimized to improve performance.

Due to the complexity of compilation, the JVM will dynamically adjust the relevant thresholds to ensure the performance of the JVM. Therefore, it is not recommended to manually adjust the compilation related parameters. Unless some specific cases, such as codecache, stop compiling when it is full, you can appropriately increase the codecache size, or some very common methods are not inlined, which hinders performance. You can adjust the number of introverted layers or the size of inlined methods.

2. Analyze the compilation log through jitwatch

By adding – XX: + unlockdiagnosticvmoptions – XX: + printcompilation – XX: + printinlining – XX: + printcodecache – XX: + printcodecacheoncompilation – XX: + traceclassloading – XX: + logcompilation – XX: logfile = logpath parameters, you can output compilation, inline and codecache information to files. However, the printed compilation logs are many and complex, and it is difficult to get information directly from them. You can use the jitwatch tool to analyze the compilation logs. Select the log file in open log on the jitwatch homepage and click start to start analyzing the log.

Principle analysis and practice of java basic function real-time compiler

Principle analysis and practice of java basic function real-time compiler

As shown in the above figure, area 1 is the Java class of the whole project, including the introduced third-party dependencies; Area 2 is the functional area. Timeline shows the timeline of JIT compilation in graphical form, histo shows some information, toplist shows the sorting of some objects and data generated during compilation, cache is the free codecache space, nmethod is the native method, and threads is the thread of JIT compilation; Area 3 is the display of log analysis results by jitwatch. Some suggestions for code optimization are given in suggestions, for example, as shown in the following figure:

Principle analysis and practice of java basic function real-time compiler

We can see that when calling the read method of zipinputstream, it is not marked as a hot method and is “too large” to be inlined. Using the – XX: inline instruction in compilecommand can force methods to be inlined, but it is recommended to use it with caution. It is not recommended to use a method unless it is determined that inlining will bring a lot of performance improvement, and excessive use will put a lot of pressure on the compilation thread and codecache.

After – allocs and – locks escape analysis in region 3, the JVM optimizes the code, including on stack allocation, lock elimination, etc.

3. Use Graal compiler

Because the JVM will dynamically adjust the compilation threshold according to the current number of compilation methods and compilation threads, there is little room for adjustment in the actual service, and the JVM has done enough.

In order to improve performance, we tried the latest Graal compiler in the service. You can start the Graal compiler instead of the C2 compiler and respond to the compilation request of C2 by using – XX: + unlockexperimentalvmoptions – XX: + usejvmcompiler. However, it should be noted that the Graal compiler is incompatible with ZGC and can only be used with G1.

As mentioned earlier, Graal is a real-time compiler written in Java. It has been integrated from JDK since Java 9 as an experimental real-time compiler. Graal compiler is to get rid of graalvm, which is a high-performance execution environment supporting multiple programming languages. It can not only run on the traditional openjdk, but also compile into executable files through AOT (ahead of time) to run alone, or even integrate into the database.

As mentioned above several times, Graal’s optimization is based on some assumption. When an error is assumed, the Java virtual opportunity switches from executing the machine code generated by the immediate compiler to interpreting and executing with the help of the mechanism of de optimization. If necessary, it will even discard the machine code and compile after collecting the program profile again.

These radical methods make the peak performance of Graal better than C2, and Graal performs better in the languages of scale and ruby. At present, Twitter has used Graal in a large number of services to improve performance. The enterprise version of graalvm has improved the performance of twitter services by 22%.

Performance after using Graal compiler

In our online service, after enabling grail compilation, tp9999 decreased by 10ms from 60ms to 50ms, with a decrease of 16.7%.

Peak performance during operation will be higher. It can be seen that the Graal compiler has brought some performance improvement to this service.

Graal compiler issues

The optimization method of the Graal compiler is more radical, so more compilation will be carried out at startup. The Graal compiler itself also needs to be compiled immediately, so the performance of the service will be poor at the beginning of startup.

Solutions considered: JDK 9 starts to provide the tool jaotc. Meanwhile, graalvm’s native image can greatly improve the service startup speed through static compilation. However, graalvm will use its own garbage collection, which is a very primitive garbage collection based on replication algorithm, compared with G1 and ZGC(Exploration and practice of a new generation of garbage collector ZGC)The performance of these excellent new garbage collectors is not good. At the same time, graalvm’s support for some features of Java is not enough. For example, configuration based support, such as reflection, requires configuring all classes that need to be reflected with a JSON file. When using a large number of reflected services, such configuration will be a great workload. We are also doing research in this regard.

5、 Summary

This paper mainly introduces the principle of JIT real-time compilation and some practical experience in meituan, as well as the use effect of the most cutting-edge real-time compiler. As a technology to improve performance in interpretive languages, JIT has been relatively mature and is used in many languages. For Java services, the JVM itself has done enough, but we should continue to deeply understand the optimization principle of JIT and the latest compilation technology, so as to make up for the disadvantages of JIT, improve the performance of Java services and constantly pursue excellence.

6、 References

[1] Deep understanding of Java virtual machine

[2]《Proceedings of the Java™ Virtual Machine Research and Technology Symposium》Monterey, California, USA April 23–24, 2001

[3]《Visualization of Program Dependence Graphs》 Thomas Würthinger

[4] In depth disassembly of Java virtual machine by Zheng Yudi

[5] JIT’s profile artifact jitwatch

[6] Link:https://jq.qq.com/?_wv=1027&k…