How LuaJIT works – JIT mode

Time:2022-11-25

PreviousWe said that JIT is the performance killer of LuaJIT. In this article, we will introduce JIT.

Just-in-time just-in-time compilation technology, the specific use in LuaJIT is: to compile Lua byte code into machine instructions in real time, that is, it is no longer necessary to interpret and execute Lua bytecode, and directly execute the machine instructions generated by just-in-time compilation.
That is to say, the input source of interpretation mode and JIT mode is the same, both are Lua byte code. With the same bytecode input, the two modes can have obvious performance differences (a difference of an order of magnitude is also relatively common), which still requires skill.

JIT can be divided into several steps:

  1. Count, count which hot codes are there
  2. Record, record hot code path, generate SSA IR code
  3. Generate, SSA IR code optimization to generate machine instructions
  4. Execute newly generated machine instructions

JIT compiled objects

Before going any further, a basic concept is introduced.
LuaJIT’s JIT is based on trace, which means a byte code execution flow, and it can span functions.
In comparison, Java’s JIT is method-based. Although there are function inlines, the restrictions are relatively large (only small functions will be compiled by inline JIT).

Personally, I think that tracing JIT can theoretically have more room to play than method base JIT. If it is only some case running scores, it should be even better.
However, the complexity of engineering implementation is much higher, so the final actual industrial effect is hard to say (there are many other factors that affect the JIT effect, such as optimizers, etc.).

Like this little example:

local debug = false
local function bar()
  return 1
end

local function foo()
  if debug then
    print("some debug log works")
  end
  
  return bar() + 1
end

whenfoo()When functions are JIT compiled, there are two distinct advantages:

  1. print("some debug log works")Because this line is not actually executed, it will not be included in the trace byte stream, that is, no machine code will be generated for it at all, so the generated machine code can be smaller (the smaller the generated machine code, the higher the CPU cache hit rate. higher)
  2. bar()It will be compiled inline without the overhead of function calls (yes, at the machine instruction level, the overhead of function calls actually needs to be considered)

count

Next, we introduce each phase of JIT one by one.
Counting is easier to understand. A major feature of JIT is: only compile hot code (if it is fully compiled, it will become AOT).

The usual JIT count has two statistical entries:

  1. Function call, when the number of executions of a function reaches a certain threshold, trigger JIT compilation of this function
  2. The number of loops, when the number of executions of a loop body reaches a certain threshold, trigger JIT to compile the loop body

That is, the thermal function and thermal cycle are counted.

However, LuaJIT is based on trace, and there are cases where trace exits midway. At this time, there is a third trace exit statistics:
If a trace often exits from a snap, start JIT compilation from this snap (snap will be introduced later), and generate a side trace.

to record

When a function/loop is hot enough, the JIT compiler starts working.
The first step is recording. The core process of recording is: while explaining and executing, generating IR code.

The specific process is:

  1. By modifying DISPATCH, add a hook for bytecode interpretation and execution
  2. In the hook, the corresponding IR code is generated for the currently executed byte code, and there will also be a judgment to complete/terminate the recording in advance
  3. Continue to interpret and execute byte code

From the start of recording to the completion of recording, this is the basic unit of trace, and the bytecode flow that is interpreted and executed during this period is the execution flow that this trace needs to accelerate.

Because the recording is the real execution flow, for the branch code, of course, trace will not assume that every execution in the future will definitely enter the current branch, but will add a guard to the IR.
And a snapshot will be recorded at the right time, and the snapshot will contain some contextual information.
If you exit from this snapshot during subsequent execution, the context information will be restored from the snapshot.

Additional details:
Not all byte codes can be JITed (for details, seeLuaJIT NYI)。
When I met NYI, LuaJIT still has the ability to stitch. for exampleFUNCCIt supports stich, then inFUNCCThe code before and after will be recorded as two traces. In the end, it will be like this, JIT executes the machine code of trace1 => interprets and executes FUNCC => JIT executes the machine code of trace2. Gluing two traces together is the effect of stitch.

generate

With information such as the IR code, machine code can be optimized for it.

There are two steps here:

  1. Optimization for IR code
    The IR code of LuaJIT is also static single assignment form (SSA), a common optimizer intermediate representation code. Many optimization algorithms can be applied to optimize, such as common dead code elimination, loop variable extraction, and so on.
  2. Generate machine instructions from IR code
    There are two main tasks in this part: register allocation, and translation into machine instructions according to IR operations, such as translation of IR ADD into machine ADD instructions.

For the guard in the IR, an if … jump logic instruction will be generated, and the stub instruction after the jump will complete the exit from a snapshot.

Here we can understand why the machine code generated by JIT can be more efficient:

  1. Based on the execution flow assumptions at the time of recording, CPU branch prediction friendly instructions can be generated, ideally, the CPU is equivalent to sequentially executing instructions
  2. Optimized for SSA IR code
  3. More efficient use of registers (at this time, there is no state record burden of the interpreter itself, and more registers can be used)

implement

After the machine instruction is generated, the bytecode that will be modified, such asFUNCFwill be changed toJFUNCF traceno
The next time the interpreter executesJFUNCFwhen, will jump totracenoThe corresponding machine instruction completes the switch from interpretation mode to JIT mode, which is also the main way to enter JIT instruction execution.

There are two ways to exit trace:
1 Exit after normal execution, and then return to the interpretation mode to continue execution
2 If the guard in the trace fails, it will exit from the trace halfway. At this time, it will first restore the context according to the corresponding snapshot, and then explain and execute

In addition, in the case of quitting from the trace, there will also be statistics on the number of quits.
If the number of exits of a snapshot reaches the hotside threshold, a sidetrace will be generated from this snapshot.
The next time you exit from this snapshot, you will jump directly to this side trace.

In this way, for the hot code with branches, it will also have the effect of full JIT coverage, but not full coverage at the beginning, but step by step as needed.

at last

As a small embedded language, Lua itself is relatively delicate and lightweight, and the implementation of LuaJIT also inherits these characteristics.
In a single-threaded environment, the JIT compiler takes up the time of the current workflow, and the efficiency of the JIT compiler itself is also very important.
It is also unacceptable for the JIT compiler to block the workflow for a long time, and balance is also very important here.

In comparison, java’s JIT compiler is completed by a separate JIT compilation thread and can be optimized more deeply. Java’s c2 JIT compiler applies relatively heavy optimization.

JIT is a very good technology, and it is also very confusing to understand the basic process/principle of its operation.

I heard that the v8 engine of JS and the process of deoptimization are quite curious, and I can learn it when I have time.