[original] KVM QEMU analysis of Linux Virtualization (2) armv8 virtualization



  • Read the fucking source code!–By Lu Xun
  • A picture is worth a thousand words.–By Gorky


  1. KVM version: 5.9.1
  2. QEMU version: 5.0.0
  3. Tools: source insight 3.5, Visio

1. Overview

  • KVM virtualization is inseparable from the support of underlying hardware. This paper will introduce the support of armv8 architecture processor for virtualization, including memory virtualization, interrupt virtualization, I / O virtualization and so on;
  • ARM processor is mainly used in the field of mobile terminals. In recent years, ARM processor has gradually moved closer to the server field, and has a more perfect support for virtualization;
  • HypervisorThe functions of the software include: memory management, device simulation, device allocation, exception handling, command capture, virtual exception management, interrupt controller management, scheduling, context switching, memory conversion, multiple virtual address space management, etc;
  • This article describes the armv8 virtualization support for understandingarch/arm64/kvmThe code below is very important. If you look at architecture specific code without hardware, it is playing rogue;

Start the journey!

2. Armv8 virtualization

2.1 Exception Level

  • The architecture before armv7 defines the exception handling mode of a processor, such asUSR, FIQ, IRQ, SVC, ABT, UND, SYS, HYP, MONThe privilege level of each exception mode is different, such asUSRThe privilege level of the mode isPL0, corresponding to user mode program running;
  • The abnormal mode of processor can be actively switched under the control of privilege level software, such as modificationCPSRRegister, can also passively switch exception mode, such as interrupt when it comes to switch toIRQ mode

The exception mode of armv7 processor is shown in the following table:

But geese, to armv8,Exception Level(EL)Instead of the privilege level, where the exception mode of the processor isException LevelThe mapping relationship of is as follows:

  • When an exception occurs, the processor changesException Level(equivalent to the processor mode switch in armv7) to handle exception types;
  • As you can see in the figureHypervisorRunning inEL2, andGuest OSRunning inEL1You can use theHVC (Hypervisor Call)Command toHypervisorRequest service, response to virtualization request involvesException LevelThe switch of;

2.2 Stage 2 translation

Stage 2 conversionIt is closely related to memory virtualization, which includes not only the conventional memory mapping access, but also the memory mapped I / O(MMIO)Access, and the system memory management unit(SMMUsMemory access under control.

2.2.1 memory mapping

Before accessing physical memory, the OS needs to establish a page table to maintain the mapping relationship between virtual address and physical address. Students who have read the previous memory management analysis should be familiar with the following figure. This can be considered asStage 1 conversion

  • When there is a virtual machine, the situation is different. For example, when QEMU is running on the Linux system, it is only a user process of the Linux system,Guest OSThe physical address that you think you visit is actually the virtual address of the user process of Linux, and further mapping is needed to the final physical address;
  • HypervisorIt can be done throughStage 2 conversionTo control the memory view of the virtual machine and whether the virtual machine can access a piece of physical memory to achieve the purpose of isolation;

  • The whole address mapping is divided into two stages

    1. Stage 1: VA(Virutal Address) -> IPA(Intermediate Physical Address), operating system controlStage 1 conversion;
    2. Stage 2: IPA(Intermediate Physical Address) -> PA(Physical Address)HypervisorcontrolStage 2 conversion;
  • Stage 2 conversionAndStage 1The conversion mechanism is very similar, but the difference is thatStage 2 conversionWhen judging whether the memory type is normal or device, it is stored in the page table information, rather than through theMAIR_ELxRegister;

  • Each VM will be assigned oneVMID, used to identifyTLB entryThe VM to which it belongs, which allows multiple transformations of different VMS in the TLB at the same time;

  • The operating system assigns anASID(Address Space Identifier), can also be used for identificationTLB entryOf the same applicationTLB entryAll have the sameASIDDifferent applications can share the same blockTLB cache。 Each VM has its ownASIDSpace, usually combinedVMIDandASIDTo use at the same time;

  • Stage 1andStage 2In the conversion page table of, all of them contain the related devices of attributes, such as access rights, storage types, etc,MMUIt will be integrated into a final valid value, and the properties with more strict restrictions will be selected, as shown in the following figure:

  • TheDeviceIf the attribute limit is more strict, selectDeviceType;
  • HypervisorIf you want to change the default integration behavior, you can use the registerHCR_EL2(Hypervisor Configuration Register)To configure, for example, settingsNon-cacheableWrite-Back CacheableEtc;

2.2.2 MMIO(Memory-Mapped Input/Output)

Guest OSThe physical address space is actuallyIPAAddress space, just like in a real physical machine,IPAAlso divided into memory address space andI/OAddress space:

  • There are two kinds of access to peripherals: 1) direct access to real peripherals; 2) triggeringfaultHypervisorIt is simulated by software;
  • VTTBR_EL2Virtualization Translation Table Base RegisterVirtual conversion table base address registerStage 2 conversionPage table of;
  • To simulate peripherals,HypervisorIt is necessary to know which peripheral is accessed and the registers accessed, read access or write access, the length of access, and which registers are used to transfer data.Stage 2 conversionThere is a specialHypervisor IPA Fault Address Register, EL2(HPFAR_EL2)Register for capturingStage 2 conversionFault in the process;

The example flow of software simulation peripheral is as follows:

  • 1) The software in VM tries to access the serial device;
  • 2) When visitingStage 2 conversionBlock and trigger abort exception to route toEL2。 Exception handler queryESR_EL2(Exception Syndrome Register)Register information about the exception, such as access length, target register, load / store operation, etc., will be queried by the exception handlerHPFAR_EL2Register to obtain the IPA address of abort;
  • 3)HypervisoradoptESR_EL2andHPFAR_EL2The relevant information in the simulation of the relevant virtual peripherals, after the completion of the simulationERETCommand returned tovCPUTo continue running from the next instruction with an exception;

2.2.3 SMMUs(System Memory Management Units)

Another case of accessing memory is DMA controller.

The operation of DMA controller under non virtualization is as follows:

  • The DMA controller is controlled by the driver of the kernel, which can ensure that the memory protection at the operating system level will not be damaged, and the user program cannot access the restricted area through DMA;

Under virtualization, what problems will happen when the driver in VM interacts with DMA controller directly? As shown in the figure below:

  • DMA controller is not subject toStage 2 conversionIt will destroy the isolation of VM;
  • Guest OS thinks that the physical address is IPA address, while the address seen by DMA is the real physical address. The two perspectives are inconsistent. In order to solve this problem, it is necessary to capture the interaction between VM and DMA controller every time and provide conversion. When memory fragmentation occurs, this processing is inefficient and easy to introduce problems;

SMMUsIt can be used to solve this problem:

  • SMMUAlso calledIOMMUTo provide MMU function for IO components, virtualization is only an application of SMMU;
  • HypervisorYou can be responsible. YesSMMUIn order to make the upper controller and virtual machine VM treat memory from the same perspective, but also maintain isolation;

2.3 Trapping and emulation of Instructions

HypervisorCapture is also required(trap)For example, when the software in the VM needs to configure the underlying processor for power management or cache consistency operation, in order not to break the isolation,HypervisorYou need to capture operations and simulate them so that no other VMS are affected. If an operation is set to capture, it will move to a higher level when it is executedException LevelTrigger exception (e.gHypervisorIn order to complete the simulation in the corresponding exception handling.

Here’s an example:

  • Execute in arm processorWFI(wait for interrupt)Command, can make the CPU in a state of low power consumption;
  • HCR_EL2(Hypervisor Control Register)When theTWI==1The vcpu executes theWFIThe command triggers an EL2 exception, which causesHypervisorIt can be simulated to schedule the task to another vcpu;

Capture(traps)Another function of is that it can be used to present the virtual values of registers to the guest OS, as follows:

  • ID_AA64MMFR0_EL1Register is used to query the processor’s support for memory system related features. The system may read the register during the startup phase,HypervisorA different virtual value can be presented to the guest OS;
  • When the vcpu reads the register, an exception is triggered,Hypervisorstaytrap_handlerSet a virtual value and return it to vcpu;
  • adopttrapTo virtualize an operation requires a lot of computation, including a series of operations such as triggering exception, catching, simulating, returning, etcID_AA64MMFR0_EL1Register access is not frequent, which is not a problem. But when you need frequent access to registers, such asMIDR_EL1andMPIDR_EL1For the sake of performance, we should avoid falling intoHypervisorCan be simulated through other mechanisms, such asVPIDR_EL2andVMIDR_EL2Register, set the value before entering VM, when readingMIDR_EL1andMPIDR_EL1The hardware returnsVPIDR_EL2andVMIDR_EL2To avoid falling into processing;

2.4 Virtualizing exceptions

  • HypervisorThe processing of virtual interrupt is complex,HypervisorIn order to enable these mechanisms, arm architecture includes support for virtual interrupts (virqs, vfiqs, vserrors);
  • The processor can receive virtual interrupt only in el0 / el1 execution state, but not in EL2 / el3 state;
  • HypervisorBy settingHCR_EL2Register to control sending virtual interrupt to el0 / el1. For example, to enable virq, it needs to be setHCR_EL2.IMOAfter setting, the physical interrupt will be sent to EL2, and then the virtual interrupt will be sent to el1;

There are two ways to generate virtual interrupts: 1) control within the processorHCR_EL2Register; 2) through GIC interrupt controller (V2 version or above); mode 1 is relatively simple to use, but it only provides the way to generate interrupt, which requiresHypervisorTo simulate the interrupt controller in VM, by capturing and then simulating, it will bring overhead, which is not an optimal solution.

Let’s seeGICWell, those who have read the previous series of articles on interrupt subsystem should have seen the following figure:

  • HypervisorTheVirtual CPU InterfaceMapping to the VM, allowing the software in the VM to communicate directly with the GIC,HypervisorOnly configuration is needed to reduce the overhead of virtual interrupt;

Let’s take an example of virtual interrupts

  1. Peripheral trigger interrupt signal to GIC;
  2. GIC generates physical interruptIRQperhapsFIQSignal, if setHCR_EL2.IMO/FMOThe interrupt signal will be routed toHypervisorHypervisorIt checks to which interrupt signal is forwardedvCPU
  3. HypervisorSet GIC to send the physical interrupt signal to a virtual interruptvCPUIf the processor is running at EL2, the interrupt signal will be ignored;
  4. HypervisorReturn control tovCPU;
  5. When the processor runs on el0 / el1, the virtual interrupt will be accepted and processed
  • The interrupt mask of armv8 processor is controlled byPSTATETo control (e.gPSTATE.I)In virtualization, the role of bits is somewhat different, such as settingHCR_EL2.IMOIndicates that the physical IRQ is routed to EL2 and turned on for el0 / el1vIRQsTherefore, when running at el0 / el1,PSTATE.IBits are for virtualvIRQsNot physicalpIRQs

2.5 Virtualizing the Generic Timers

Let’s take a look at the inside of the SOC

After simplification, it is as follows:

  • Each processor of arm architecture contains a set of general timers. You can see two modules in the figureComparatorsandCounter Module, whenComparatorsWhen the value of is less than or equal to the system’s count value, an interrupt occurs. We all know that in the operating systemtimerThe interruption is the pulse of the system;

The figure below shows thevCPUTime sequence:

  • Physical time 4ms, eachvCPURun for 2ms, if setvCPU0stayT=0The interrupt occurs after 3MS, which is expected to be 3MS after the physical time (i.evCPU0Is the interrupt generated after the virtual time of 2ms or 3MS? Arm architecture supports these two settings;
  • Running invCPUThe software on can access two kinds of clocks at the same time:El1 physical clockandEl1 virtual clock

El1 physical clockandEl1 virtual clock

  • El1 physical clockCompared with the system counter module, thewall-clockTime;
  • El1 virtual clockCompared with the virtual counter, the virtual counter subtracts an offset from the physical counter;
  • HypervisorResponsible for current scheduling operationvCPUSpecify the corresponding offset in such a way that the virtual time will only be coveredvCPUThe part of time actually running;

Here’s an example:

  • In the 6 ms time period, eachvCPURun for 3 ms,HypervisorYou can use the offset register to set thevCPUThe time is adjusted to its actual running time;

2.6 Virtualization Host Extensions(VHE)

  • Let’s start with a question: usuallyHost OSThe kernel running in el1, while the code controlling virtualization runs in EL2, which means that the traditional context switch is obviously inefficient;
  • VHEFor supporttype-2OfHypervisorThis extension allows the kernel to run directly in EL2, reducing the number of system registers shared between host and guest, and reducing the overhead of virtualization;

VHEBy system registerHCR_EL2OfE2HandTGEIt is controlled by two bits, as shown in the following figure:

VHEThe introduction of virtual address space needs to be considered, as shown in the following figure:

  • We mentioned the problem of virtual address space in memory subsystem analysis, which is divided into user address space(EL0)And kernel address space(EL1)The two regions are not consistent, but in theEL2There is only one virtual address space area becauseHypervisorIt does not support applications, so it does not need to be divided into kernel space and user space;
  • EL0/EL1Virtual address space also supportsASID(Address Space Identifiers), andEL2No, the reason is the sameHypervisorThere is no need to support applications;

As can be seen from the above two points, in order to supportHost OSCan run inEL2Need to add an address space area, as well as supportASID, settingsHCR_EL2.E2HThis problem can be solved by using the register bit of

Host OSRunning inEL2Another problem to be solved is register access redirection, which needs to be accessed in the kernelEL1For exampleTTBR0_EL1When the kernel is running inEL2The access flow can be controlled by setting registers without modifying the kernel code, as shown in the following figure:

  • Redirecting access registers introduces a new problem,HypervisorIn some cases, access to the realEL1Register, arm architecture introduces a new alias mechanism to_EL12/_EL02At the end, as shown in the figure below, theECH==1OfEL2visitTTBR0_EL1

Host OSRunning inEL2You also need to consider exception handling, as mentioned earlierHCR_EL2.IMO/FMO/AMOCan be used to control the routing of physical exceptions toEL1/EL2。 When running inEL0AndTGE==1All physical exceptions are routed toEL2(except SCR_ El3 control), this is becauseHost AppsRunning inEL0, andHost OSRunning inEL2

2.7 summary

  • This paper deals with memory virtualization (stage 2 conversion), I / O Virtualization (including SMMU, interrupts, etc.), interrupt virtualization, and instructionstrap and emulationEtc;
  • The basic routine is to route to when a virtualization service is requestedEL2If there is hardware support, the hardware is responsible for processing, otherwise, it can be simulated by software;
  • Although this article has not involved in code analysis, it has been roughly scanned once. The general outline has been clearly understood. I may not believe it. I am a little excited now;

reference resources

《ArmV8-A virtualization.pdf》
《ARM Cortex-A Series Programmer's Guide for ARMv8-A》
arm64: Virtualization Host Extension support

Welcome to pay attention to the official account number and update the technical articles without any time.