KVM API Integration

Description

The goal of this effort is to implement KVM support in MicroV.

For more information on KVM, please see this.
For more information on MicroV, please see this.

MicroV

High Level Components

KVM normally consists of a userspace application that performs emulation and handles some VMExits (not all). Traditionally this application is QEMU, but it could also be kvmtool from Google or rust-vmm from Amazon's Firecracker (just to name the big ones). In phase 1, we will focus on QEMU, ensuring MicroV is executing VMs properly on AMD hardware. Future phases will add support for rust-vmm which requires additional KVM APIs to be implemented to work properly.

The phase 1 high level components are:

Bareflank Microkernel: This is what will run in Ring 0 of VMX-root. It's only job is to load the MicroV extension and handle privileged operations, memory management, state transition, policy enforcement, etc. It doesn't actually implement a hypervisor, but it is aware of HVE tasks that an extension needs help with.
MicroV: This is Bareflank Microkernel Extension that runs in userspace of VMX root. It is the thing that implements the hypervisor (sort of, more on that later). MicroV's main job is to implement the MicroV ABI Spec, which defines how a VMX nonroot kernel and userspace communicate with MicroV to cooperatively implement a complete VMM. For Phase 1, MicroV implements the MicroV ABI Spec, which is a hypercall interface, most of which just calls into the Microkernel to handle state reads/writes. It will also trap on VMExits and for the most part, return the VMExit information to the kernel in VMX nonroot, which in turn will hand the information to userspace in VMX nonroot to actually handle, meaning for most operations MicroV is just a pass-through mechanism. In future Phases, MicroV will have to also handle LAPIC and IOAPIC emulation which is needed to support rust-vmm, and generally better for performance.
KVM Driver: This is a wrapper driver designed to emulate the actual KVM driver. All of the existing KVM tools expect to run from VMX root and simply IOCTL into the Linux kernel to implement Guest VM support. With MicroV, Linux is running in VMX nonroot, and therefore the Linux kernel cannot handle most of the APIs userspace will be asking of it. Instead, this driver will simply forward the IOCTL to MicroV using the MicroV ABI Spec.

Step 1 (quick demo)

The first step will be to get the following working: https://zserge.com/posts/kvm/

This will require that we verify that this works with regular KVM. Once that is done, we can use this to ensure that MicroV can run the same code. Getting MicroV to run such a simple example will ensure all of the basics are in place and working. This includes the following KVM APIs:

When userspace calls an IOCTL, it will end up in the entry.c code in the shim driver. The entry.c code will call the appropriate dispatch_xxx function. These dispatch functions DO NOT IMPLEMENT the hypercall. All they do is execute copy_to_user, copy_from_user to/from a struct on the stack and then call the appropriate handle_xxx function in the shim's src directory. For an example of the copy functions, see the following: https://elixir.bootlin.com/linux/v5.13-rc7/source/virt/kvm/kvm_main.c#L3504

The handle_ function actually implement the KVM IOCTL. The handle_ functions CANNOT CALL LINUX APIS. This code is designed to be common between all operating systems. If a Linux API is needed, it should use a platform_xxx function to do it. If a platform_xxx function is missing, please reach out over slack with what you think should be added. We will need to work together to determine what the API should look like to ensure it will work with other operating systems. In general, the handle_xxx functions will be making the actual calls to mv_xxx hypercalls, implementing KVM IOCTLs that are shim only (meaning there are no hypercalls to make), and handling KVM to MicroV ABI conversions as needed.

For KVM_RUN, we will need the following exit reasons implemented:

KVM_EXIT_IO
KVM_EXIT_SHUTDOWN

From a MicroV point of view, we will need the following ABIs defined and implemented:

SHIM_INIT:
This will call mv_id_op_version and make sure the version is correct. If it is, it will call mv_handle_op_open_handle to get a handle. It will then, on each PP, call mv_pp_op_set_shared_page_gpa to set the GPA of the PP's shared page. This shared page will be used to pass non-register based arguments between MicroV and the shim. For example, mv_vs_op_run takes a structure, as there is more data than can git in the registers alone. The shared page is used for all of this.

SHIM_FINI:
Calls mv_handle_op_close_handle

KVM_CREATE_VM:
Calls mv_vm_op_create_vm. This will return a VMID. The shim will have to create an FD for userspace software. Any time that FD is used, it will need to use the VMID associated with the FD for MicroV hypercalls.

KVM_CREATE_VCPU:
Calls mv_vp_op_create_vp and mv_vs_op_create_vs. MicroV has a VP and a VS. Most of the time, you will just work with the VS. For this first step, think of them as the same thing. These will return a VPID and a VSID. The shim will have to create an FD for userspace software. Any time that FD is used, it will need to use the VPID or VSID associated with the FD for MicroV hypercalls.

KVM_SET_USER_MEMORY_REGION:
Calls mv_vm_op_mmio_map. Will have to perform GVA to GPA conversions using virt_to_phys from the Linux kernel. DO NOT USE mv_vs_op_gva_to_gla or mv_vs_op_gla_to_gpa. Those hypercalls are only needed for integration testing (or if we ever need to implement KVM_TRANSLATE). It should not be used by the shim unless we have to implement KVM_TRANSLATE in the future. This IOCTL might also have to break this one call into multiple hypercalls if it cannot be fit into a single hypercall. Also, this hypercall might perform a continuation as it is slow, so the IOCTL might have to handle this. Here are a couple of important notes:

Userspace for a normal VM might ask the shim to map gigabytes of memory. The memory set here is the "physical" memory that the guest will use, so it is the guest's representation of RAM. Like actual MMIO on x86, there might be multiple regions. Some are RAM, some are memory mapped PCI devices, etc. So userspace may actually call this more than once. That is why KVM has this idea of "slots". We will have to implement this.
The memory region provided by userspace CANNOT BE PAGED OUT. This is important. The shim driver will have to "lock" this memory. Not sure how this is done from the Linux kernel, but we need to determine what Linux APIs are used to take a userspace buffer of memory and tell the kernel that it cannot be paged out.
Userspace will provide a "userspace" virtual address and a size. This is not a "kernel" virtual address. It is a "userspace" virtual address (thanks to Meltdown, they are not the same). MicroV only talks "guest physical addresses (GPAs)". The shim driver is executing in the kernel, which is in a VM, so the kernel's idea of a physical address is a GPA. So, all the shim has to do is a userspace address (i.e., virtual address) to physical address (i.e., GPA) translation.
On Linux, the virt_to_phys function may or may not work. Linux sadly has a million ways to translate from virt to phys depending on how the memory was allocated. What the right APIs are will need to be figured out. Again, look at the KVM driver as it has to do this already. Since all of this code will be in the "src" directory, it has to be cross platform, so platform_virt_to_phys should be updated and used here.
The userspace provided memory region is virtually contiguous. This does not mean it is physically contiguous. This means that the shim driver will need to start with the userspace provided virtual address, look up it's physical address and record it in an MDL entry. It will then have to add 4k (i.e., 0x1000) to the virtual address, and look up the physical address. If the physical address is still physically contiguous, the MDL entry's size can be increased by 4k. If the physical address is not physically contiguous, a new MDL entry will have to be created. This process is then repeated until the entire userspace memory buffer has been translated for every 4k page. This ensures that we are keeping memory in the MDL that will be provided to MicroV as physically contiguous as possible. With any luck, the MDL will actually describe 2M or higher physical pages which can save memory on the MicroV side of things. If we want to get fancy, we could even loop through the MDL for every translation to ensure that the current page being translated does not already exist in an MDL entry. For example, if we see 0x1000, 0x3000, and 0x2000, these addresses all look like they are NOT physically contiguous, but we know they are, they are just not in the right order. Once all of the MDL entries are recorded, we could loop through all of the entries and see if we can combine them, ensuring that we provide MicroV with an MDL with the least number of entries (which really ensures that physical memory is as contiguous as possible).
Since it is possible that every virtual address to physical address will be completely random, we might have an MDL that has one entry for every 4k virtual address being set, it is possible that the shim will have to call MicroV's hypercall several times as you can only fit a limited number of MDL entries in the shared page. To handle this, the shim should reserve a buffer of memory to fit buffer/4k entries (which is the worst case). All of the MDL entries should be calculated and combined. Once that is done, the shim can loop through the entries, adding them to the shared page. Once the shared page is full, mv_vm_op_mmio_map is called. From there, the shim starts at the beginning of the shared page and continues to add MDL entries and the process is repeated until MicroV has been given all of them. The reason we should calculate all of the MDL entries first, and then start making calls to MicroV is to ensure that we have translated all of the virtual to physical addresses, and combined as many as possible.
Mapping memory is slow. For phase 1, we will likely just deal with this. But, in the future, MicroV will likely track how much time it is taking to complete this hypercall. If it takes too long, interrupts will begin to pile up and cause problems. The "HyperV Top Level Specification" defines about how long this can take before bad things happen. To ensure the hypercall only takes a certain amount of time, once the time has elapsed, MicroV will return from the hypercall with a RETRY failure code. If the shim sees this failure code, it needs to execute this hypercall again, with the same exact parameters. Literally just run the hypercall again. MicroV will continue where it left off. The reason MicroV returns with the error code is that once it returns, it is likely that the shim will not actually execute, and instead, the Root VM will have to handle a bunch of interrupts. Once these are done, the kernel will continue to execute the shim, and since the shim will see the RETRY status code, it will simply execute the hypercall again, and MIcroV will continue where it left off. This provides MicroV with a means to give the root VM some time to do some house keeping for long running hypercalls.
The ABI actually provides a means to for the shim to tell MicroV that it will handle a continuation when it wants to using a hypercall flag. This allows the shim to make other hypercalls before performing the continuation in case it has house keeping to do as well. This flag exists, but we don't have any plans to support it right now. We simply added it incase it is needed as Xen currently has this.

KVM_GET_VCPU_MMAP_SIZE:
Returns the size of the KVM_RUN struct. No hypercalls are needed for this.

KVM_GET_REGS:
Calls mv_vs_op_reg_get_list. This hypercall allows you to fill in a list of the registers that you want and it will return their value or it will return an error code. Simply ask for the registers that KVM wants for this hypercall.

KVM_SET_REGS:
Calls mv_vs_op_reg_set_list. This hypercall allows you to fill in a list of the registers that you want to set and it will set their values or it will return an error code. Simply ask to set the registers that KVM wants for this hypercall.

KVM_GET_SREGS:
Calls mv_vs_op_reg_get_list and mv_vs_op_msr_get_list. These hypercall allows you to fill in a list of the registers and MSRs that you want and it will return their values or it will return an error code. Simply ask for the registers and MSRs that KVM wants for this hypercall.

KVM_SET_SREGS:
Calls mv_vs_op_reg_set_list and mv_vs_op_msr_set_list. This hypercall allows you to fill in a list of the registers and MSRs that you want to set and it will set their values or it will return an error code. Simply ask to set the registers and MSRs that KVM wants for this hypercall.

KVM_RUN:
Calls mv_vs_op_run. This IOCTL will have to translate between MicroV's ABI and KVM's. The following are some important notes:

The way that KVM_RUN works is that when the userspace application is ready to run the VM, it will do so by making a call to KVM_RUN which runs a VCPU. When guest SMP support is finally added, you would actually have more than one thread for each VCPU and KVM_RUN is executed by each thread. When MicroV detects that there is something for the userspace app to complete, it will return from KVM_RUN with an exit reason and in some cases, information needed to handle the exit. Once the exit has been handled by the userspace application, it will execute KVM_RUN again, and the process repeats until it is time to kill the VM, or the VM kills itself.
MicroV will have something similar to KVM_EXIT_IO, but it will not have anything for KVM_EXIT_SHUTDOWN. KVM_EXIT_SHUTDOWN is a Linux specific thing that the shim will have to implement. How this is done is currently unknown. The application that runs the VM will be in an endless loop, so I assume that a CTRL+C would be used to stop the VM. How the shim driver catches this and returns from KVM_RUN with KVM_EXIT_SHUTDOWN is unknown. Look at the KVM driver to see how it handle this.

Step 2 (QEMU demo)

The second step will be to get QEMU working without the need for MicroV to handle interrupts (meaning no IRQCHIP). https://github.com/qemu/qemu

Like step one, this will not include guest SMP support. Since QEMU will be handling guest interrupts and LAPIC, IOAPIC, PIC and PIT emulation, the gust will be slower than we would like, but this demo will provide the ability to run a full Linux Ubuntu 18.04/20.04 VM. This demo will include the ability to run a guest VM on any root VP, including root VP migration (meaning the root OS can move the guest VM from one root VP to another). The following additional IOCTLs will need to be implemented:

For KVM_RUN, we will need the following exit reasons implemented:

TBD

From a MicroV point of view, we will need the following ABIs defined and implemented:

KVM_GET_API_VERSION:
This simply need to return a valid version number. Will need to pick a version that makes sense for rust-vmm and QEMU. Ideally it is the latest version of KVM.

KVM_GET_MSR_INDEX_LIST:
Calls mv_pp_op_msr_get_supported. Will need to translate between MicroV and KVM as these ABIs will not be the same. KVM uses a list which could be larger than a page, while MicroV uses an MSR bitmap style ABI to fit everything into a single page.

KVM_GET_MSR_FEATURE_INDEX_LIST:
Calls mv_pp_op_msr_get_permissable. Will need to translate between MicroV and KVM as these ABIs will not be the same. KVM uses a list which could be larger than a page, while MicroV uses an MSR bitmap style ABI to fit everything into a single page.

KVM_CHECK_EXTENSION:
Calls mv_id_op_has_capability. Some of the capabilities will just be enabled by the shim driver without having to call MicroV as the ABI will just be there. But if there are ABIs that are optional for MicroV, the shim can call mv_id_op_has_capability and relay the information to userspace as needed.

KVM_TRANSLATE:
Calls mv_vs_op_gla_to_gpa. MicroV uses registers only for this, so the shim will have to translate from MicroV registers to the kvm_translation struct. It should be noted here that KVM's docs say "virtual" address, but the struct takes a linear address. Since the API does not include a segment register, it has to be a GLA to GPA conversion as you cannot convert a virtual address to a physical address without knowing what segment you want to use. Someone clearly figured this out when they updated the kvm_translation struct, but nobody updated the documentation. KVM in this case is just assuming that a virtual address is the same thing as a linear address, and they are not. Virtual address uses segmentation while linear addresses use paging. It just so happens that on most systems, the segment's base address for CS, SS and DS is set to 0, so the virtual address and linear address appear the same.

KVM_INTERRUPT:
Calls mv_vs_op_interrupt

KVM_GET_MSRS:
Calls mv_vs_op_msr_get_list

KVM_SET_MSRS:
Calls mv_vs_op_msr_set_list

KVM_GET_CPUID2:
Calls mv_vs_op_cpuid_get_list

KVM_SET_CPUID:
KVM_SET_CPUID2:
Calls mv_vs_op_cpuid_set_list

KVM_GET_FPU:
Calls mv_vs_op_fpu_get_all

KVM_SET_FPU:
Calls mv_vs_op_fpu_set_all

KVM_GET_DEBUGREGS:
Calls mv_vs_op_reg_get_debug

KVM_SET_DEBUGREGS:
Calls mv_vs_op_reg_set_debug

KVM_GET_MP_STATE:
Calls mv_vs_op_mp_state_get

KVM_SET_MP_STATE:
Calls mv_vs_op_mp_state_set

KVM_GET_XSAVE:
Calls mv_vs_op_xsave_get_all

KVM_SET_XSAVE:
Calls mv_vs_op_xsave_set_all

KVM_GET_XCRS:
Calls mv_vs_op_reg_get, asking for XCR0

KVM_SET_XCRS:
Calls mv_vs_op_reg_set, asking for XCR0

KVM_GET_SUPPORTED_CPUID:
Calls mv_pp_op_cpuid_get_supported

KVM_GET_TSC_KHZ:
Calls mv_pp_op_tsc_get_khz

KVM_GET_ONE_REG:
Calls mv_vs_op_reg_get

KVM_SET_ONE_REG:
Calls mv_vs_op_reg_set

KVM_GET_EMULATED_CPUID:
Calls mv_pp_op_cpuid_get_emulated

Step 3 (rust-vmm demo)

The third step will be to get rust-vmm working. This will require MicroV to handle interrupts, meaning MicroV will have to emulate the LAPIC, IOAPIC, PIC and PIT. https://github.com/rust-vmm/vmm-reference

Like the previous demos, this will not include guest SMP support. PCI pass-through will however be included to ensure that the emulated devices provided by MicroV are working properly. To start, PCI pass-through will only include NICs. To support PCI pass-through, MicroV will enable external interrupt exiting in the root VM. Future versions of MicroV might include a root VM driver to trap on PCI pass-through specific interrupts and forward them to the correct guest VMs as needed to allow MicroV to disable external interrupt handling from the root VM. But for now, external interrupt exiting will be used to simplify this demo. The following additional IOCTLs will need to be implemented:

Step 4 (higher TRL demo)

The fourth step will be to raise the TRL of the previous demos and include lockdown support (i.e., deprivilege the root VM) This will include support for additional test systems and PCI pass-through devices. All unit and integration testing will be complete as well. To deprivilege the root VM, a complete analysis will need to be made as to what memory and additional gust state like general purpose and system registers QEMU and rust-vmm require. All additional resources will be locked down to prevent access to the root VM. This lock-down should take place before the first mv_vs_op_run is called.

Future Steps

In the future, the following will also be added:

Additional Notes

There are some IOCTLs that we are not sure if they are needed. If they are, they should be removed from this list and added to the lists above as needed. These include:

KVM_GET_VCPU_EVENTS
KVM_SET_VCPU_EVENTS
KVM_ENABLE_CAP
KVM_CREATE_DEVICE
KVM_GET_DEVICE_ATTR
KVM_SET_DEVICE_ATTR
KVM_HAS_DEVICE_ATTR
KVM_SET_TSS_ADDR
KVM_SET_IDENTITY_MAP_ADDR

IOCTLs that we believe are not needed are as follows. Again, if these are needed, they should be removed from this list and added to the lists above. Keep in mind that if they are needed, it is possible that we are not providing the right capabilities to software, and we might simply need to tweak the capabilities instead. Finally, some of these might be needed if we add support for nested virtualization, SEV/TDX, or VM migration.

KVM_GET_DIRTY_LOG
KVM_SET_SIGNAL_MASK
KVM_XEN_HVM_CONFIG
KVM_GET_CLOCK
KVM_SET_CLOCK
KVM_SET_BOOT_CPU_ID
KVM_SET_TSC_KHZ
KVM_NMI
KVM_KVMCLOCK_CTRL
KVM_SET_GUEST_DEBUG
KVM_SMI
KVM_X86_GET_MCE_CAP_SUPPORTED
KVM_X86_SETUP_MCE
KVM_X86_SET_MCE
KVM_MEMORY_ENCRYPT_OP
KVM_MEMORY_ENCRYPT_REG_REGION
KVM_MEMORY_ENCRYPT_UNREG_REGION
KVM_HYPERV_EVENTFD
KVM_GET_NESTED_STATE
KVM_SET_NESTED_STATE
KVM_REGISTER_COALESCED_MMIO
KVM_UNREGISTER_COALESCED_MMIO
KVM_CLEAR_DIRTY_LOG
KVM_GET_SUPPORTED_HV_CPUID
KVM_SET_PMU_EVENT_FILTER

Provide feedback

Saved searches

Use saved searches to filter your results more quickly