1# Architecture 2 3The principle characteristics of crosvm are: 4 5- A process per virtual device, made using fork on Linux 6- Each process is sandboxed using [minijail] 7- Support for several CPU architectures, operating systems, and [hypervisors] 8- Written in Rust for security and safety 9 10A typical session of crosvm starts in `main.rs` where command line parsing is done to build up a 11`Config` structure. The `Config` is used by `run_config` in `src/crosvm/sys/unix.rs` to setup and 12execute a VM. Broken down into rough steps: 13 141. Load the Linux kernel from an ELF or bzImage file. 151. Create a handful of control sockets used by the virtual devices. 161. Invoke the architecture-specific VM builder `Arch::build_vm` (located in `x86_64/src/lib.rs`, 17 `aarch64/src/lib.rs`, or `riscv64/src/lib.rs`). 181. `Arch::build_vm` will create a `RunnableLinuxVm` to represent a virtual machine instance. 191. `create_devices` creates every PCI device, including the virtio devices, that were configured in 20 `Config`, along with matching [minijail] configs for each. 211. `Arch::assign_pci_addresses` assigns an address to each PCI device, prioritizing devices that 22 report a preferred slot by implementing the `PciDevice` trait's `preferred_address` function. 231. `Arch::generate_pci_root`, using a list of every PCI device with optional `Minijail`, will 24 finally jail the PCI devices and construct a `PciRoot` that communicates with them. 251. Once the VM has been built, it's contained within a `RunnableLinuxVm` object that is used by the 26 VCPUs and control loop to service requests until shutdown. 27 28## Forking 29 30During the device creation routine, each device will be created and then wrapped in a `ProxyDevice` 31which will internally `fork` (but not `exec`) and [minijail] the device, while dropping it for the 32main process. The only interaction that the device is capable of having with the main process is via 33the proxied trait methods of `BusDevice`, shared memory mappings such as the guest memory, and file 34descriptors that were specifically allowed by that device's security policy. This can lead to some 35surprising behavior to be aware of such as why some file descriptors which were once valid are now 36invalid. 37 38## Sandboxing Policy 39 40Every sandbox is made with [minijail] and starts with `create_sandbox_minijail` in `jail` crate 41which set some very restrictive settings. Linux namespaces and seccomp filters are used for 42sandboxing. Each seccomp policy can be found under `jail/seccomp/{arch}/{device}.policy` and should 43start by `@include`-ing the `common_device.policy`. With the exception of architecture specific 44devices (such as `Pl030` on ARM or `I8042` on x86_64), every device will need a different policy for 45each supported architecture. 46 47## The VM Control Sockets 48 49For the operations that devices need to perform on the global VM state, such as mapping into guest 50memory address space, there are the VM control sockets. There are a few kinds, split by the type of 51request and response that the socket will process. This also proves basic security privilege 52separation in case a device becomes compromised by a malicious guest. For example, a rogue device 53that is able to allocate MSI routes would not be able to use the same socket to (de)register guest 54memory. During the device initialization stage, each device that requires some aspect of VM control 55will have a constructor that requires the corresponding control socket. The control socket will get 56preserved when the device is sandboxed and the other side of the socket will be waited on in the 57main process's control loop. 58 59The socket exposed by crosvm with the `--socket` command line argument is another form of the VM 60control socket. Because the protocol of the control socket is internal and unstable, the only 61supported way of using that resulting named unix domain socket is via crosvm command line 62subcommands such as `crosvm stop` or programmatically via the [`crosvm_control`] library. 63 64## GuestMemory 65 66`GuestMemory` and its friends `VolatileMemory`, `VolatileSlice`, `MemoryMapping`, and 67`SharedMemory`, are common types used throughout crosvm to interact with guest memory. Know which 68one to use in what place using some guidelines 69 70- `GuestMemory` is for sending around references to all of the guest memory. It can be cloned 71 freely, but the underlying guest memory is always the same. Internally, it's implemented using 72 `MemoryMapping` and `SharedMemory`. Note that `GuestMemory` is mapped into the host address space 73 (for non-protected VMs), but it is non-contiguous. Device memory, such as mapped DMA-Bufs, are not 74 present in `GuestMemory`. 75- `SharedMemory` wraps a `memfd` and can be mapped using `MemoryMapping` to access its data. 76 `SharedMemory` can't be cloned. 77- `VolatileMemory` is a trait that exposes generic access to non-contiguous memory. `GuestMemory` 78 implements this trait. Use this trait for functions that operate on a memory space but don't 79 necessarily need it to be guest memory. 80- `VolatileSlice` is analogous to a Rust slice, but unlike those, a `VolatileSlice` has data that 81 changes asynchronously by all those that reference it. Exclusive mutability and data 82 synchronization are not available when it comes to a `VolatileSlice`. This type is useful for 83 functions that operate on contiguous shared memory, such as a single entry from a scatter gather 84 table, or for safe wrappers around functions which operate on pointers, such as a `read` or 85 `write` syscall. 86- `MemoryMapping` is a safe wrapper around anonymous and file mappings. Provides RAII and does 87 munmap after use. Access via Rust references is forbidden, but indirect reading and writing is 88 available via `VolatileSlice` and several convenience functions. This type is most useful for 89 mapping memory unrelated to `GuestMemory`. 90 91See [memory layout](https://crosvm.dev/book/appendix/memory_layout.html) for details how crosvm 92arranges the guest address space. 93 94### Device Model 95 96### `Bus`/`BusDevice` 97 98The root of the crosvm device model is the `Bus` structure and its friend the `BusDevice` trait. The 99`Bus` structure is a virtual computer bus used to emulate the memory-mapped I/O bus and also I/O 100ports for x86 VMs. On a read or write to an address on a VM's bus, the corresponding `Bus` object is 101queried for a `BusDevice` that occupies that address. `Bus` will then forward the read/write to the 102`BusDevice`. Because of this behavior, only one `BusDevice` may exist at any given address. However, 103a `BusDevice` may be placed at more than one address range. Depending on how a `BusDevice` was 104inserted into the `Bus`, the forwarded read/write will be relative to 0 or to the start of the 105address range that the `BusDevice` occupies (which would be ambiguous if the `BusDevice` occupied 106more than one range). 107 108Only the base address of a multi-byte read/write is used to search for a device, so a device 109implementation should be aware that the last address of a single read/write may be outside its 110address range. For example, if a `BusDevice` was inserted at base address 0x1000 with a length of 1110x40, a 4-byte read by a VCPU at 0x39 would be forwarded to that `BusDevice`. 112 113Each `BusDevice` is reference counted and wrapped in a mutex, so implementations of `BusDevice` need 114not worry about synchronizing their access across multiple VCPUs and threads. Each VCPU will get a 115complete copy of the `Bus`, so there is no contention for querying the `Bus` about an address. Once 116the `BusDevice` is found, the `Bus` will acquire an exclusive lock to the device and forward the 117VCPU's read/write. The implementation of the `BusDevice` will block execution of the VCPU that 118invoked it, as well as any other VCPU attempting access, until it returns from its method. 119 120Most devices in crosvm do not implement `BusDevice` directly, but some are examples are `i8042` and 121`Serial`. With the exception of PCI devices, all devices are inserted by architecture specific code 122(which may call into the architecture-neutral `arch` crate). A `BusDevice` can be proxied to a 123sandboxed process using `ProxyDevice`, which will create the second process using a fork, with no 124exec. 125 126### `PciConfigIo`/`PciConfigMmio` 127 128In order to use the more complex PCI bus, there are a couple adapters that implement `BusDevice` and 129call into a `PciRoot` with higher level calls to `config_space_read`/`config_space_write`. The 130`PciConfigMmio` is a `BusDevice` for insertion into the MMIO `Bus` for ARM devices. For x86_64, 131`PciConfigIo` is inserted into the I/O port `Bus`. There is only one implementation of `PciRoot` 132that is used by either of the `PciConfig*` structures. Because these devices are very simple, they 133have very little code or state. They aren't sandboxed and are run as part of the main process. 134 135### `PciRoot`/`PciDevice`/`VirtioPciDevice` 136 137The `PciRoot`, analogous to `BusDevice` for `Bus`s, contains all the `PciDevice` trait objects. 138Because of a shortcut (or hack), the `ProxyDevice` only supports jailing `BusDevice` traits. 139Therefore, `PciRoot` only contains `BusDevice`s, even though they also implement `PciDevice`. In 140fact, every `PciDevice` also implements `BusDevice` because of a blanket implementation 141(`impl<T: PciDevice> BusDevice for T { … }`). There are a few PCI related methods in `BusDevice` to 142allow the `PciRoot` to still communicate with the underlying `PciDevice` (yes, this abstraction is 143very leaky). Most devices will not implement `PciDevice` directly, instead using the 144`VirtioPciDevice` implementation for virtio devices, but the xHCI (USB) controller is an example 145that implements `PciDevice` directly. The `VirtioPciDevice` is an implementation of `PciDevice` that 146wraps a `VirtioDevice`, which is how the virtio specified PCI transport is adapted to a transport 147agnostic `VirtioDevice` implementation. 148 149### `VirtioDevice` 150 151The `VirtioDevice` is the most widely implemented trait among the device traits. Each of the 152different virtio devices (block, rng, net, etc.) implement this trait directly and they follow a 153similar pattern. Most of the trait methods are easily filled in with basic information about the 154specific device, but `activate` will be the heart of the implementation. It's called by the virtio 155transport after the guest's driver has indicated the device has been configured and is ready to run. 156The virtio device implementation will receive the run time related resources (`GuestMemory`, 157`Interrupt`, etc.) for processing virtio queues and associated interrupts via the arguments to 158`activate`, but `activate` can't spend its time actually processing the queues. A VCPU will be 159blocked as long as `activate` is running. Every device uses `activate` to launch a worker thread 160that takes ownership of run time resources to do the actual processing. There is some subtlety in 161dealing with virtio queues, so the smart thing to do is copy a simpler device and adapt it, such as 162the rng device (`rng.rs`). 163 164## Communication Framework 165 166Because of the multi-process nature of crosvm, communication is done over several IPC primitives. 167The common ones are shared memory pages, unix sockets, anonymous pipes, and various other file 168descriptor variants (DMA-buf, eventfd, etc.). Standard methods (`read`/`write`) of using these 169primitives may be used, but crosvm has developed some helpers which should be used where applicable. 170 171### `WaitContext` 172 173Most threads in crosvm will have a wait loop using a [`WaitContext`], which is a wrapper around a 174`epoll` on Linux and `WaitForMultipleObjects` on Windows. In either case, waitable objects can be 175added to the context along with an associated token, whose type is the type parameter of 176`WaitContext`. A call to the `wait` function will block until at least one of the waitable objects 177has become signaled and will return a collection of the tokens associated with those objects. The 178tokens used with `WaitContext` must be convertible to and from a `u64`. There is a custom derive 179`#[derive(EventToken)]` which can be applied to an `enum` declaration that makes it easy to use your 180own enum in a `WaitContext`. 181 182#### Linux Platform Limitations 183 184The limitations of `WaitContext` on Linux are the same as the limitations of `epoll`. The same FD 185can not be inserted more than once, and the FD will be automatically removed if the process runs out 186of references to that FD. A `dup`/`fork` call will increment that reference count, so closing the 187original FD will not actually remove it from the `WaitContext`. It is possible to receive tokens 188from `WaitContext` for an FD that was closed because of a race condition in which an event was 189registered in the background before the `close` happened. Best practice is to keep an FD open and 190remove it from the `WaitContext` before closing it so that events associated with it can be reliably 191eliminated. 192 193### `serde` with Descriptors 194 195Using raw sockets and pipes to communicate is very inconvenient for rich data types. To help make 196this easier and less error prone, crosvm uses the `serde` crate. To allow transmitting types with 197embedded descriptors (FDs on Linux or HANDLEs on Windows), a module is provided for sending and 198receiving descriptors alongside the plain old bytes that serde consumes. 199 200## Code Map 201 202Source code is organized into crates, each with their own unit tests. 203 204- `./src/` - The top-level binary front-end for using crosvm. 205- `aarch64` - Support code specific to 64-bit ARM architectures. 206- `base` - Safe wrappers for system facilities which provides cross-platform-compatible interfaces. 207- `cros_async` - Runtime for async/await programming. This crate provides a `Future` executor based 208 on `io_uring` and one based on `epoll`. 209- `devices` - Virtual devices exposed to the guest OS. 210- `disk` - Library to create and manipulate several types of disks such as raw disk, [qcow], etc. 211- `hypervisor` - Abstract layer to interact with hypervisors. For Linux, this crate is a wrapper of 212 `kvm`. 213- `e2e_tests` - End-to-end tests that run a crosvm VM. 214- `infra` - Infrastructure recipes for continuous integration testing. 215- `jail` - Sandboxing helper library for Linux. 216- `jail/seccomp` - Contains minijail seccomp policy files for each sandboxed device. Because some 217 syscalls vary by architecture, the seccomp policies are split by architecture. 218- `kernel_loader` - Loads kernel images in various formats to a slice of memory. 219- `kvm_sys` - Low-level (mostly) auto-generated structures and constants for using KVM. 220- `kvm` - Unsafe, low-level wrapper code for using `kvm_sys`. 221- `media/libvda` - Safe wrapper of [libvda], a ChromeOS HW-accelerated video decoding/encoding 222 library. 223- `net_sys` - Low-level (mostly) auto-generated structures and constants for creating TUN/TAP 224 devices. 225- `net_util` - Wrapper for creating TUN/TAP devices. 226- `qcow_util` - A library and a binary to manipulate [qcow] disks. 227- `sync` - Our version of `std::sync::Mutex` and `std::sync::Condvar`. 228- `third_party` - Third-party libraries which we are maintaining on the ChromeOS tree or the AOSP 229 tree. 230- `tools` - Scripts for code health such as wrappers of `rustfmt` and `clippy`. 231- `vfio_sys` - Low-level (mostly) auto-generated structures, constants and ioctls for [VFIO]. 232- `vhost` - Wrappers for creating vhost based devices. 233- `virtio_sys` - Low-level (mostly) auto-generated structures and constants for interfacing with 234 kernel vhost support. 235- `vm_control` - IPC for the VM. 236- `vm_memory` - VM-specific memory objects. 237- `x86_64` - Support code specific to 64-bit x86 machines. 238 239[hypervisors]: https://crosvm.dev/book/hypervisors.html 240[libvda]: https://chromium.googlesource.com/chromiumos/platform2/+/refs/heads/main/arc/vm/libvda/ 241[minijail]: https://crosvm.dev/book/appendix/minijail.html 242[qcow]: https://en.wikipedia.org/wiki/Qcow 243[vfio]: https://www.kernel.org/doc/html/latest/driver-api/vfio.html 244[`crosvm_control`]: https://crosvm.dev/book/running_crosvm/programmatic_interaction.html 245[`waitcontext`]: https://crosvm.dev/doc/base/struct.WaitContext.html 246