xref: /aosp_15_r20/external/crosvm/ARCHITECTURE.md (revision bb4ee6a4ae7042d18b07a98463b9c8b875e44b39)
1# Architecture
2
3The principle characteristics of crosvm are:
4
5- A process per virtual device, made using fork on Linux
6- Each process is sandboxed using [minijail]
7- Support for several CPU architectures, operating systems, and [hypervisors]
8- Written in Rust for security and safety
9
10A typical session of crosvm starts in `main.rs` where command line parsing is done to build up a
11`Config` structure. The `Config` is used by `run_config` in `src/crosvm/sys/unix.rs` to setup and
12execute a VM. Broken down into rough steps:
13
141. Load the Linux kernel from an ELF or bzImage file.
151. Create a handful of control sockets used by the virtual devices.
161. Invoke the architecture-specific VM builder `Arch::build_vm` (located in `x86_64/src/lib.rs`,
17   `aarch64/src/lib.rs`, or `riscv64/src/lib.rs`).
181. `Arch::build_vm` will create a `RunnableLinuxVm` to represent a virtual machine instance.
191. `create_devices` creates every PCI device, including the virtio devices, that were configured in
20   `Config`, along with matching [minijail] configs for each.
211. `Arch::assign_pci_addresses` assigns an address to each PCI device, prioritizing devices that
22   report a preferred slot by implementing the `PciDevice` trait's `preferred_address` function.
231. `Arch::generate_pci_root`, using a list of every PCI device with optional `Minijail`, will
24   finally jail the PCI devices and construct a `PciRoot` that communicates with them.
251. Once the VM has been built, it's contained within a `RunnableLinuxVm` object that is used by the
26   VCPUs and control loop to service requests until shutdown.
27
28## Forking
29
30During the device creation routine, each device will be created and then wrapped in a `ProxyDevice`
31which will internally `fork` (but not `exec`) and [minijail] the device, while dropping it for the
32main process. The only interaction that the device is capable of having with the main process is via
33the proxied trait methods of `BusDevice`, shared memory mappings such as the guest memory, and file
34descriptors that were specifically allowed by that device's security policy. This can lead to some
35surprising behavior to be aware of such as why some file descriptors which were once valid are now
36invalid.
37
38## Sandboxing Policy
39
40Every sandbox is made with [minijail] and starts with `create_sandbox_minijail` in `jail` crate
41which set some very restrictive settings. Linux namespaces and seccomp filters are used for
42sandboxing. Each seccomp policy can be found under `jail/seccomp/{arch}/{device}.policy` and should
43start by `@include`-ing the `common_device.policy`. With the exception of architecture specific
44devices (such as `Pl030` on ARM or `I8042` on x86_64), every device will need a different policy for
45each supported architecture.
46
47## The VM Control Sockets
48
49For the operations that devices need to perform on the global VM state, such as mapping into guest
50memory address space, there are the VM control sockets. There are a few kinds, split by the type of
51request and response that the socket will process. This also proves basic security privilege
52separation in case a device becomes compromised by a malicious guest. For example, a rogue device
53that is able to allocate MSI routes would not be able to use the same socket to (de)register guest
54memory. During the device initialization stage, each device that requires some aspect of VM control
55will have a constructor that requires the corresponding control socket. The control socket will get
56preserved when the device is sandboxed and the other side of the socket will be waited on in the
57main process's control loop.
58
59The socket exposed by crosvm with the `--socket` command line argument is another form of the VM
60control socket. Because the protocol of the control socket is internal and unstable, the only
61supported way of using that resulting named unix domain socket is via crosvm command line
62subcommands such as `crosvm stop` or programmatically via the [`crosvm_control`] library.
63
64## GuestMemory
65
66`GuestMemory` and its friends `VolatileMemory`, `VolatileSlice`, `MemoryMapping`, and
67`SharedMemory`, are common types used throughout crosvm to interact with guest memory. Know which
68one to use in what place using some guidelines
69
70- `GuestMemory` is for sending around references to all of the guest memory. It can be cloned
71  freely, but the underlying guest memory is always the same. Internally, it's implemented using
72  `MemoryMapping` and `SharedMemory`. Note that `GuestMemory` is mapped into the host address space
73  (for non-protected VMs), but it is non-contiguous. Device memory, such as mapped DMA-Bufs, are not
74  present in `GuestMemory`.
75- `SharedMemory` wraps a `memfd` and can be mapped using `MemoryMapping` to access its data.
76  `SharedMemory` can't be cloned.
77- `VolatileMemory` is a trait that exposes generic access to non-contiguous memory. `GuestMemory`
78  implements this trait. Use this trait for functions that operate on a memory space but don't
79  necessarily need it to be guest memory.
80- `VolatileSlice` is analogous to a Rust slice, but unlike those, a `VolatileSlice` has data that
81  changes asynchronously by all those that reference it. Exclusive mutability and data
82  synchronization are not available when it comes to a `VolatileSlice`. This type is useful for
83  functions that operate on contiguous shared memory, such as a single entry from a scatter gather
84  table, or for safe wrappers around functions which operate on pointers, such as a `read` or
85  `write` syscall.
86- `MemoryMapping` is a safe wrapper around anonymous and file mappings. Provides RAII and does
87  munmap after use. Access via Rust references is forbidden, but indirect reading and writing is
88  available via `VolatileSlice` and several convenience functions. This type is most useful for
89  mapping memory unrelated to `GuestMemory`.
90
91See [memory layout](https://crosvm.dev/book/appendix/memory_layout.html) for details how crosvm
92arranges the guest address space.
93
94### Device Model
95
96### `Bus`/`BusDevice`
97
98The root of the crosvm device model is the `Bus` structure and its friend the `BusDevice` trait. The
99`Bus` structure is a virtual computer bus used to emulate the memory-mapped I/O bus and also I/O
100ports for x86 VMs. On a read or write to an address on a VM's bus, the corresponding `Bus` object is
101queried for a `BusDevice` that occupies that address. `Bus` will then forward the read/write to the
102`BusDevice`. Because of this behavior, only one `BusDevice` may exist at any given address. However,
103a `BusDevice` may be placed at more than one address range. Depending on how a `BusDevice` was
104inserted into the `Bus`, the forwarded read/write will be relative to 0 or to the start of the
105address range that the `BusDevice` occupies (which would be ambiguous if the `BusDevice` occupied
106more than one range).
107
108Only the base address of a multi-byte read/write is used to search for a device, so a device
109implementation should be aware that the last address of a single read/write may be outside its
110address range. For example, if a `BusDevice` was inserted at base address 0x1000 with a length of
1110x40, a 4-byte read by a VCPU at 0x39 would be forwarded to that `BusDevice`.
112
113Each `BusDevice` is reference counted and wrapped in a mutex, so implementations of `BusDevice` need
114not worry about synchronizing their access across multiple VCPUs and threads. Each VCPU will get a
115complete copy of the `Bus`, so there is no contention for querying the `Bus` about an address. Once
116the `BusDevice` is found, the `Bus` will acquire an exclusive lock to the device and forward the
117VCPU's read/write. The implementation of the `BusDevice` will block execution of the VCPU that
118invoked it, as well as any other VCPU attempting access, until it returns from its method.
119
120Most devices in crosvm do not implement `BusDevice` directly, but some are examples are `i8042` and
121`Serial`. With the exception of PCI devices, all devices are inserted by architecture specific code
122(which may call into the architecture-neutral `arch` crate). A `BusDevice` can be proxied to a
123sandboxed process using `ProxyDevice`, which will create the second process using a fork, with no
124exec.
125
126### `PciConfigIo`/`PciConfigMmio`
127
128In order to use the more complex PCI bus, there are a couple adapters that implement `BusDevice` and
129call into a `PciRoot` with higher level calls to `config_space_read`/`config_space_write`. The
130`PciConfigMmio` is a `BusDevice` for insertion into the MMIO `Bus` for ARM devices. For x86_64,
131`PciConfigIo` is inserted into the I/O port `Bus`. There is only one implementation of `PciRoot`
132that is used by either of the `PciConfig*` structures. Because these devices are very simple, they
133have very little code or state. They aren't sandboxed and are run as part of the main process.
134
135### `PciRoot`/`PciDevice`/`VirtioPciDevice`
136
137The `PciRoot`, analogous to `BusDevice` for `Bus`s, contains all the `PciDevice` trait objects.
138Because of a shortcut (or hack), the `ProxyDevice` only supports jailing `BusDevice` traits.
139Therefore, `PciRoot` only contains `BusDevice`s, even though they also implement `PciDevice`. In
140fact, every `PciDevice` also implements `BusDevice` because of a blanket implementation
141(`impl<T: PciDevice> BusDevice for T { … }`). There are a few PCI related methods in `BusDevice` to
142allow the `PciRoot` to still communicate with the underlying `PciDevice` (yes, this abstraction is
143very leaky). Most devices will not implement `PciDevice` directly, instead using the
144`VirtioPciDevice` implementation for virtio devices, but the xHCI (USB) controller is an example
145that implements `PciDevice` directly. The `VirtioPciDevice` is an implementation of `PciDevice` that
146wraps a `VirtioDevice`, which is how the virtio specified PCI transport is adapted to a transport
147agnostic `VirtioDevice` implementation.
148
149### `VirtioDevice`
150
151The `VirtioDevice` is the most widely implemented trait among the device traits. Each of the
152different virtio devices (block, rng, net, etc.) implement this trait directly and they follow a
153similar pattern. Most of the trait methods are easily filled in with basic information about the
154specific device, but `activate` will be the heart of the implementation. It's called by the virtio
155transport after the guest's driver has indicated the device has been configured and is ready to run.
156The virtio device implementation will receive the run time related resources (`GuestMemory`,
157`Interrupt`, etc.) for processing virtio queues and associated interrupts via the arguments to
158`activate`, but `activate` can't spend its time actually processing the queues. A VCPU will be
159blocked as long as `activate` is running. Every device uses `activate` to launch a worker thread
160that takes ownership of run time resources to do the actual processing. There is some subtlety in
161dealing with virtio queues, so the smart thing to do is copy a simpler device and adapt it, such as
162the rng device (`rng.rs`).
163
164## Communication Framework
165
166Because of the multi-process nature of crosvm, communication is done over several IPC primitives.
167The common ones are shared memory pages, unix sockets, anonymous pipes, and various other file
168descriptor variants (DMA-buf, eventfd, etc.). Standard methods (`read`/`write`) of using these
169primitives may be used, but crosvm has developed some helpers which should be used where applicable.
170
171### `WaitContext`
172
173Most threads in crosvm will have a wait loop using a [`WaitContext`], which is a wrapper around a
174`epoll` on Linux and `WaitForMultipleObjects` on Windows. In either case, waitable objects can be
175added to the context along with an associated token, whose type is the type parameter of
176`WaitContext`. A call to the `wait` function will block until at least one of the waitable objects
177has become signaled and will return a collection of the tokens associated with those objects. The
178tokens used with `WaitContext` must be convertible to and from a `u64`. There is a custom derive
179`#[derive(EventToken)]` which can be applied to an `enum` declaration that makes it easy to use your
180own enum in a `WaitContext`.
181
182#### Linux Platform Limitations
183
184The limitations of `WaitContext` on Linux are the same as the limitations of `epoll`. The same FD
185can not be inserted more than once, and the FD will be automatically removed if the process runs out
186of references to that FD. A `dup`/`fork` call will increment that reference count, so closing the
187original FD will not actually remove it from the `WaitContext`. It is possible to receive tokens
188from `WaitContext` for an FD that was closed because of a race condition in which an event was
189registered in the background before the `close` happened. Best practice is to keep an FD open and
190remove it from the `WaitContext` before closing it so that events associated with it can be reliably
191eliminated.
192
193### `serde` with Descriptors
194
195Using raw sockets and pipes to communicate is very inconvenient for rich data types. To help make
196this easier and less error prone, crosvm uses the `serde` crate. To allow transmitting types with
197embedded descriptors (FDs on Linux or HANDLEs on Windows), a module is provided for sending and
198receiving descriptors alongside the plain old bytes that serde consumes.
199
200## Code Map
201
202Source code is organized into crates, each with their own unit tests.
203
204- `./src/` - The top-level binary front-end for using crosvm.
205- `aarch64` - Support code specific to 64-bit ARM architectures.
206- `base` - Safe wrappers for system facilities which provides cross-platform-compatible interfaces.
207- `cros_async` - Runtime for async/await programming. This crate provides a `Future` executor based
208  on `io_uring` and one based on `epoll`.
209- `devices` - Virtual devices exposed to the guest OS.
210- `disk` - Library to create and manipulate several types of disks such as raw disk, [qcow], etc.
211- `hypervisor` - Abstract layer to interact with hypervisors. For Linux, this crate is a wrapper of
212  `kvm`.
213- `e2e_tests` - End-to-end tests that run a crosvm VM.
214- `infra` - Infrastructure recipes for continuous integration testing.
215- `jail` - Sandboxing helper library for Linux.
216- `jail/seccomp` - Contains minijail seccomp policy files for each sandboxed device. Because some
217  syscalls vary by architecture, the seccomp policies are split by architecture.
218- `kernel_loader` - Loads kernel images in various formats to a slice of memory.
219- `kvm_sys` - Low-level (mostly) auto-generated structures and constants for using KVM.
220- `kvm` - Unsafe, low-level wrapper code for using `kvm_sys`.
221- `media/libvda` - Safe wrapper of [libvda], a ChromeOS HW-accelerated video decoding/encoding
222  library.
223- `net_sys` - Low-level (mostly) auto-generated structures and constants for creating TUN/TAP
224  devices.
225- `net_util` - Wrapper for creating TUN/TAP devices.
226- `qcow_util` - A library and a binary to manipulate [qcow] disks.
227- `sync` - Our version of `std::sync::Mutex` and `std::sync::Condvar`.
228- `third_party` - Third-party libraries which we are maintaining on the ChromeOS tree or the AOSP
229  tree.
230- `tools` - Scripts for code health such as wrappers of `rustfmt` and `clippy`.
231- `vfio_sys` - Low-level (mostly) auto-generated structures, constants and ioctls for [VFIO].
232- `vhost` - Wrappers for creating vhost based devices.
233- `virtio_sys` - Low-level (mostly) auto-generated structures and constants for interfacing with
234  kernel vhost support.
235- `vm_control` - IPC for the VM.
236- `vm_memory` - VM-specific memory objects.
237- `x86_64` - Support code specific to 64-bit x86 machines.
238
239[hypervisors]: https://crosvm.dev/book/hypervisors.html
240[libvda]: https://chromium.googlesource.com/chromiumos/platform2/+/refs/heads/main/arc/vm/libvda/
241[minijail]: https://crosvm.dev/book/appendix/minijail.html
242[qcow]: https://en.wikipedia.org/wiki/Qcow
243[vfio]: https://www.kernel.org/doc/html/latest/driver-api/vfio.html
244[`crosvm_control`]: https://crosvm.dev/book/running_crosvm/programmatic_interaction.html
245[`waitcontext`]: https://crosvm.dev/doc/base/struct.WaitContext.html
246