1.. SPDX-License-Identifier: GPL-2.0 2 3Hibernating Guest VMs 4===================== 5 6Background 7---------- 8Linux supports the ability to hibernate itself in order to save power. 9Hibernation is sometimes called suspend-to-disk, as it writes a memory 10image to disk and puts the hardware into the lowest possible power 11state. Upon resume from hibernation, the hardware is restarted and the 12memory image is restored from disk so that it can resume execution 13where it left off. See the "Hibernation" section of 14Documentation/admin-guide/pm/sleep-states.rst. 15 16Hibernation is usually done on devices with a single user, such as a 17personal laptop. For example, the laptop goes into hibernation when 18the cover is closed, and resumes when the cover is opened again. 19Hibernation and resume happen on the same hardware, and Linux kernel 20code orchestrating the hibernation steps assumes that the hardware 21configuration is not changed while in the hibernated state. 22 23Hibernation can be initiated within Linux by writing "disk" to 24/sys/power/state or by invoking the reboot system call with the 25appropriate arguments. This functionality may be wrapped by user space 26commands such "systemctl hibernate" that are run directly from a 27command line or in response to events such as the laptop lid closing. 28 29Considerations for Guest VM Hibernation 30--------------------------------------- 31Linux guests on Hyper-V can also be hibernated, in which case the 32hardware is the virtual hardware provided by Hyper-V to the guest VM. 33Only the targeted guest VM is hibernated, while other guest VMs and 34the underlying Hyper-V host continue to run normally. While the 35underlying Windows Hyper-V and physical hardware on which it is 36running might also be hibernated using hibernation functionality in 37the Windows host, host hibernation and its impact on guest VMs is not 38in scope for this documentation. 39 40Resuming a hibernated guest VM can be more challenging than with 41physical hardware because VMs make it very easy to change the hardware 42configuration between the hibernation and resume. Even when the resume 43is done on the same VM that hibernated, the memory size might be 44changed, or virtual NICs or SCSI controllers might be added or 45removed. Virtual PCI devices assigned to the VM might be added or 46removed. Most such changes cause the resume steps to fail, though 47adding a new virtual NIC, SCSI controller, or vPCI device should work. 48 49Additional complexity can ensue because the disks of the hibernated VM 50can be moved to another newly created VM that otherwise has the same 51virtual hardware configuration. While it is desirable for resume from 52hibernation to succeed after such a move, there are challenges. See 53details on this scenario and its limitations in the "Resuming on a 54Different VM" section below. 55 56Hyper-V also provides ways to move a VM from one Hyper-V host to 57another. Hyper-V tries to ensure processor model and Hyper-V version 58compatibility using VM Configuration Versions, and prevents moves to 59a host that isn't compatible. Linux adapts to host and processor 60differences by detecting them at boot time, but such detection is not 61done when resuming execution in the hibernation image. If a VM is 62hibernated on one host, then resumed on a host with a different processor 63model or Hyper-V version, settings recorded in the hibernation image 64may not match the new host. Because Linux does not detect such 65mismatches when resuming the hibernation image, undefined behavior 66and failures could result. 67 68 69Enabling Guest VM Hibernation 70----------------------------- 71Hibernation of a Hyper-V guest VM is disabled by default because 72hibernation is incompatible with memory hot-add, as provided by the 73Hyper-V balloon driver. If hot-add is used and the VM hibernates, it 74hibernates with more memory than it started with. But when the VM 75resumes from hibernation, Hyper-V gives the VM only the originally 76assigned memory, and the memory size mismatch causes resume to fail. 77 78To enable a Hyper-V VM for hibernation, the Hyper-V administrator must 79enable the ACPI virtual S4 sleep state in the ACPI configuration that 80Hyper-V provides to the guest VM. Such enablement is accomplished by 81modifying a WMI property of the VM, the steps for which are outside 82the scope of this documentation but are available on the web. 83Enablement is treated as the indicator that the administrator 84prioritizes Linux hibernation in the VM over hot-add, so the Hyper-V 85balloon driver in Linux disables hot-add. Enablement is indicated if 86the contents of /sys/power/disk contains "platform" as an option. The 87enablement is also visible in /sys/bus/vmbus/hibernation. See function 88hv_is_hibernation_supported(). 89 90Linux supports ACPI sleep states on x86, but not on arm64. So Linux 91guest VM hibernation is not available on Hyper-V for arm64. 92 93Initiating Guest VM Hibernation 94------------------------------- 95Guest VMs can self-initiate hibernation using the standard Linux 96methods of writing "disk" to /sys/power/state or the reboot system 97call. As an additional layer, Linux guests on Hyper-V support the 98"Shutdown" integration service, via which a Hyper-V administrator can 99tell a Linux VM to hibernate using a command outside the VM. The 100command generates a request to the Hyper-V shutdown driver in Linux, 101which sends the uevent "EVENT=hibernate". See kernel functions 102shutdown_onchannelcallback() and send_hibernate_uevent(). A udev rule 103must be provided in the VM that handles this event and initiates 104hibernation. 105 106Handling VMBus Devices During Hibernation & Resume 107-------------------------------------------------- 108The VMBus bus driver, and the individual VMBus device drivers, 109implement suspend and resume functions that are called as part of the 110Linux orchestration of hibernation and of resuming from hibernation. 111The overall approach is to leave in place the data structures for the 112primary VMBus channels and their associated Linux devices, such as 113SCSI controllers and others, so that they are captured in the 114hibernation image. This approach allows any state associated with the 115device to be persisted across the hibernation/resume. When the VM 116resumes, the devices are re-offered by Hyper-V and are connected to 117the data structures that already exist in the resumed hibernation 118image. 119 120VMBus devices are identified by class and instance GUID. (See section 121"VMBus device creation/deletion" in 122Documentation/virt/hyperv/vmbus.rst.) Upon resume from hibernation, 123the resume functions expect that the devices offered by Hyper-V have 124the same class/instance GUIDs as the devices present at the time of 125hibernation. Having the same class/instance GUIDs allows the offered 126devices to be matched to the primary VMBus channel data structures in 127the memory of the now resumed hibernation image. If any devices are 128offered that don't match primary VMBus channel data structures that 129already exist, they are processed normally as newly added devices. If 130primary VMBus channels that exist in the resumed hibernation image are 131not matched with a device offered in the resumed VM, the resume 132sequence waits for 10 seconds, then proceeds. But the unmatched device 133is likely to cause errors in the resumed VM. 134 135When resuming existing primary VMBus channels, the newly offered 136relids might be different because relids can change on each VM boot, 137even if the VM configuration hasn't changed. The VMBus bus driver 138resume function matches the class/instance GUIDs, and updates the 139relids in case they have changed. 140 141VMBus sub-channels are not persisted in the hibernation image. Each 142VMBus device driver's suspend function must close any sub-channels 143prior to hibernation. Closing a sub-channel causes Hyper-V to send a 144RESCIND_CHANNELOFFER message, which Linux processes by freeing the 145channel data structures so that all vestiges of the sub-channel are 146removed. By contrast, primary channels are marked closed and their 147ring buffers are freed, but Hyper-V does not send a rescind message, 148so the channel data structure continues to exist. Upon resume, the 149device driver's resume function re-allocates the ring buffer and 150re-opens the existing channel. It then communicates with Hyper-V to 151re-open sub-channels from scratch. 152 153The Linux ends of Hyper-V sockets are forced closed at the time of 154hibernation. The guest can't force closing the host end of the socket, 155but any host-side actions on the host end will produce an error. 156 157VMBus devices use the same suspend function for the "freeze" and the 158"poweroff" phases, and the same resume function for the "thaw" and 159"restore" phases. See the "Entering Hibernation" section of 160Documentation/driver-api/pm/devices.rst for the sequencing of the 161phases. 162 163Detailed Hibernation Sequence 164----------------------------- 1651. The Linux power management (PM) subsystem prepares for 166 hibernation by freezing user space processes and allocating 167 memory to hold the hibernation image. 1682. As part of the "freeze" phase, Linux PM calls the "suspend" 169 function for each VMBus device in turn. As described above, this 170 function removes sub-channels, and leaves the primary channel in 171 a closed state. 1723. Linux PM calls the "suspend" function for the VMBus bus, which 173 closes any Hyper-V socket channels and unloads the top-level 174 VMBus connection with the Hyper-V host. 1754. Linux PM disables non-boot CPUs, creates the hibernation image in 176 the previously allocated memory, then re-enables non-boot CPUs. 177 The hibernation image contains the memory data structures for the 178 closed primary channels, but no sub-channels. 1795. As part of the "thaw" phase, Linux PM calls the "resume" function 180 for the VMBus bus, which re-establishes the top-level VMBus 181 connection and requests that Hyper-V re-offer the VMBus devices. 182 As offers are received for the primary channels, the relids are 183 updated as previously described. 1846. Linux PM calls the "resume" function for each VMBus device. Each 185 device re-opens its primary channel, and communicates with Hyper-V 186 to re-establish sub-channels if appropriate. The sub-channels 187 are re-created as new channels since they were previously removed 188 entirely in Step 2. 1897. With VMBus devices now working again, Linux PM writes the 190 hibernation image from memory to disk. 1918. Linux PM repeats Steps 2 and 3 above as part of the "poweroff" 192 phase. VMBus channels are closed and the top-level VMBus 193 connection is unloaded. 1949. Linux PM disables non-boot CPUs, and then enters ACPI sleep state 195 S4. Hibernation is now complete. 196 197Detailed Resume Sequence 198------------------------ 1991. The guest VM boots into a fresh Linux OS instance. During boot, 200 the top-level VMBus connection is established, and synthetic 201 devices are enabled. This happens via the normal paths that don't 202 involve hibernation. 2032. Linux PM hibernation code reads swap space is to find and read 204 the hibernation image into memory. If there is no hibernation 205 image, then this boot becomes a normal boot. 2063. If this is a resume from hibernation, the "freeze" phase is used 207 to shutdown VMBus devices and unload the top-level VMBus 208 connection in the running fresh OS instance, just like Steps 2 209 and 3 in the hibernation sequence. 2104. Linux PM disables non-boot CPUs, and transfers control to the 211 read-in hibernation image. In the now-running hibernation image, 212 non-boot CPUs are restarted. 2135. As part of the "resume" phase, Linux PM repeats Steps 5 and 6 214 from the hibernation sequence. The top-level VMBus connection is 215 re-established, and offers are received and matched to primary 216 channels in the image. Relids are updated. VMBus device resume 217 functions re-open primary channels and re-create sub-channels. 2186. Linux PM exits the hibernation resume sequence and the VM is now 219 running normally from the hibernation image. 220 221Key-Value Pair (KVP) Pseudo-Device Anomalies 222-------------------------------------------- 223The VMBus KVP device behaves differently from other pseudo-devices 224offered by Hyper-V. When the KVP primary channel is closed, Hyper-V 225sends a rescind message, which causes all vestiges of the device to be 226removed. But Hyper-V then re-offers the device, causing it to be newly 227re-created. The removal and re-creation occurs during the "freeze" 228phase of hibernation, so the hibernation image contains the re-created 229KVP device. Similar behavior occurs during the "freeze" phase of the 230resume sequence while still in the fresh OS instance. But in both 231cases, the top-level VMBus connection is subsequently unloaded, which 232causes the device to be discarded on the Hyper-V side. So no harm is 233done and everything still works. 234 235Virtual PCI devices 236------------------- 237Virtual PCI devices are physical PCI devices that are mapped directly 238into the VM's physical address space so the VM can interact directly 239with the hardware. vPCI devices include those accessed via what Hyper-V 240calls "Discrete Device Assignment" (DDA), as well as SR-IOV NIC 241Virtual Functions (VF) devices. See Documentation/virt/hyperv/vpci.rst. 242 243Hyper-V DDA devices are offered to guest VMs after the top-level VMBus 244connection is established, just like VMBus synthetic devices. They are 245statically assigned to the VM, and their instance GUIDs don't change 246unless the Hyper-V administrator makes changes to the configuration. 247DDA devices are represented in Linux as virtual PCI devices that have 248a VMBus identity as well as a PCI identity. Consequently, Linux guest 249hibernation first handles DDA devices as VMBus devices in order to 250manage the VMBus channel. But then they are also handled as PCI 251devices using the hibernation functions implemented by their native 252PCI driver. 253 254SR-IOV NIC VFs also have a VMBus identity as well as a PCI 255identity, and overall are processed similarly to DDA devices. A 256difference is that VFs are not offered to the VM during initial boot 257of the VM. Instead, the VMBus synthetic NIC driver first starts 258operating and communicates to Hyper-V that it is prepared to accept a 259VF, and then the VF offer is made. However, the VMBus connection 260might later be unloaded and then re-established without the VM being 261rebooted, as happens in Steps 3 and 5 in the Detailed Hibernation 262Sequence above and in the Detailed Resume Sequence. In such a case, 263the VFs likely became part of the VM during initial boot, so when the 264VMBus connection is re-established, the VFs are offered on the 265re-established connection without intervention by the synthetic NIC driver. 266 267UIO Devices 268----------- 269A VMBus device can be exposed to user space using the Hyper-V UIO 270driver (uio_hv_generic.c) so that a user space driver can control and 271operate the device. However, the VMBus UIO driver does not support the 272suspend and resume operations needed for hibernation. If a VMBus 273device is configured to use the UIO driver, hibernating the VM fails 274and Linux continues to run normally. The most common use of the Hyper-V 275UIO driver is for DPDK networking, but there are other uses as well. 276 277Resuming on a Different VM 278-------------------------- 279This scenario occurs in the Azure public cloud in that a hibernated 280customer VM only exists as saved configuration and disks -- the VM no 281longer exists on any Hyper-V host. When the customer VM is resumed, a 282new Hyper-V VM with identical configuration is created, likely on a 283different Hyper-V host. That new Hyper-V VM becomes the resumed 284customer VM, and the steps the Linux kernel takes to resume from the 285hibernation image must work in that new VM. 286 287While the disks and their contents are preserved from the original VM, 288the Hyper-V-provided VMBus instance GUIDs of the disk controllers and 289other synthetic devices would typically be different. The difference 290would cause the resume from hibernation to fail, so several things are 291done to solve this problem: 292 293* For VMBus synthetic devices that support only a single instance, 294 Hyper-V always assigns the same instance GUIDs. For example, the 295 Hyper-V mouse, the shutdown pseudo-device, the time sync pseudo 296 device, etc., always have the same instance GUID, both for local 297 Hyper-V installs as well as in the Azure cloud. 298 299* VMBus synthetic SCSI controllers may have multiple instances in a 300 VM, and in the general case instance GUIDs vary from VM to VM. 301 However, Azure VMs always have exactly two synthetic SCSI 302 controllers, and Azure code overrides the normal Hyper-V behavior 303 so these controllers are always assigned the same two instance 304 GUIDs. Consequently, when a customer VM is resumed on a newly 305 created VM, the instance GUIDs match. But this guarantee does not 306 hold for local Hyper-V installs. 307 308* Similarly, VMBus synthetic NICs may have multiple instances in a 309 VM, and the instance GUIDs vary from VM to VM. Again, Azure code 310 overrides the normal Hyper-V behavior so that the instance GUID 311 of a synthetic NIC in a customer VM does not change, even if the 312 customer VM is deallocated or hibernated, and then re-constituted 313 on a newly created VM. As with SCSI controllers, this behavior 314 does not hold for local Hyper-V installs. 315 316* vPCI devices do not have the same instance GUIDs when resuming 317 from hibernation on a newly created VM. Consequently, Azure does 318 not support hibernation for VMs that have DDA devices such as 319 NVMe controllers or GPUs. For SR-IOV NIC VFs, Azure removes the 320 VF from the VM before it hibernates so that the hibernation image 321 does not contain a VF device. When the VM is resumed it 322 instantiates a new VF, rather than trying to match against a VF 323 that is present in the hibernation image. Because Azure must 324 remove any VFs before initiating hibernation, Azure VM 325 hibernation must be initiated externally from the Azure Portal or 326 Azure CLI, which in turn uses the Shutdown integration service to 327 tell Linux to do the hibernation. If hibernation is self-initiated 328 within the Azure VM, VFs remain in the hibernation image, and are 329 not resumed properly. 330 331In summary, Azure takes special actions to remove VFs and to ensure 332that VMBus device instance GUIDs match on a new/different VM, allowing 333hibernation to work for most general-purpose Azure VMs sizes. While 334similar special actions could be taken when resuming on a different VM 335on a local Hyper-V install, orchestrating such actions is not provided 336out-of-the-box by local Hyper-V and so requires custom scripting. 337