1 2## Basic Definitions for Patching 3 4**Binary**: Executable image and data. Binaries may persist in an archive 5(e.g., chrome.7z), and need to be periodically updated. Formats for binaries 6include {PE files EXE / DLL, ELF, DEX}. Architectures binaries include 7{x86, x64, ARM, AArch64, Dalvik}. A binary is also referred to as an executable 8or an image file. 9 10**Patching**: Sending a "new" file to clients who have an "old" file by 11computing and transmitting a "patch" that can be used to transform "old" into 12"new". Patches are compressed for transmission. A key performance metric is 13patch size, which refers to the size of compressed patch file. For our 14experiments we use 7z. 15 16**Patch generation**: Computation of a "patch" from "old" and "new". This can be 17expensive (e.g., ~15-20 min for Chrome, using 1 GB of RAM), but since patch 18generation is a run-once step on the server-side when releasing "new" binaries, 19the expense is not too critical. 20 21**Patch application**: Transformation from "old" binaries to "new", using a 22(downloaded) "patch". This is executed on client side on updates, so resource 23constraints (e.g., time, RAM, disk space) is more stringent. Also, fault- 24tolerance is important. This is usually achieved by an update system by having 25a fallback method of directly downloading "new" in case of patching failure. 26 27**Offset**: Position relative to the start of a file. 28 29**Local offset**: An offset relative to the start of a region of a file. 30 31**Element**: A region in a file with associated executable type, represented by 32the tuple (exe_type, offset, length). Every Element in new file is associated 33with an Element in old file and patched independently. 34 35**Reference**: A directed connection between two offsets in a binary. For 36example, consider jump instructions in x86: 37 38 00401000: E9 3D 00 00 00 jmp 00401042 39 40Here, the 4 bytes `[3D 00 00 00]` starting at address `00401001` point to 41address `00401042` in memory. This forms a reference from `offset(00401001)` 42(length 4) to `offset(00401042)`, where `offset(addr)` indicates the disk 43offset corresponding to `addr`. A reference has a location, length (implicitly 44determined by reference type), body, and target. 45 46**Location**: The starting offset of bytes that store a reference. In the 47preceding example, `offset(00401001)` is a location. Each location is the 48beginning of a reference body. 49 50**Body**: The span of bytes that encodes reference data, i.e., 51[location, location + length) = 52[location, location + 1, ..., location + length - 1]. 53In the preceding example, `length = 4`, so the reference body is 54`[00401001, 00401001 + 4) = [00401001, 00401002, 00401003, 00401004]`. 55All reference bodies in an image must not overlap, and often regions boundaries 56are required to not straddle a reference body. 57 58**Target**: The offset that's the destination of a reference. In the preceding 59example, `offset(00401042)` is the target. Different references can share common 60targets. For example, in 61 62 00401000: E9 3D 00 00 00 jmp 00401042 63 00401005: EB 3B jmp 00401042 64 65we have two references with different locations and bodies, but same target 66of `00401042`. 67 68Because the bytes that encode a reference depend on its target, and potentially 69on its location, they are more likely to get modified from an old version of a 70binary to a newer version. This is why "naive" patching does not do well on 71binaries. 72 73**Target Key**: An alternative representation of a Target for a fixed pool, as its 74index in the sorted list of Target offsets. Keys are useful since: 75 * Their numerical index are smaller than offsets, allowing more efficient 76 storage of target correction data in patch. 77 * They simplify association from Targets to Labels. 78 79**Disassembler**: Architecture specific data and operations, used to extract and 80correct references in a binary. 81 82**Type of reference**: The type of a reference determines the binary 83representation used to encode its target. This affects how references are parsed 84and written by a disassembler. There can be many types of references in the same 85binary. 86 87A reference is represented by the tuple (disassembler, location, target, type). 88This tuple contains sufficient information to write the reference in a binary. 89 90**Pool of targets**: Collection of targets that is assumed to have some semantic 91relationship. Each reference type belong to exactly one reference pool. Targets 92for references in the same pool are shared. 93 94For example, the following describes two pools defined for Dalvik Executable 95format (DEX). Both pools spawn multiple types of references. 96 971. Index in string table. 98 - From bytecode to string index using 16 bits. 99 - From bytecode to string index using 32 bits. 100 - From field item to string index using 32 bits. 1012. Address in code. 102 - Relative 16 bits pointer. 103 - Relative 32 bits pointer. 104 105Boundaries between different pools can be ambiguous. Having all targets belong 106to the same pool can reduce redundancy, but will use more memory and might 107cause larger corrections to happen, so this is a trade-off that can be resolved 108with benchmarks. 109 110**Abs32 references**: References whose targets are adjusted by the OS during 111program load. In an image, a **relocation table** typically provides locations 112of abs32 references. At each abs32 location, the stored bytes then encode 113semantic information about the target (e.g., as RVA). 114 115**Rel32 references**: References embedded within machine code, in which targets 116are encoded as some delta relative to the reference's location. Typical examples 117of rel32 references are branching instructions and instruction pointer-relative 118memory access. 119 120**Equivalence**: A (src_offset, dst_offset, length) tuple describing a region of 121"old" binary, at an offset of |src_offset|, that is similar to a region of "new" 122binary, at an offset of |dst_offset|. 123 124**Raw delta unit**: Describes a raw modification to apply on the new image, as a 125pair (copy_offset, diff), where copy_offset describes the position in new file 126as an offset in the data that was copied from the old file, and diff is the 127bytewise difference to apply. 128 129**Associated Targets**: A target in "old" binary is associated with a target in 130"new" binary if both targets: 1311. are part of similar regions from the same equivalence, and 1322. have the same local offset (relative to respective start regions), and 1333. are not part of any larger region from a different equivalence. 134Not all targets are necessarily associated with another target. 135 136**Target Affinity**: Level of confidence in the association between two targets. 137The affinity between targets that are potentially associated is measured based 138on surrounding content, as well as reference type. 139 140**Label**: An integer assigned for each Target in "old" and "new" binary as part 141of generating a patch, and used to alias targets when searching for similar 142regions that will form equivalences. Labels are assigned such that 143associated targets in old and new binaries share the same Label. Unmatched 144Targets have a Label of 0. For example, given 145 * "Old" targets = [0x1111, 0x3333, 0x5555, 0x7777], 146 * "New" targets = [0x2222, 0x4444, 0x6666, 0x8888], 147to represent matchings 0x1111 <=> 0x6666, 0x3333 <=> 0x2222, we'd assign 148 * Label 1 to 0x1111 (in "old") and 0x6666 (in "new"), 149 * Label 2 to 0x3333 (in "old") and 0x2222 (in "new"). 150 Represented as arrays indexed over Target Keys, we'd have: 151 * "Old" labels = [1, 2, 0 ,0], 152 * "New" labels = [2, 0, 1, 0]. 153 154**Encoded Image**: The result of projecting the content of an image to scalar 155values that describe content on a higher level of abstraction, masking away 156undesirable noise in raw content. Notably, the projection encodes references 157based on their associated label. 158 159## Interfaces 160 161zucchini_lib: Core Zucchini library that operate on buffers to generate and 162apply patches. 163 164zucchini_io: Wrapper on zucchini_lib that handles file I/O, using memory-mapped 165I/O to interface with zucchini_lib. 166 167zucchini: Stand-alone executable that parses command-line arguments, and passes 168the results to zucchini_io. Also implements various helper flows. 169 170## Zucchini Ensemble Patch Format 171 172### Types 173 174**int8**: 8-bit unsigned int. 175 176**uint32**: 32-bit unsigned int, little-endian. 177 178**int32**: 32-bit signed int, little-endian. 179 180**Varints**: This is a generic variable-length encoding for integer quantities 181that strips away leading (most-significant) null bytes. 182The Varints format is borrowed from protocol-buffers, see 183[documentation](https://developers.google.com/protocol-buffers/docs/encoding#varints) 184for more info. 185 186**varuint32**: A uint32 encoded using Varints format. 187 188**varint32**: A int32 encoded using Varints format. 189 190### File Layout 191 192Name | Format | Description 193--- | --- | --- 194header | PatchHeader | The header. 195elements_count | uint32 | Number of patch units. 196elements | PatchElement[elements_count] | List of all patch elements. 197 198Position of elements in new file is ascending. 199 200### Structures 201 202**PatchHeader** 203 204Name | Format | Description 205--- | --- | --- 206magic | uint32 = kMagic | Magic value. 207major_version | uint16 | Major version number indicating breaking changes. 208minor_version | uint16 | Minor version number indicating possibly breaking changes. 209old_size | uint32 | Size of old file in bytes. 210old_crc | uint32 | CRC32 of old file. 211new_size | uint32 | Size of new file in bytes. 212new_crc | uint32 | CRC32 of new file. 213 214**kMagic** == `'Z' | ('u' << 8) | ('c' << 16) | ('c' << 24)` 215 216**PatchElement** 217Contains all the information required to produce a single element in new file. 218 219Name | Format | Description 220--- | --- | --- 221header | PatchElementHeader | The header. 222equivalences | EquivalenceList | List of equivalences. 223raw_deltas | RawDeltaList | List of raw deltas. 224reference_deltas | ReferenceDeltaList | List of reference deltas. 225pool_count | uint32 | Number of pools. 226extra_targets | ExtraTargetList[pool_count] | Lists of extra targets. 227 228**PatchElementHeader** 229Describes a correspondence between an element in old and in new files. Some 230redundancy arise from storing |new_offset|, but it is necessary to make 231PatchElement self contained. 232 233Name | Format | Description 234--- | --- | --- 235old_offset | uint32 | Starting offset of the element in old file. 236old_length | uint32 | Length of the element in old file. 237new_offset | uint32 | Starting offset of the element in new file. 238new_length | uint32 | Length of the element in new file. 239exe_type | uint32 | Executable type for this unit, see `enum ExecutableType`. 240version | uint16 | Version specific to the executable type for this unit. 241 242**EquivalenceList** 243Encodes a list of equivalences, where dst offsets (in new image) are ascending. 244 245Name | Format | Description 246--- | --- | --- 247src_skip | Buffer<varint32> | Src offset for each equivalence, delta encoded. 248dst_skip | Buffer<varuint32> | Dst offset for each equivalence, delta encoded. 249copy_count | Buffer<varuint32> | Length for each equivalence. 250 251**RawDeltaList** 252Encodes a list of raw delta units, with ascending copy offsets. 253 254Name | Format | Description 255--- | --- | --- 256raw_delta_skip | Buffer<varuint32> | Copy offset for each delta unit, delta encoded and biased by -1. 257raw_delta_diff | Buffer<int8> | Bytewise difference for each delta unit. 258 259**ReferenceDeltaList** 260Encodes a list of reference deltas, in the order they appear in the new 261image file. A reference delta is a signed integer representing a jump through a 262list of targets. 263 264Name | Format | Description 265--- | --- | --- 266reference_delta | Buffer<varuint32> | Vector of reference deltas. 267 268**ExtraTargetList** 269Encodes a list of additional targets in the new image file, in ascending 270order. 271 272Name | Format | Description 273--- | --- | --- 274pool_tag | uint8_t | Unique identifier for this pool of targets. 275extra_targets | Buffer<varuint32> | Additional targets, delta encoded and biased by -1. 276 277**Buffer<T>** 278A generic vector of data. 279 280Name | Format | Description 281--- | --- | --- 282size |uint32 | Size of content in bytes. 283content |T[] | List of integers. 284 285# Format Changelog 286All breaking changes to zucchini patch format will be documented in this 287section. 288 289The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/). 290 291## [Unreleased] 292 293## [1.0] - 2021-10-27 294### Added 295Major/Minor version is encoded in PatchHeader 296Disassembler version associated with an element version is encoded in PatchElementHeader. 297