1# Puffin: A deterministic deflate re-compressor (for patching purposes) 2 3## Table of Contents 4 5[TOC] 6 7## Puffin 8 9Puffin is a deterministic deflate recompressor. It is mainly used for patching 10deflate compressed images (zip, gzip, etc.) because patches created from deflate 11files/streams are usually large (deflate has a bit-aligned format, hence, 12changing one byte in the raw data can cause the entire deflate stream to change 13drastically.) 14 15Puffin has two tools (operations): `puffdiff` and `puffpatch` (shown 16[here](puffin).) The purpose of `puffdiff` operation is to create a patch 17between a source and a target file with one or both of them having some deflate 18streams. This patch is used in the `puffpatch` operation to generate the target 19file from the source file deterministically. The patch itself is created by 20`bsdiff` library (but can be replaced by any other diffing mechanism). But, 21before it uses `bsdiff` to create the patch, we have to transform both the 22source and target files into a specific format so `bsdiff` can produce smaller 23patches. We define `puff` operation to perform such a transformation. The 24resulting stream is called a `puff` stream. The reverse transformation (from 25`puff` stream to deflate stream) is called a `huff` operation. `huff` is used in 26the client to transform the `puff` stream back into its original deflate stream 27deterministically. 28 29 30 31For information about deflate format see [RFC 1951]. 32 33## puff and huff 34 35`puff` is an operation that decompresses only the Huffman part of the deflate 36stream and keeps the structure of the LZ77 coding unchanged. This is roughly 37equivalent of decompressing ‘half way’. 38 39`huff` is the exact opposite of `puff` and it deterministically converts the 40`puff` stream back to its original deflate stream. This deterministic conversion 41is based on two facts: 42 43* There is no need to perform LZ77 algorithm. This means the deflate stream 44 could be built by any LZ77 algorithm. 45* The dynamic Huffman tables can be recreated uniquely from the code length 46 array stored inside the `puff` stream. This means the deflate stream can be 47 reconstructed deterministically using the Huffman tables. 48 49The inclusion of Huffman tables in the `puff` stream has minuscule space burden 50(on average maximum 320 bytes for each block. There is roughly one block per 5132KB of uncompressed data). 52 53`bsdiff` of two `puffed` streams has much smaller patch in comparison to their 54deflate streams, but it is larger than uncompressed streams. 55 56**The major benefits** 57 58* Size of the patch is smaller than deflate stream’s patch. 59* `puff` and `huff` are deterministic operations. 60* `huff` is in order of 10X faster than full recompression. It is even faster 61 than Huffman algorithm because it already has the Huffman tables and does 62 not need to reconstruct it. 63* Both algorithms have very low memory footprint. 64* The underlying LZ77 algorithm can be anything (as long as it is deflate 65 compatible). This includes google’s [zopfli] 66 67**The drawbacks** 68 69* The need to define a file format for the puffed stream and stay with this 70 format forever. If format needs to be changed in the future, then some 71 versioning mechanism should be there to handle it and backward compatibility 72 should be maintained. 73* The need to define a payload format and stay with it forever. Similarly 74 needs to be versioned if required later change. 75* Does not reduces the patch size as full recompression. 76 77 78## puffdiff and puffpatch 79 80A deflate compressed file (gzip, zip, etc.) contains multiple deflate streams at 81different locations. When this file is puffed, the resulting `puff` stream 82contains both puffs and the raw data that existed between the deflates in the 83compressed file. When performing `huff` operation, the location of puffs in the 84`puff` stream and deflates in the deflate stream should be passed upon. This is 85necessary as `huff` operation has to know exactly where the locations of both 86puffs and deflates are. (See the following [image](puffin-stream)) 87 88 89 90Similarly `puffpatch` requires deflates and puffs locations in both the source 91and target streams. These location information is saved using Protobufs in the 92patch generated by `bsdiff`. One requirement for these two operations are that 93`puffpatch` has to be efficient in the client. Client devices have limited 94memory and CPU bandwidth and it is necessary that each of these operations are 95performed with the most efficiency available. In order to achieve this 96efficiency a new operation can be added to `bspatch`, that reads and writes into 97a `puff` streams using special interfaces for puffing and huffing on the fly. 98 99 100## Puffin Patch Format 101 102* Magic (4 bytes) - The string literal "PUF1". 103* Header Length (4 bytes) - The size of the header (length of the generated 104 Protobuf). 105* Header - Lengths and locations of deflate and puffs streams in source and 106 target files in a Protobuf format. See [puffin.proto]. 107* Patch - This is a binary array directly generated by the `bsdiff` program. 108 109## Puffin Stream Format 110 111* [Block Header](#block-header-format) (3+ bytes) - Defines the type of the 112 block. 113* Data - A mix of literals list and copy length/distances. 114* End of Block (2 bytes) - The end of block symbol. Similar to the symbols for 115 copy length/distance but without the distance bytes. The actual copy length 116 value is 259 (0x81FF). 117 118### Block Header Format 119 120* Length (2 Bytes) - The size of the block header excluding the two Length 121 bytes itself - 1. 122* Final Block (1 Bit) - Whether the block is the last one or not. 123 * 1 - Final block 124 * 0 - Middle block 125* Type (2 Bits) 126 * 0 - Uncompressed. Immediately after the header, zero or one literals 127 list is present which defines the content of the uncompressed blocks. 128 * 1 - Fixed Huffman table. The list of literals and length/distances comes 129 immediately after the header. Fixed Huffman table is defined the deflate 130 specification and will not change. 131 * 2 - Dynamic Huffman table. The dynamic Huffman table comes at the end of 132 the block header. 133 * 3 - Undefined. This is an error. 134* Skip Bits (5 Bits) - Used only for uncompressed blocks (For other types its 135 value is zero). In an uncompressed block, the [RFC 1951] skips any bits 136 after reading final block and type bits until the byte boundary in the 137 input. However, it does not define what the value of these bits should 138 be. Most implementations assume 0, but some implementations may use any 139 value. This segment contains those bits as a five-bits integer. When writing 140 the block header back to the deflate format, the actual number of bits which 141 where skipped will be calculated easily. 142* Huffman Table - It only comes for dynamic Huffman table. 143 144#### Dynamic Huffman Table Format 145 146Depending on the scheme for storing the Huffman tables, the payload size can 147change. We discovered that the following scheme does not produce the smallest 148payload, but it is the most deterministic one. In a deflate stream, Huffman 149tables for literals/length and distances are themselves Huffman coded. In this 150format, we also `puff` the Huffman tables into the `puff` stream instead of 151completely decompressing them. 152 153There are three tables stored in this structure very similar to the one defined 154in [RFC 1951]. A Huffman table can be defined as an array of unsigned integer 155code length values. Three Puffed Huffman tables appear like the following 156scheme. The first table (codes) is the Huffman table used to decode the next two 157Huffman tables. The second Huffman table is used to decode literals and lengths, 158and the third Huffman table is used to decode distances. 159 160 161* Literal/Length Count (1 byte) - Number of alphabets used for literal/length 162 Huffman codes - 257. 163* Distance Count (1 byte) - Number of alphabets used for distance Huffman 164 codes - 1. 165* Alphabet Count (1 byte) - Number of alphabets for coding two previous 166 Huffman tables - 4. 167* Code Lengths ((Alphabet Count + 1) / 2 bytes) - A list of codes for reading 168 the next two Huffman tables. Each byte contains two codes. If the number of 169 codes is odd, the last four Bits will be zero. 170* Literal/Length/Distance Code Lengths - List of code lengths for encoding 171 literals/lengths/distance followed The encoding is as follows: 172 * [0..15] - Represent code [0..15] 173 * [16..19] - Copy the last code length [3..6] times. 174 * [20..27] - Repeat code length of 0 for [3..10] times. 175 * [28..155] - Repeat code length of 0 for [11..138] times. 176 177### Literals List 178Literals lists are constructed by a “length” value followed by “length” bytes of 179literals. The Puffer should try to merge adjacent literals lists as much as 180possible into one literals list in the `puff` stream. This Is a length value 181followed by length bytes of literals (Even if there is only one literal.) 182 183* Tag (1 Bit) - The value is 0 indicating that this is a list of literals (not 184 a copy length/distance). 185* Length - The number of literals that would follow in the list. 186 * (7 Bits) Length = [1 .. 127], The value is: Length - 1 187 * (7 Bits + 2 Bytes) Length = [128 .. 65663], The values are: 127, 188 Length - 127. 189 Conserves size by using only one byte if the number of upcoming 190 literals is smaller or equal to 127 (About 99% of literals length in a 191 normal deflate stream fit into this category.) We should never have zero 192 length literals. Otherwise it will use three bytes. 193* Literals (Length Bytes) - A sequence of Length number of literals. 194 195### Copy Length/Distance 196 197This Is a Length value followed by a Distance value. 198 199* Tag (1 Bit) - The value is 1 indicating that this is a copy length/distance 200 field. 201* Length - Conserves size by using only one byte if the length value is 202 smaller or equal to 129. Otherwise two bytes are used. 203 * (7 Bits) - Length = [3 .. 129], The value is: Length - 3 204 * (7 Bites + 1 Byte) Length = [130 .. 258], The value is: 127, Length - 205 130: 206* Distance (2 Bytes) - Distance = [1 .. 32768], The value is: 207 Distance - 1. The distance value as an unsigned integer. 208 209## Building Puffin 210Currently Puffin is being used in both Android and Chrome OS and is built 211differently in each of them. There is also a Makefile build, but it is not 212comprehensive. 213 214## Usage 215 216Puffin builds an executable `puffin` which can be used to perform diffing and 217patching algorithms using Puffin format. To get the list of options available 218run: 219 220```shell 221puffin --help 222``` 223 224It can also be used as a library (currently used by update_engine) that provides 225different APIs. 226 227## References 228 229[RFC 1951]: https://www.ietf.org/rfc/rfc1951.txt 230[puffin.proto]: /src/puffin.proto 231[zopfli]: https://en.wikipedia.org/wiki/Zopfli 232