1*07fb1d06SElliott Hughes# Puffin: A deterministic deflate re-compressor (for patching purposes) 2*07fb1d06SElliott Hughes 3*07fb1d06SElliott Hughes## Table of Contents 4*07fb1d06SElliott Hughes 5*07fb1d06SElliott Hughes[TOC] 6*07fb1d06SElliott Hughes 7*07fb1d06SElliott Hughes## Puffin 8*07fb1d06SElliott Hughes 9*07fb1d06SElliott HughesPuffin is a deterministic deflate recompressor. It is mainly used for patching 10*07fb1d06SElliott Hughesdeflate compressed images (zip, gzip, etc.) because patches created from deflate 11*07fb1d06SElliott Hughesfiles/streams are usually large (deflate has a bit-aligned format, hence, 12*07fb1d06SElliott Hugheschanging one byte in the raw data can cause the entire deflate stream to change 13*07fb1d06SElliott Hughesdrastically.) 14*07fb1d06SElliott Hughes 15*07fb1d06SElliott HughesPuffin has two tools (operations): `puffdiff` and `puffpatch` (shown 16*07fb1d06SElliott Hughes[here](puffin).) The purpose of `puffdiff` operation is to create a patch 17*07fb1d06SElliott Hughesbetween a source and a target file with one or both of them having some deflate 18*07fb1d06SElliott Hughesstreams. This patch is used in the `puffpatch` operation to generate the target 19*07fb1d06SElliott Hughesfile from the source file deterministically. The patch itself is created by 20*07fb1d06SElliott Hughes`bsdiff` library (but can be replaced by any other diffing mechanism). But, 21*07fb1d06SElliott Hughesbefore it uses `bsdiff` to create the patch, we have to transform both the 22*07fb1d06SElliott Hughessource and target files into a specific format so `bsdiff` can produce smaller 23*07fb1d06SElliott Hughespatches. We define `puff` operation to perform such a transformation. The 24*07fb1d06SElliott Hughesresulting stream is called a `puff` stream. The reverse transformation (from 25*07fb1d06SElliott Hughes`puff` stream to deflate stream) is called a `huff` operation. `huff` is used in 26*07fb1d06SElliott Hughesthe client to transform the `puff` stream back into its original deflate stream 27*07fb1d06SElliott Hughesdeterministically. 28*07fb1d06SElliott Hughes 29*07fb1d06SElliott Hughes 30*07fb1d06SElliott Hughes 31*07fb1d06SElliott HughesFor information about deflate format see [RFC 1951]. 32*07fb1d06SElliott Hughes 33*07fb1d06SElliott Hughes## puff and huff 34*07fb1d06SElliott Hughes 35*07fb1d06SElliott Hughes`puff` is an operation that decompresses only the Huffman part of the deflate 36*07fb1d06SElliott Hughesstream and keeps the structure of the LZ77 coding unchanged. This is roughly 37*07fb1d06SElliott Hughesequivalent of decompressing ‘half way’. 38*07fb1d06SElliott Hughes 39*07fb1d06SElliott Hughes`huff` is the exact opposite of `puff` and it deterministically converts the 40*07fb1d06SElliott Hughes`puff` stream back to its original deflate stream. This deterministic conversion 41*07fb1d06SElliott Hughesis based on two facts: 42*07fb1d06SElliott Hughes 43*07fb1d06SElliott Hughes* There is no need to perform LZ77 algorithm. This means the deflate stream 44*07fb1d06SElliott Hughes could be built by any LZ77 algorithm. 45*07fb1d06SElliott Hughes* The dynamic Huffman tables can be recreated uniquely from the code length 46*07fb1d06SElliott Hughes array stored inside the `puff` stream. This means the deflate stream can be 47*07fb1d06SElliott Hughes reconstructed deterministically using the Huffman tables. 48*07fb1d06SElliott Hughes 49*07fb1d06SElliott HughesThe inclusion of Huffman tables in the `puff` stream has minuscule space burden 50*07fb1d06SElliott Hughes(on average maximum 320 bytes for each block. There is roughly one block per 51*07fb1d06SElliott Hughes32KB of uncompressed data). 52*07fb1d06SElliott Hughes 53*07fb1d06SElliott Hughes`bsdiff` of two `puffed` streams has much smaller patch in comparison to their 54*07fb1d06SElliott Hughesdeflate streams, but it is larger than uncompressed streams. 55*07fb1d06SElliott Hughes 56*07fb1d06SElliott Hughes**The major benefits** 57*07fb1d06SElliott Hughes 58*07fb1d06SElliott Hughes* Size of the patch is smaller than deflate stream’s patch. 59*07fb1d06SElliott Hughes* `puff` and `huff` are deterministic operations. 60*07fb1d06SElliott Hughes* `huff` is in order of 10X faster than full recompression. It is even faster 61*07fb1d06SElliott Hughes than Huffman algorithm because it already has the Huffman tables and does 62*07fb1d06SElliott Hughes not need to reconstruct it. 63*07fb1d06SElliott Hughes* Both algorithms have very low memory footprint. 64*07fb1d06SElliott Hughes* The underlying LZ77 algorithm can be anything (as long as it is deflate 65*07fb1d06SElliott Hughes compatible). This includes google’s [zopfli] 66*07fb1d06SElliott Hughes 67*07fb1d06SElliott Hughes**The drawbacks** 68*07fb1d06SElliott Hughes 69*07fb1d06SElliott Hughes* The need to define a file format for the puffed stream and stay with this 70*07fb1d06SElliott Hughes format forever. If format needs to be changed in the future, then some 71*07fb1d06SElliott Hughes versioning mechanism should be there to handle it and backward compatibility 72*07fb1d06SElliott Hughes should be maintained. 73*07fb1d06SElliott Hughes* The need to define a payload format and stay with it forever. Similarly 74*07fb1d06SElliott Hughes needs to be versioned if required later change. 75*07fb1d06SElliott Hughes* Does not reduces the patch size as full recompression. 76*07fb1d06SElliott Hughes 77*07fb1d06SElliott Hughes 78*07fb1d06SElliott Hughes## puffdiff and puffpatch 79*07fb1d06SElliott Hughes 80*07fb1d06SElliott HughesA deflate compressed file (gzip, zip, etc.) contains multiple deflate streams at 81*07fb1d06SElliott Hughesdifferent locations. When this file is puffed, the resulting `puff` stream 82*07fb1d06SElliott Hughescontains both puffs and the raw data that existed between the deflates in the 83*07fb1d06SElliott Hughescompressed file. When performing `huff` operation, the location of puffs in the 84*07fb1d06SElliott Hughes`puff` stream and deflates in the deflate stream should be passed upon. This is 85*07fb1d06SElliott Hughesnecessary as `huff` operation has to know exactly where the locations of both 86*07fb1d06SElliott Hughespuffs and deflates are. (See the following [image](puffin-stream)) 87*07fb1d06SElliott Hughes 88*07fb1d06SElliott Hughes 89*07fb1d06SElliott Hughes 90*07fb1d06SElliott HughesSimilarly `puffpatch` requires deflates and puffs locations in both the source 91*07fb1d06SElliott Hughesand target streams. These location information is saved using Protobufs in the 92*07fb1d06SElliott Hughespatch generated by `bsdiff`. One requirement for these two operations are that 93*07fb1d06SElliott Hughes`puffpatch` has to be efficient in the client. Client devices have limited 94*07fb1d06SElliott Hughesmemory and CPU bandwidth and it is necessary that each of these operations are 95*07fb1d06SElliott Hughesperformed with the most efficiency available. In order to achieve this 96*07fb1d06SElliott Hughesefficiency a new operation can be added to `bspatch`, that reads and writes into 97*07fb1d06SElliott Hughesa `puff` streams using special interfaces for puffing and huffing on the fly. 98*07fb1d06SElliott Hughes 99*07fb1d06SElliott Hughes 100*07fb1d06SElliott Hughes## Puffin Patch Format 101*07fb1d06SElliott Hughes 102*07fb1d06SElliott Hughes* Magic (4 bytes) - The string literal "PUF1". 103*07fb1d06SElliott Hughes* Header Length (4 bytes) - The size of the header (length of the generated 104*07fb1d06SElliott Hughes Protobuf). 105*07fb1d06SElliott Hughes* Header - Lengths and locations of deflate and puffs streams in source and 106*07fb1d06SElliott Hughes target files in a Protobuf format. See [puffin.proto]. 107*07fb1d06SElliott Hughes* Patch - This is a binary array directly generated by the `bsdiff` program. 108*07fb1d06SElliott Hughes 109*07fb1d06SElliott Hughes## Puffin Stream Format 110*07fb1d06SElliott Hughes 111*07fb1d06SElliott Hughes* [Block Header](#block-header-format) (3+ bytes) - Defines the type of the 112*07fb1d06SElliott Hughes block. 113*07fb1d06SElliott Hughes* Data - A mix of literals list and copy length/distances. 114*07fb1d06SElliott Hughes* End of Block (2 bytes) - The end of block symbol. Similar to the symbols for 115*07fb1d06SElliott Hughes copy length/distance but without the distance bytes. The actual copy length 116*07fb1d06SElliott Hughes value is 259 (0x81FF). 117*07fb1d06SElliott Hughes 118*07fb1d06SElliott Hughes### Block Header Format 119*07fb1d06SElliott Hughes 120*07fb1d06SElliott Hughes* Length (2 Bytes) - The size of the block header excluding the two Length 121*07fb1d06SElliott Hughes bytes itself - 1. 122*07fb1d06SElliott Hughes* Final Block (1 Bit) - Whether the block is the last one or not. 123*07fb1d06SElliott Hughes * 1 - Final block 124*07fb1d06SElliott Hughes * 0 - Middle block 125*07fb1d06SElliott Hughes* Type (2 Bits) 126*07fb1d06SElliott Hughes * 0 - Uncompressed. Immediately after the header, zero or one literals 127*07fb1d06SElliott Hughes list is present which defines the content of the uncompressed blocks. 128*07fb1d06SElliott Hughes * 1 - Fixed Huffman table. The list of literals and length/distances comes 129*07fb1d06SElliott Hughes immediately after the header. Fixed Huffman table is defined the deflate 130*07fb1d06SElliott Hughes specification and will not change. 131*07fb1d06SElliott Hughes * 2 - Dynamic Huffman table. The dynamic Huffman table comes at the end of 132*07fb1d06SElliott Hughes the block header. 133*07fb1d06SElliott Hughes * 3 - Undefined. This is an error. 134*07fb1d06SElliott Hughes* Skip Bits (5 Bits) - Used only for uncompressed blocks (For other types its 135*07fb1d06SElliott Hughes value is zero). In an uncompressed block, the [RFC 1951] skips any bits 136*07fb1d06SElliott Hughes after reading final block and type bits until the byte boundary in the 137*07fb1d06SElliott Hughes input. However, it does not define what the value of these bits should 138*07fb1d06SElliott Hughes be. Most implementations assume 0, but some implementations may use any 139*07fb1d06SElliott Hughes value. This segment contains those bits as a five-bits integer. When writing 140*07fb1d06SElliott Hughes the block header back to the deflate format, the actual number of bits which 141*07fb1d06SElliott Hughes where skipped will be calculated easily. 142*07fb1d06SElliott Hughes* Huffman Table - It only comes for dynamic Huffman table. 143*07fb1d06SElliott Hughes 144*07fb1d06SElliott Hughes#### Dynamic Huffman Table Format 145*07fb1d06SElliott Hughes 146*07fb1d06SElliott HughesDepending on the scheme for storing the Huffman tables, the payload size can 147*07fb1d06SElliott Hugheschange. We discovered that the following scheme does not produce the smallest 148*07fb1d06SElliott Hughespayload, but it is the most deterministic one. In a deflate stream, Huffman 149*07fb1d06SElliott Hughestables for literals/length and distances are themselves Huffman coded. In this 150*07fb1d06SElliott Hughesformat, we also `puff` the Huffman tables into the `puff` stream instead of 151*07fb1d06SElliott Hughescompletely decompressing them. 152*07fb1d06SElliott Hughes 153*07fb1d06SElliott HughesThere are three tables stored in this structure very similar to the one defined 154*07fb1d06SElliott Hughesin [RFC 1951]. A Huffman table can be defined as an array of unsigned integer 155*07fb1d06SElliott Hughescode length values. Three Puffed Huffman tables appear like the following 156*07fb1d06SElliott Hughesscheme. The first table (codes) is the Huffman table used to decode the next two 157*07fb1d06SElliott HughesHuffman tables. The second Huffman table is used to decode literals and lengths, 158*07fb1d06SElliott Hughesand the third Huffman table is used to decode distances. 159*07fb1d06SElliott Hughes 160*07fb1d06SElliott Hughes 161*07fb1d06SElliott Hughes* Literal/Length Count (1 byte) - Number of alphabets used for literal/length 162*07fb1d06SElliott Hughes Huffman codes - 257. 163*07fb1d06SElliott Hughes* Distance Count (1 byte) - Number of alphabets used for distance Huffman 164*07fb1d06SElliott Hughes codes - 1. 165*07fb1d06SElliott Hughes* Alphabet Count (1 byte) - Number of alphabets for coding two previous 166*07fb1d06SElliott Hughes Huffman tables - 4. 167*07fb1d06SElliott Hughes* Code Lengths ((Alphabet Count + 1) / 2 bytes) - A list of codes for reading 168*07fb1d06SElliott Hughes the next two Huffman tables. Each byte contains two codes. If the number of 169*07fb1d06SElliott Hughes codes is odd, the last four Bits will be zero. 170*07fb1d06SElliott Hughes* Literal/Length/Distance Code Lengths - List of code lengths for encoding 171*07fb1d06SElliott Hughes literals/lengths/distance followed The encoding is as follows: 172*07fb1d06SElliott Hughes * [0..15] - Represent code [0..15] 173*07fb1d06SElliott Hughes * [16..19] - Copy the last code length [3..6] times. 174*07fb1d06SElliott Hughes * [20..27] - Repeat code length of 0 for [3..10] times. 175*07fb1d06SElliott Hughes * [28..155] - Repeat code length of 0 for [11..138] times. 176*07fb1d06SElliott Hughes 177*07fb1d06SElliott Hughes### Literals List 178*07fb1d06SElliott HughesLiterals lists are constructed by a “length” value followed by “length” bytes of 179*07fb1d06SElliott Hughesliterals. The Puffer should try to merge adjacent literals lists as much as 180*07fb1d06SElliott Hughespossible into one literals list in the `puff` stream. This Is a length value 181*07fb1d06SElliott Hughesfollowed by length bytes of literals (Even if there is only one literal.) 182*07fb1d06SElliott Hughes 183*07fb1d06SElliott Hughes* Tag (1 Bit) - The value is 0 indicating that this is a list of literals (not 184*07fb1d06SElliott Hughes a copy length/distance). 185*07fb1d06SElliott Hughes* Length - The number of literals that would follow in the list. 186*07fb1d06SElliott Hughes * (7 Bits) Length = [1 .. 127], The value is: Length - 1 187*07fb1d06SElliott Hughes * (7 Bits + 2 Bytes) Length = [128 .. 65663], The values are: 127, 188*07fb1d06SElliott Hughes Length - 127. 189*07fb1d06SElliott Hughes Conserves size by using only one byte if the number of upcoming 190*07fb1d06SElliott Hughes literals is smaller or equal to 127 (About 99% of literals length in a 191*07fb1d06SElliott Hughes normal deflate stream fit into this category.) We should never have zero 192*07fb1d06SElliott Hughes length literals. Otherwise it will use three bytes. 193*07fb1d06SElliott Hughes* Literals (Length Bytes) - A sequence of Length number of literals. 194*07fb1d06SElliott Hughes 195*07fb1d06SElliott Hughes### Copy Length/Distance 196*07fb1d06SElliott Hughes 197*07fb1d06SElliott HughesThis Is a Length value followed by a Distance value. 198*07fb1d06SElliott Hughes 199*07fb1d06SElliott Hughes* Tag (1 Bit) - The value is 1 indicating that this is a copy length/distance 200*07fb1d06SElliott Hughes field. 201*07fb1d06SElliott Hughes* Length - Conserves size by using only one byte if the length value is 202*07fb1d06SElliott Hughes smaller or equal to 129. Otherwise two bytes are used. 203*07fb1d06SElliott Hughes * (7 Bits) - Length = [3 .. 129], The value is: Length - 3 204*07fb1d06SElliott Hughes * (7 Bites + 1 Byte) Length = [130 .. 258], The value is: 127, Length - 205*07fb1d06SElliott Hughes 130: 206*07fb1d06SElliott Hughes* Distance (2 Bytes) - Distance = [1 .. 32768], The value is: 207*07fb1d06SElliott Hughes Distance - 1. The distance value as an unsigned integer. 208*07fb1d06SElliott Hughes 209*07fb1d06SElliott Hughes## Building Puffin 210*07fb1d06SElliott HughesCurrently Puffin is being used in both Android and Chrome OS and is built 211*07fb1d06SElliott Hughesdifferently in each of them. There is also a Makefile build, but it is not 212*07fb1d06SElliott Hughescomprehensive. 213*07fb1d06SElliott Hughes 214*07fb1d06SElliott Hughes## Usage 215*07fb1d06SElliott Hughes 216*07fb1d06SElliott HughesPuffin builds an executable `puffin` which can be used to perform diffing and 217*07fb1d06SElliott Hughespatching algorithms using Puffin format. To get the list of options available 218*07fb1d06SElliott Hughesrun: 219*07fb1d06SElliott Hughes 220*07fb1d06SElliott Hughes```shell 221*07fb1d06SElliott Hughespuffin --help 222*07fb1d06SElliott Hughes``` 223*07fb1d06SElliott Hughes 224*07fb1d06SElliott HughesIt can also be used as a library (currently used by update_engine) that provides 225*07fb1d06SElliott Hughesdifferent APIs. 226*07fb1d06SElliott Hughes 227*07fb1d06SElliott Hughes## References 228*07fb1d06SElliott Hughes 229*07fb1d06SElliott Hughes[RFC 1951]: https://www.ietf.org/rfc/rfc1951.txt 230*07fb1d06SElliott Hughes[puffin.proto]: /src/puffin.proto 231*07fb1d06SElliott Hughes[zopfli]: https://en.wikipedia.org/wiki/Zopfli 232