xref: /aosp_15_r20/external/puffin/README.md (revision 07fb1d065b7cfb4729786fadd42a612532d2f466)
1*07fb1d06SElliott Hughes# Puffin: A deterministic deflate re-compressor (for patching purposes)
2*07fb1d06SElliott Hughes
3*07fb1d06SElliott Hughes## Table of Contents
4*07fb1d06SElliott Hughes
5*07fb1d06SElliott Hughes[TOC]
6*07fb1d06SElliott Hughes
7*07fb1d06SElliott Hughes## Puffin
8*07fb1d06SElliott Hughes
9*07fb1d06SElliott HughesPuffin is a deterministic deflate recompressor. It is mainly used for patching
10*07fb1d06SElliott Hughesdeflate compressed images (zip, gzip, etc.) because patches created from deflate
11*07fb1d06SElliott Hughesfiles/streams are usually large (deflate has a bit-aligned format, hence,
12*07fb1d06SElliott Hugheschanging one byte in the raw data can cause the entire deflate stream to change
13*07fb1d06SElliott Hughesdrastically.)
14*07fb1d06SElliott Hughes
15*07fb1d06SElliott HughesPuffin has two tools (operations): `puffdiff` and `puffpatch` (shown
16*07fb1d06SElliott Hughes[here](puffin).)  The purpose of `puffdiff` operation is to create a patch
17*07fb1d06SElliott Hughesbetween a source and a target file with one or both of them having some deflate
18*07fb1d06SElliott Hughesstreams. This patch is used in the `puffpatch` operation to generate the target
19*07fb1d06SElliott Hughesfile from the source file deterministically. The patch itself is created by
20*07fb1d06SElliott Hughes`bsdiff` library (but can be replaced by any other diffing mechanism). But,
21*07fb1d06SElliott Hughesbefore it uses `bsdiff` to create the patch, we have to transform both the
22*07fb1d06SElliott Hughessource and target files into a specific format so `bsdiff` can produce smaller
23*07fb1d06SElliott Hughespatches. We define `puff` operation to perform such a transformation. The
24*07fb1d06SElliott Hughesresulting stream is called a `puff` stream. The reverse transformation (from
25*07fb1d06SElliott Hughes`puff` stream to deflate stream) is called a `huff` operation. `huff` is used in
26*07fb1d06SElliott Hughesthe client to transform the `puff` stream back into its original deflate stream
27*07fb1d06SElliott Hughesdeterministically.
28*07fb1d06SElliott Hughes
29*07fb1d06SElliott Hughes![puffin](images/puffin.png "Puffin architecture")
30*07fb1d06SElliott Hughes
31*07fb1d06SElliott HughesFor information about deflate format see [RFC 1951].
32*07fb1d06SElliott Hughes
33*07fb1d06SElliott Hughes## puff and huff
34*07fb1d06SElliott Hughes
35*07fb1d06SElliott Hughes`puff` is an operation that decompresses only the Huffman part of the deflate
36*07fb1d06SElliott Hughesstream and keeps the structure of the LZ77 coding unchanged. This is roughly
37*07fb1d06SElliott Hughesequivalent of decompressing ‘half way’.
38*07fb1d06SElliott Hughes
39*07fb1d06SElliott Hughes`huff` is the exact opposite of `puff` and it deterministically converts the
40*07fb1d06SElliott Hughes`puff` stream back to its original deflate stream. This deterministic conversion
41*07fb1d06SElliott Hughesis based on two facts:
42*07fb1d06SElliott Hughes
43*07fb1d06SElliott Hughes*   There is no need to perform LZ77 algorithm. This means the deflate stream
44*07fb1d06SElliott Hughes    could be built by any LZ77 algorithm.
45*07fb1d06SElliott Hughes*   The dynamic Huffman tables can be recreated uniquely from the code length
46*07fb1d06SElliott Hughes    array stored inside the `puff` stream. This means the deflate stream can be
47*07fb1d06SElliott Hughes    reconstructed deterministically using the Huffman tables.
48*07fb1d06SElliott Hughes
49*07fb1d06SElliott HughesThe inclusion of Huffman tables in the `puff` stream has minuscule space burden
50*07fb1d06SElliott Hughes(on average maximum 320 bytes for each block. There is roughly one block per
51*07fb1d06SElliott Hughes32KB of uncompressed data).
52*07fb1d06SElliott Hughes
53*07fb1d06SElliott Hughes`bsdiff` of two `puffed` streams has much smaller patch in comparison to their
54*07fb1d06SElliott Hughesdeflate streams, but it is larger than uncompressed streams.
55*07fb1d06SElliott Hughes
56*07fb1d06SElliott Hughes**The major benefits**
57*07fb1d06SElliott Hughes
58*07fb1d06SElliott Hughes*   Size of the patch is smaller than deflate stream’s patch.
59*07fb1d06SElliott Hughes*   `puff` and `huff` are deterministic operations.
60*07fb1d06SElliott Hughes*   `huff` is in order of 10X faster than full recompression. It is even faster
61*07fb1d06SElliott Hughes    than Huffman algorithm because it already has the Huffman tables and does
62*07fb1d06SElliott Hughes    not need to reconstruct it.
63*07fb1d06SElliott Hughes*   Both algorithms have very low memory footprint.
64*07fb1d06SElliott Hughes*   The underlying LZ77 algorithm can be anything (as long as it is deflate
65*07fb1d06SElliott Hughes    compatible). This includes google’s [zopfli]
66*07fb1d06SElliott Hughes
67*07fb1d06SElliott Hughes**The drawbacks**
68*07fb1d06SElliott Hughes
69*07fb1d06SElliott Hughes*   The need to define a file format for the puffed stream and stay with this
70*07fb1d06SElliott Hughes    format forever. If format needs to be changed in the future, then some
71*07fb1d06SElliott Hughes    versioning mechanism should be there to handle it and backward compatibility
72*07fb1d06SElliott Hughes    should be maintained.
73*07fb1d06SElliott Hughes*   The need to define a payload format and stay with it forever. Similarly
74*07fb1d06SElliott Hughes    needs to be versioned if required later change.
75*07fb1d06SElliott Hughes*   Does not reduces the patch size as full recompression.
76*07fb1d06SElliott Hughes
77*07fb1d06SElliott Hughes
78*07fb1d06SElliott Hughes## puffdiff and puffpatch
79*07fb1d06SElliott Hughes
80*07fb1d06SElliott HughesA deflate compressed file (gzip, zip, etc.) contains multiple deflate streams at
81*07fb1d06SElliott Hughesdifferent locations. When this file is puffed, the resulting `puff` stream
82*07fb1d06SElliott Hughescontains both puffs and the raw data that existed between the deflates in the
83*07fb1d06SElliott Hughescompressed file. When performing `huff` operation, the location of puffs in the
84*07fb1d06SElliott Hughes`puff` stream and deflates in the deflate stream should be passed upon. This is
85*07fb1d06SElliott Hughesnecessary as `huff` operation has to know exactly where the locations of both
86*07fb1d06SElliott Hughespuffs and deflates are. (See the following [image](puffin-stream))
87*07fb1d06SElliott Hughes
88*07fb1d06SElliott Hughes![puffin-stream](images/puffin-stream.png "Puffin stream")
89*07fb1d06SElliott Hughes
90*07fb1d06SElliott HughesSimilarly `puffpatch` requires deflates and puffs locations in both the source
91*07fb1d06SElliott Hughesand target streams. These location information is saved using Protobufs in the
92*07fb1d06SElliott Hughespatch generated by `bsdiff`. One requirement for these two operations are that
93*07fb1d06SElliott Hughes`puffpatch` has to be efficient in the client. Client devices have limited
94*07fb1d06SElliott Hughesmemory and CPU bandwidth and it is necessary that each of these operations are
95*07fb1d06SElliott Hughesperformed with the most efficiency available. In order to achieve this
96*07fb1d06SElliott Hughesefficiency a new operation can be added to `bspatch`, that reads and writes into
97*07fb1d06SElliott Hughesa `puff` streams using special interfaces for puffing and huffing on the fly.
98*07fb1d06SElliott Hughes
99*07fb1d06SElliott Hughes
100*07fb1d06SElliott Hughes## Puffin Patch Format
101*07fb1d06SElliott Hughes
102*07fb1d06SElliott Hughes*   Magic (4 bytes) - The string literal "PUF1".
103*07fb1d06SElliott Hughes*   Header Length (4 bytes) - The size of the header (length of the generated
104*07fb1d06SElliott Hughes    Protobuf).
105*07fb1d06SElliott Hughes*   Header - Lengths and locations of deflate and puffs streams in source and
106*07fb1d06SElliott Hughes    target files in a Protobuf format. See [puffin.proto].
107*07fb1d06SElliott Hughes*   Patch - This is a binary array directly generated by the `bsdiff` program.
108*07fb1d06SElliott Hughes
109*07fb1d06SElliott Hughes## Puffin Stream Format
110*07fb1d06SElliott Hughes
111*07fb1d06SElliott Hughes*   [Block Header](#block-header-format) (3+ bytes) - Defines the type of the
112*07fb1d06SElliott Hughes    block.
113*07fb1d06SElliott Hughes*   Data - A mix of literals list and copy length/distances.
114*07fb1d06SElliott Hughes*   End of Block (2 bytes) - The end of block symbol. Similar to the symbols for
115*07fb1d06SElliott Hughes    copy length/distance but without the distance bytes. The actual copy length
116*07fb1d06SElliott Hughes    value is 259 (0x81FF).
117*07fb1d06SElliott Hughes
118*07fb1d06SElliott Hughes### Block Header Format
119*07fb1d06SElliott Hughes
120*07fb1d06SElliott Hughes*   Length (2 Bytes) - The size of the block header excluding the two Length
121*07fb1d06SElliott Hughes    bytes itself - 1.
122*07fb1d06SElliott Hughes*   Final Block (1 Bit) - Whether the block is the last one or not.
123*07fb1d06SElliott Hughes    *   1 - Final block
124*07fb1d06SElliott Hughes	*   0 - Middle block
125*07fb1d06SElliott Hughes*   Type (2 Bits)
126*07fb1d06SElliott Hughes    *   0 - Uncompressed. Immediately after the header, zero or one literals
127*07fb1d06SElliott Hughes        list is present which defines the content of the uncompressed blocks.
128*07fb1d06SElliott Hughes	*   1 - Fixed Huffman table. The list of literals and length/distances comes
129*07fb1d06SElliott Hughes        immediately after the header. Fixed Huffman table is defined the deflate
130*07fb1d06SElliott Hughes        specification and will not change.
131*07fb1d06SElliott Hughes	*   2 - Dynamic Huffman table. The dynamic Huffman table comes at the end of
132*07fb1d06SElliott Hughes        the block header.
133*07fb1d06SElliott Hughes	*   3 - Undefined. This is an error.
134*07fb1d06SElliott Hughes*   Skip Bits (5 Bits) - Used only for uncompressed blocks (For other types its
135*07fb1d06SElliott Hughes    value is zero). In an uncompressed block, the [RFC 1951] skips any bits
136*07fb1d06SElliott Hughes    after reading final block and type bits until the byte boundary in the
137*07fb1d06SElliott Hughes    input. However, it does not define what the value of these bits should
138*07fb1d06SElliott Hughes    be. Most implementations assume 0, but some implementations may use any
139*07fb1d06SElliott Hughes    value. This segment contains those bits as a five-bits integer. When writing
140*07fb1d06SElliott Hughes    the block header back to the deflate format, the actual number of bits which
141*07fb1d06SElliott Hughes    where skipped will be calculated easily.
142*07fb1d06SElliott Hughes*   Huffman Table - It only comes for dynamic Huffman table.
143*07fb1d06SElliott Hughes
144*07fb1d06SElliott Hughes#### Dynamic Huffman Table Format
145*07fb1d06SElliott Hughes
146*07fb1d06SElliott HughesDepending on the scheme for storing the Huffman tables, the payload size can
147*07fb1d06SElliott Hugheschange. We discovered that the following scheme does not produce the smallest
148*07fb1d06SElliott Hughespayload, but it is the most deterministic one. In a deflate stream, Huffman
149*07fb1d06SElliott Hughestables for literals/length and distances are themselves Huffman coded. In this
150*07fb1d06SElliott Hughesformat, we also `puff` the Huffman tables into the `puff` stream instead of
151*07fb1d06SElliott Hughescompletely decompressing them.
152*07fb1d06SElliott Hughes
153*07fb1d06SElliott HughesThere are three tables stored in this structure very similar to the one defined
154*07fb1d06SElliott Hughesin [RFC 1951].  A Huffman table can be defined as an array of unsigned integer
155*07fb1d06SElliott Hughescode length values. Three Puffed Huffman tables appear like the following
156*07fb1d06SElliott Hughesscheme. The first table (codes) is the Huffman table used to decode the next two
157*07fb1d06SElliott HughesHuffman tables. The second Huffman table is used to decode literals and lengths,
158*07fb1d06SElliott Hughesand the third Huffman table is used to decode distances.
159*07fb1d06SElliott Hughes
160*07fb1d06SElliott Hughes
161*07fb1d06SElliott Hughes*   Literal/Length Count (1 byte) - Number of alphabets used for literal/length
162*07fb1d06SElliott Hughes    Huffman codes - 257.
163*07fb1d06SElliott Hughes*   Distance Count (1 byte) - Number of alphabets used for distance Huffman
164*07fb1d06SElliott Hughes    codes - 1.
165*07fb1d06SElliott Hughes*   Alphabet Count (1 byte) - Number of alphabets for coding two previous
166*07fb1d06SElliott Hughes    Huffman tables - 4.
167*07fb1d06SElliott Hughes*   Code Lengths ((Alphabet Count + 1) / 2 bytes) - A list of codes for reading
168*07fb1d06SElliott Hughes    the next two Huffman tables. Each byte contains two codes. If the number of
169*07fb1d06SElliott Hughes    codes is odd, the last four Bits will be zero.
170*07fb1d06SElliott Hughes*   Literal/Length/Distance Code Lengths - List of code lengths for encoding
171*07fb1d06SElliott Hughes    literals/lengths/distance followed The encoding is as follows:
172*07fb1d06SElliott Hughes	*   [0..15] - Represent code [0..15]
173*07fb1d06SElliott Hughes	*   [16..19] - Copy the last code length [3..6] times.
174*07fb1d06SElliott Hughes	*   [20..27] - Repeat code length of 0 for [3..10] times.
175*07fb1d06SElliott Hughes	*   [28..155] - Repeat code length of 0 for [11..138] times.
176*07fb1d06SElliott Hughes
177*07fb1d06SElliott Hughes### Literals List
178*07fb1d06SElliott HughesLiterals lists are constructed by a “length” value followed by “length” bytes of
179*07fb1d06SElliott Hughesliterals. The Puffer should try to merge adjacent literals lists as much as
180*07fb1d06SElliott Hughespossible into one literals list in the `puff` stream.  This Is a length value
181*07fb1d06SElliott Hughesfollowed by length bytes of literals (Even if there is only one literal.)
182*07fb1d06SElliott Hughes
183*07fb1d06SElliott Hughes*   Tag (1 Bit) - The value is 0 indicating that this is a list of literals (not
184*07fb1d06SElliott Hughes    a copy length/distance).
185*07fb1d06SElliott Hughes*   Length - The number of literals that would follow in the list.
186*07fb1d06SElliott Hughes    *   (7 Bits) Length = [1 .. 127], The value is: Length - 1
187*07fb1d06SElliott Hughes	*   (7 Bits + 2 Bytes) Length = [128 .. 65663], The values are: 127,
188*07fb1d06SElliott Hughes        Length - 127.
189*07fb1d06SElliott Hughes	Conserves size by using only one byte if the number of upcoming
190*07fb1d06SElliott Hughes    literals is smaller or equal to 127 (About 99% of literals length in a
191*07fb1d06SElliott Hughes    normal deflate stream fit into this category.) We should never have zero
192*07fb1d06SElliott Hughes    length literals. Otherwise it will use three bytes.
193*07fb1d06SElliott Hughes*   Literals (Length Bytes) - A sequence of Length number of literals.
194*07fb1d06SElliott Hughes
195*07fb1d06SElliott Hughes### Copy Length/Distance
196*07fb1d06SElliott Hughes
197*07fb1d06SElliott HughesThis Is a Length value followed by a Distance value.
198*07fb1d06SElliott Hughes
199*07fb1d06SElliott Hughes*   Tag (1 Bit) - The value is 1 indicating that this is a copy length/distance
200*07fb1d06SElliott Hughes    field.
201*07fb1d06SElliott Hughes*   Length - Conserves size by using only one byte if the length value is
202*07fb1d06SElliott Hughes    smaller or equal to 129. Otherwise two bytes are used.
203*07fb1d06SElliott Hughes    *   (7 Bits) - Length = [3 .. 129], The value is: Length - 3
204*07fb1d06SElliott Hughes	*   (7 Bites + 1 Byte) Length = [130 .. 258], The value is: 127, Length -
205*07fb1d06SElliott Hughes        130:
206*07fb1d06SElliott Hughes*   Distance (2 Bytes) - Distance = [1 .. 32768], The value is:
207*07fb1d06SElliott Hughes    Distance - 1. The distance value as an unsigned integer.
208*07fb1d06SElliott Hughes
209*07fb1d06SElliott Hughes## Building Puffin
210*07fb1d06SElliott HughesCurrently Puffin is being used in both Android and Chrome OS and is built
211*07fb1d06SElliott Hughesdifferently in each of them. There is also a Makefile build, but it is not
212*07fb1d06SElliott Hughescomprehensive.
213*07fb1d06SElliott Hughes
214*07fb1d06SElliott Hughes## Usage
215*07fb1d06SElliott Hughes
216*07fb1d06SElliott HughesPuffin builds an executable `puffin` which can be used to perform diffing and
217*07fb1d06SElliott Hughespatching algorithms using Puffin format. To get the list of options available
218*07fb1d06SElliott Hughesrun:
219*07fb1d06SElliott Hughes
220*07fb1d06SElliott Hughes```shell
221*07fb1d06SElliott Hughespuffin --help
222*07fb1d06SElliott Hughes```
223*07fb1d06SElliott Hughes
224*07fb1d06SElliott HughesIt can also be used as a library (currently used by update_engine) that provides
225*07fb1d06SElliott Hughesdifferent APIs.
226*07fb1d06SElliott Hughes
227*07fb1d06SElliott Hughes## References
228*07fb1d06SElliott Hughes
229*07fb1d06SElliott Hughes[RFC 1951]: https://www.ietf.org/rfc/rfc1951.txt
230*07fb1d06SElliott Hughes[puffin.proto]: /src/puffin.proto
231*07fb1d06SElliott Hughes[zopfli]: https://en.wikipedia.org/wiki/Zopfli
232