xref: /aosp_15_r20/external/lz4/doc/lz4_Block_format.md (revision 27162e4e17433d5aa7cb38e7b6a433a09405fc7f)
1*27162e4eSAndroid Build Coastguard WorkerLZ4 Block Format Description
2*27162e4eSAndroid Build Coastguard Worker============================
3*27162e4eSAndroid Build Coastguard WorkerLast revised: 2022-07-31 .
4*27162e4eSAndroid Build Coastguard WorkerAuthor : Yann Collet
5*27162e4eSAndroid Build Coastguard Worker
6*27162e4eSAndroid Build Coastguard Worker
7*27162e4eSAndroid Build Coastguard WorkerThis specification is intended for developers willing to
8*27162e4eSAndroid Build Coastguard Workerproduce or read LZ4 compressed data blocks
9*27162e4eSAndroid Build Coastguard Workerusing any programming language of their choice.
10*27162e4eSAndroid Build Coastguard Worker
11*27162e4eSAndroid Build Coastguard WorkerLZ4 is an LZ77-type compressor with a fixed byte-oriented encoding format.
12*27162e4eSAndroid Build Coastguard WorkerThere is no entropy encoder back-end nor framing layer.
13*27162e4eSAndroid Build Coastguard WorkerThe latter is assumed to be handled by other parts of the system
14*27162e4eSAndroid Build Coastguard Worker(see [LZ4 Frame format]).
15*27162e4eSAndroid Build Coastguard WorkerThis design is assumed to favor simplicity and speed.
16*27162e4eSAndroid Build Coastguard Worker
17*27162e4eSAndroid Build Coastguard WorkerThis document describes only the Block Format,
18*27162e4eSAndroid Build Coastguard Workernot how the compressor nor decompressor actually work.
19*27162e4eSAndroid Build Coastguard WorkerFor more details on such topics, see later section "Implementation Notes".
20*27162e4eSAndroid Build Coastguard Worker
21*27162e4eSAndroid Build Coastguard Worker[LZ4 Frame format]: lz4_Frame_format.md
22*27162e4eSAndroid Build Coastguard Worker
23*27162e4eSAndroid Build Coastguard Worker
24*27162e4eSAndroid Build Coastguard Worker
25*27162e4eSAndroid Build Coastguard WorkerCompressed block format
26*27162e4eSAndroid Build Coastguard Worker-----------------------
27*27162e4eSAndroid Build Coastguard WorkerAn LZ4 compressed block is composed of sequences.
28*27162e4eSAndroid Build Coastguard WorkerA sequence is a suite of literals (not-compressed bytes),
29*27162e4eSAndroid Build Coastguard Workerfollowed by a match copy operation.
30*27162e4eSAndroid Build Coastguard Worker
31*27162e4eSAndroid Build Coastguard WorkerEach sequence starts with a `token`.
32*27162e4eSAndroid Build Coastguard WorkerThe `token` is a one byte value, separated into two 4-bits fields.
33*27162e4eSAndroid Build Coastguard WorkerTherefore each field ranges from 0 to 15.
34*27162e4eSAndroid Build Coastguard Worker
35*27162e4eSAndroid Build Coastguard Worker
36*27162e4eSAndroid Build Coastguard WorkerThe first field uses the 4 high-bits of the token.
37*27162e4eSAndroid Build Coastguard WorkerIt provides the length of literals to follow.
38*27162e4eSAndroid Build Coastguard Worker
39*27162e4eSAndroid Build Coastguard WorkerIf the field value is smaller than 15,
40*27162e4eSAndroid Build Coastguard Workerthen it represents the total nb of literals present in the sequence,
41*27162e4eSAndroid Build Coastguard Workerincluding 0, in which case there is no literal.
42*27162e4eSAndroid Build Coastguard Worker
43*27162e4eSAndroid Build Coastguard WorkerThe value 15 is a special case: more bytes are required to indicate the full length.
44*27162e4eSAndroid Build Coastguard WorkerEach additional byte then represents a value from 0 to 255,
45*27162e4eSAndroid Build Coastguard Workerwhich is added to the previous value to produce a total length.
46*27162e4eSAndroid Build Coastguard WorkerWhen the byte value is 255, another byte must be read and added, and so on.
47*27162e4eSAndroid Build Coastguard WorkerThere can be any number of bytes of value `255` following `token`.
48*27162e4eSAndroid Build Coastguard WorkerThe Block Format does not define any "size limit",
49*27162e4eSAndroid Build Coastguard Workerthough real implementations may feature some practical limits
50*27162e4eSAndroid Build Coastguard Worker(see more details in later chapter "Implementation Notes").
51*27162e4eSAndroid Build Coastguard Worker
52*27162e4eSAndroid Build Coastguard WorkerNote : this format explains why a non-compressible input block is expanded by 0.4%.
53*27162e4eSAndroid Build Coastguard Worker
54*27162e4eSAndroid Build Coastguard WorkerExample 1 : A literal length of 48 will be represented as :
55*27162e4eSAndroid Build Coastguard Worker
56*27162e4eSAndroid Build Coastguard Worker  - 15 : value for the 4-bits High field
57*27162e4eSAndroid Build Coastguard Worker  - 33 : (=48-15) remaining length to reach 48
58*27162e4eSAndroid Build Coastguard Worker
59*27162e4eSAndroid Build Coastguard WorkerExample 2 : A literal length of 280 will be represented as :
60*27162e4eSAndroid Build Coastguard Worker
61*27162e4eSAndroid Build Coastguard Worker  - 15  : value for the 4-bits High field
62*27162e4eSAndroid Build Coastguard Worker  - 255 : following byte is maxed, since 280-15 >= 255
63*27162e4eSAndroid Build Coastguard Worker  - 10  : (=280 - 15 - 255) remaining length to reach 280
64*27162e4eSAndroid Build Coastguard Worker
65*27162e4eSAndroid Build Coastguard WorkerExample 3 : A literal length of 15 will be represented as :
66*27162e4eSAndroid Build Coastguard Worker
67*27162e4eSAndroid Build Coastguard Worker  - 15 : value for the 4-bits High field
68*27162e4eSAndroid Build Coastguard Worker  - 0  : (=15-15) yes, the zero must be output
69*27162e4eSAndroid Build Coastguard Worker
70*27162e4eSAndroid Build Coastguard WorkerFollowing `token` and optional length bytes, are the literals themselves.
71*27162e4eSAndroid Build Coastguard WorkerThey are exactly as numerous as just decoded (length of literals).
72*27162e4eSAndroid Build Coastguard WorkerReminder: it's possible that there are zero literals.
73*27162e4eSAndroid Build Coastguard Worker
74*27162e4eSAndroid Build Coastguard Worker
75*27162e4eSAndroid Build Coastguard WorkerFollowing the literals is the match copy operation.
76*27162e4eSAndroid Build Coastguard Worker
77*27162e4eSAndroid Build Coastguard WorkerIt starts by the `offset` value.
78*27162e4eSAndroid Build Coastguard WorkerThis is a 2 bytes value, in little endian format
79*27162e4eSAndroid Build Coastguard Worker(the 1st byte is the "low" byte, the 2nd one is the "high" byte).
80*27162e4eSAndroid Build Coastguard Worker
81*27162e4eSAndroid Build Coastguard WorkerThe `offset` represents the position of the match to be copied from the past.
82*27162e4eSAndroid Build Coastguard WorkerFor example, 1 means "current position - 1 byte".
83*27162e4eSAndroid Build Coastguard WorkerThe maximum `offset` value is 65535. 65536 and beyond cannot be coded.
84*27162e4eSAndroid Build Coastguard WorkerNote that 0 is an invalid `offset` value.
85*27162e4eSAndroid Build Coastguard WorkerThe presence of a 0 `offset` value denotes an invalid (corrupted) block.
86*27162e4eSAndroid Build Coastguard Worker
87*27162e4eSAndroid Build Coastguard WorkerThen the `matchlength` can be extracted.
88*27162e4eSAndroid Build Coastguard WorkerFor this, we use the second `token` field, the low 4-bits.
89*27162e4eSAndroid Build Coastguard WorkerSuch a value, obviously, ranges from 0 to 15.
90*27162e4eSAndroid Build Coastguard WorkerHowever here, 0 means that the copy operation is minimal.
91*27162e4eSAndroid Build Coastguard WorkerThe minimum length of a match, called `minmatch`, is 4.
92*27162e4eSAndroid Build Coastguard WorkerAs a consequence, a 0 value means 4 bytes.
93*27162e4eSAndroid Build Coastguard WorkerSimilarly to literal length, any value smaller than 15 represents a length,
94*27162e4eSAndroid Build Coastguard Workerto which 4 (`minmatch`) must be added, thus ranging from 4 to 18.
95*27162e4eSAndroid Build Coastguard WorkerA value of 15 is special, meaning 19+ bytes,
96*27162e4eSAndroid Build Coastguard Workerto which one must read additional bytes, one at a time,
97*27162e4eSAndroid Build Coastguard Workerwith each byte value ranging from 0 to 255.
98*27162e4eSAndroid Build Coastguard WorkerThey are added to total to provide the final match length.
99*27162e4eSAndroid Build Coastguard WorkerA 255 value means there is another byte to read and add.
100*27162e4eSAndroid Build Coastguard WorkerThere is no limit to the number of optional `255` bytes that can be present,
101*27162e4eSAndroid Build Coastguard Workerand therefore no limit to representable match length,
102*27162e4eSAndroid Build Coastguard Workerthough real-life implementations are likely going to enforce limits for practical reasons (see more details in "Implementation Notes" section below).
103*27162e4eSAndroid Build Coastguard Worker
104*27162e4eSAndroid Build Coastguard WorkerNote: this format has a maximum achievable compression ratio of about ~250.
105*27162e4eSAndroid Build Coastguard Worker
106*27162e4eSAndroid Build Coastguard WorkerDecoding the `matchlength` reaches the end of current sequence.
107*27162e4eSAndroid Build Coastguard WorkerNext byte will be the start of another sequence, and therefore a new `token`.
108*27162e4eSAndroid Build Coastguard Worker
109*27162e4eSAndroid Build Coastguard Worker
110*27162e4eSAndroid Build Coastguard WorkerEnd of block conditions
111*27162e4eSAndroid Build Coastguard Worker-------------------------
112*27162e4eSAndroid Build Coastguard WorkerThere are specific restrictions required to terminate an LZ4 block.
113*27162e4eSAndroid Build Coastguard Worker
114*27162e4eSAndroid Build Coastguard Worker1. The last sequence contains only literals.
115*27162e4eSAndroid Build Coastguard Worker   The block ends right after the literals (no `offset` field).
116*27162e4eSAndroid Build Coastguard Worker2. The last 5 bytes of input are always literals.
117*27162e4eSAndroid Build Coastguard Worker   Therefore, the last sequence contains at least 5 bytes.
118*27162e4eSAndroid Build Coastguard Worker   - Special : if input is smaller than 5 bytes,
119*27162e4eSAndroid Build Coastguard Worker     there is only one sequence, it contains the whole input as literals.
120*27162e4eSAndroid Build Coastguard Worker     Even empty input can be represented, using a zero byte,
121*27162e4eSAndroid Build Coastguard Worker     interpreted as a final token without literal and without a match.
122*27162e4eSAndroid Build Coastguard Worker3. The last match must start at least 12 bytes before the end of block.
123*27162e4eSAndroid Build Coastguard Worker   The last match is part of the _penultimate_ sequence.
124*27162e4eSAndroid Build Coastguard Worker   It is followed by the last sequence, which contains _only_ literals.
125*27162e4eSAndroid Build Coastguard Worker   - Note that, as a consequence,
126*27162e4eSAndroid Build Coastguard Worker     blocks < 12 bytes cannot be compressed.
127*27162e4eSAndroid Build Coastguard Worker     And as an extension, _independent_ blocks < 13 bytes cannot be compressed,
128*27162e4eSAndroid Build Coastguard Worker     because they must start by at least one literal,
129*27162e4eSAndroid Build Coastguard Worker     that the match can then copy afterwards.
130*27162e4eSAndroid Build Coastguard Worker
131*27162e4eSAndroid Build Coastguard WorkerWhen a block does not respect these end conditions,
132*27162e4eSAndroid Build Coastguard Workera conformant decoder is allowed to reject the block as incorrect.
133*27162e4eSAndroid Build Coastguard Worker
134*27162e4eSAndroid Build Coastguard WorkerThese rules are in place to ensure compatibility with
135*27162e4eSAndroid Build Coastguard Workera wide range of historical decoders
136*27162e4eSAndroid Build Coastguard Workerwhich rely on these conditions for their speed-oriented design.
137*27162e4eSAndroid Build Coastguard Worker
138*27162e4eSAndroid Build Coastguard WorkerImplementation notes
139*27162e4eSAndroid Build Coastguard Worker-----------------------
140*27162e4eSAndroid Build Coastguard WorkerThe LZ4 Block Format only defines the compressed format,
141*27162e4eSAndroid Build Coastguard Workerit does not tell how to create a decoder or an encoder,
142*27162e4eSAndroid Build Coastguard Workerwhich design is left free to the imagination of the implementer.
143*27162e4eSAndroid Build Coastguard Worker
144*27162e4eSAndroid Build Coastguard WorkerHowever, thanks to experience, there are a number of typical topics that
145*27162e4eSAndroid Build Coastguard Workermost implementations will have to consider.
146*27162e4eSAndroid Build Coastguard WorkerThis section tries to provide a few guidelines.
147*27162e4eSAndroid Build Coastguard Worker
148*27162e4eSAndroid Build Coastguard Worker#### Metadata
149*27162e4eSAndroid Build Coastguard Worker
150*27162e4eSAndroid Build Coastguard WorkerAn LZ4-compressed Block requires additional metadata for proper decoding.
151*27162e4eSAndroid Build Coastguard WorkerTypically, a decoder will require the compressed block's size,
152*27162e4eSAndroid Build Coastguard Workerand an upper bound of decompressed size.
153*27162e4eSAndroid Build Coastguard WorkerOther variants exist, such as knowing the decompressed size,
154*27162e4eSAndroid Build Coastguard Workerand having an upper bound of the input size.
155*27162e4eSAndroid Build Coastguard WorkerThe Block Format does not specify how to transmit such information,
156*27162e4eSAndroid Build Coastguard Workerwhich is considered an out-of-band information channel.
157*27162e4eSAndroid Build Coastguard WorkerThat's because in many cases, the information is present in the environment.
158*27162e4eSAndroid Build Coastguard WorkerFor example, databases must store the size of their compressed block for indexing,
159*27162e4eSAndroid Build Coastguard Workerand know that their decompressed block can't be larger than a certain threshold.
160*27162e4eSAndroid Build Coastguard Worker
161*27162e4eSAndroid Build Coastguard WorkerIf you need a format which is "self-contained",
162*27162e4eSAndroid Build Coastguard Workerand also transports the necessary metadata for proper decoding on any platform,
163*27162e4eSAndroid Build Coastguard Workerconsider employing the [LZ4 Frame format] instead.
164*27162e4eSAndroid Build Coastguard Worker
165*27162e4eSAndroid Build Coastguard Worker#### Large lengths
166*27162e4eSAndroid Build Coastguard Worker
167*27162e4eSAndroid Build Coastguard WorkerWhile the Block Format does not define any maximum value for length fields,
168*27162e4eSAndroid Build Coastguard Workerin practice, most implementations will feature some form of limit,
169*27162e4eSAndroid Build Coastguard Workersince it's expected for such values to be stored into registers of fixed bit width.
170*27162e4eSAndroid Build Coastguard Worker
171*27162e4eSAndroid Build Coastguard WorkerIf length fields use 64-bit registers,
172*27162e4eSAndroid Build Coastguard Workerthen it can be assumed that there is no practical limit,
173*27162e4eSAndroid Build Coastguard Workeras it would require a single continuous block of multiple petabytes to reach it,
174*27162e4eSAndroid Build Coastguard Workerwhich is unreasonable by today's standard.
175*27162e4eSAndroid Build Coastguard Worker
176*27162e4eSAndroid Build Coastguard WorkerIf length fields use 32-bit registers, then it can be overflowed,
177*27162e4eSAndroid Build Coastguard Workerbut requires a compressed block of size > 16 MB.
178*27162e4eSAndroid Build Coastguard WorkerTherefore, implementations that do not deal with compressed blocks > 16 MB are safe.
179*27162e4eSAndroid Build Coastguard WorkerHowever, if such a case is allowed,
180*27162e4eSAndroid Build Coastguard Workerthen it's recommended to check that no large length overflows the register.
181*27162e4eSAndroid Build Coastguard Worker
182*27162e4eSAndroid Build Coastguard WorkerIf length fields use 16-bit registers,
183*27162e4eSAndroid Build Coastguard Workerthen it's definitely possible to overflow such register,
184*27162e4eSAndroid Build Coastguard Workerwith less than < 300 bytes of compressed data.
185*27162e4eSAndroid Build Coastguard Worker
186*27162e4eSAndroid Build Coastguard WorkerA conformant decoder should be able to detect length overflows when it's possible,
187*27162e4eSAndroid Build Coastguard Workerand simply error out when that happens.
188*27162e4eSAndroid Build Coastguard WorkerThe input block might not be invalid,
189*27162e4eSAndroid Build Coastguard Workerit's just not decodable by the local decoder implementation.
190*27162e4eSAndroid Build Coastguard Worker
191*27162e4eSAndroid Build Coastguard WorkerNote that, in order to be compatible with the larger LZ4 ecosystem,
192*27162e4eSAndroid Build Coastguard Workerit's recommended to be able to read and represent lengths of up to 4 MB,
193*27162e4eSAndroid Build Coastguard Workerand to accept blocks of size up to 4 MB.
194*27162e4eSAndroid Build Coastguard WorkerSuch limits are compatible with 32-bit length registers,
195*27162e4eSAndroid Build Coastguard Workerand prevent overflow of 32-bit registers.
196*27162e4eSAndroid Build Coastguard Worker
197*27162e4eSAndroid Build Coastguard Worker#### Safe decoding
198*27162e4eSAndroid Build Coastguard Worker
199*27162e4eSAndroid Build Coastguard WorkerIf a decoder receives compressed data from any external source,
200*27162e4eSAndroid Build Coastguard Workerit is recommended to ensure that the decoder is resilient to corrupted input,
201*27162e4eSAndroid Build Coastguard Workerand made safe from buffer overflow manipulations.
202*27162e4eSAndroid Build Coastguard WorkerAlways ensure that read and write operations
203*27162e4eSAndroid Build Coastguard Workerremain within the limits of provided buffers.
204*27162e4eSAndroid Build Coastguard Worker
205*27162e4eSAndroid Build Coastguard WorkerOf particular importance, ensure that the nb of bytes instructed to copy
206*27162e4eSAndroid Build Coastguard Workerdoes not overflow neither the input nor the output buffers.
207*27162e4eSAndroid Build Coastguard WorkerEnsure also, when reading an offset value, that the resulting position to copy
208*27162e4eSAndroid Build Coastguard Workerdoes not reach beyond the beginning of the buffer.
209*27162e4eSAndroid Build Coastguard WorkerSuch a situation can happen during the first 64 KB of decoded data.
210*27162e4eSAndroid Build Coastguard Worker
211*27162e4eSAndroid Build Coastguard WorkerFor more safety, test the decoder with fuzzers
212*27162e4eSAndroid Build Coastguard Workerto ensure it's resilient to improbable sequences of conditions.
213*27162e4eSAndroid Build Coastguard WorkerCombine them with sanitizers, in order to catch overflows (asan)
214*27162e4eSAndroid Build Coastguard Workeror initialization issues (msan).
215*27162e4eSAndroid Build Coastguard Worker
216*27162e4eSAndroid Build Coastguard WorkerPay some attention to offset 0 scenario, which is invalid,
217*27162e4eSAndroid Build Coastguard Workerand therefore must not be blindly decoded:
218*27162e4eSAndroid Build Coastguard Workera naive implementation could preserve destination buffer content,
219*27162e4eSAndroid Build Coastguard Workerwhich could then result in information disclosure
220*27162e4eSAndroid Build Coastguard Workerif such buffer was uninitialized and still containing private data.
221*27162e4eSAndroid Build Coastguard WorkerFor reference, in such a scenario, the reference LZ4 decoder
222*27162e4eSAndroid Build Coastguard Workerclears the match segment with `0` bytes,
223*27162e4eSAndroid Build Coastguard Workerthough other solutions are certainly possible.
224*27162e4eSAndroid Build Coastguard Worker
225*27162e4eSAndroid Build Coastguard WorkerFinally, pay attention to the "overlap match" scenario,
226*27162e4eSAndroid Build Coastguard Workerwhen `matchlength` is larger than `offset`.
227*27162e4eSAndroid Build Coastguard WorkerIn which case, since `match_pos + matchlength > current_pos`,
228*27162e4eSAndroid Build Coastguard Workersome of the later bytes to copy do not exist yet,
229*27162e4eSAndroid Build Coastguard Workerand will be generated during the early stage of match copy operation.
230*27162e4eSAndroid Build Coastguard WorkerSuch scenario must be handled with special care.
231*27162e4eSAndroid Build Coastguard WorkerA common case is an offset of 1,
232*27162e4eSAndroid Build Coastguard Workermeaning the last byte is repeated `matchlength` times.
233*27162e4eSAndroid Build Coastguard Worker
234*27162e4eSAndroid Build Coastguard Worker#### Compression techniques
235*27162e4eSAndroid Build Coastguard Worker
236*27162e4eSAndroid Build Coastguard WorkerThe core of a LZ4 compressor is to detect duplicated data across past 64 KB.
237*27162e4eSAndroid Build Coastguard WorkerThe format makes no assumption nor limits to the way a compressor
238*27162e4eSAndroid Build Coastguard Workersearches and selects matches within the source data block.
239*27162e4eSAndroid Build Coastguard WorkerFor example, an upper compression limit can be reached,
240*27162e4eSAndroid Build Coastguard Workerusing a technique called "full optimal parsing", at high cpu and memory cost.
241*27162e4eSAndroid Build Coastguard WorkerBut multiple other techniques can be considered,
242*27162e4eSAndroid Build Coastguard Workerfeaturing distinct time / performance trade-offs.
243*27162e4eSAndroid Build Coastguard WorkerAs long as the specified format is respected,
244*27162e4eSAndroid Build Coastguard Workerthe result will be compatible with and decodable by any compliant decoder.
245