xref: /aosp_15_r20/external/lz4/doc/lz4_Frame_format.md (revision 27162e4e17433d5aa7cb38e7b6a433a09405fc7f)
1LZ4 Frame Format Description
2============================
3
4### Notices
5
6Copyright (c) 2013-2020 Yann Collet
7
8Permission is granted to copy and distribute this document
9for any purpose and without charge,
10including translations into other languages
11and incorporation into compilations,
12provided that the copyright notice and this notice are preserved,
13and that any substantive changes or deletions from the original
14are clearly marked.
15Distribution of this document is unlimited.
16
17### Version
18
191.6.4 (28/12/2023)
20
21
22Introduction
23------------
24
25The purpose of this document is to define a lossless compressed data format,
26that is independent of CPU type, operating system,
27file system and character set, suitable for
28File compression, Pipe and streaming compression
29using the [LZ4 algorithm](http://www.lz4.org).
30
31The data can be produced or consumed,
32even for an arbitrarily long sequentially presented input data stream,
33using only an a priori bounded amount of intermediate storage,
34and hence can be used in data communications.
35The format uses the LZ4 compression method,
36and optional [xxHash-32 checksum method](https://github.com/Cyan4973/xxHash),
37for detection of data corruption.
38
39The data format defined by this specification
40does not attempt to allow random access to compressed data.
41
42This specification is intended for use by implementers of software
43to compress data into LZ4 format and/or decompress data from LZ4 format.
44The text of the specification assumes a basic background in programming
45at the level of bits and other primitive data representations.
46
47Unless otherwise indicated below,
48a compliant compressor must produce data sets
49that conform to the specifications presented here.
50It doesn't need to support all options though.
51
52A compliant decompressor must be able to decompress
53at least one working set of parameters
54that conforms to the specifications presented here.
55It may also ignore checksums.
56Whenever it does not support a specific parameter within the compressed stream,
57it must produce a non-ambiguous error code
58and associated error message explaining which parameter is unsupported.
59
60
61General Structure of LZ4 Frame format
62-------------------------------------
63
64| MagicNb | F. Descriptor | Data Block | (...) | EndMark | C. Checksum |
65|:-------:|:-------------:| ---------- | ----- | ------- | ----------- |
66| 4 bytes |  3-15 bytes   |            |       | 4 bytes | 0-4 bytes   |
67
68__Magic Number__
69
704 Bytes, Little endian format.
71Value : 0x184D2204
72
73__Frame Descriptor__
74
753 to 15 Bytes, to be detailed in its own paragraph,
76as it is the most important part of the spec.
77
78The combined _Magic_Number_ and _Frame_Descriptor_ fields are sometimes
79called ___LZ4 Frame Header___. Its size varies between 7 and 19 bytes.
80
81__Data Blocks__
82
83To be detailed in its own paragraph.
84That’s where compressed data is stored.
85
86__EndMark__
87
88The flow of blocks ends when the last data block is followed by
89the 32-bit value `0x00000000`.
90
91__Content Checksum__
92
93_Content_Checksum_ verify that the full content has been decoded correctly.
94The content checksum is the result of [xxHash-32 algorithm]
95digesting the original (decoded) data as input, and a seed of zero.
96Content checksum is only present when its associated flag
97is set in the frame descriptor.
98Content Checksum validates the result,
99that all blocks were fully transmitted in the correct order and without error,
100and also that the encoding/decoding process itself generated no distortion.
101Its usage is recommended.
102
103The combined _EndMark_ and _Content_Checksum_ fields might sometimes be
104referred to as ___LZ4 Frame Footer___. Its size varies between 4 and 8 bytes.
105
106__Frame Concatenation__
107
108In some circumstances, it may be preferable to append multiple frames,
109for example in order to add new data to an existing compressed file
110without re-framing it.
111
112In such case, each frame has its own set of descriptor flags.
113Each frame is considered independent.
114The only relation between frames is their sequential order.
115
116The ability to decode multiple concatenated frames
117within a single stream or file
118is left outside of this specification.
119As an example, the reference lz4 command line utility behavior is
120to decode all concatenated frames in their sequential order.
121
122
123Frame Descriptor
124----------------
125
126| FLG     | BD      | (Content Size) | (Dictionary ID) | HC      |
127| ------- | ------- |:--------------:|:---------------:| ------- |
128| 1 byte  | 1 byte  |  0 - 8 bytes   |   0 - 4 bytes   | 1 byte  |
129
130The descriptor uses a minimum of 3 bytes,
131and up to 15 bytes depending on optional parameters.
132
133__FLG byte__
134
135|  BitNb  |  7-6  |   5   |    4     |  3   |    2     |    1     |   0  |
136| ------- |-------|-------|----------|------|----------|----------|------|
137|FieldName|Version|B.Indep|B.Checksum|C.Size|C.Checksum|*Reserved*|DictID|
138
139
140__BD byte__
141
142|  BitNb  |     7    |     6-5-4     |  3-2-1-0 |
143| ------- | -------- | ------------- | -------- |
144|FieldName|*Reserved*| Block MaxSize |*Reserved*|
145
146In the tables, bit 7 is highest bit, while bit 0 is lowest.
147
148__Version Number__
149
1502-bits field, must be set to `01`.
151Any other value cannot be decoded by this version of the specification.
152Other version numbers will use different flag layouts.
153
154__Block Independence flag__
155
156If this flag is set to “1”, blocks are independent.
157If this flag is set to “0”, each block depends on previous ones
158(up to LZ4 window size, which is 64 KB).
159In such case, it’s necessary to decode all blocks in sequence.
160
161Block dependency improves compression ratio, especially for small blocks.
162On the other hand, it makes random access or multi-threaded decoding impossible.
163
164__Block checksum flag__
165
166If this flag is set, each data block will be followed by a 4-bytes checksum,
167calculated by using the xxHash-32 algorithm on the raw (compressed) data block.
168The intention is to detect data corruption (storage or transmission errors)
169immediately, before decoding.
170Block checksum usage is optional.
171
172__Content Size flag__
173
174If this flag is set, the uncompressed size of data included within the frame
175will be present as an 8 bytes unsigned little endian value, after the flags.
176Content Size usage is optional.
177
178__Content checksum flag__
179
180If this flag is set, a 32-bits content checksum will be appended
181after the EndMark.
182
183__Dictionary ID flag__
184
185If this flag is set, a 4-bytes Dict-ID field will be present,
186after the descriptor flags and the Content Size.
187
188__Block Maximum Size__
189
190This information is useful to help the decoder allocate memory.
191Size here refers to the original (uncompressed) data size.
192Block Maximum Size is one value among the following table :
193
194|  0  |  1  |  2  |  3  |   4   |   5    |  6   |  7   |
195| --- | --- | --- | --- | ----- | ------ | ---- | ---- |
196| N/A | N/A | N/A | N/A | 64 KB | 256 KB | 1 MB | 4 MB |
197
198The decoder may refuse to allocate block sizes above any system-specific size.
199Unused values may be used in a future revision of the spec.
200A decoder conformant with the current version of the spec
201is only able to decode block sizes defined in this spec.
202
203__Reserved bits__
204
205Value of reserved bits **must** be 0 (zero).
206Reserved bit might be used in a future version of the specification,
207typically enabling new optional features.
208When this happens, a decoder respecting the current specification version
209shall not be able to decode such a frame.
210
211__Content Size__
212
213This is the original (uncompressed) size.
214This information is optional, and only present if the associated flag is set.
215Content size is provided using unsigned 8 Bytes, for a maximum of 16 Exabytes.
216Format is Little endian.
217This value is informational, typically for display or memory allocation.
218It can be skipped by a decoder, or used to validate content correctness.
219
220__Dictionary ID__
221
222A dictionary is useful to compress short input sequences.
223When present, the compressor can take advantage of dictionary's content
224as a kind of “known prefix” to encode the input in a more compact manner.
225
226When the frame descriptor defines independent blocks,
227every block is initialized with the same dictionary.
228If the frame descriptor defines linked blocks,
229the dictionary is only used once, at the beginning of the frame.
230
231The compressor and the decompressor must employ exactly the same dictionary for the data to be decodable.
232
233The Dict-ID field is offered as a way to help the decoder determine
234which dictionary must be used to correctly decode the compressed frame.
235Dict-ID is only present if the associated flag is set.
236It's an unsigned 32-bits value, stored using little-endian convention.
237Within a single frame, only a single Dict-ID field can be defined.
238
239Note that the Dict-ID field is optional.
240Knowledge of which dictionary to employ can also be passed off-band,
241for example, it could be implied by the context of the application.
242
243__Header Checksum__
244
245One-byte checksum of combined descriptor fields, including optional ones.
246The value is the second byte of `xxh32()` : ` (xxh32()>>8) & 0xFF `
247using zero as a seed, and the full Frame Descriptor as an input
248(including optional fields when they are present).
249A wrong checksum indicates that the descriptor is erroneous.
250
251
252Data Blocks
253-----------
254
255| Block Size |  data  | (Block Checksum) |
256|:----------:| ------ |:----------------:|
257|  4 bytes   |        |   0 - 4 bytes    |
258
259
260__Block Size__
261
262This field uses 4-bytes, format is little-endian.
263
264If the highest bit is set (`1`), the block is uncompressed.
265
266If the highest bit is not set (`0`), the block is LZ4-compressed,
267using the [LZ4 block format specification](https://github.com/lz4/lz4/blob/dev/doc/lz4_Block_format.md).
268
269All other bits give the size, in bytes, of the data section.
270The size does not include the block checksum if present.
271
272_Block_Size_ shall never be larger than _Block_Maximum_Size_.
273Such an outcome could potentially happen for non-compressible sources.
274In such a case, such data block **must** be passed using uncompressed format.
275
276A value of `0x00000000` is invalid, and signifies an _EndMark_ instead.
277Note that this is different from a value of `0x80000000` (highest bit set),
278which is an uncompressed block of size 0 (empty),
279which is valid, and therefore doesn't end a frame.
280Note that, if _Block_checksum_ is enabled,
281even an empty block must be followed by a 32-bit block checksum.
282
283__Data__
284
285Where the actual data to decode stands.
286It might be compressed or not, depending on previous field indications.
287
288When compressed, the data must respect the [LZ4 block format specification](https://github.com/lz4/lz4/blob/dev/doc/lz4_Block_format.md).
289
290Note that a block is not necessarily full.
291Uncompressed size of data can be any size __up to__ _Block_Maximum_Size_,
292so it may contain less data than the maximum block size.
293
294__Block checksum__
295
296Only present if the associated flag is set.
297This is a 4-bytes checksum value, in little endian format,
298calculated by using the [xxHash-32 algorithm] on the __raw__ (undecoded) data block,
299and a seed of zero.
300The intention is to detect data corruption (storage or transmission errors)
301before decoding.
302
303_Block_checksum_ can be cumulative with _Content_checksum_.
304
305[xxHash-32 algorithm]: https://github.com/Cyan4973/xxHash/blob/release/doc/xxhash_spec.md
306
307
308Skippable Frames
309----------------
310
311| Magic Number | Frame Size | User Data |
312|:------------:|:----------:| --------- |
313|   4 bytes    |  4 bytes   |           |
314
315Skippable frames allow the integration of user-defined data
316into a flow of concatenated frames.
317Its design is pretty straightforward,
318with the sole objective to allow the decoder to quickly skip
319over user-defined data and continue decoding.
320
321For the purpose of facilitating identification,
322it is discouraged to start a flow of concatenated frames with a skippable frame.
323If there is a need to start such a flow with some user data
324encapsulated into a skippable frame,
325it’s recommended to start with a zero-byte LZ4 frame
326followed by a skippable frame.
327This will make it easier for file type identifiers.
328
329
330__Magic Number__
331
3324 Bytes, Little endian format.
333Value : 0x184D2A5X, which means any value from 0x184D2A50 to 0x184D2A5F.
334All 16 values are valid to identify a skippable frame.
335
336__Frame Size__
337
338This is the size, in bytes, of the following User Data
339(without including the magic number nor the size field itself).
3404 Bytes, Little endian format, unsigned 32-bits.
341This means User Data can’t be bigger than (2^32-1) Bytes.
342
343__User Data__
344
345User Data can be anything. Data will just be skipped by the decoder.
346
347
348Legacy frame
349------------
350
351The Legacy frame format was defined into the initial versions of “LZ4Demo”.
352Newer compressors should not use this format anymore, as it is too restrictive.
353
354Main characteristics of the legacy format :
355
356- Fixed block size : 8 MB.
357- All blocks must be completely filled, except the last one.
358- All blocks are always compressed, even when compression is detrimental.
359- The last block is detected either because
360  it is followed by the “EOF” (End of File) mark,
361  or because it is followed by a known Frame Magic Number.
362- No checksum
363- Convention is Little endian
364
365| MagicNb | B.CSize | CData | B.CSize | CData |  (...)  | EndMark |
366| ------- | ------- | ----- | ------- | ----- | ------- | ------- |
367| 4 bytes | 4 bytes | CSize | 4 bytes | CSize | x times |   EOF   |
368
369
370__Magic Number__
371
3724 Bytes, Little endian format.
373Value : 0x184C2102
374
375__Block Compressed Size__
376
377This is the size, in bytes, of the following compressed data block.
3784 Bytes, Little endian format.
379
380__Data__
381
382Where the actual compressed data stands.
383Data is always compressed, even when compression is detrimental.
384
385__EndMark__
386
387End of legacy frame is implicit only.
388It must be followed by a standard EOF (End Of File) signal,
389whether it is a file or a stream.
390
391Alternatively, if the frame is followed by a valid Frame Magic Number,
392it is considered completed.
393This policy makes it possible to concatenate legacy frames.
394
395Any other value will be interpreted as a block size,
396and trigger an error if it does not fit within acceptable range.
397
398
399Version changes
400---------------
401
4021.6.4 : minor clarifications for Dictionaries
403
4041.6.3 : minor : clarify Data Block
405
4061.6.2 : clarifies specification of _EndMark_
407
4081.6.1 : introduced terms "LZ4 Frame Header" and "LZ4 Frame Footer"
409
4101.6.0 : restored Dictionary ID field in Frame header
411
4121.5.1 : changed document format to MarkDown
413
4141.5 : removed Dictionary ID from specification
415
4161.4.1 : changed wording from “stream” to “frame”
417
4181.4 : added skippable streams, re-added stream checksum
419
4201.3 : modified header checksum
421
4221.2 : reduced choice of “block size”, to postpone decision on “dynamic size of BlockSize Field”.
423
4241.1 : optional fields are now part of the descriptor
425
4261.0 : changed “block size” specification, adding a compressed/uncompressed flag
427
4280.9 : reduced scale of “block maximum size” table
429
4300.8 : removed : high compression flag
431
4320.7 : removed : stream checksum
433
4340.6 : settled : stream size uses 8 bytes, endian convention is little endian
435
4360.5 : added copyright notice
437
4380.4 : changed format to Google Doc compatible OpenDocument
439