1*27162e4eSAndroid Build Coastguard WorkerLZ4 Block Format Description 2*27162e4eSAndroid Build Coastguard Worker============================ 3*27162e4eSAndroid Build Coastguard WorkerLast revised: 2022-07-31 . 4*27162e4eSAndroid Build Coastguard WorkerAuthor : Yann Collet 5*27162e4eSAndroid Build Coastguard Worker 6*27162e4eSAndroid Build Coastguard Worker 7*27162e4eSAndroid Build Coastguard WorkerThis specification is intended for developers willing to 8*27162e4eSAndroid Build Coastguard Workerproduce or read LZ4 compressed data blocks 9*27162e4eSAndroid Build Coastguard Workerusing any programming language of their choice. 10*27162e4eSAndroid Build Coastguard Worker 11*27162e4eSAndroid Build Coastguard WorkerLZ4 is an LZ77-type compressor with a fixed byte-oriented encoding format. 12*27162e4eSAndroid Build Coastguard WorkerThere is no entropy encoder back-end nor framing layer. 13*27162e4eSAndroid Build Coastguard WorkerThe latter is assumed to be handled by other parts of the system 14*27162e4eSAndroid Build Coastguard Worker(see [LZ4 Frame format]). 15*27162e4eSAndroid Build Coastguard WorkerThis design is assumed to favor simplicity and speed. 16*27162e4eSAndroid Build Coastguard Worker 17*27162e4eSAndroid Build Coastguard WorkerThis document describes only the Block Format, 18*27162e4eSAndroid Build Coastguard Workernot how the compressor nor decompressor actually work. 19*27162e4eSAndroid Build Coastguard WorkerFor more details on such topics, see later section "Implementation Notes". 20*27162e4eSAndroid Build Coastguard Worker 21*27162e4eSAndroid Build Coastguard Worker[LZ4 Frame format]: lz4_Frame_format.md 22*27162e4eSAndroid Build Coastguard Worker 23*27162e4eSAndroid Build Coastguard Worker 24*27162e4eSAndroid Build Coastguard Worker 25*27162e4eSAndroid Build Coastguard WorkerCompressed block format 26*27162e4eSAndroid Build Coastguard Worker----------------------- 27*27162e4eSAndroid Build Coastguard WorkerAn LZ4 compressed block is composed of sequences. 28*27162e4eSAndroid Build Coastguard WorkerA sequence is a suite of literals (not-compressed bytes), 29*27162e4eSAndroid Build Coastguard Workerfollowed by a match copy operation. 30*27162e4eSAndroid Build Coastguard Worker 31*27162e4eSAndroid Build Coastguard WorkerEach sequence starts with a `token`. 32*27162e4eSAndroid Build Coastguard WorkerThe `token` is a one byte value, separated into two 4-bits fields. 33*27162e4eSAndroid Build Coastguard WorkerTherefore each field ranges from 0 to 15. 34*27162e4eSAndroid Build Coastguard Worker 35*27162e4eSAndroid Build Coastguard Worker 36*27162e4eSAndroid Build Coastguard WorkerThe first field uses the 4 high-bits of the token. 37*27162e4eSAndroid Build Coastguard WorkerIt provides the length of literals to follow. 38*27162e4eSAndroid Build Coastguard Worker 39*27162e4eSAndroid Build Coastguard WorkerIf the field value is smaller than 15, 40*27162e4eSAndroid Build Coastguard Workerthen it represents the total nb of literals present in the sequence, 41*27162e4eSAndroid Build Coastguard Workerincluding 0, in which case there is no literal. 42*27162e4eSAndroid Build Coastguard Worker 43*27162e4eSAndroid Build Coastguard WorkerThe value 15 is a special case: more bytes are required to indicate the full length. 44*27162e4eSAndroid Build Coastguard WorkerEach additional byte then represents a value from 0 to 255, 45*27162e4eSAndroid Build Coastguard Workerwhich is added to the previous value to produce a total length. 46*27162e4eSAndroid Build Coastguard WorkerWhen the byte value is 255, another byte must be read and added, and so on. 47*27162e4eSAndroid Build Coastguard WorkerThere can be any number of bytes of value `255` following `token`. 48*27162e4eSAndroid Build Coastguard WorkerThe Block Format does not define any "size limit", 49*27162e4eSAndroid Build Coastguard Workerthough real implementations may feature some practical limits 50*27162e4eSAndroid Build Coastguard Worker(see more details in later chapter "Implementation Notes"). 51*27162e4eSAndroid Build Coastguard Worker 52*27162e4eSAndroid Build Coastguard WorkerNote : this format explains why a non-compressible input block is expanded by 0.4%. 53*27162e4eSAndroid Build Coastguard Worker 54*27162e4eSAndroid Build Coastguard WorkerExample 1 : A literal length of 48 will be represented as : 55*27162e4eSAndroid Build Coastguard Worker 56*27162e4eSAndroid Build Coastguard Worker - 15 : value for the 4-bits High field 57*27162e4eSAndroid Build Coastguard Worker - 33 : (=48-15) remaining length to reach 48 58*27162e4eSAndroid Build Coastguard Worker 59*27162e4eSAndroid Build Coastguard WorkerExample 2 : A literal length of 280 will be represented as : 60*27162e4eSAndroid Build Coastguard Worker 61*27162e4eSAndroid Build Coastguard Worker - 15 : value for the 4-bits High field 62*27162e4eSAndroid Build Coastguard Worker - 255 : following byte is maxed, since 280-15 >= 255 63*27162e4eSAndroid Build Coastguard Worker - 10 : (=280 - 15 - 255) remaining length to reach 280 64*27162e4eSAndroid Build Coastguard Worker 65*27162e4eSAndroid Build Coastguard WorkerExample 3 : A literal length of 15 will be represented as : 66*27162e4eSAndroid Build Coastguard Worker 67*27162e4eSAndroid Build Coastguard Worker - 15 : value for the 4-bits High field 68*27162e4eSAndroid Build Coastguard Worker - 0 : (=15-15) yes, the zero must be output 69*27162e4eSAndroid Build Coastguard Worker 70*27162e4eSAndroid Build Coastguard WorkerFollowing `token` and optional length bytes, are the literals themselves. 71*27162e4eSAndroid Build Coastguard WorkerThey are exactly as numerous as just decoded (length of literals). 72*27162e4eSAndroid Build Coastguard WorkerReminder: it's possible that there are zero literals. 73*27162e4eSAndroid Build Coastguard Worker 74*27162e4eSAndroid Build Coastguard Worker 75*27162e4eSAndroid Build Coastguard WorkerFollowing the literals is the match copy operation. 76*27162e4eSAndroid Build Coastguard Worker 77*27162e4eSAndroid Build Coastguard WorkerIt starts by the `offset` value. 78*27162e4eSAndroid Build Coastguard WorkerThis is a 2 bytes value, in little endian format 79*27162e4eSAndroid Build Coastguard Worker(the 1st byte is the "low" byte, the 2nd one is the "high" byte). 80*27162e4eSAndroid Build Coastguard Worker 81*27162e4eSAndroid Build Coastguard WorkerThe `offset` represents the position of the match to be copied from the past. 82*27162e4eSAndroid Build Coastguard WorkerFor example, 1 means "current position - 1 byte". 83*27162e4eSAndroid Build Coastguard WorkerThe maximum `offset` value is 65535. 65536 and beyond cannot be coded. 84*27162e4eSAndroid Build Coastguard WorkerNote that 0 is an invalid `offset` value. 85*27162e4eSAndroid Build Coastguard WorkerThe presence of a 0 `offset` value denotes an invalid (corrupted) block. 86*27162e4eSAndroid Build Coastguard Worker 87*27162e4eSAndroid Build Coastguard WorkerThen the `matchlength` can be extracted. 88*27162e4eSAndroid Build Coastguard WorkerFor this, we use the second `token` field, the low 4-bits. 89*27162e4eSAndroid Build Coastguard WorkerSuch a value, obviously, ranges from 0 to 15. 90*27162e4eSAndroid Build Coastguard WorkerHowever here, 0 means that the copy operation is minimal. 91*27162e4eSAndroid Build Coastguard WorkerThe minimum length of a match, called `minmatch`, is 4. 92*27162e4eSAndroid Build Coastguard WorkerAs a consequence, a 0 value means 4 bytes. 93*27162e4eSAndroid Build Coastguard WorkerSimilarly to literal length, any value smaller than 15 represents a length, 94*27162e4eSAndroid Build Coastguard Workerto which 4 (`minmatch`) must be added, thus ranging from 4 to 18. 95*27162e4eSAndroid Build Coastguard WorkerA value of 15 is special, meaning 19+ bytes, 96*27162e4eSAndroid Build Coastguard Workerto which one must read additional bytes, one at a time, 97*27162e4eSAndroid Build Coastguard Workerwith each byte value ranging from 0 to 255. 98*27162e4eSAndroid Build Coastguard WorkerThey are added to total to provide the final match length. 99*27162e4eSAndroid Build Coastguard WorkerA 255 value means there is another byte to read and add. 100*27162e4eSAndroid Build Coastguard WorkerThere is no limit to the number of optional `255` bytes that can be present, 101*27162e4eSAndroid Build Coastguard Workerand therefore no limit to representable match length, 102*27162e4eSAndroid Build Coastguard Workerthough real-life implementations are likely going to enforce limits for practical reasons (see more details in "Implementation Notes" section below). 103*27162e4eSAndroid Build Coastguard Worker 104*27162e4eSAndroid Build Coastguard WorkerNote: this format has a maximum achievable compression ratio of about ~250. 105*27162e4eSAndroid Build Coastguard Worker 106*27162e4eSAndroid Build Coastguard WorkerDecoding the `matchlength` reaches the end of current sequence. 107*27162e4eSAndroid Build Coastguard WorkerNext byte will be the start of another sequence, and therefore a new `token`. 108*27162e4eSAndroid Build Coastguard Worker 109*27162e4eSAndroid Build Coastguard Worker 110*27162e4eSAndroid Build Coastguard WorkerEnd of block conditions 111*27162e4eSAndroid Build Coastguard Worker------------------------- 112*27162e4eSAndroid Build Coastguard WorkerThere are specific restrictions required to terminate an LZ4 block. 113*27162e4eSAndroid Build Coastguard Worker 114*27162e4eSAndroid Build Coastguard Worker1. The last sequence contains only literals. 115*27162e4eSAndroid Build Coastguard Worker The block ends right after the literals (no `offset` field). 116*27162e4eSAndroid Build Coastguard Worker2. The last 5 bytes of input are always literals. 117*27162e4eSAndroid Build Coastguard Worker Therefore, the last sequence contains at least 5 bytes. 118*27162e4eSAndroid Build Coastguard Worker - Special : if input is smaller than 5 bytes, 119*27162e4eSAndroid Build Coastguard Worker there is only one sequence, it contains the whole input as literals. 120*27162e4eSAndroid Build Coastguard Worker Even empty input can be represented, using a zero byte, 121*27162e4eSAndroid Build Coastguard Worker interpreted as a final token without literal and without a match. 122*27162e4eSAndroid Build Coastguard Worker3. The last match must start at least 12 bytes before the end of block. 123*27162e4eSAndroid Build Coastguard Worker The last match is part of the _penultimate_ sequence. 124*27162e4eSAndroid Build Coastguard Worker It is followed by the last sequence, which contains _only_ literals. 125*27162e4eSAndroid Build Coastguard Worker - Note that, as a consequence, 126*27162e4eSAndroid Build Coastguard Worker blocks < 12 bytes cannot be compressed. 127*27162e4eSAndroid Build Coastguard Worker And as an extension, _independent_ blocks < 13 bytes cannot be compressed, 128*27162e4eSAndroid Build Coastguard Worker because they must start by at least one literal, 129*27162e4eSAndroid Build Coastguard Worker that the match can then copy afterwards. 130*27162e4eSAndroid Build Coastguard Worker 131*27162e4eSAndroid Build Coastguard WorkerWhen a block does not respect these end conditions, 132*27162e4eSAndroid Build Coastguard Workera conformant decoder is allowed to reject the block as incorrect. 133*27162e4eSAndroid Build Coastguard Worker 134*27162e4eSAndroid Build Coastguard WorkerThese rules are in place to ensure compatibility with 135*27162e4eSAndroid Build Coastguard Workera wide range of historical decoders 136*27162e4eSAndroid Build Coastguard Workerwhich rely on these conditions for their speed-oriented design. 137*27162e4eSAndroid Build Coastguard Worker 138*27162e4eSAndroid Build Coastguard WorkerImplementation notes 139*27162e4eSAndroid Build Coastguard Worker----------------------- 140*27162e4eSAndroid Build Coastguard WorkerThe LZ4 Block Format only defines the compressed format, 141*27162e4eSAndroid Build Coastguard Workerit does not tell how to create a decoder or an encoder, 142*27162e4eSAndroid Build Coastguard Workerwhich design is left free to the imagination of the implementer. 143*27162e4eSAndroid Build Coastguard Worker 144*27162e4eSAndroid Build Coastguard WorkerHowever, thanks to experience, there are a number of typical topics that 145*27162e4eSAndroid Build Coastguard Workermost implementations will have to consider. 146*27162e4eSAndroid Build Coastguard WorkerThis section tries to provide a few guidelines. 147*27162e4eSAndroid Build Coastguard Worker 148*27162e4eSAndroid Build Coastguard Worker#### Metadata 149*27162e4eSAndroid Build Coastguard Worker 150*27162e4eSAndroid Build Coastguard WorkerAn LZ4-compressed Block requires additional metadata for proper decoding. 151*27162e4eSAndroid Build Coastguard WorkerTypically, a decoder will require the compressed block's size, 152*27162e4eSAndroid Build Coastguard Workerand an upper bound of decompressed size. 153*27162e4eSAndroid Build Coastguard WorkerOther variants exist, such as knowing the decompressed size, 154*27162e4eSAndroid Build Coastguard Workerand having an upper bound of the input size. 155*27162e4eSAndroid Build Coastguard WorkerThe Block Format does not specify how to transmit such information, 156*27162e4eSAndroid Build Coastguard Workerwhich is considered an out-of-band information channel. 157*27162e4eSAndroid Build Coastguard WorkerThat's because in many cases, the information is present in the environment. 158*27162e4eSAndroid Build Coastguard WorkerFor example, databases must store the size of their compressed block for indexing, 159*27162e4eSAndroid Build Coastguard Workerand know that their decompressed block can't be larger than a certain threshold. 160*27162e4eSAndroid Build Coastguard Worker 161*27162e4eSAndroid Build Coastguard WorkerIf you need a format which is "self-contained", 162*27162e4eSAndroid Build Coastguard Workerand also transports the necessary metadata for proper decoding on any platform, 163*27162e4eSAndroid Build Coastguard Workerconsider employing the [LZ4 Frame format] instead. 164*27162e4eSAndroid Build Coastguard Worker 165*27162e4eSAndroid Build Coastguard Worker#### Large lengths 166*27162e4eSAndroid Build Coastguard Worker 167*27162e4eSAndroid Build Coastguard WorkerWhile the Block Format does not define any maximum value for length fields, 168*27162e4eSAndroid Build Coastguard Workerin practice, most implementations will feature some form of limit, 169*27162e4eSAndroid Build Coastguard Workersince it's expected for such values to be stored into registers of fixed bit width. 170*27162e4eSAndroid Build Coastguard Worker 171*27162e4eSAndroid Build Coastguard WorkerIf length fields use 64-bit registers, 172*27162e4eSAndroid Build Coastguard Workerthen it can be assumed that there is no practical limit, 173*27162e4eSAndroid Build Coastguard Workeras it would require a single continuous block of multiple petabytes to reach it, 174*27162e4eSAndroid Build Coastguard Workerwhich is unreasonable by today's standard. 175*27162e4eSAndroid Build Coastguard Worker 176*27162e4eSAndroid Build Coastguard WorkerIf length fields use 32-bit registers, then it can be overflowed, 177*27162e4eSAndroid Build Coastguard Workerbut requires a compressed block of size > 16 MB. 178*27162e4eSAndroid Build Coastguard WorkerTherefore, implementations that do not deal with compressed blocks > 16 MB are safe. 179*27162e4eSAndroid Build Coastguard WorkerHowever, if such a case is allowed, 180*27162e4eSAndroid Build Coastguard Workerthen it's recommended to check that no large length overflows the register. 181*27162e4eSAndroid Build Coastguard Worker 182*27162e4eSAndroid Build Coastguard WorkerIf length fields use 16-bit registers, 183*27162e4eSAndroid Build Coastguard Workerthen it's definitely possible to overflow such register, 184*27162e4eSAndroid Build Coastguard Workerwith less than < 300 bytes of compressed data. 185*27162e4eSAndroid Build Coastguard Worker 186*27162e4eSAndroid Build Coastguard WorkerA conformant decoder should be able to detect length overflows when it's possible, 187*27162e4eSAndroid Build Coastguard Workerand simply error out when that happens. 188*27162e4eSAndroid Build Coastguard WorkerThe input block might not be invalid, 189*27162e4eSAndroid Build Coastguard Workerit's just not decodable by the local decoder implementation. 190*27162e4eSAndroid Build Coastguard Worker 191*27162e4eSAndroid Build Coastguard WorkerNote that, in order to be compatible with the larger LZ4 ecosystem, 192*27162e4eSAndroid Build Coastguard Workerit's recommended to be able to read and represent lengths of up to 4 MB, 193*27162e4eSAndroid Build Coastguard Workerand to accept blocks of size up to 4 MB. 194*27162e4eSAndroid Build Coastguard WorkerSuch limits are compatible with 32-bit length registers, 195*27162e4eSAndroid Build Coastguard Workerand prevent overflow of 32-bit registers. 196*27162e4eSAndroid Build Coastguard Worker 197*27162e4eSAndroid Build Coastguard Worker#### Safe decoding 198*27162e4eSAndroid Build Coastguard Worker 199*27162e4eSAndroid Build Coastguard WorkerIf a decoder receives compressed data from any external source, 200*27162e4eSAndroid Build Coastguard Workerit is recommended to ensure that the decoder is resilient to corrupted input, 201*27162e4eSAndroid Build Coastguard Workerand made safe from buffer overflow manipulations. 202*27162e4eSAndroid Build Coastguard WorkerAlways ensure that read and write operations 203*27162e4eSAndroid Build Coastguard Workerremain within the limits of provided buffers. 204*27162e4eSAndroid Build Coastguard Worker 205*27162e4eSAndroid Build Coastguard WorkerOf particular importance, ensure that the nb of bytes instructed to copy 206*27162e4eSAndroid Build Coastguard Workerdoes not overflow neither the input nor the output buffers. 207*27162e4eSAndroid Build Coastguard WorkerEnsure also, when reading an offset value, that the resulting position to copy 208*27162e4eSAndroid Build Coastguard Workerdoes not reach beyond the beginning of the buffer. 209*27162e4eSAndroid Build Coastguard WorkerSuch a situation can happen during the first 64 KB of decoded data. 210*27162e4eSAndroid Build Coastguard Worker 211*27162e4eSAndroid Build Coastguard WorkerFor more safety, test the decoder with fuzzers 212*27162e4eSAndroid Build Coastguard Workerto ensure it's resilient to improbable sequences of conditions. 213*27162e4eSAndroid Build Coastguard WorkerCombine them with sanitizers, in order to catch overflows (asan) 214*27162e4eSAndroid Build Coastguard Workeror initialization issues (msan). 215*27162e4eSAndroid Build Coastguard Worker 216*27162e4eSAndroid Build Coastguard WorkerPay some attention to offset 0 scenario, which is invalid, 217*27162e4eSAndroid Build Coastguard Workerand therefore must not be blindly decoded: 218*27162e4eSAndroid Build Coastguard Workera naive implementation could preserve destination buffer content, 219*27162e4eSAndroid Build Coastguard Workerwhich could then result in information disclosure 220*27162e4eSAndroid Build Coastguard Workerif such buffer was uninitialized and still containing private data. 221*27162e4eSAndroid Build Coastguard WorkerFor reference, in such a scenario, the reference LZ4 decoder 222*27162e4eSAndroid Build Coastguard Workerclears the match segment with `0` bytes, 223*27162e4eSAndroid Build Coastguard Workerthough other solutions are certainly possible. 224*27162e4eSAndroid Build Coastguard Worker 225*27162e4eSAndroid Build Coastguard WorkerFinally, pay attention to the "overlap match" scenario, 226*27162e4eSAndroid Build Coastguard Workerwhen `matchlength` is larger than `offset`. 227*27162e4eSAndroid Build Coastguard WorkerIn which case, since `match_pos + matchlength > current_pos`, 228*27162e4eSAndroid Build Coastguard Workersome of the later bytes to copy do not exist yet, 229*27162e4eSAndroid Build Coastguard Workerand will be generated during the early stage of match copy operation. 230*27162e4eSAndroid Build Coastguard WorkerSuch scenario must be handled with special care. 231*27162e4eSAndroid Build Coastguard WorkerA common case is an offset of 1, 232*27162e4eSAndroid Build Coastguard Workermeaning the last byte is repeated `matchlength` times. 233*27162e4eSAndroid Build Coastguard Worker 234*27162e4eSAndroid Build Coastguard Worker#### Compression techniques 235*27162e4eSAndroid Build Coastguard Worker 236*27162e4eSAndroid Build Coastguard WorkerThe core of a LZ4 compressor is to detect duplicated data across past 64 KB. 237*27162e4eSAndroid Build Coastguard WorkerThe format makes no assumption nor limits to the way a compressor 238*27162e4eSAndroid Build Coastguard Workersearches and selects matches within the source data block. 239*27162e4eSAndroid Build Coastguard WorkerFor example, an upper compression limit can be reached, 240*27162e4eSAndroid Build Coastguard Workerusing a technique called "full optimal parsing", at high cpu and memory cost. 241*27162e4eSAndroid Build Coastguard WorkerBut multiple other techniques can be considered, 242*27162e4eSAndroid Build Coastguard Workerfeaturing distinct time / performance trade-offs. 243*27162e4eSAndroid Build Coastguard WorkerAs long as the specified format is respected, 244*27162e4eSAndroid Build Coastguard Workerthe result will be compatible with and decodable by any compliant decoder. 245