1<html> 2<head> 3<title>An Introduction to Data Compression</title> 4</head> 5<body> 6 7<h1>LZW and GIF explained<br> 8<font size="-1">by Steve Blackstock</font> 9</h1> 10 11<p>I hope this little document will help enlighten those of you out there 12who want to know more about the Lempel-Ziv Welch (LZW) compression algorithm, and, 13specifically, the implementation that GIF uses.</p> 14 15<p>Before we start, here's a little terminology, for the purposes of this 16document:</p> 17 18<ul> 19<li> 20 <strong>character</strong>: a fundamental data element. In normal text files, this is 21just a single byte. In raster images, which is what we're interested in, it's 22an index that specifies the color of a given pixel. I'll refer to an arbitray 23character as "K". 24</li><li> 25 <strong>charstream</strong>: a stream of characters, as in a data file. 26</li><li> 27 <strong>string</strong>: a number of continuous characters, anywhere from one to very 28many characters in length. I can specify an arbitrary string as "[...]K". 29</li><li> 30 <strong>prefix</strong>: almost the same as a string, but with the implication that a 31prefix immediately precedes a character, and a prefix can have a length of 32zero. So, a prefix and a character make up a string. I will refer to an 33arbitrary prefix as "[...]". 34</li><li> 35 <strong>root</strong>: a single-character string. For most purposes, this is a 36character, but we may occasionally make a distinction. It is [...]K, where 37[...] is empty. 38</li><li> 39 <strong>code</strong>: a number, specified by a known number of bits, which maps to a 40string. 41</li><li> 42 <strong>codestream</strong>: the output stream of codes, as in the "raster data" 43</li><li> 44 <strong>entry</strong>: a code and its string. 45</li><li> 46 <strong>string table</strong>: a list of entries; usually, but not necessarily, unique. 47</li> 48</ul> 49 50<p> 51 LZW is a way of compressing data that takes advantage of repetition of 52strings in the data. Since raster data usually contains a lot of this 53repetition, LZW is a good way of compressing and decompressing it. 54 For the moment, lets consider normal LZW encoding and decoding. GIF's 55variation on the concept is just an extension from there. 56 57</p><p> 58 LZW manipulates three objects in both compression and decompression: the 59charstream, the codestream, and the string table. In compression, the 60charstream is the input and the codestream is the output. In decompression, 61the codestream is the input and the charstream is the output. The string table 62is a product of both compression and decompression, but is never passed from 63one to the other. 64 65</p><h2>Compression</h2> 66 67<p> 68 The first thing we do in LZW compression is initialize our string table. 69To do this, we need to choose a code size (how many bits) and know how many 70values our characters can possibly take. Let's say our code size is 12 bits, 71meaning we can store 0->FFF, or 4096 entries in our string table. Lets also 72say that we have 32 possible different characters. (This corresponds to, say, 73a picture in which there are 32 different colors possible for each pixel.) To 74initialize the table, we set code#0 to character#0, code #1 to character#1, 75and so on, until code#31 to character#31. Actually, we are specifying that 76each code from 0 to 31 maps to a root. There will be no more entries in the 77table that have this property. 78 79</p><p> 80 81Now we start compressing data. Let's first define something 82called the "current prefix". It's just a prefix that we'll store 83things in and compare things to now and then. I will refer to it as 84"[.c.]". Initially, the current prefix has nothing in it. Let's also 85define a "current string", which will be the current prefix plus the 86next character in the charstream. I will refer to the current string 87as "[.c.]K", where K is some character. OK, look at the first 88character in the charstream. Call it P. Make [.c.]P the current 89string. (At this point, of course, it's just the root P.) Now search 90through the string table to see if [.c.]P appears in it. Of course, it 91does now, because our string table is initialized to have all roots. 92So we don't do anything. Now make [.c.]P the current prefix. Look at 93the next character in the charstream. Call it Q. Add it to the 94current prefix to form [.c.]Q, the current string. Now search through 95the string table to see if [.c.]Q appears in it. In this case, of 96course, it doesn't. Aha! Now we get to do something. Add [.c.]Q 97(which is PQ in this case) to the string table for code#32, and output 98the code for [.c.] to the codestream. Now start over again with the 99current prefix being just the root Q. Keep adding characters to [.c.] 100to form [.c.]K, until you can't find [.c.]K in the string table. Then 101output the code for [.c.] and add [.c.]K to the string table. In 102pseudo-code, the algorithm goes something like this: 103 104</p><pre> [1] Initialize string table; 105 [2] [.c.] <- empty; 106 [3] K <- next character in charstream; 107 [4] Is [.c.]K in string table? 108 (yes: [.c.] <- [.c.]K; 109 go to [3]; 110 ) 111 (no: add [.c.]K to the string table; 112 output the code for [.c.] to the codestream; 113 [.c.] <- K; 114 go to [3]; 115 ) 116</pre> 117 118<p> 119 It's as simple as that! Of course, when you get to step [3] and there 120aren't any more characters left, you just output the code for [.c.] and throw 121the table away. You're done. 122 123</p><p> 124 Wanna do an example? Let's pretend we have a four-character alphabet: 125A,B,C,D. The charstream looks like ABACABA. Let's compress it. First, we 126initialize our string table to: #0=A, #1=B, #2=C, #3=D. The first character is 127A, which is in the string table, so [.c.] becomes A. Next we get AB, which is 128not in the table, so we output code #0 (for [.c.]), 129 and add AB to the string table as code #4. [.c.] becomes B. Next we get 130[.c.]A = BA, which is not in the string table, so output code #1, and add BA 131to the string table as code #5. [.c.] becomes A. Next we get AC, which is not 132in the string table. Output code #0, and add AC to the string table as code 133#6. Now [.c.] becomes C. Next we get [.c.]A = CA, which is not in the table. 134Output #2 for C, and add CA to table as code#7. Now [.c.] becomes A. Next we 135get AB, which IS in the string table, so [.c.] gets AB, and we look at ABA, 136which is not in the string table, so output the code for AB, which is #4, and 137add ABA to the string table as code #8. [.c.] becomes A. We can't get any more 138characters, so we just output #0 for the code for A, and we're done. So, the 139codestream is #0#1#0#2#4#0. 140 141</p><p> 142 A few words (four) should be said here about efficiency: use a hashing 143strategy. The search through the string table can be computationally 144intensive, and some hashing is well worth the effort. Also, note that 145"straight LZW" compression runs the risk of overflowing the string table - 146getting to a code which can't be represented in the number of bits you've set 147aside for codes. There are several ways of dealing with this problem, and GIF 148implements a very clever one, but we'll get to that. 149 150</p><p> 151 An important thing to notice is that, at any point during the 152compression, if [...]K is in the string table, [...] is there also. This fact 153suggests an efficient method for storing strings in the table. Rather than 154store the entire string of K's in the table, realize that any string can be 155expressed as a prefix plus a character: [...]K. If we're about to store [...]K 156in the table, we know that [...] is already there, so we can just store the 157code for [...] plus the final character K. 158 159 160</p><h2>Decompression</h2> 161 162Decompression is perhaps more 163difficult conceptually, but it is really easier to program. 164We again have to start with an initialized string 165table. This table comes from what knowledge we have about the charstream that 166we will eventually get, like what possible values the characters can take. In 167GIF files, this information is in the header as the number of possible pixel 168values. The beauty of LZW, though, is that this is all we need to know. We 169will build the rest of the string table as we decompress the codestream. The 170compression is done in such a way that we will never encounter a code in the 171codestream that we can't translate into a string. 172 173<p> 174We need to define something called a "current code", which I 175will refer to as "<code>", and an "old-code", which I will refer 176to as "<old>". To start things off, look at the first code. This 177is now <code>. This code will be in the intialized string table as 178the code for a root. Output the root to the charstream. Make this code 179the old-code <old>. *Now look at the next code, and make it 180<code>. It is possible that this code will not be in the string 181table, but let's assume for now that it is. Output the string 182corresponding to <code> to the codestream. Now find the first 183character in the string you just translated. Call this K. Add this to 184the prefix [...] generated by <old> to form a new string 185[...]K. Add this string [...]K to the string table, and set the 186old-code <old> to the current code <code>. Repeat from where I 187typed the asterisk, and you're all set. 188This is the most common case so you should understand this before going 189on. 190 191</p><p> 192 193Now let's consider the possibility that <code> is not in the 194string table, which as we will see can only occur for strings of the 195form P[...]P (for any character P). Think back to compression, and 196try to understand what happens when you have a string like 197P[...]P[...]PQ appear in the charstream. Suppose P[...] is already in 198the string table, but P[...]P is not. The compressor will parse out 199P[...], and find that P[...]P is not in the string table. It will 200output the code for P[...], and add P[...]P to the string table. Then 201it will get up to P[...]P for the next string, and find that P[...]P 202is in the table, as the code just added. So it will output the code 203for P[...]P if it finds that P[...]PQ is not in the table. The 204decompressor is always "one step behind" the compressor. When the 205decompressor sees the code for P[...]P, it will not have added that 206code to it's string table yet because it needed the beginning 207character of P[...]P to add to the string for the last code, P[...], 208to form the code for P[...]P. However, when a decompressor finds a 209code that it doesn't know yet, it will always be the very next one to 210be added to the string table. So it can guess at what the string for 211the code should be, and, in fact, it will always be correct. If I am a 212decompressor, and I see code#124, and yet my string table has entries 213only up to code#123, I can figure out what code#124 must be, add it to 214my string table, and output the string. If code#123 generated the 215string [...], which I will refer to here as a prefix, then code#124, 216in this special case, will be [...] plus the first character of [...]. 217So just add the first character of [...] to the end of itself. Not 218too bad. 219 220 221</p><p> 222 223As an example (and a very common one) of this special case, let's 224assume we have a raster image in which the first three pixels have the 225same color value. That is, my charstream looks like: QQQ.... For the 226sake of argument, let's say we have 32 colors, and Q is the 227color#12. The compressor will generate the code sequence 22812,32,.... (if you don't know why, take a minute to understand it.) 229Remember that #32 is not in the initial table, which goes from #0 to 230#31. The decompressor will see #12 and translate it just fine as color 231Q. Then it will see #32 and not yet know what that means. But if it 232thinks about it long enough, it can figure out that QQ should be 233entry#32 in the table and QQ should be the next string output. So the 234decompression pseudo-code goes something like: 235 236</p><pre> [1] Initialize string table; 237 [2] get first code: <code> 238 [3] output the string for <code> to the charstream; 239 [4] <old> = <code> 240 [5] <code> <- next code in codestream; 241 [6] does <code> exist in the string table? 242 (yes: output the string for <code> to the charstream; 243 [...] <- translation for <old> 244 K <- first character of translation for <code> 245 add [...]K to the string table; 246 <old> <- <code> 247 ) 248 (no: [...] <- translation for <old> 249 K <- first character of [...]; 250 output [...]K to charstream and add it to string table; 251 <old> <- <code> 252 ) 253 [7] go to [5]; 254</pre> 255 256<p> 257 Again, when you get to step [5] and there are no more codes, you're 258finished. Outputting of strings, and finding of initial characters in strings 259are efficiency problems all to themselves, but I'm not going to suggest ways 260to do them here. Half the fun of programming is figuring these things out! 261 262</p><h2>GIF variation</h2> 263 264<p> 265 266Now for the GIF variations on the theme. In part of the header of a 267GIF file, there is a field, in the Raster Data stream, called "code 268size". This is a very misleading name for the field, but we have to 269live with it. What it is really is the "root size". The actual size, 270in bits, of the compression codes actually changes during 271compression/decompression, and I will refer to that size here as the 272"compression size". The initial table is just the codes for all the 273roots, as usual, but two special codes are added on top of those. The 274"code size" N is set to max(2,bits-per-pixel). In the table the roots 275take up slots #0 through #(2**N-1), and the special codes are (2**N) 276and (2**N + 1). The initial compression size will be N+1 bits per 277code. If you're encoding, you output the codes (N+1) bits at a time to 278start with, and if you're decoding, you grab (N+1) bits from the 279codestream at a time. As for the special codes: <CC> or the clear 280code, is (2**N), and <EOI>, or end-of-information, is (2**N + 2811). <CC> tells the compressor to re- initialize the string table, 282and to reset the compression size to (N+1). <EOI> means there's no 283more in the codestream. 284 285</p><p> 286 287If you're encoding or decoding, you should 288start adding things to the string table at <CC> + 2. If you're 289encoding, you should output <CC> as the very first code, and then 290whenever after that you reach code #4095 (hex FFF), because GIF does 291not allow compression sizes to be greater than 12 bits. If you're 292decoding, you should reinitialize your string table when you observe 293<CC>. The variable compression sizes are really no big deal. If 294you're encoding, you start with a compression size of (N+1) bits, and, 295whenever you output the code (2**(compression size)-1), you bump the 296compression size up one bit. So the next code you output will be one 297bit longer. Remember that the largest compression size is 12 bits, 298corresponding to a code of 4095. If you get that far, you must output 299<CC> as the next code, and start over. If you're decoding, you 300must increase your compression size AS SOON AS YOU write entry 301#(2**(compression size) - 1) to the string table. The next code you 302READ will be one bit longer. Don't make the mistake of waiting until 303you need to add the code (2**compression size) to the table. You'll 304have already missed a bit from the last code. The packaging of codes 305into a bitsream for the raster data is also a potential stumbling 306block for the novice encoder or decoder. The lowest order bit in the 307code should coincide with the lowest available bit in the first 308available byte in the codestream. For example, if you're starting with 3095-bit compression codes, and your first three codes are, say, 310<abcde>, <fghij>, <klmno>, where e, j, and o are bit#0, 311then your codestream will start off like: 312 313</p><pre> byte#0: hijabcde 314 byte#1: .klmnofg 315</pre> 316 317<p> 318 319 So the differences between straight LZW and GIF LZW are: two additional 320special codes and variable compression sizes. If you understand LZW, and you 321understand those variations, you understand it all! 322 323</p><p> 324 Just as sort of a P.S., you may have noticed that a compressor has a 325little bit of flexibility at compression time. I specified a "greedy" approach 326to the compression, grabbing as many characters as possible before outputting 327codes. This is, in fact, the standard LZW way of doing things, and it will 328yield the best compression ratio. But there's no rule saying you can't stop 329anywhere along the line and just output the code for the current prefix, 330whether it's already in the table or not, and add that string plus the next 331character to the string table. There are various reasons for wanting to do 332this, especially if the strings get extremely long and make hashing difficult. 333If you need to, do it. 334 335</p><p> 336 Hope this helps out.----steve blackstock 337 338</p><h3>Further information</h3> 339 340The original paper that describes the LZW algorithm is: 341 342<blockquote> 343Terry A. Welch. 344A Technique for High Performance Data Compression. 345IEEE Computer, Vol. 17, No. 6, 1984, pp. 8-19. 346</blockquote> 347 348The GIF format is described in more detail in the 349<a href="gif87.txt">GIF87(5) - GIF 87</a> and 350<a href="gif89.txt">GIF89a(5) - GIF 89a</a> standards. 351</body> 352</html> 353