1*324bb76bSAndroid Build Coastguard Worker<html> 2*324bb76bSAndroid Build Coastguard Worker<head> 3*324bb76bSAndroid Build Coastguard Worker<title>An Introduction to Data Compression</title> 4*324bb76bSAndroid Build Coastguard Worker</head> 5*324bb76bSAndroid Build Coastguard Worker<body> 6*324bb76bSAndroid Build Coastguard Worker 7*324bb76bSAndroid Build Coastguard Worker<h1>LZW and GIF explained<br> 8*324bb76bSAndroid Build Coastguard Worker<font size="-1">by Steve Blackstock</font> 9*324bb76bSAndroid Build Coastguard Worker</h1> 10*324bb76bSAndroid Build Coastguard Worker 11*324bb76bSAndroid Build Coastguard Worker<p>I hope this little document will help enlighten those of you out there 12*324bb76bSAndroid Build Coastguard Workerwho want to know more about the Lempel-Ziv Welch (LZW) compression algorithm, and, 13*324bb76bSAndroid Build Coastguard Workerspecifically, the implementation that GIF uses.</p> 14*324bb76bSAndroid Build Coastguard Worker 15*324bb76bSAndroid Build Coastguard Worker<p>Before we start, here's a little terminology, for the purposes of this 16*324bb76bSAndroid Build Coastguard Workerdocument:</p> 17*324bb76bSAndroid Build Coastguard Worker 18*324bb76bSAndroid Build Coastguard Worker<ul> 19*324bb76bSAndroid Build Coastguard Worker<li> 20*324bb76bSAndroid Build Coastguard Worker <strong>character</strong>: a fundamental data element. In normal text files, this is 21*324bb76bSAndroid Build Coastguard Workerjust a single byte. In raster images, which is what we're interested in, it's 22*324bb76bSAndroid Build Coastguard Workeran index that specifies the color of a given pixel. I'll refer to an arbitray 23*324bb76bSAndroid Build Coastguard Workercharacter as "K". 24*324bb76bSAndroid Build Coastguard Worker</li><li> 25*324bb76bSAndroid Build Coastguard Worker <strong>charstream</strong>: a stream of characters, as in a data file. 26*324bb76bSAndroid Build Coastguard Worker</li><li> 27*324bb76bSAndroid Build Coastguard Worker <strong>string</strong>: a number of continuous characters, anywhere from one to very 28*324bb76bSAndroid Build Coastguard Workermany characters in length. I can specify an arbitrary string as "[...]K". 29*324bb76bSAndroid Build Coastguard Worker</li><li> 30*324bb76bSAndroid Build Coastguard Worker <strong>prefix</strong>: almost the same as a string, but with the implication that a 31*324bb76bSAndroid Build Coastguard Workerprefix immediately precedes a character, and a prefix can have a length of 32*324bb76bSAndroid Build Coastguard Workerzero. So, a prefix and a character make up a string. I will refer to an 33*324bb76bSAndroid Build Coastguard Workerarbitrary prefix as "[...]". 34*324bb76bSAndroid Build Coastguard Worker</li><li> 35*324bb76bSAndroid Build Coastguard Worker <strong>root</strong>: a single-character string. For most purposes, this is a 36*324bb76bSAndroid Build Coastguard Workercharacter, but we may occasionally make a distinction. It is [...]K, where 37*324bb76bSAndroid Build Coastguard Worker[...] is empty. 38*324bb76bSAndroid Build Coastguard Worker</li><li> 39*324bb76bSAndroid Build Coastguard Worker <strong>code</strong>: a number, specified by a known number of bits, which maps to a 40*324bb76bSAndroid Build Coastguard Workerstring. 41*324bb76bSAndroid Build Coastguard Worker</li><li> 42*324bb76bSAndroid Build Coastguard Worker <strong>codestream</strong>: the output stream of codes, as in the "raster data" 43*324bb76bSAndroid Build Coastguard Worker</li><li> 44*324bb76bSAndroid Build Coastguard Worker <strong>entry</strong>: a code and its string. 45*324bb76bSAndroid Build Coastguard Worker</li><li> 46*324bb76bSAndroid Build Coastguard Worker <strong>string table</strong>: a list of entries; usually, but not necessarily, unique. 47*324bb76bSAndroid Build Coastguard Worker</li> 48*324bb76bSAndroid Build Coastguard Worker</ul> 49*324bb76bSAndroid Build Coastguard Worker 50*324bb76bSAndroid Build Coastguard Worker<p> 51*324bb76bSAndroid Build Coastguard Worker LZW is a way of compressing data that takes advantage of repetition of 52*324bb76bSAndroid Build Coastguard Workerstrings in the data. Since raster data usually contains a lot of this 53*324bb76bSAndroid Build Coastguard Workerrepetition, LZW is a good way of compressing and decompressing it. 54*324bb76bSAndroid Build Coastguard Worker For the moment, lets consider normal LZW encoding and decoding. GIF's 55*324bb76bSAndroid Build Coastguard Workervariation on the concept is just an extension from there. 56*324bb76bSAndroid Build Coastguard Worker 57*324bb76bSAndroid Build Coastguard Worker</p><p> 58*324bb76bSAndroid Build Coastguard Worker LZW manipulates three objects in both compression and decompression: the 59*324bb76bSAndroid Build Coastguard Workercharstream, the codestream, and the string table. In compression, the 60*324bb76bSAndroid Build Coastguard Workercharstream is the input and the codestream is the output. In decompression, 61*324bb76bSAndroid Build Coastguard Workerthe codestream is the input and the charstream is the output. The string table 62*324bb76bSAndroid Build Coastguard Workeris a product of both compression and decompression, but is never passed from 63*324bb76bSAndroid Build Coastguard Workerone to the other. 64*324bb76bSAndroid Build Coastguard Worker 65*324bb76bSAndroid Build Coastguard Worker</p><h2>Compression</h2> 66*324bb76bSAndroid Build Coastguard Worker 67*324bb76bSAndroid Build Coastguard Worker<p> 68*324bb76bSAndroid Build Coastguard Worker The first thing we do in LZW compression is initialize our string table. 69*324bb76bSAndroid Build Coastguard WorkerTo do this, we need to choose a code size (how many bits) and know how many 70*324bb76bSAndroid Build Coastguard Workervalues our characters can possibly take. Let's say our code size is 12 bits, 71*324bb76bSAndroid Build Coastguard Workermeaning we can store 0->FFF, or 4096 entries in our string table. Lets also 72*324bb76bSAndroid Build Coastguard Workersay that we have 32 possible different characters. (This corresponds to, say, 73*324bb76bSAndroid Build Coastguard Workera picture in which there are 32 different colors possible for each pixel.) To 74*324bb76bSAndroid Build Coastguard Workerinitialize the table, we set code#0 to character#0, code #1 to character#1, 75*324bb76bSAndroid Build Coastguard Workerand so on, until code#31 to character#31. Actually, we are specifying that 76*324bb76bSAndroid Build Coastguard Workereach code from 0 to 31 maps to a root. There will be no more entries in the 77*324bb76bSAndroid Build Coastguard Workertable that have this property. 78*324bb76bSAndroid Build Coastguard Worker 79*324bb76bSAndroid Build Coastguard Worker</p><p> 80*324bb76bSAndroid Build Coastguard Worker 81*324bb76bSAndroid Build Coastguard WorkerNow we start compressing data. Let's first define something 82*324bb76bSAndroid Build Coastguard Workercalled the "current prefix". It's just a prefix that we'll store 83*324bb76bSAndroid Build Coastguard Workerthings in and compare things to now and then. I will refer to it as 84*324bb76bSAndroid Build Coastguard Worker"[.c.]". Initially, the current prefix has nothing in it. Let's also 85*324bb76bSAndroid Build Coastguard Workerdefine a "current string", which will be the current prefix plus the 86*324bb76bSAndroid Build Coastguard Workernext character in the charstream. I will refer to the current string 87*324bb76bSAndroid Build Coastguard Workeras "[.c.]K", where K is some character. OK, look at the first 88*324bb76bSAndroid Build Coastguard Workercharacter in the charstream. Call it P. Make [.c.]P the current 89*324bb76bSAndroid Build Coastguard Workerstring. (At this point, of course, it's just the root P.) Now search 90*324bb76bSAndroid Build Coastguard Workerthrough the string table to see if [.c.]P appears in it. Of course, it 91*324bb76bSAndroid Build Coastguard Workerdoes now, because our string table is initialized to have all roots. 92*324bb76bSAndroid Build Coastguard WorkerSo we don't do anything. Now make [.c.]P the current prefix. Look at 93*324bb76bSAndroid Build Coastguard Workerthe next character in the charstream. Call it Q. Add it to the 94*324bb76bSAndroid Build Coastguard Workercurrent prefix to form [.c.]Q, the current string. Now search through 95*324bb76bSAndroid Build Coastguard Workerthe string table to see if [.c.]Q appears in it. In this case, of 96*324bb76bSAndroid Build Coastguard Workercourse, it doesn't. Aha! Now we get to do something. Add [.c.]Q 97*324bb76bSAndroid Build Coastguard Worker(which is PQ in this case) to the string table for code#32, and output 98*324bb76bSAndroid Build Coastguard Workerthe code for [.c.] to the codestream. Now start over again with the 99*324bb76bSAndroid Build Coastguard Workercurrent prefix being just the root Q. Keep adding characters to [.c.] 100*324bb76bSAndroid Build Coastguard Workerto form [.c.]K, until you can't find [.c.]K in the string table. Then 101*324bb76bSAndroid Build Coastguard Workeroutput the code for [.c.] and add [.c.]K to the string table. In 102*324bb76bSAndroid Build Coastguard Workerpseudo-code, the algorithm goes something like this: 103*324bb76bSAndroid Build Coastguard Worker 104*324bb76bSAndroid Build Coastguard Worker</p><pre> [1] Initialize string table; 105*324bb76bSAndroid Build Coastguard Worker [2] [.c.] <- empty; 106*324bb76bSAndroid Build Coastguard Worker [3] K <- next character in charstream; 107*324bb76bSAndroid Build Coastguard Worker [4] Is [.c.]K in string table? 108*324bb76bSAndroid Build Coastguard Worker (yes: [.c.] <- [.c.]K; 109*324bb76bSAndroid Build Coastguard Worker go to [3]; 110*324bb76bSAndroid Build Coastguard Worker ) 111*324bb76bSAndroid Build Coastguard Worker (no: add [.c.]K to the string table; 112*324bb76bSAndroid Build Coastguard Worker output the code for [.c.] to the codestream; 113*324bb76bSAndroid Build Coastguard Worker [.c.] <- K; 114*324bb76bSAndroid Build Coastguard Worker go to [3]; 115*324bb76bSAndroid Build Coastguard Worker ) 116*324bb76bSAndroid Build Coastguard Worker</pre> 117*324bb76bSAndroid Build Coastguard Worker 118*324bb76bSAndroid Build Coastguard Worker<p> 119*324bb76bSAndroid Build Coastguard Worker It's as simple as that! Of course, when you get to step [3] and there 120*324bb76bSAndroid Build Coastguard Workeraren't any more characters left, you just output the code for [.c.] and throw 121*324bb76bSAndroid Build Coastguard Workerthe table away. You're done. 122*324bb76bSAndroid Build Coastguard Worker 123*324bb76bSAndroid Build Coastguard Worker</p><p> 124*324bb76bSAndroid Build Coastguard Worker Wanna do an example? Let's pretend we have a four-character alphabet: 125*324bb76bSAndroid Build Coastguard WorkerA,B,C,D. The charstream looks like ABACABA. Let's compress it. First, we 126*324bb76bSAndroid Build Coastguard Workerinitialize our string table to: #0=A, #1=B, #2=C, #3=D. The first character is 127*324bb76bSAndroid Build Coastguard WorkerA, which is in the string table, so [.c.] becomes A. Next we get AB, which is 128*324bb76bSAndroid Build Coastguard Workernot in the table, so we output code #0 (for [.c.]), 129*324bb76bSAndroid Build Coastguard Worker and add AB to the string table as code #4. [.c.] becomes B. Next we get 130*324bb76bSAndroid Build Coastguard Worker[.c.]A = BA, which is not in the string table, so output code #1, and add BA 131*324bb76bSAndroid Build Coastguard Workerto the string table as code #5. [.c.] becomes A. Next we get AC, which is not 132*324bb76bSAndroid Build Coastguard Workerin the string table. Output code #0, and add AC to the string table as code 133*324bb76bSAndroid Build Coastguard Worker#6. Now [.c.] becomes C. Next we get [.c.]A = CA, which is not in the table. 134*324bb76bSAndroid Build Coastguard WorkerOutput #2 for C, and add CA to table as code#7. Now [.c.] becomes A. Next we 135*324bb76bSAndroid Build Coastguard Workerget AB, which IS in the string table, so [.c.] gets AB, and we look at ABA, 136*324bb76bSAndroid Build Coastguard Workerwhich is not in the string table, so output the code for AB, which is #4, and 137*324bb76bSAndroid Build Coastguard Workeradd ABA to the string table as code #8. [.c.] becomes A. We can't get any more 138*324bb76bSAndroid Build Coastguard Workercharacters, so we just output #0 for the code for A, and we're done. So, the 139*324bb76bSAndroid Build Coastguard Workercodestream is #0#1#0#2#4#0. 140*324bb76bSAndroid Build Coastguard Worker 141*324bb76bSAndroid Build Coastguard Worker</p><p> 142*324bb76bSAndroid Build Coastguard Worker A few words (four) should be said here about efficiency: use a hashing 143*324bb76bSAndroid Build Coastguard Workerstrategy. The search through the string table can be computationally 144*324bb76bSAndroid Build Coastguard Workerintensive, and some hashing is well worth the effort. Also, note that 145*324bb76bSAndroid Build Coastguard Worker"straight LZW" compression runs the risk of overflowing the string table - 146*324bb76bSAndroid Build Coastguard Workergetting to a code which can't be represented in the number of bits you've set 147*324bb76bSAndroid Build Coastguard Workeraside for codes. There are several ways of dealing with this problem, and GIF 148*324bb76bSAndroid Build Coastguard Workerimplements a very clever one, but we'll get to that. 149*324bb76bSAndroid Build Coastguard Worker 150*324bb76bSAndroid Build Coastguard Worker</p><p> 151*324bb76bSAndroid Build Coastguard Worker An important thing to notice is that, at any point during the 152*324bb76bSAndroid Build Coastguard Workercompression, if [...]K is in the string table, [...] is there also. This fact 153*324bb76bSAndroid Build Coastguard Workersuggests an efficient method for storing strings in the table. Rather than 154*324bb76bSAndroid Build Coastguard Workerstore the entire string of K's in the table, realize that any string can be 155*324bb76bSAndroid Build Coastguard Workerexpressed as a prefix plus a character: [...]K. If we're about to store [...]K 156*324bb76bSAndroid Build Coastguard Workerin the table, we know that [...] is already there, so we can just store the 157*324bb76bSAndroid Build Coastguard Workercode for [...] plus the final character K. 158*324bb76bSAndroid Build Coastguard Worker 159*324bb76bSAndroid Build Coastguard Worker 160*324bb76bSAndroid Build Coastguard Worker</p><h2>Decompression</h2> 161*324bb76bSAndroid Build Coastguard Worker 162*324bb76bSAndroid Build Coastguard WorkerDecompression is perhaps more 163*324bb76bSAndroid Build Coastguard Workerdifficult conceptually, but it is really easier to program. 164*324bb76bSAndroid Build Coastguard WorkerWe again have to start with an initialized string 165*324bb76bSAndroid Build Coastguard Workertable. This table comes from what knowledge we have about the charstream that 166*324bb76bSAndroid Build Coastguard Workerwe will eventually get, like what possible values the characters can take. In 167*324bb76bSAndroid Build Coastguard WorkerGIF files, this information is in the header as the number of possible pixel 168*324bb76bSAndroid Build Coastguard Workervalues. The beauty of LZW, though, is that this is all we need to know. We 169*324bb76bSAndroid Build Coastguard Workerwill build the rest of the string table as we decompress the codestream. The 170*324bb76bSAndroid Build Coastguard Workercompression is done in such a way that we will never encounter a code in the 171*324bb76bSAndroid Build Coastguard Workercodestream that we can't translate into a string. 172*324bb76bSAndroid Build Coastguard Worker 173*324bb76bSAndroid Build Coastguard Worker<p> 174*324bb76bSAndroid Build Coastguard WorkerWe need to define something called a "current code", which I 175*324bb76bSAndroid Build Coastguard Workerwill refer to as "<code>", and an "old-code", which I will refer 176*324bb76bSAndroid Build Coastguard Workerto as "<old>". To start things off, look at the first code. This 177*324bb76bSAndroid Build Coastguard Workeris now <code>. This code will be in the intialized string table as 178*324bb76bSAndroid Build Coastguard Workerthe code for a root. Output the root to the charstream. Make this code 179*324bb76bSAndroid Build Coastguard Workerthe old-code <old>. *Now look at the next code, and make it 180*324bb76bSAndroid Build Coastguard Worker<code>. It is possible that this code will not be in the string 181*324bb76bSAndroid Build Coastguard Workertable, but let's assume for now that it is. Output the string 182*324bb76bSAndroid Build Coastguard Workercorresponding to <code> to the codestream. Now find the first 183*324bb76bSAndroid Build Coastguard Workercharacter in the string you just translated. Call this K. Add this to 184*324bb76bSAndroid Build Coastguard Workerthe prefix [...] generated by <old> to form a new string 185*324bb76bSAndroid Build Coastguard Worker[...]K. Add this string [...]K to the string table, and set the 186*324bb76bSAndroid Build Coastguard Workerold-code <old> to the current code <code>. Repeat from where I 187*324bb76bSAndroid Build Coastguard Workertyped the asterisk, and you're all set. 188*324bb76bSAndroid Build Coastguard WorkerThis is the most common case so you should understand this before going 189*324bb76bSAndroid Build Coastguard Workeron. 190*324bb76bSAndroid Build Coastguard Worker 191*324bb76bSAndroid Build Coastguard Worker</p><p> 192*324bb76bSAndroid Build Coastguard Worker 193*324bb76bSAndroid Build Coastguard WorkerNow let's consider the possibility that <code> is not in the 194*324bb76bSAndroid Build Coastguard Workerstring table, which as we will see can only occur for strings of the 195*324bb76bSAndroid Build Coastguard Workerform P[...]P (for any character P). Think back to compression, and 196*324bb76bSAndroid Build Coastguard Workertry to understand what happens when you have a string like 197*324bb76bSAndroid Build Coastguard WorkerP[...]P[...]PQ appear in the charstream. Suppose P[...] is already in 198*324bb76bSAndroid Build Coastguard Workerthe string table, but P[...]P is not. The compressor will parse out 199*324bb76bSAndroid Build Coastguard WorkerP[...], and find that P[...]P is not in the string table. It will 200*324bb76bSAndroid Build Coastguard Workeroutput the code for P[...], and add P[...]P to the string table. Then 201*324bb76bSAndroid Build Coastguard Workerit will get up to P[...]P for the next string, and find that P[...]P 202*324bb76bSAndroid Build Coastguard Workeris in the table, as the code just added. So it will output the code 203*324bb76bSAndroid Build Coastguard Workerfor P[...]P if it finds that P[...]PQ is not in the table. The 204*324bb76bSAndroid Build Coastguard Workerdecompressor is always "one step behind" the compressor. When the 205*324bb76bSAndroid Build Coastguard Workerdecompressor sees the code for P[...]P, it will not have added that 206*324bb76bSAndroid Build Coastguard Workercode to it's string table yet because it needed the beginning 207*324bb76bSAndroid Build Coastguard Workercharacter of P[...]P to add to the string for the last code, P[...], 208*324bb76bSAndroid Build Coastguard Workerto form the code for P[...]P. However, when a decompressor finds a 209*324bb76bSAndroid Build Coastguard Workercode that it doesn't know yet, it will always be the very next one to 210*324bb76bSAndroid Build Coastguard Workerbe added to the string table. So it can guess at what the string for 211*324bb76bSAndroid Build Coastguard Workerthe code should be, and, in fact, it will always be correct. If I am a 212*324bb76bSAndroid Build Coastguard Workerdecompressor, and I see code#124, and yet my string table has entries 213*324bb76bSAndroid Build Coastguard Workeronly up to code#123, I can figure out what code#124 must be, add it to 214*324bb76bSAndroid Build Coastguard Workermy string table, and output the string. If code#123 generated the 215*324bb76bSAndroid Build Coastguard Workerstring [...], which I will refer to here as a prefix, then code#124, 216*324bb76bSAndroid Build Coastguard Workerin this special case, will be [...] plus the first character of [...]. 217*324bb76bSAndroid Build Coastguard WorkerSo just add the first character of [...] to the end of itself. Not 218*324bb76bSAndroid Build Coastguard Workertoo bad. 219*324bb76bSAndroid Build Coastguard Worker 220*324bb76bSAndroid Build Coastguard Worker 221*324bb76bSAndroid Build Coastguard Worker</p><p> 222*324bb76bSAndroid Build Coastguard Worker 223*324bb76bSAndroid Build Coastguard WorkerAs an example (and a very common one) of this special case, let's 224*324bb76bSAndroid Build Coastguard Workerassume we have a raster image in which the first three pixels have the 225*324bb76bSAndroid Build Coastguard Workersame color value. That is, my charstream looks like: QQQ.... For the 226*324bb76bSAndroid Build Coastguard Workersake of argument, let's say we have 32 colors, and Q is the 227*324bb76bSAndroid Build Coastguard Workercolor#12. The compressor will generate the code sequence 228*324bb76bSAndroid Build Coastguard Worker12,32,.... (if you don't know why, take a minute to understand it.) 229*324bb76bSAndroid Build Coastguard WorkerRemember that #32 is not in the initial table, which goes from #0 to 230*324bb76bSAndroid Build Coastguard Worker#31. The decompressor will see #12 and translate it just fine as color 231*324bb76bSAndroid Build Coastguard WorkerQ. Then it will see #32 and not yet know what that means. But if it 232*324bb76bSAndroid Build Coastguard Workerthinks about it long enough, it can figure out that QQ should be 233*324bb76bSAndroid Build Coastguard Workerentry#32 in the table and QQ should be the next string output. So the 234*324bb76bSAndroid Build Coastguard Workerdecompression pseudo-code goes something like: 235*324bb76bSAndroid Build Coastguard Worker 236*324bb76bSAndroid Build Coastguard Worker</p><pre> [1] Initialize string table; 237*324bb76bSAndroid Build Coastguard Worker [2] get first code: <code> 238*324bb76bSAndroid Build Coastguard Worker [3] output the string for <code> to the charstream; 239*324bb76bSAndroid Build Coastguard Worker [4] <old> = <code> 240*324bb76bSAndroid Build Coastguard Worker [5] <code> <- next code in codestream; 241*324bb76bSAndroid Build Coastguard Worker [6] does <code> exist in the string table? 242*324bb76bSAndroid Build Coastguard Worker (yes: output the string for <code> to the charstream; 243*324bb76bSAndroid Build Coastguard Worker [...] <- translation for <old> 244*324bb76bSAndroid Build Coastguard Worker K <- first character of translation for <code> 245*324bb76bSAndroid Build Coastguard Worker add [...]K to the string table; 246*324bb76bSAndroid Build Coastguard Worker <old> <- <code> 247*324bb76bSAndroid Build Coastguard Worker ) 248*324bb76bSAndroid Build Coastguard Worker (no: [...] <- translation for <old> 249*324bb76bSAndroid Build Coastguard Worker K <- first character of [...]; 250*324bb76bSAndroid Build Coastguard Worker output [...]K to charstream and add it to string table; 251*324bb76bSAndroid Build Coastguard Worker <old> <- <code> 252*324bb76bSAndroid Build Coastguard Worker ) 253*324bb76bSAndroid Build Coastguard Worker [7] go to [5]; 254*324bb76bSAndroid Build Coastguard Worker</pre> 255*324bb76bSAndroid Build Coastguard Worker 256*324bb76bSAndroid Build Coastguard Worker<p> 257*324bb76bSAndroid Build Coastguard Worker Again, when you get to step [5] and there are no more codes, you're 258*324bb76bSAndroid Build Coastguard Workerfinished. Outputting of strings, and finding of initial characters in strings 259*324bb76bSAndroid Build Coastguard Workerare efficiency problems all to themselves, but I'm not going to suggest ways 260*324bb76bSAndroid Build Coastguard Workerto do them here. Half the fun of programming is figuring these things out! 261*324bb76bSAndroid Build Coastguard Worker 262*324bb76bSAndroid Build Coastguard Worker</p><h2>GIF variation</h2> 263*324bb76bSAndroid Build Coastguard Worker 264*324bb76bSAndroid Build Coastguard Worker<p> 265*324bb76bSAndroid Build Coastguard Worker 266*324bb76bSAndroid Build Coastguard WorkerNow for the GIF variations on the theme. In part of the header of a 267*324bb76bSAndroid Build Coastguard WorkerGIF file, there is a field, in the Raster Data stream, called "code 268*324bb76bSAndroid Build Coastguard Workersize". This is a very misleading name for the field, but we have to 269*324bb76bSAndroid Build Coastguard Workerlive with it. What it is really is the "root size". The actual size, 270*324bb76bSAndroid Build Coastguard Workerin bits, of the compression codes actually changes during 271*324bb76bSAndroid Build Coastguard Workercompression/decompression, and I will refer to that size here as the 272*324bb76bSAndroid Build Coastguard Worker"compression size". The initial table is just the codes for all the 273*324bb76bSAndroid Build Coastguard Workerroots, as usual, but two special codes are added on top of those. The 274*324bb76bSAndroid Build Coastguard Worker"code size" N is set to max(2,bits-per-pixel). In the table the roots 275*324bb76bSAndroid Build Coastguard Workertake up slots #0 through #(2**N-1), and the special codes are (2**N) 276*324bb76bSAndroid Build Coastguard Workerand (2**N + 1). The initial compression size will be N+1 bits per 277*324bb76bSAndroid Build Coastguard Workercode. If you're encoding, you output the codes (N+1) bits at a time to 278*324bb76bSAndroid Build Coastguard Workerstart with, and if you're decoding, you grab (N+1) bits from the 279*324bb76bSAndroid Build Coastguard Workercodestream at a time. As for the special codes: <CC> or the clear 280*324bb76bSAndroid Build Coastguard Workercode, is (2**N), and <EOI>, or end-of-information, is (2**N + 281*324bb76bSAndroid Build Coastguard Worker1). <CC> tells the compressor to re- initialize the string table, 282*324bb76bSAndroid Build Coastguard Workerand to reset the compression size to (N+1). <EOI> means there's no 283*324bb76bSAndroid Build Coastguard Workermore in the codestream. 284*324bb76bSAndroid Build Coastguard Worker 285*324bb76bSAndroid Build Coastguard Worker</p><p> 286*324bb76bSAndroid Build Coastguard Worker 287*324bb76bSAndroid Build Coastguard WorkerIf you're encoding or decoding, you should 288*324bb76bSAndroid Build Coastguard Workerstart adding things to the string table at <CC> + 2. If you're 289*324bb76bSAndroid Build Coastguard Workerencoding, you should output <CC> as the very first code, and then 290*324bb76bSAndroid Build Coastguard Workerwhenever after that you reach code #4095 (hex FFF), because GIF does 291*324bb76bSAndroid Build Coastguard Workernot allow compression sizes to be greater than 12 bits. If you're 292*324bb76bSAndroid Build Coastguard Workerdecoding, you should reinitialize your string table when you observe 293*324bb76bSAndroid Build Coastguard Worker<CC>. The variable compression sizes are really no big deal. If 294*324bb76bSAndroid Build Coastguard Workeryou're encoding, you start with a compression size of (N+1) bits, and, 295*324bb76bSAndroid Build Coastguard Workerwhenever you output the code (2**(compression size)-1), you bump the 296*324bb76bSAndroid Build Coastguard Workercompression size up one bit. So the next code you output will be one 297*324bb76bSAndroid Build Coastguard Workerbit longer. Remember that the largest compression size is 12 bits, 298*324bb76bSAndroid Build Coastguard Workercorresponding to a code of 4095. If you get that far, you must output 299*324bb76bSAndroid Build Coastguard Worker<CC> as the next code, and start over. If you're decoding, you 300*324bb76bSAndroid Build Coastguard Workermust increase your compression size AS SOON AS YOU write entry 301*324bb76bSAndroid Build Coastguard Worker#(2**(compression size) - 1) to the string table. The next code you 302*324bb76bSAndroid Build Coastguard WorkerREAD will be one bit longer. Don't make the mistake of waiting until 303*324bb76bSAndroid Build Coastguard Workeryou need to add the code (2**compression size) to the table. You'll 304*324bb76bSAndroid Build Coastguard Workerhave already missed a bit from the last code. The packaging of codes 305*324bb76bSAndroid Build Coastguard Workerinto a bitsream for the raster data is also a potential stumbling 306*324bb76bSAndroid Build Coastguard Workerblock for the novice encoder or decoder. The lowest order bit in the 307*324bb76bSAndroid Build Coastguard Workercode should coincide with the lowest available bit in the first 308*324bb76bSAndroid Build Coastguard Workeravailable byte in the codestream. For example, if you're starting with 309*324bb76bSAndroid Build Coastguard Worker5-bit compression codes, and your first three codes are, say, 310*324bb76bSAndroid Build Coastguard Worker<abcde>, <fghij>, <klmno>, where e, j, and o are bit#0, 311*324bb76bSAndroid Build Coastguard Workerthen your codestream will start off like: 312*324bb76bSAndroid Build Coastguard Worker 313*324bb76bSAndroid Build Coastguard Worker</p><pre> byte#0: hijabcde 314*324bb76bSAndroid Build Coastguard Worker byte#1: .klmnofg 315*324bb76bSAndroid Build Coastguard Worker</pre> 316*324bb76bSAndroid Build Coastguard Worker 317*324bb76bSAndroid Build Coastguard Worker<p> 318*324bb76bSAndroid Build Coastguard Worker 319*324bb76bSAndroid Build Coastguard Worker So the differences between straight LZW and GIF LZW are: two additional 320*324bb76bSAndroid Build Coastguard Workerspecial codes and variable compression sizes. If you understand LZW, and you 321*324bb76bSAndroid Build Coastguard Workerunderstand those variations, you understand it all! 322*324bb76bSAndroid Build Coastguard Worker 323*324bb76bSAndroid Build Coastguard Worker</p><p> 324*324bb76bSAndroid Build Coastguard Worker Just as sort of a P.S., you may have noticed that a compressor has a 325*324bb76bSAndroid Build Coastguard Workerlittle bit of flexibility at compression time. I specified a "greedy" approach 326*324bb76bSAndroid Build Coastguard Workerto the compression, grabbing as many characters as possible before outputting 327*324bb76bSAndroid Build Coastguard Workercodes. This is, in fact, the standard LZW way of doing things, and it will 328*324bb76bSAndroid Build Coastguard Workeryield the best compression ratio. But there's no rule saying you can't stop 329*324bb76bSAndroid Build Coastguard Workeranywhere along the line and just output the code for the current prefix, 330*324bb76bSAndroid Build Coastguard Workerwhether it's already in the table or not, and add that string plus the next 331*324bb76bSAndroid Build Coastguard Workercharacter to the string table. There are various reasons for wanting to do 332*324bb76bSAndroid Build Coastguard Workerthis, especially if the strings get extremely long and make hashing difficult. 333*324bb76bSAndroid Build Coastguard WorkerIf you need to, do it. 334*324bb76bSAndroid Build Coastguard Worker 335*324bb76bSAndroid Build Coastguard Worker</p><p> 336*324bb76bSAndroid Build Coastguard Worker Hope this helps out.----steve blackstock 337*324bb76bSAndroid Build Coastguard Worker 338*324bb76bSAndroid Build Coastguard Worker</p><h3>Further information</h3> 339*324bb76bSAndroid Build Coastguard Worker 340*324bb76bSAndroid Build Coastguard WorkerThe original paper that describes the LZW algorithm is: 341*324bb76bSAndroid Build Coastguard Worker 342*324bb76bSAndroid Build Coastguard Worker<blockquote> 343*324bb76bSAndroid Build Coastguard WorkerTerry A. Welch. 344*324bb76bSAndroid Build Coastguard WorkerA Technique for High Performance Data Compression. 345*324bb76bSAndroid Build Coastguard WorkerIEEE Computer, Vol. 17, No. 6, 1984, pp. 8-19. 346*324bb76bSAndroid Build Coastguard Worker</blockquote> 347*324bb76bSAndroid Build Coastguard Worker 348*324bb76bSAndroid Build Coastguard WorkerThe GIF format is described in more detail in the 349*324bb76bSAndroid Build Coastguard Worker<a href="gif87.txt">GIF87(5) - GIF 87</a> and 350*324bb76bSAndroid Build Coastguard Worker<a href="gif89.txt">GIF89a(5) - GIF 89a</a> standards. 351*324bb76bSAndroid Build Coastguard Worker</body> 352*324bb76bSAndroid Build Coastguard Worker</html> 353