xref: /aosp_15_r20/external/brotli/research/README.md (revision f4ee7fba7774faf2a30f13154332c0a06550dbc4)
1*f4ee7fbaSAndroid Build Coastguard Worker## Introduction
2*f4ee7fbaSAndroid Build Coastguard Worker
3*f4ee7fbaSAndroid Build Coastguard WorkerIn this directory we publish simple tools to analyze backward reference distance distributions in LZ77 compression. We developed these tools to be able to make more efficient encoding of distances in large-window brotli. In large-window compression the average cost of a backward reference distance is higher, and this may allow for more advanced encoding strategies, such as delta coding or an increase in context size, to bring significant compression density improvements. Our tools visualize the backward references as histogram images, i.e., one pixel in the image shows how many distances of a certain range exist at a certain locality in the data. The human visual system is excellent at pattern detection, so we tried to roughly identify patterns visually before going into more quantitative analysis. These tools can turn out to be useful in development of  other LZ77-based compressors and we hope you try them out.
4*f4ee7fbaSAndroid Build Coastguard Worker
5*f4ee7fbaSAndroid Build Coastguard Worker
6*f4ee7fbaSAndroid Build Coastguard Worker## Tools
7*f4ee7fbaSAndroid Build Coastguard Worker### find\_opt\_references
8*f4ee7fbaSAndroid Build Coastguard Worker
9*f4ee7fbaSAndroid Build Coastguard WorkerThis tool generates optimal (match-length-wise) backward references for every position in the input files and stores them in `*.dist` file described below.
10*f4ee7fbaSAndroid Build Coastguard Worker
11*f4ee7fbaSAndroid Build Coastguard WorkerExample usage:
12*f4ee7fbaSAndroid Build Coastguard Worker
13*f4ee7fbaSAndroid Build Coastguard Worker    find_opt_references input.txt output.dist
14*f4ee7fbaSAndroid Build Coastguard Worker
15*f4ee7fbaSAndroid Build Coastguard Worker### draw\_histogram
16*f4ee7fbaSAndroid Build Coastguard Worker
17*f4ee7fbaSAndroid Build Coastguard WorkerThis tool generates a visualization of the distribution of backward references stored in `*.dist` file. The original file size has to be specified as a second parameter. The output is a grayscale PGM (binary) image.
18*f4ee7fbaSAndroid Build Coastguard Worker
19*f4ee7fbaSAndroid Build Coastguard WorkerExample usage:
20*f4ee7fbaSAndroid Build Coastguard Worker
21*f4ee7fbaSAndroid Build Coastguard Worker    draw_histogram input.dist 65536 output.pgm
22*f4ee7fbaSAndroid Build Coastguard Worker
23*f4ee7fbaSAndroid Build Coastguard WorkerHere's an example of resulting image:
24*f4ee7fbaSAndroid Build Coastguard Worker
25*f4ee7fbaSAndroid Build Coastguard Worker![](img/enwik9_brotli.png)
26*f4ee7fbaSAndroid Build Coastguard Worker
27*f4ee7fbaSAndroid Build Coastguard Worker### draw\_diff
28*f4ee7fbaSAndroid Build Coastguard Worker
29*f4ee7fbaSAndroid Build Coastguard WorkerThis tool generates a diff PPM (binary) image between two input 8-bit PGM (binary) images. Input images must be of same size. Useful for comparing different backward references distributions for same input file. Normally used for comparison of output images from `draw_histogram` tool.
30*f4ee7fbaSAndroid Build Coastguard Worker
31*f4ee7fbaSAndroid Build Coastguard WorkerExample usage:
32*f4ee7fbaSAndroid Build Coastguard Worker
33*f4ee7fbaSAndroid Build Coastguard Worker    draw_diff image1.pgm image2.pgm diff.ppm
34*f4ee7fbaSAndroid Build Coastguard Worker
35*f4ee7fbaSAndroid Build Coastguard WorkerFor example the diff of this image
36*f4ee7fbaSAndroid Build Coastguard Worker
37*f4ee7fbaSAndroid Build Coastguard Worker![](img/enwik9_brotli.png)
38*f4ee7fbaSAndroid Build Coastguard Worker
39*f4ee7fbaSAndroid Build Coastguard Workerand this image
40*f4ee7fbaSAndroid Build Coastguard Worker
41*f4ee7fbaSAndroid Build Coastguard Worker![](img/enwik9_opt.png)
42*f4ee7fbaSAndroid Build Coastguard Worker
43*f4ee7fbaSAndroid Build Coastguard Workerlooks like this:
44*f4ee7fbaSAndroid Build Coastguard Worker
45*f4ee7fbaSAndroid Build Coastguard Worker![](img/enwik9_diff.png)
46*f4ee7fbaSAndroid Build Coastguard Worker
47*f4ee7fbaSAndroid Build Coastguard Worker
48*f4ee7fbaSAndroid Build Coastguard Worker## Backward distance file format
49*f4ee7fbaSAndroid Build Coastguard Worker
50*f4ee7fbaSAndroid Build Coastguard WorkerThe format of `*.dist` files is as follows:
51*f4ee7fbaSAndroid Build Coastguard Worker
52*f4ee7fbaSAndroid Build Coastguard Worker    [[     0| match length][     1|position|distance]...]
53*f4ee7fbaSAndroid Build Coastguard Worker     [1 byte|      4 bytes][1 byte| 4 bytes| 4 bytes]
54*f4ee7fbaSAndroid Build Coastguard Worker
55*f4ee7fbaSAndroid Build Coastguard WorkerMore verbose explanation: for each backward reference there is a position-distance pair, also a copy length may be specified. Copy length is prefixed with flag byte 0, position-distance pair is prefixed with flag byte 1. Each number is a 32-bit integer. Copy length always comes before position-distance pair. Standalone copy length is allowed, in this case it is ignored.
56*f4ee7fbaSAndroid Build Coastguard Worker
57*f4ee7fbaSAndroid Build Coastguard WorkerHere's an example of how to read from `*.dist` file:
58*f4ee7fbaSAndroid Build Coastguard Worker
59*f4ee7fbaSAndroid Build Coastguard Worker```c++
60*f4ee7fbaSAndroid Build Coastguard Worker#include "read_dist.h"
61*f4ee7fbaSAndroid Build Coastguard Worker
62*f4ee7fbaSAndroid Build Coastguard WorkerFILE* f;
63*f4ee7fbaSAndroid Build Coastguard Workerint copy, pos, dist;
64*f4ee7fbaSAndroid Build Coastguard Workerwhile (ReadBackwardReference(fin, &copy, &pos, &dist)) {
65*f4ee7fbaSAndroid Build Coastguard Worker   ...
66*f4ee7fbaSAndroid Build Coastguard Worker}
67*f4ee7fbaSAndroid Build Coastguard Worker```
68