xref: /aosp_15_r20/external/gemmlowp/doc/less-than-8-bit.md (revision 5f39d1b313f0528e11bae88b3029b54b9e1033e7)
1*5f39d1b3SJooyung Han# Computation with less than 8 bits in gemmlowp
2*5f39d1b3SJooyung Han
3*5f39d1b3SJooyung Han## Introduction
4*5f39d1b3SJooyung Han
5*5f39d1b3SJooyung HanWe assume familiarity with gemmlowp's low-precision uint8 computation paradigm,
6*5f39d1b3SJooyung Hanwhich is described in [low-precision.md](low-precision.md).
7*5f39d1b3SJooyung Han
8*5f39d1b3SJooyung HanThis document is about the possibility of further reducing precision below 8
9*5f39d1b3SJooyung Hanbits.
10*5f39d1b3SJooyung Han
11*5f39d1b3SJooyung HanThat allows to get higher arithmetic throughput on some architectures, at the
12*5f39d1b3SJooyung Hancost of decreased accuracy.
13*5f39d1b3SJooyung Han
14*5f39d1b3SJooyung Han## The past, present, and future of less-than-8-bit computation in gemmlowp
15*5f39d1b3SJooyung Han
16*5f39d1b3SJooyung HanA meta note is needed here as to how this fits with the general gemmlowp design.
17*5f39d1b3SJooyung Han
18*5f39d1b3SJooyung Han### The past
19*5f39d1b3SJooyung Han
20*5f39d1b3SJooyung HanLess-than-8-bit computation was initially designed and implemented in gemmlowp
21*5f39d1b3SJooyung Hanas a drop-in replacement for regular 8bit computation, a plain optimization. The
22*5f39d1b3SJooyung Hanidea was that to automatically requantize 8bit operands to less-than-8bit during
23*5f39d1b3SJooyung Hanthe O(N^2) packing stage, then take advantage of the lower bit depth during the
24*5f39d1b3SJooyung HanO(N^3) compute stage. For large enough matrices, that should be worth it.
25*5f39d1b3SJooyung Han
26*5f39d1b3SJooyung Han### The present
27*5f39d1b3SJooyung Han
28*5f39d1b3SJooyung HanTODO(benoitjacob): update this documentation. This 'present' state just
29*5f39d1b3SJooyung Hanbecame the past (February 2017).
30*5f39d1b3SJooyung Han
31*5f39d1b3SJooyung HanAt the moment, this less-than-8-bit mode of gemmlowp is not much used in
32*5f39d1b3SJooyung Hanpractice, because the implicit requantization of operands from 8bit to
33*5f39d1b3SJooyung Hanless-than-8bit turned out to be more expensive than initially expected, both in
34*5f39d1b3SJooyung Hanterms of speed and accuracy:
35*5f39d1b3SJooyung Han
36*5f39d1b3SJooyung Han1.  Speed: the O(N^2) requantization is only negligible compared to the O(N^3)
37*5f39d1b3SJooyung Han    compute kernel when the matrix size N is large enough; in practice, smaller
38*5f39d1b3SJooyung Han    matrix sizes turned out to be very important, making the requantization
39*5f39d1b3SJooyung Han    approach slower than expected.
40*5f39d1b3SJooyung Han
41*5f39d1b3SJooyung Han2.  Accuracy: As neural networks were optimized for size, their sensitivity to
42*5f39d1b3SJooyung Han    numerical accuracy increased. Then the approach of requantizing
43*5f39d1b3SJooyung Han    already-quantized data turned out to be more wasteful of accuracy than we
44*5f39d1b3SJooyung Han    could afford.
45*5f39d1b3SJooyung Han
46*5f39d1b3SJooyung Han### The future
47*5f39d1b3SJooyung Han
48*5f39d1b3SJooyung HanLess-than-8bit still probably has good prospects; what should be dropped is the
49*5f39d1b3SJooyung Hanrequantization. In other words, in the future, we might have neural networkds
50*5f39d1b3SJooyung Hantrained right away for some bit depth lower than 8 bits. The resulting values
51*5f39d1b3SJooyung Hanwould probably still be stored as 8 bits (unless the bit depth eventually
52*5f39d1b3SJooyung Hanbecomes very low). Thus, no particular work would be needed in the packing
53*5f39d1b3SJooyung Hanstage; no overhead or loss of accuracy would be incurred anymore.
54*5f39d1b3SJooyung Han
55*5f39d1b3SJooyung HanIn other words: the design of less-than-8-bit kernels is probably useful in the
56*5f39d1b3SJooyung Hanlong run; what is on the way out is requantization and packing/unpacking-stage
57*5f39d1b3SJooyung Hanaspects.
58*5f39d1b3SJooyung Han
59*5f39d1b3SJooyung HanWith that said, the rest of this page retains its old content about the present
60*5f39d1b3SJooyung Hanapproach:
61*5f39d1b3SJooyung Han
62*5f39d1b3SJooyung Han## Public interface
63*5f39d1b3SJooyung Han
64*5f39d1b3SJooyung Han### The BitDepthSetting parameter in the EightBitIntGemm interface
65*5f39d1b3SJooyung Han
66*5f39d1b3SJooyung HanAccessing less-than-8-bit computation via the EightBitIntGemm is very simple:
67*5f39d1b3SJooyung HanEightBitIntGemm takes a BitDepthSetting enum which allows to choose among a
68*5f39d1b3SJooyung Hanfixed set of supported bit-depth combinations.
69*5f39d1b3SJooyung Han
70*5f39d1b3SJooyung Han### The BitDepthParams parameter in the public/gemmlowp.h interface
71*5f39d1b3SJooyung Han
72*5f39d1b3SJooyung HanThe public/gemmlowp.h interface exposes more extensive control over
73*5f39d1b3SJooyung Hanquantization, by means of a BitDepthParams template parameter, which is a type
74*5f39d1b3SJooyung Hanparameter, carrying information about: 1. The LHS and RHS bit depth, which can
75*5f39d1b3SJooyung Hanbe set arbitrarily and independently; 2. The 'RoundingStrategy', which is the
76*5f39d1b3SJooyung Hanheuristic used to choose a rounding mode, based on the accumulation size (a.k.a.
77*5f39d1b3SJooyung Hanthe "depth" dimension of the Gemm). Details can be seen in public/bit_depth.h.
78*5f39d1b3SJooyung Han
79*5f39d1b3SJooyung Han### How does BitDepth{Setting,Params} affect input/output uint8 matrix data?
80*5f39d1b3SJooyung Han
81*5f39d1b3SJooyung HanInput/output matrix data is all uint8's, ranging from 0 to 255, regardless of
82*5f39d1b3SJooyung Hanthe BitDepth{Setting,Params}.
83*5f39d1b3SJooyung Han
84*5f39d1b3SJooyung HanSo the BitDepth{Setting,Params} is only an internal detail. It only means to
85*5f39d1b3SJooyung Hanallow gemmlowp to use lower precision internally, but the input/output data
86*5f39d1b3SJooyung Hanformat is unaffected.
87*5f39d1b3SJooyung Han
88*5f39d1b3SJooyung HanAs far as the API contract goes, the only thing that the
89*5f39d1b3SJooyung HanBitDepth{Setting,Params} does is to relax the accuracy requirement. With
90*5f39d1b3SJooyung Hanstandard 8bit/8bit computation, gemmlowp is required to return the exact result
91*5f39d1b3SJooyung Hanas specified in [low-precision.md](low-precision.md). With lower bit depths,
92*5f39d1b3SJooyung Hangemmlowp is no longer required to return an exact result.
93*5f39d1b3SJooyung Han
94*5f39d1b3SJooyung Han## Implementation
95*5f39d1b3SJooyung Han
96*5f39d1b3SJooyung HanHere we refer to the 3 stages of computation as described in
97*5f39d1b3SJooyung Han[design.md](design.md), namely: packing, computation kernel, unpacking.
98*5f39d1b3SJooyung Han
99*5f39d1b3SJooyung HanThe general idea is that at the packing stage, we requantize input (Lhs/Rhs)
100*5f39d1b3SJooyung Handata to less-than-8-bit depths by scaling them, thus shrinking the range of the
101*5f39d1b3SJooyung Hanpacked matrix entries; for instance, if the Rhs bit depth is to be 5 bits then
102*5f39d1b3SJooyung Hanpacked Rhs matrix entries will be in the range [0 ... 31]. This then allows the
103*5f39d1b3SJooyung HanGEMM kernel to use narrower accumulators without risking overflow, thus
104*5f39d1b3SJooyung Hanachieving higher arithmetic throughput. Finally, at the unpacking stage, it only
105*5f39d1b3SJooyung Hanremains to scale the result values to compensate for the scalings applied
106*5f39d1b3SJooyung Hanearlier.
107*5f39d1b3SJooyung Han
108*5f39d1b3SJooyung HanLet us go into more detail for each of those stages:
109*5f39d1b3SJooyung Han
110*5f39d1b3SJooyung Han### Packing stage
111*5f39d1b3SJooyung Han
112*5f39d1b3SJooyung HanThe packing stage is where most of the work specific to the BitDepthParams takes
113*5f39d1b3SJooyung Hanplace.
114*5f39d1b3SJooyung Han
115*5f39d1b3SJooyung HanHere, we have to scale input matrix values from their original range of [0 ...
116*5f39d1b3SJooyung Han255] to the range specified by the BitDepthParams, which is [0 ... (2^N)-1]
117*5f39d1b3SJooyung Hanwhere N is the number of bits for the matrix at hand (Lhs or Rhs). For example,
118*5f39d1b3SJooyung Hanfor a bit depth of 5 bits, we need to scale down to [0 ... 31].
119*5f39d1b3SJooyung Han
120*5f39d1b3SJooyung HanThis scaling is what we call "requantization". The pedantic name matches the
121*5f39d1b3SJooyung Hanfact that this is actually quite nontrivial to do correctly i.e. in such a way
122*5f39d1b3SJooyung Hanthat the result accuracy will be good enough for real-world applications. See
123*5f39d1b3SJooyung Hanthe section below on requantization details.
124*5f39d1b3SJooyung Han
125*5f39d1b3SJooyung HanConcretely, this work happens in PackingRegisterBlock::Pack(), which calls
126*5f39d1b3SJooyung HanRequantize(). This is in internal/pack.h. This code can be overridden for
127*5f39d1b3SJooyung Hanspecific architectures, see internal/pack_neon.h.
128*5f39d1b3SJooyung Han
129*5f39d1b3SJooyung HanThis requantization work is costly and makes packing slower. This means that, at
130*5f39d1b3SJooyung Hanleast in our approach, less-than-8-bit computation is only interesting for
131*5f39d1b3SJooyung Hanlarge-enough, square-enough GEMMs where packing is only a small fraction of the
132*5f39d1b3SJooyung Hanoverall cost. In cases where packing overhead is more prevalent (highly
133*5f39d1b3SJooyung Hanrectangular cases), less-than-8-bit is probably a waste of time as long as we
134*5f39d1b3SJooyung Hantreat it as an internal computation detail. What might help there, might be if
135*5f39d1b3SJooyung Hanwe shrunk the input/output data format to lower memory bandwidth usage.
136*5f39d1b3SJooyung Han
137*5f39d1b3SJooyung Han### Computation kernel stage
138*5f39d1b3SJooyung Han
139*5f39d1b3SJooyung HanIn principle, the computation kernel stage simply doesn't have to care about the
140*5f39d1b3SJooyung Hanbit depth at all. In fact, on architectures where we do not have specific
141*5f39d1b3SJooyung Hanoptimized kernels for less-than-8-bit cases, we simply use our standard kernel
142*5f39d1b3SJooyung Hanthere, and that's correct!
143*5f39d1b3SJooyung Han
144*5f39d1b3SJooyung HanHowever, while the kernel doesn't have to know about the fact that the operands
145*5f39d1b3SJooyung Hanare on less than 8 bits, it can use that information to make special
146*5f39d1b3SJooyung Hanoptimizations that would be incorrect in the general 8-bit case and become
147*5f39d1b3SJooyung Hancorrect here thanks to the more restricted range of inputs. That's the whole
148*5f39d1b3SJooyung Hanpoint of this less-than-8-bit computation idea.
149*5f39d1b3SJooyung Han
150*5f39d1b3SJooyung HanWith Lhs entries guaranteed to be smaller than 2^N, and Rhs entries guaranteed
151*5f39d1b3SJooyung Hanto be smaller than 2^M, each product is thus guaranteed to be smaller than
152*5f39d1b3SJooyung Han2^(M+N). Thus, one may accumulate 2^(16-(M+N)) such products and still be
153*5f39d1b3SJooyung Hanguaranteed that such an accumulator will be smaller than 2^16, and thus can be
154*5f39d1b3SJooyung Hanstored as a uint16 without risking overflow.
155*5f39d1b3SJooyung Han
156*5f39d1b3SJooyung HanFor example, in the L7R5 case, the Lhs enties are on 7 bits (N=7) and the Rhs
157*5f39d1b3SJooyung Hanentries are on 5 bits (M=5), so each product fits in 12 bits and one can thus
158*5f39d1b3SJooyung Hanaccumulate 16 ( = 2^(16-12)) such products into uint16 accumulators with no risk
159*5f39d1b3SJooyung Hanof overflow.
160*5f39d1b3SJooyung Han
161*5f39d1b3SJooyung HanThis means that a computation kernel may use uint16 accumulators for several
162*5f39d1b3SJooyung Hanloop iterations (16 in the above example), provided that it is allowed to assume
163*5f39d1b3SJooyung Hanthat inputs are in such restricted range.
164*5f39d1b3SJooyung Han
165*5f39d1b3SJooyung HanAfter this fixed number of loop iterations, the kernel must accumulate the local
166*5f39d1b3SJooyung Hanuint16 accumulators back into global uint32 accumulators.
167*5f39d1b3SJooyung Han
168*5f39d1b3SJooyung HanOn SIMD architectures with suitable uint16 arithmetic, this in principle allows
169*5f39d1b3SJooyung Hanto multiply arithmetic throughput by up to 2x, since twice more accumulators now
170*5f39d1b3SJooyung Hanfit in each SIMD vector register. This is partially offset by the cost of
171*5f39d1b3SJooyung Hanaccumulating back into global uint32 accumulators every several loop iterations,
172*5f39d1b3SJooyung Hanbut our experience on ARM NEON has been that we still get quite close to a 2x
173*5f39d1b3SJooyung Hanspeedup. See internal/kernel_neon.h, specifically
174*5f39d1b3SJooyung HanNEON32Kernel12x4Depth2Assuming12BitProducts.
175*5f39d1b3SJooyung Han
176*5f39d1b3SJooyung Han### Unpacking stage
177*5f39d1b3SJooyung Han
178*5f39d1b3SJooyung HanAt the unpacking stage, it only remains to scale the result values to compensate
179*5f39d1b3SJooyung Hanfor the scaling of the inputs. This is easier because now we are expanding the
180*5f39d1b3SJooyung Hanrange instead of shrinking it, so we don't need to worry about ways to minimize
181*5f39d1b3SJooyung Hana loss of accuracy. We simply need to multiply result values by a constant
182*5f39d1b3SJooyung Hanfraction, rounding to nearest.
183*5f39d1b3SJooyung Han
184*5f39d1b3SJooyung HanSince the inputs were scaled by factors of (2^lhs_bits - 1)/255 and
185*5f39d1b3SJooyung Han(2^rhs_bits - 1)/255 respectively, the scaling of the outputs needs to be by the
186*5f39d1b3SJooyung Hanfollowing factor:
187*5f39d1b3SJooyung Han
188*5f39d1b3SJooyung Han                 255 * 255
189*5f39d1b3SJooyung Han    -----------------------------------
190*5f39d1b3SJooyung Han    (2^lhs_bits - 1) * (2^rhs_bits - 1)
191*5f39d1b3SJooyung Han
192*5f39d1b3SJooyung HanThis is done by a MultiplyByConstantFraction function, see internal/unpack.h
193*5f39d1b3SJooyung Han
194*5f39d1b3SJooyung Han## Requantization details
195*5f39d1b3SJooyung Han
196*5f39d1b3SJooyung HanHere we go into more detail on the Requantize() function used at the packing
197*5f39d1b3SJooyung Hanstage to requantize input matrix data. See this function in internal/pack.h.
198*5f39d1b3SJooyung Han
199*5f39d1b3SJooyung HanIt depends on the bit depth and on a rounding mode, and requantizes an input
200*5f39d1b3SJooyung Hanvalue in [0 ... 255] to the range [0 ... (2^N)-1] specified by the bit depth N.
201*5f39d1b3SJooyung Han
202*5f39d1b3SJooyung Han### Naive, bad rounding, that's plainly biased
203*5f39d1b3SJooyung Han
204*5f39d1b3SJooyung HanNaive and inaccurate ways to achieve this requantization include: 1. By shifting
205*5f39d1b3SJooyung Hanbits rights by (8-N) bits; 2. By multiplying by ((2^N) - 1) and dividing by 255.
206*5f39d1b3SJooyung Han
207*5f39d1b3SJooyung HanBoth of those are biased in some way: 1. has the wrong "derivative", since it
208*5f39d1b3SJooyung Hanapproximates (((2^N) - 1) / 255) by ((2^N) / 256) ; 2. has bias since it
209*5f39d1b3SJooyung Haneffectively implements rounding towards 0.
210*5f39d1b3SJooyung Han
211*5f39d1b3SJooyung HanIn practice, both of the above requantization functions give results that are
212*5f39d1b3SJooyung Hantoo inaccurate in practice for the neural network that we tried (GoogLeNet).
213*5f39d1b3SJooyung Han
214*5f39d1b3SJooyung Han### Round-to-nearest rounding: unbiased in principle but not in practice
215*5f39d1b3SJooyung Han
216*5f39d1b3SJooyung HanThe simplest fix is to avoid the bias in 2. by rounding-to-nearest instead of
217*5f39d1b3SJooyung Hanrounding towards 0. This can be achieved by doing
218*5f39d1b3SJooyung Han
219*5f39d1b3SJooyung Handst = (src * maxval + rounding_offset) / 255;
220*5f39d1b3SJooyung Han
221*5f39d1b3SJooyung HanWhere maxval = ((2^N) - 1) is the highest requantized value, and the
222*5f39d1b3SJooyung Hanrounding_offset can be set to
223*5f39d1b3SJooyung Han
224*5f39d1b3SJooyung Hanrounding_offset = 127
225*5f39d1b3SJooyung Han
226*5f39d1b3SJooyung Hanto achieve rounding-to-nearest (while the above rounding towards 0 corresponded
227*5f39d1b3SJooyung Hanto rounding_offset = 0).
228*5f39d1b3SJooyung Han
229*5f39d1b3SJooyung HanIn principle, rounding-to-nearest is unbiased and optimal in various ways.
230*5f39d1b3SJooyung Han
231*5f39d1b3SJooyung HanIn practice though, our input data is not random real numbers, but
232*5f39d1b3SJooyung Hanalready-quantized 8-bit values. That means that even in the best case, there
233*5f39d1b3SJooyung Hanwould be at most 255 different possible input values; in practice, we generally
234*5f39d1b3SJooyung Hansee the input values distributed non-uniformly in that range, so that a majority
235*5f39d1b3SJooyung Hanof input values tend to be in a much smaller range. See test/test_data.cc.
236*5f39d1b3SJooyung Han
237*5f39d1b3SJooyung HanHaving a large part of the input values in a very small finite set, means that
238*5f39d1b3SJooyung Hanthe corresponding rounding errors are also in a very small finite set, which can
239*5f39d1b3SJooyung Hanbe small enough that the mean of these rounding errors is significantly
240*5f39d1b3SJooyung Handifferent from 0. That rounding-to-nearest is "unbiased" only means that over a
241*5f39d1b3SJooyung Hansufficiently large set of input values, the bias would become arbitrarily close
242*5f39d1b3SJooyung Hanto 0; here, the set of input values is effectively small enough that the
243*5f39d1b3SJooyung Hanresulting bias is significant.
244*5f39d1b3SJooyung Han
245*5f39d1b3SJooyung HanThis leads to biasing the matrix product entries, resulting in an error that
246*5f39d1b3SJooyung Hangrows linearly with the depth dimension of the GEMM.
247*5f39d1b3SJooyung Han
248*5f39d1b3SJooyung Han### Probabilistic rounding: unbiased even on small finite input distributions
249*5f39d1b3SJooyung Han
250*5f39d1b3SJooyung HanTo address that, we can instead use probabilistic rounding. The idea is that for
251*5f39d1b3SJooyung Haninstance if we have to round the value 3.8 to the nearest integer, we can round
252*5f39d1b3SJooyung Hanit to 3 with 20% probability and to 4 with probability 80%. If that value 3.8
253*5f39d1b3SJooyung Hanoccurs many times, the mean requantized value will thus tend to 3.8.
254*5f39d1b3SJooyung Han
255*5f39d1b3SJooyung HanThis amounts to keeping the above requantization formula,
256*5f39d1b3SJooyung Han
257*5f39d1b3SJooyung Handst = (src * maxval + rounding_offset) / 255;
258*5f39d1b3SJooyung Han
259*5f39d1b3SJooyung Hanbut now the rounding_offset is a random value in [0 .. 254].
260*5f39d1b3SJooyung Han
261*5f39d1b3SJooyung HanThis guarantees zero bias no matter how small the distribution of input values
262*5f39d1b3SJooyung Hanis.
263*5f39d1b3SJooyung Han
264*5f39d1b3SJooyung HanOn the other hand, the variance of the error term here is higher than with
265*5f39d1b3SJooyung Hanrounding-to-nearest --- one can check that it is 2x higher.
266*5f39d1b3SJooyung Han
267*5f39d1b3SJooyung HanSo the error term coming from the Central Limit Theorem, which grows with the
268*5f39d1b3SJooyung Hansquare root of the accumulator depth i.e. the GEMM depth, will be 2x higher
269*5f39d1b3SJooyung Hanhere.
270*5f39d1b3SJooyung Han
271*5f39d1b3SJooyung HanStill, for large enough GEMM depth, that is better than rounding-to-nearest
272*5f39d1b3SJooyung Hanwhich has an error term growing linearly with the GEMM depth.
273*5f39d1b3SJooyung Han
274*5f39d1b3SJooyung Han### Switching between rounding-to-nearest and probabilistic rounding
275*5f39d1b3SJooyung Han
276*5f39d1b3SJooyung HanThus, for fixed input values and bit depths, we expect that probabilistic
277*5f39d1b3SJooyung Hanrounding will give more accurate results for large enough GEMM depths, while
278*5f39d1b3SJooyung Hanrounding-to-nearest will be more accurate for smaller GEMM depths.
279*5f39d1b3SJooyung Han
280*5f39d1b3SJooyung HanThat is why use switch between these rounding modes based on GEMM depth, see
281*5f39d1b3SJooyung HanChooseRoundingMode in internal/bit_depth_util.h.
282*5f39d1b3SJooyung Han
283*5f39d1b3SJooyung HanIt is based on a constant, kProbabilisticRoundingThreshold, defined in
284*5f39d1b3SJooyung Haninternal/common.h and empirically determined. See the comment there. It would be
285*5f39d1b3SJooyung Hannice to better understand the statistics here and come up with better heuristics
286*5f39d1b3SJooyung Hanfor this switching.
287*5f39d1b3SJooyung Han
288*5f39d1b3SJooyung Han### Choice of pseudorandom number generator
289*5f39d1b3SJooyung Han
290*5f39d1b3SJooyung HanWe provide two PRNGs. The first is an 8-bit Xorshift. It is fast, naturally
291*5f39d1b3SJooyung Hanproduces values ranging over an interval of width 255, which is what we need
292*5f39d1b3SJooyung Hanhere (as opposed to an interval of width 256), and turns out, from empirical
293*5f39d1b3SJooyung Hantests, to produce better results than a linear congruential generator (LCG).
294*5f39d1b3SJooyung HanThat's unfortunate, as a 8-bit LCG performs better (we confirmed that on various
295*5f39d1b3SJooyung HanARM devices) but we need as perfect un-biased-ness as we can get.
296*5f39d1b3SJooyung Han
297*5f39d1b3SJooyung HanThe second is an "add-mod" sequence generator, which generates non-random values
298*5f39d1b3SJooyung Hanin the sequence x = (x + 97) % 255. This generates a low-discrepancy sequence
299*5f39d1b3SJooyung Hanthat minimizes the "clumpiness" of the random offsets (Thus, for example,
300*5f39d1b3SJooyung Hanquantizing a 3x3 matrix will have a maximum additive error of about 200 from the
301*5f39d1b3SJooyung Hanrandom offsets). While not random, this sequence performs well empirically for
302*5f39d1b3SJooyung Hanmany quantizations. (For information about why 97 is a good value, see
303*5f39d1b3SJooyung Hanhttps://en.wikipedia.org/wiki/Low-discrepancy_sequence#Additive_recurrence and
304*5f39d1b3SJooyung Hanhttp://mollwollfumble.blogspot.com/2011/03/subrandom-numbers.html 97/255 = 0.38;
305*5f39d1b3SJooyung Han0.382 is the best choice. For discrete numbers, the choice must be relatively
306*5f39d1b3SJooyung Hanprime to the modulus. 97 is prime, so it is safely relatively prime to 255. 107
307*5f39d1b3SJooyung Hanis another near-optimal choice.
308*5f39d1b3SJooyung Han
309*5f39d1b3SJooyung HanThe low-discrepancy sequence generator is the default.
310*5f39d1b3SJooyung Han
311*5f39d1b3SJooyung HanMore details and results are given in a comment on the default PRNG in
312*5f39d1b3SJooyung Haninternal/pack.h. Interested users can change the PRNG used by setting
313*5f39d1b3SJooyung HanDefaultRoundingGenerator in bit_depth_util.h.
314