xref: /aosp_15_r20/external/gemmlowp/doc/design.md (revision 5f39d1b313f0528e11bae88b3029b54b9e1033e7)
1*5f39d1b3SJooyung Han# Overview of gemmlowp design
2*5f39d1b3SJooyung Han
3*5f39d1b3SJooyung Han## Primer on GEMM, kernels, and cache friendliness
4*5f39d1b3SJooyung Han
5*5f39d1b3SJooyung Hangemmlowp, like most GEMMs, implements the straightforward matrix multiplication
6*5f39d1b3SJooyung Hanalgorithm, which takes n^3 multiply-accumulate instructions for n*n sized
7*5f39d1b3SJooyung Hanmatrices. Because the arithmetic complexity grows quicker than the memory
8*5f39d1b3SJooyung Hancomplexity (n^3 vs. n^2), memory accesses are redundant (each matrix entry is
9*5f39d1b3SJooyung Hanaccessed n times). A large part of a GEMM's performance and design goes toward
10*5f39d1b3SJooyung Hanminimizing the inefficiency resulting from these redundant memory accesses.
11*5f39d1b3SJooyung Han
12*5f39d1b3SJooyung HanUltimately, once values are loaded into CPU registers, they cost nothing to
13*5f39d1b3SJooyung Hanaccess, so as long as we can work within registers, this problem doesn't exist.
14*5f39d1b3SJooyung HanThus, in order to be efficient, a GEMM's inner loops must wisely use the
15*5f39d1b3SJooyung Hanavailable registers to do as much arithmetic work as possible before loading
16*5f39d1b3SJooyung Hanmore data from memory into registers. This means that a GEMM implementation
17*5f39d1b3SJooyung Hanneeds to have architecture-specific inner loops tailored for architecture
18*5f39d1b3SJooyung Handetails such as the number of registers, and typically written in assembly. This
19*5f39d1b3SJooyung Han'inner loops' architecture-specific component is referred to as the GEMM kernel.
20*5f39d1b3SJooyung Han(More details about kernels are in [kernel.md](kernel.md)).
21*5f39d1b3SJooyung Han
22*5f39d1b3SJooyung HanHowever, only small blocks can fit at a given time in registers, so at larger
23*5f39d1b3SJooyung Hanscales one needs to repeatedly load blocks of matrices from memory, and these
24*5f39d1b3SJooyung Hanaccesses are redundant for the reason outlined above. The way that one minimizes
25*5f39d1b3SJooyung Hanthe resulting inefficiency is by organizing for cache locality, so that most of
26*5f39d1b3SJooyung Hanthese accesses hit the L1 cache, and most of the remaining ones hit the L2
27*5f39d1b3SJooyung Hancache, etc.
28*5f39d1b3SJooyung Han
29*5f39d1b3SJooyung HanThis is achieved by subdividing the matrices into blocks sized to fit in L2
30*5f39d1b3SJooyung Hancache, and subdividing these blocks into sub-blocks sizes to fit in L1 cache,
31*5f39d1b3SJooyung Hanand performing the matrix multiplication one such block at a time.
32*5f39d1b3SJooyung Han
33*5f39d1b3SJooyung HanIn practice, it tends to pay off to "pack" input blocks for optimally efficient
34*5f39d1b3SJooyung Hantraversal by the kernel, since they will be traversed multiple times. "packing"
35*5f39d1b3SJooyung Hanmeans at least reordering the data layout for 1) simple access patterns that fit
36*5f39d1b3SJooyung Hanthe CPU's cache behavior (in particular, the cache line size), and 2) simple
37*5f39d1b3SJooyung Hanloading into SIMD vector registers by the kernel.
38*5f39d1b3SJooyung Han
39*5f39d1b3SJooyung HanSo a typical GEMM, in pseudo-code, tends to look like this:
40*5f39d1b3SJooyung Han
41*5f39d1b3SJooyung Han```
42*5f39d1b3SJooyung Hanallocate(some_lhs_L2_block);
43*5f39d1b3SJooyung Hanallocate(some_rhs_L2_block);
44*5f39d1b3SJooyung Hanfor (some_lhs_L2_block) {
45*5f39d1b3SJooyung Han  pack(some_lhs_L2_block);
46*5f39d1b3SJooyung Han  for (some_rhs_L2_block) {
47*5f39d1b3SJooyung Han    pack(some_rhs_L2_block);
48*5f39d1b3SJooyung Han    for (some_lhs_sub_block in some_lhs_L2_block) {
49*5f39d1b3SJooyung Han      for (some_rhs_sub_block in some_rhs_L2_block) {
50*5f39d1b3SJooyung Han        kernel(some_lhs_sub_block, some_rhs_sub_block);
51*5f39d1b3SJooyung Han      }
52*5f39d1b3SJooyung Han    }
53*5f39d1b3SJooyung Han  }
54*5f39d1b3SJooyung Han}
55*5f39d1b3SJooyung Han```
56*5f39d1b3SJooyung Han
57*5f39d1b3SJooyung Han## Impact of low-precision computation on gemmlowp design
58*5f39d1b3SJooyung Han
59*5f39d1b3SJooyung HanRefer to [low-precision.md](low-precision.md) for specifics of the
60*5f39d1b3SJooyung Hanlow-precision-computation paradigm and how it's implemented in gemmlowp.
61*5f39d1b3SJooyung Han
62*5f39d1b3SJooyung HanInputs and outputs are matrices of uint8 values, but internally we are
63*5f39d1b3SJooyung Hanaccumulating int32 values, only converting them back to uint8 at the end. This
64*5f39d1b3SJooyung Hanmeans that we need so store a block of int32 accumulators at a time. We compute
65*5f39d1b3SJooyung Hana block of the result in int32 accumulators and then we "unpack" it into the
66*5f39d1b3SJooyung Handestination matrix at once. In this way, we minimize the amount of memory used
67*5f39d1b3SJooyung Hanto store int32 values at a given time.
68*5f39d1b3SJooyung Han
69*5f39d1b3SJooyung HanBecause of that, besides the "pack" and "kernel" stages outlined above, a third
70*5f39d1b3SJooyung Hanstage is needed in gemmlowp, which we call "unpack". Thus we arrive at the
71*5f39d1b3SJooyung Han3-stage computation scheme that gemmlowp uses:
72*5f39d1b3SJooyung Han
73*5f39d1b3SJooyung Han1.  Pack lhs/rhs blocks from the input matrices.
74*5f39d1b3SJooyung Han2.  Compute the product of the packed blocks, using the kernel.
75*5f39d1b3SJooyung Han3.  Unpack the result block into the output matrix.
76*5f39d1b3SJooyung Han
77*5f39d1b3SJooyung HanThe pseudo-code overview of gemmlowp now looks like:
78*5f39d1b3SJooyung Han
79*5f39d1b3SJooyung Han```
80*5f39d1b3SJooyung Hanallocate(some_lhs_L2_block);
81*5f39d1b3SJooyung Hanallocate(some_rhs_L2_block);
82*5f39d1b3SJooyung Han// new: temp storage for int32 accums
83*5f39d1b3SJooyung Hanallocate(some_int32_accumulators_block);
84*5f39d1b3SJooyung Hanfor (some_lhs_L2_block) {
85*5f39d1b3SJooyung Han  pack(some_lhs_L2_block);
86*5f39d1b3SJooyung Han  for (some_rhs_L2_block) {
87*5f39d1b3SJooyung Han    pack(some_rhs_L2_block);
88*5f39d1b3SJooyung Han    for (some_lhs_sub_block in some_lhs_L2_block) {
89*5f39d1b3SJooyung Han      for (some_rhs_sub_block in some_rhs_L2_block) {
90*5f39d1b3SJooyung Han        // new: pass int32 accums to kernel
91*5f39d1b3SJooyung Han        kernel(&some_int32_accumulators_block,
92*5f39d1b3SJooyung Han               some_lhs_sub_block,
93*5f39d1b3SJooyung Han               some_rhs_sub_block);
94*5f39d1b3SJooyung Han      }
95*5f39d1b3SJooyung Han    }
96*5f39d1b3SJooyung Han    // new: unpack int32 accums into destination matrix
97*5f39d1b3SJooyung Han    unpack(some_int32_accumulators_block);
98*5f39d1b3SJooyung Han  }
99*5f39d1b3SJooyung Han}
100*5f39d1b3SJooyung Han```
101*5f39d1b3SJooyung Han
102*5f39d1b3SJooyung Han## Exploring gemmlowp code
103*5f39d1b3SJooyung Han
104*5f39d1b3SJooyung HanThe design outlined above can be readily matched to gemmlowp source code, in
105*5f39d1b3SJooyung Hanparticular in this file, which gives a simple GEMM implementation fitting in one
106*5f39d1b3SJooyung Hanrather small function:
107*5f39d1b3SJooyung Han
108*5f39d1b3SJooyung Han```
109*5f39d1b3SJooyung Haninternal/single_thread_gemm.h
110*5f39d1b3SJooyung Han```
111*5f39d1b3SJooyung Han
112*5f39d1b3SJooyung HanThe reader can compare the above pseudo-code to the actual code in this file:
113*5f39d1b3SJooyung Han
114*5f39d1b3SJooyung Han```
115*5f39d1b3SJooyung Hanfor (int r = 0; r < rows; r += block_params.l2_rows) {
116*5f39d1b3SJooyung Han  int rs = std::min(block_params.l2_rows, rows - r);
117*5f39d1b3SJooyung Han
118*5f39d1b3SJooyung Han  PackLhs(&packed_lhs, lhs.block(r, 0, rs, depth));
119*5f39d1b3SJooyung Han
120*5f39d1b3SJooyung Han  for (int c = 0; c < cols; c += block_params.l2_cols) {
121*5f39d1b3SJooyung Han    int cs = std::min(block_params.l2_cols, cols - c);
122*5f39d1b3SJooyung Han
123*5f39d1b3SJooyung Han    if (!pack_rhs_once) {
124*5f39d1b3SJooyung Han      PackRhs(&packed_rhs, rhs.block(0, c, depth, cs));
125*5f39d1b3SJooyung Han    }
126*5f39d1b3SJooyung Han
127*5f39d1b3SJooyung Han    Compute(kernel, block_params, &packed_result, packed_lhs, packed_rhs);
128*5f39d1b3SJooyung Han
129*5f39d1b3SJooyung Han    auto result_block = result->block(r, c, rs, cs);
130*5f39d1b3SJooyung Han    UnpackResult(&result_block, packed_result, packed_lhs, packed_rhs, depth,
131*5f39d1b3SJooyung Han                 result_offset, result_mult_int, result_shift);
132*5f39d1b3SJooyung Han  }
133*5f39d1b3SJooyung Han}
134*5f39d1b3SJooyung Han```
135*5f39d1b3SJooyung Han
136*5f39d1b3SJooyung HanThe files in `internal/` fall into a few categories:
137*5f39d1b3SJooyung Han
138*5f39d1b3SJooyung HanThere are two top-level GEMM implementations,
139*5f39d1b3SJooyung Han
140*5f39d1b3SJooyung Han*   [internal/single_thread_gemm.h](../internal/single_thread_gemm.h)
141*5f39d1b3SJooyung Han*   [internal/multi_thread_gemm.h](../internal/multi_thread_gemm.h)
142*5f39d1b3SJooyung Han
143*5f39d1b3SJooyung HanThey both call into pack/compute/unpack stages (see [kernel.md](kernel.md) and
144*5f39d1b3SJooyung Han[packing.md](packing.md)) implemented in the following files:
145*5f39d1b3SJooyung Han
146*5f39d1b3SJooyung Han*   [internal/pack.h](../internal/pack.h)
147*5f39d1b3SJooyung Han*   [internal/compute.h](../internal/compute.h)
148*5f39d1b3SJooyung Han*   [internal/unpack.h](../internal/unpack.h)
149*5f39d1b3SJooyung Han    *   This in turn calls into [internal/output.h](../internal/output.h) for
150*5f39d1b3SJooyung Han        the output pipeline (see [output.md](output.md))
151*5f39d1b3SJooyung Han
152*5f39d1b3SJooyung HanThe pack.h and unpack.h files contain generic templated code that can be
153*5f39d1b3SJooyung Hanoverridden by optimized code in template specializations; for example, see the
154*5f39d1b3SJooyung HanNEON optimized code here:
155*5f39d1b3SJooyung Han
156*5f39d1b3SJooyung Han*   [internal/pack_neon.h](../internal/pack_neon.h)
157*5f39d1b3SJooyung Han*   [internal/unpack_neon.h](../internal/unpack_neon.h)
158*5f39d1b3SJooyung Han    *   This in turn calls into
159*5f39d1b3SJooyung Han        [internal/output_neon.h](../internal/output_neon.h)
160*5f39d1b3SJooyung Han
161*5f39d1b3SJooyung HanThe compute stage contains generic code in compute.h that only calls into
162*5f39d1b3SJooyung Hanoptimized code through the Kernel::Run() entry point. Each kernel is basically
163*5f39d1b3SJooyung Hanjust as struct offering a Run() implementation; see the NEON kernels in:
164*5f39d1b3SJooyung Han
165*5f39d1b3SJooyung Han*   [internal/kernel_neon.h](../internal/kernel_neon.h)
166