xref: /aosp_15_r20/external/gemmlowp/doc/public.md (revision 5f39d1b313f0528e11bae88b3029b54b9e1033e7)
1*5f39d1b3SJooyung Han# Gemmlowp's public entry points
2*5f39d1b3SJooyung Han
3*5f39d1b3SJooyung Hangemmlowp's public interface is defined in
4*5f39d1b3SJooyung Han[public/gemmlowp.h](../public/gemmlowp.h).
5*5f39d1b3SJooyung Han
6*5f39d1b3SJooyung Han## GemmWithOutputPipeline
7*5f39d1b3SJooyung Han
8*5f39d1b3SJooyung HanThe primary public entry point is: `GemmWithOutputPipeline`.
9*5f39d1b3SJooyung Han
10*5f39d1b3SJooyung HanA usage example is given in
11*5f39d1b3SJooyung Han[doc/quantization_example.cc](quantization_example.cc).
12*5f39d1b3SJooyung Han
13*5f39d1b3SJooyung HanThe high-level overview of how this specifies a low-precision matrix
14*5f39d1b3SJooyung Hanmultiplication is explained in [low-precision.md](low-precision.md). The
15*5f39d1b3SJooyung Hanrationale for a specific quantization paradigm is given in
16*5f39d1b3SJooyung Han[quantization.md](quantization.md). That specific quantization paradigm is
17*5f39d1b3SJooyung Hanimplemented at two different stages of the computation: as pre-processing on
18*5f39d1b3SJooyung Hanthe operands and as post-processing on the result:
19*5f39d1b3SJooyung Han
20*5f39d1b3SJooyung Han*   Pre-processing on the LHS, RHS operands, in the form of adding constant
21*5f39d1b3SJooyung Han    `lhs_offset`, `rhs_offset` to them, is explained in
22*5f39d1b3SJooyung Han    [low-precision.md](low-precision.md).
23*5f39d1b3SJooyung Han
24*5f39d1b3SJooyung Han*   Post-processing on the result, in the form of a flexible "output pipeline",
25*5f39d1b3SJooyung Han    is explained in [output.md](output.md).
26*5f39d1b3SJooyung Han
27*5f39d1b3SJooyung HanMore details on this below as we discuss specific function parameters.
28*5f39d1b3SJooyung Han
29*5f39d1b3SJooyung HanThe prototype is:
30*5f39d1b3SJooyung Han
31*5f39d1b3SJooyung Han```
32*5f39d1b3SJooyung Hantemplate <typename InputScalar, typename OutputScalar, typename BitDepthParams,
33*5f39d1b3SJooyung Han          MapOrder LhsOrder, MapOrder RhsOrder, MapOrder ResultOrder,
34*5f39d1b3SJooyung Han          typename OutputPipelineType, typename GemmContextType>
35*5f39d1b3SJooyung Hanvoid GemmWithOutputPipeline(GemmContextType* context,
36*5f39d1b3SJooyung Han                            const MatrixMap<const InputScalar, LhsOrder>& lhs,
37*5f39d1b3SJooyung Han                            const MatrixMap<const InputScalar, RhsOrder>& rhs,
38*5f39d1b3SJooyung Han                            MatrixMap<OutputScalar, ResultOrder>* result,
39*5f39d1b3SJooyung Han                            int lhs_offset, int rhs_offset,
40*5f39d1b3SJooyung Han                            const OutputPipelineType& output_pipeline);
41*5f39d1b3SJooyung Han```
42*5f39d1b3SJooyung Han
43*5f39d1b3SJooyung HanA typical call looks like (from the [usage example](quantization_example.cc)):
44*5f39d1b3SJooyung Han
45*5f39d1b3SJooyung Han```
46*5f39d1b3SJooyung Hangemmlowp::GemmWithOutputPipeline<std::uint8_t, std::uint8_t,
47*5f39d1b3SJooyung Han                                 gemmlowp::DefaultL8R8BitDepthParams>(
48*5f39d1b3SJooyung Han    &gemm_context, uint8_lhs_matrix, uint8_rhs_matrix,
49*5f39d1b3SJooyung Han    &uint8_result_matrix, lhs_offset, rhs_offset, output_pipeline);
50*5f39d1b3SJooyung Han```
51*5f39d1b3SJooyung Han
52*5f39d1b3SJooyung Han### Template parameters
53*5f39d1b3SJooyung Han
54*5f39d1b3SJooyung HanTypically only the 3 first template parameters need to be specified, the rest
55*5f39d1b3SJooyung Hanbeing automatically deduced from function parameters:
56*5f39d1b3SJooyung Han
57*5f39d1b3SJooyung Han*   `InputScalar`: The scalar type of the LHS and RHS operands. At the moment,
58*5f39d1b3SJooyung Han    this must be `std::uint8_t`.
59*5f39d1b3SJooyung Han*   `OutputScalar`: The scalar type of the result. At the moment,
60*5f39d1b3SJooyung Han    this must be `std::uint8_t`.
61*5f39d1b3SJooyung Han*   `BitDepthParams`: Defines the bit format of the input and output matrices
62*5f39d1b3SJooyung Han    and the required accuracy of the computation. At the moment, the only
63*5f39d1b3SJooyung Han    non-deprecated valid value is `gemmlowp::DefaultL8R8BitDepthParams`. See
64*5f39d1b3SJooyung Han    [less-than-8-bit.md](less-than-8-bit.md) for other values and the general
65*5f39d1b3SJooyung Han    idea of this, and how it may become more useful in the future.
66*5f39d1b3SJooyung Han
67*5f39d1b3SJooyung HanThe other template parameters, which typically do not need to be specified, are:
68*5f39d1b3SJooyung Han
69*5f39d1b3SJooyung Han*   `LhsOrder`, `RhsOrder`, `ResultOrder`: the storage orders (row-major or
70*5f39d1b3SJooyung Han    column-major) of the LHS, RHS, result matrices. See
71*5f39d1b3SJooyung Han    [public/map.h](../public/map.h). See the below performance note: we
72*5f39d1b3SJooyung Han    recommend using respectively RowMajor, ColMajor, ColMajor for optimal
73*5f39d1b3SJooyung Han    performance.
74*5f39d1b3SJooyung Han*   `OutputPipelineType`: the actual `std::tuple` type of the output pipeline.
75*5f39d1b3SJooyung Han    See below explanation of the `output_pipeline` parameter, and
76*5f39d1b3SJooyung Han    [output.md](output.md).
77*5f39d1b3SJooyung Han*   `GemmContextType`: the type of the `context` parameter. At the moment, this
78*5f39d1b3SJooyung Han    must be `gemmlowp::GemmContext`.
79*5f39d1b3SJooyung Han
80*5f39d1b3SJooyung Han### Function parameters
81*5f39d1b3SJooyung Han
82*5f39d1b3SJooyung HanThe function parameters taken by `GemmWithOutputPipeline` are:
83*5f39d1b3SJooyung Han
84*5f39d1b3SJooyung Han*   `context`: The `gemmlowp::GemmContext` object holding state and resources to
85*5f39d1b3SJooyung Han    be used for this gemmlowp call.
86*5f39d1b3SJooyung Han*   `lhs`, `rhs`: The LHS and RHS operand matrices. Note that these are
87*5f39d1b3SJooyung Han    `MatrixMap` objects, mapping external buffers as matrices, not owning data.
88*5f39d1b3SJooyung Han    See [public/map.h](../public/map.h).
89*5f39d1b3SJooyung Han*   `result`: pointer to the destination `MatrixMap` object, which must be
90*5f39d1b3SJooyung Han    already constructed, wrapping the external destination buffer with the
91*5f39d1b3SJooyung Han    wanted destination matrix shape and storage layout. No memory allocation
92*5f39d1b3SJooyung Han    will be performed by gemmlowp for the destination buffer. See
93*5f39d1b3SJooyung Han    [public/map.h](../public/map.h).
94*5f39d1b3SJooyung Han*   `lhs_offset`, `rhs_offset` are constants added to each matrix entry in the
95*5f39d1b3SJooyung Han    LHS, RHS matrices respectively, as explained in
96*5f39d1b3SJooyung Han    [low-precision.md](low-precision.md). This is only the part of the
97*5f39d1b3SJooyung Han    quantization paradigm explained in [quantization.md](quantization.md) that
98*5f39d1b3SJooyung Han    needs to be implemented as operations on the operands; everything else is
99*5f39d1b3SJooyung Han    operations on the result, see `output_pipeline`.
100*5f39d1b3SJooyung Han*   `output_pipeline` is a `std::tuple` of output stages (see
101*5f39d1b3SJooyung Han    [public/output_stages.h](../public/output_stages.h)), specifying the output
102*5f39d1b3SJooyung Han    pipeline (see [output.md](output.md)). This is the part of the quantization
103*5f39d1b3SJooyung Han    paradigm explained in [quantization.md](quantization.md) that needs to be
104*5f39d1b3SJooyung Han    implemented as operations on the result matrix.
105*5f39d1b3SJooyung Han
106*5f39d1b3SJooyung Han### Performance note on storage orders.
107*5f39d1b3SJooyung Han
108*5f39d1b3SJooyung Hangemmlowp supports arbitrary combinations of storage orders for the LHS, RHS and
109*5f39d1b3SJooyung Hanresult matrices. However, not all are equally optimized for.
110*5f39d1b3SJooyung Han
111*5f39d1b3SJooyung HanBecause gemmlowp is primarily aimed at neural network inference workloads,
112*5f39d1b3SJooyung Hanoptimization focus is on this particular combination of storage orders:
113*5f39d1b3SJooyung Han
114*5f39d1b3SJooyung Han*   `LhsOrder=RowMajor`
115*5f39d1b3SJooyung Han*   `RhsOrder=ColMajor`
116*5f39d1b3SJooyung Han*   `ResultOrder=ColMajor`
117*5f39d1b3SJooyung Han
118*5f39d1b3SJooyung HanThe rationale is that the LHS is typically the constant weights of a neural
119*5f39d1b3SJooyung Hannetwork layer (e.g. the weights of a Convolutional layer implemented as a matrix
120*5f39d1b3SJooyung Hanmultiplication), while the RHS and result are neural network activations,
121*5f39d1b3SJooyung Hanrespectively the input and output activations of the layer.
122*5f39d1b3SJooyung Han
123*5f39d1b3SJooyung HanBecause the RHS and result are activations, we want them to share the same
124*5f39d1b3SJooyung Hanstorage order -- so that one layer's output activations can be readily used as
125*5f39d1b3SJooyung Hanthe next layer's input activations. Thus, we focus on `RhsOrder=ResultOrder`.
126*5f39d1b3SJooyung Han
127*5f39d1b3SJooyung HanWe also know from general considerations on matrix multiplication that it is
128*5f39d1b3SJooyung Hanslightly more efficient to have the direction of accumulation (the "depth"
129*5f39d1b3SJooyung Handimension) be the direction of contiguous storage in memory. That means that it
130*5f39d1b3SJooyung Hanis always going to be slightly easier and more efficient to have
131*5f39d1b3SJooyung Han`LhsOrder=RowMajor` and `RhsOrder=ColMajor`.
132*5f39d1b3SJooyung Han
133*5f39d1b3SJooyung HanPutting this together, we arrive at gemmlowp's focus on the above-described
134*5f39d1b3SJooyung Hancombination of storage orders.
135*5f39d1b3SJooyung Han
136*5f39d1b3SJooyung HanUsing other storage orders will typically mean taking less efficient paths in
137*5f39d1b3SJooyung Hanthe packing and unpacking stages, see [packing.md](packing.md). The compute
138*5f39d1b3SJooyung Hankernel stage ([kernel.md](kernel.md)) is unaffected.
139*5f39d1b3SJooyung Han
140*5f39d1b3SJooyung Han## GemmWithOutputPipelinePC
141*5f39d1b3SJooyung Han
142*5f39d1b3SJooyung HanThis is a variant where `lhs_offset` and `rhs_offset` may be vectors instead of
143*5f39d1b3SJooyung Hanscalar. They are then broadcasted against LHS, RHS respectively.
144*5f39d1b3SJooyung Han
145*5f39d1b3SJooyung HanThis is useful for some flavors of neural network inference with "per-channel
146*5f39d1b3SJooyung Hanquantization", whence the PC suffix. This has been useful in some settings where
147*5f39d1b3SJooyung Hana neural network trained in float arithmetic was subsequently quantized. On the
148*5f39d1b3SJooyung Hanother hand, retraining neural networks for quantized inference tends to remove
149*5f39d1b3SJooyung Hanthe need for per-channel quantization. For that reason, the long-term usefulness
150*5f39d1b3SJooyung Hanof this entry point is in question.
151*5f39d1b3SJooyung Han
152*5f39d1b3SJooyung Han## Gemm
153*5f39d1b3SJooyung Han
154*5f39d1b3SJooyung HanThis is gemmlowp's original, now legacy and deprecated, entry point. See the
155*5f39d1b3SJooyung Hansection of [low-precision.md](low-precision.md) on the legacy quantization
156*5f39d1b3SJooyung Hanparadigm. Avoid in new code.
157*5f39d1b3SJooyung Han
158*5f39d1b3SJooyung Han## The eight_bit_int_gemm directory
159*5f39d1b3SJooyung Han
160*5f39d1b3SJooyung HanAs explained in the top-level [README.md](../README.md#public-interfaces), this
161*5f39d1b3SJooyung Hanis entirely deprecated.
162