1*5f39d1b3SJooyung Han# Gemmlowp's public entry points 2*5f39d1b3SJooyung Han 3*5f39d1b3SJooyung Hangemmlowp's public interface is defined in 4*5f39d1b3SJooyung Han[public/gemmlowp.h](../public/gemmlowp.h). 5*5f39d1b3SJooyung Han 6*5f39d1b3SJooyung Han## GemmWithOutputPipeline 7*5f39d1b3SJooyung Han 8*5f39d1b3SJooyung HanThe primary public entry point is: `GemmWithOutputPipeline`. 9*5f39d1b3SJooyung Han 10*5f39d1b3SJooyung HanA usage example is given in 11*5f39d1b3SJooyung Han[doc/quantization_example.cc](quantization_example.cc). 12*5f39d1b3SJooyung Han 13*5f39d1b3SJooyung HanThe high-level overview of how this specifies a low-precision matrix 14*5f39d1b3SJooyung Hanmultiplication is explained in [low-precision.md](low-precision.md). The 15*5f39d1b3SJooyung Hanrationale for a specific quantization paradigm is given in 16*5f39d1b3SJooyung Han[quantization.md](quantization.md). That specific quantization paradigm is 17*5f39d1b3SJooyung Hanimplemented at two different stages of the computation: as pre-processing on 18*5f39d1b3SJooyung Hanthe operands and as post-processing on the result: 19*5f39d1b3SJooyung Han 20*5f39d1b3SJooyung Han* Pre-processing on the LHS, RHS operands, in the form of adding constant 21*5f39d1b3SJooyung Han `lhs_offset`, `rhs_offset` to them, is explained in 22*5f39d1b3SJooyung Han [low-precision.md](low-precision.md). 23*5f39d1b3SJooyung Han 24*5f39d1b3SJooyung Han* Post-processing on the result, in the form of a flexible "output pipeline", 25*5f39d1b3SJooyung Han is explained in [output.md](output.md). 26*5f39d1b3SJooyung Han 27*5f39d1b3SJooyung HanMore details on this below as we discuss specific function parameters. 28*5f39d1b3SJooyung Han 29*5f39d1b3SJooyung HanThe prototype is: 30*5f39d1b3SJooyung Han 31*5f39d1b3SJooyung Han``` 32*5f39d1b3SJooyung Hantemplate <typename InputScalar, typename OutputScalar, typename BitDepthParams, 33*5f39d1b3SJooyung Han MapOrder LhsOrder, MapOrder RhsOrder, MapOrder ResultOrder, 34*5f39d1b3SJooyung Han typename OutputPipelineType, typename GemmContextType> 35*5f39d1b3SJooyung Hanvoid GemmWithOutputPipeline(GemmContextType* context, 36*5f39d1b3SJooyung Han const MatrixMap<const InputScalar, LhsOrder>& lhs, 37*5f39d1b3SJooyung Han const MatrixMap<const InputScalar, RhsOrder>& rhs, 38*5f39d1b3SJooyung Han MatrixMap<OutputScalar, ResultOrder>* result, 39*5f39d1b3SJooyung Han int lhs_offset, int rhs_offset, 40*5f39d1b3SJooyung Han const OutputPipelineType& output_pipeline); 41*5f39d1b3SJooyung Han``` 42*5f39d1b3SJooyung Han 43*5f39d1b3SJooyung HanA typical call looks like (from the [usage example](quantization_example.cc)): 44*5f39d1b3SJooyung Han 45*5f39d1b3SJooyung Han``` 46*5f39d1b3SJooyung Hangemmlowp::GemmWithOutputPipeline<std::uint8_t, std::uint8_t, 47*5f39d1b3SJooyung Han gemmlowp::DefaultL8R8BitDepthParams>( 48*5f39d1b3SJooyung Han &gemm_context, uint8_lhs_matrix, uint8_rhs_matrix, 49*5f39d1b3SJooyung Han &uint8_result_matrix, lhs_offset, rhs_offset, output_pipeline); 50*5f39d1b3SJooyung Han``` 51*5f39d1b3SJooyung Han 52*5f39d1b3SJooyung Han### Template parameters 53*5f39d1b3SJooyung Han 54*5f39d1b3SJooyung HanTypically only the 3 first template parameters need to be specified, the rest 55*5f39d1b3SJooyung Hanbeing automatically deduced from function parameters: 56*5f39d1b3SJooyung Han 57*5f39d1b3SJooyung Han* `InputScalar`: The scalar type of the LHS and RHS operands. At the moment, 58*5f39d1b3SJooyung Han this must be `std::uint8_t`. 59*5f39d1b3SJooyung Han* `OutputScalar`: The scalar type of the result. At the moment, 60*5f39d1b3SJooyung Han this must be `std::uint8_t`. 61*5f39d1b3SJooyung Han* `BitDepthParams`: Defines the bit format of the input and output matrices 62*5f39d1b3SJooyung Han and the required accuracy of the computation. At the moment, the only 63*5f39d1b3SJooyung Han non-deprecated valid value is `gemmlowp::DefaultL8R8BitDepthParams`. See 64*5f39d1b3SJooyung Han [less-than-8-bit.md](less-than-8-bit.md) for other values and the general 65*5f39d1b3SJooyung Han idea of this, and how it may become more useful in the future. 66*5f39d1b3SJooyung Han 67*5f39d1b3SJooyung HanThe other template parameters, which typically do not need to be specified, are: 68*5f39d1b3SJooyung Han 69*5f39d1b3SJooyung Han* `LhsOrder`, `RhsOrder`, `ResultOrder`: the storage orders (row-major or 70*5f39d1b3SJooyung Han column-major) of the LHS, RHS, result matrices. See 71*5f39d1b3SJooyung Han [public/map.h](../public/map.h). See the below performance note: we 72*5f39d1b3SJooyung Han recommend using respectively RowMajor, ColMajor, ColMajor for optimal 73*5f39d1b3SJooyung Han performance. 74*5f39d1b3SJooyung Han* `OutputPipelineType`: the actual `std::tuple` type of the output pipeline. 75*5f39d1b3SJooyung Han See below explanation of the `output_pipeline` parameter, and 76*5f39d1b3SJooyung Han [output.md](output.md). 77*5f39d1b3SJooyung Han* `GemmContextType`: the type of the `context` parameter. At the moment, this 78*5f39d1b3SJooyung Han must be `gemmlowp::GemmContext`. 79*5f39d1b3SJooyung Han 80*5f39d1b3SJooyung Han### Function parameters 81*5f39d1b3SJooyung Han 82*5f39d1b3SJooyung HanThe function parameters taken by `GemmWithOutputPipeline` are: 83*5f39d1b3SJooyung Han 84*5f39d1b3SJooyung Han* `context`: The `gemmlowp::GemmContext` object holding state and resources to 85*5f39d1b3SJooyung Han be used for this gemmlowp call. 86*5f39d1b3SJooyung Han* `lhs`, `rhs`: The LHS and RHS operand matrices. Note that these are 87*5f39d1b3SJooyung Han `MatrixMap` objects, mapping external buffers as matrices, not owning data. 88*5f39d1b3SJooyung Han See [public/map.h](../public/map.h). 89*5f39d1b3SJooyung Han* `result`: pointer to the destination `MatrixMap` object, which must be 90*5f39d1b3SJooyung Han already constructed, wrapping the external destination buffer with the 91*5f39d1b3SJooyung Han wanted destination matrix shape and storage layout. No memory allocation 92*5f39d1b3SJooyung Han will be performed by gemmlowp for the destination buffer. See 93*5f39d1b3SJooyung Han [public/map.h](../public/map.h). 94*5f39d1b3SJooyung Han* `lhs_offset`, `rhs_offset` are constants added to each matrix entry in the 95*5f39d1b3SJooyung Han LHS, RHS matrices respectively, as explained in 96*5f39d1b3SJooyung Han [low-precision.md](low-precision.md). This is only the part of the 97*5f39d1b3SJooyung Han quantization paradigm explained in [quantization.md](quantization.md) that 98*5f39d1b3SJooyung Han needs to be implemented as operations on the operands; everything else is 99*5f39d1b3SJooyung Han operations on the result, see `output_pipeline`. 100*5f39d1b3SJooyung Han* `output_pipeline` is a `std::tuple` of output stages (see 101*5f39d1b3SJooyung Han [public/output_stages.h](../public/output_stages.h)), specifying the output 102*5f39d1b3SJooyung Han pipeline (see [output.md](output.md)). This is the part of the quantization 103*5f39d1b3SJooyung Han paradigm explained in [quantization.md](quantization.md) that needs to be 104*5f39d1b3SJooyung Han implemented as operations on the result matrix. 105*5f39d1b3SJooyung Han 106*5f39d1b3SJooyung Han### Performance note on storage orders. 107*5f39d1b3SJooyung Han 108*5f39d1b3SJooyung Hangemmlowp supports arbitrary combinations of storage orders for the LHS, RHS and 109*5f39d1b3SJooyung Hanresult matrices. However, not all are equally optimized for. 110*5f39d1b3SJooyung Han 111*5f39d1b3SJooyung HanBecause gemmlowp is primarily aimed at neural network inference workloads, 112*5f39d1b3SJooyung Hanoptimization focus is on this particular combination of storage orders: 113*5f39d1b3SJooyung Han 114*5f39d1b3SJooyung Han* `LhsOrder=RowMajor` 115*5f39d1b3SJooyung Han* `RhsOrder=ColMajor` 116*5f39d1b3SJooyung Han* `ResultOrder=ColMajor` 117*5f39d1b3SJooyung Han 118*5f39d1b3SJooyung HanThe rationale is that the LHS is typically the constant weights of a neural 119*5f39d1b3SJooyung Hannetwork layer (e.g. the weights of a Convolutional layer implemented as a matrix 120*5f39d1b3SJooyung Hanmultiplication), while the RHS and result are neural network activations, 121*5f39d1b3SJooyung Hanrespectively the input and output activations of the layer. 122*5f39d1b3SJooyung Han 123*5f39d1b3SJooyung HanBecause the RHS and result are activations, we want them to share the same 124*5f39d1b3SJooyung Hanstorage order -- so that one layer's output activations can be readily used as 125*5f39d1b3SJooyung Hanthe next layer's input activations. Thus, we focus on `RhsOrder=ResultOrder`. 126*5f39d1b3SJooyung Han 127*5f39d1b3SJooyung HanWe also know from general considerations on matrix multiplication that it is 128*5f39d1b3SJooyung Hanslightly more efficient to have the direction of accumulation (the "depth" 129*5f39d1b3SJooyung Handimension) be the direction of contiguous storage in memory. That means that it 130*5f39d1b3SJooyung Hanis always going to be slightly easier and more efficient to have 131*5f39d1b3SJooyung Han`LhsOrder=RowMajor` and `RhsOrder=ColMajor`. 132*5f39d1b3SJooyung Han 133*5f39d1b3SJooyung HanPutting this together, we arrive at gemmlowp's focus on the above-described 134*5f39d1b3SJooyung Hancombination of storage orders. 135*5f39d1b3SJooyung Han 136*5f39d1b3SJooyung HanUsing other storage orders will typically mean taking less efficient paths in 137*5f39d1b3SJooyung Hanthe packing and unpacking stages, see [packing.md](packing.md). The compute 138*5f39d1b3SJooyung Hankernel stage ([kernel.md](kernel.md)) is unaffected. 139*5f39d1b3SJooyung Han 140*5f39d1b3SJooyung Han## GemmWithOutputPipelinePC 141*5f39d1b3SJooyung Han 142*5f39d1b3SJooyung HanThis is a variant where `lhs_offset` and `rhs_offset` may be vectors instead of 143*5f39d1b3SJooyung Hanscalar. They are then broadcasted against LHS, RHS respectively. 144*5f39d1b3SJooyung Han 145*5f39d1b3SJooyung HanThis is useful for some flavors of neural network inference with "per-channel 146*5f39d1b3SJooyung Hanquantization", whence the PC suffix. This has been useful in some settings where 147*5f39d1b3SJooyung Hana neural network trained in float arithmetic was subsequently quantized. On the 148*5f39d1b3SJooyung Hanother hand, retraining neural networks for quantized inference tends to remove 149*5f39d1b3SJooyung Hanthe need for per-channel quantization. For that reason, the long-term usefulness 150*5f39d1b3SJooyung Hanof this entry point is in question. 151*5f39d1b3SJooyung Han 152*5f39d1b3SJooyung Han## Gemm 153*5f39d1b3SJooyung Han 154*5f39d1b3SJooyung HanThis is gemmlowp's original, now legacy and deprecated, entry point. See the 155*5f39d1b3SJooyung Hansection of [low-precision.md](low-precision.md) on the legacy quantization 156*5f39d1b3SJooyung Hanparadigm. Avoid in new code. 157*5f39d1b3SJooyung Han 158*5f39d1b3SJooyung Han## The eight_bit_int_gemm directory 159*5f39d1b3SJooyung Han 160*5f39d1b3SJooyung HanAs explained in the top-level [README.md](../README.md#public-interfaces), this 161*5f39d1b3SJooyung Hanis entirely deprecated. 162