gemmlowp/doc/public.md

*5f39d1b3SJooyung Han# Gemmlowp's public entry points
*5f39d1b3SJooyung Han
*5f39d1b3SJooyung Hangemmlowp's public interface is defined in
*5f39d1b3SJooyung Han[public/gemmlowp.h](../public/gemmlowp.h).
*5f39d1b3SJooyung Han
*5f39d1b3SJooyung Han## GemmWithOutputPipeline
*5f39d1b3SJooyung Han
*5f39d1b3SJooyung HanThe primary public entry point is: `GemmWithOutputPipeline`.
*5f39d1b3SJooyung Han
*5f39d1b3SJooyung HanA usage example is given in
*5f39d1b3SJooyung Han[doc/quantization_example.cc](quantization_example.cc).
*5f39d1b3SJooyung Han
*5f39d1b3SJooyung HanThe high-level overview of how this specifies a low-precision matrix
*5f39d1b3SJooyung Hanmultiplication is explained in [low-precision.md](low-precision.md). The
*5f39d1b3SJooyung Hanrationale for a specific quantization paradigm is given in
*5f39d1b3SJooyung Han[quantization.md](quantization.md). That specific quantization paradigm is
*5f39d1b3SJooyung Hanimplemented at two different stages of the computation: as pre-processing on
*5f39d1b3SJooyung Hanthe operands and as post-processing on the result:
*5f39d1b3SJooyung Han
*5f39d1b3SJooyung Han*   Pre-processing on the LHS, RHS operands, in the form of adding constant
*5f39d1b3SJooyung Han    `lhs_offset`, `rhs_offset` to them, is explained in
*5f39d1b3SJooyung Han    [low-precision.md](low-precision.md).
*5f39d1b3SJooyung Han
*5f39d1b3SJooyung Han*   Post-processing on the result, in the form of a flexible "output pipeline",
*5f39d1b3SJooyung Han    is explained in [output.md](output.md).
*5f39d1b3SJooyung Han
*5f39d1b3SJooyung HanMore details on this below as we discuss specific function parameters.
*5f39d1b3SJooyung Han
*5f39d1b3SJooyung HanThe prototype is:
*5f39d1b3SJooyung Han
*5f39d1b3SJooyung Han```
*5f39d1b3SJooyung Hantemplate <typename InputScalar, typename OutputScalar, typename BitDepthParams,
*5f39d1b3SJooyung Han          MapOrder LhsOrder, MapOrder RhsOrder, MapOrder ResultOrder,
*5f39d1b3SJooyung Han          typename OutputPipelineType, typename GemmContextType>
*5f39d1b3SJooyung Hanvoid GemmWithOutputPipeline(GemmContextType* context,
*5f39d1b3SJooyung Han                            const MatrixMap<const InputScalar, LhsOrder>& lhs,
*5f39d1b3SJooyung Han                            const MatrixMap<const InputScalar, RhsOrder>& rhs,
*5f39d1b3SJooyung Han                            MatrixMap<OutputScalar, ResultOrder>* result,
*5f39d1b3SJooyung Han                            int lhs_offset, int rhs_offset,
*5f39d1b3SJooyung Han                            const OutputPipelineType& output_pipeline);
*5f39d1b3SJooyung Han```
*5f39d1b3SJooyung Han
*5f39d1b3SJooyung HanA typical call looks like (from the [usage example](quantization_example.cc)):
*5f39d1b3SJooyung Han
*5f39d1b3SJooyung Han```
*5f39d1b3SJooyung Hangemmlowp::GemmWithOutputPipeline<std::uint8_t, std::uint8_t,
*5f39d1b3SJooyung Han                                 gemmlowp::DefaultL8R8BitDepthParams>(
*5f39d1b3SJooyung Han    &gemm_context, uint8_lhs_matrix, uint8_rhs_matrix,
*5f39d1b3SJooyung Han    &uint8_result_matrix, lhs_offset, rhs_offset, output_pipeline);
*5f39d1b3SJooyung Han```
*5f39d1b3SJooyung Han
*5f39d1b3SJooyung Han### Template parameters
*5f39d1b3SJooyung Han
*5f39d1b3SJooyung HanTypically only the 3 first template parameters need to be specified, the rest
*5f39d1b3SJooyung Hanbeing automatically deduced from function parameters:
*5f39d1b3SJooyung Han
*5f39d1b3SJooyung Han*   `InputScalar`: The scalar type of the LHS and RHS operands. At the moment,
*5f39d1b3SJooyung Han    this must be `std::uint8_t`.
*5f39d1b3SJooyung Han*   `OutputScalar`: The scalar type of the result. At the moment,
*5f39d1b3SJooyung Han    this must be `std::uint8_t`.
*5f39d1b3SJooyung Han*   `BitDepthParams`: Defines the bit format of the input and output matrices
*5f39d1b3SJooyung Han    and the required accuracy of the computation. At the moment, the only
*5f39d1b3SJooyung Han    non-deprecated valid value is `gemmlowp::DefaultL8R8BitDepthParams`. See
*5f39d1b3SJooyung Han    [less-than-8-bit.md](less-than-8-bit.md) for other values and the general
*5f39d1b3SJooyung Han    idea of this, and how it may become more useful in the future.
*5f39d1b3SJooyung Han
*5f39d1b3SJooyung HanThe other template parameters, which typically do not need to be specified, are:
*5f39d1b3SJooyung Han
*5f39d1b3SJooyung Han*   `LhsOrder`, `RhsOrder`, `ResultOrder`: the storage orders (row-major or
*5f39d1b3SJooyung Han    column-major) of the LHS, RHS, result matrices. See
*5f39d1b3SJooyung Han    [public/map.h](../public/map.h). See the below performance note: we
*5f39d1b3SJooyung Han    recommend using respectively RowMajor, ColMajor, ColMajor for optimal
*5f39d1b3SJooyung Han    performance.
*5f39d1b3SJooyung Han*   `OutputPipelineType`: the actual `std::tuple` type of the output pipeline.
*5f39d1b3SJooyung Han    See below explanation of the `output_pipeline` parameter, and
*5f39d1b3SJooyung Han    [output.md](output.md).
*5f39d1b3SJooyung Han*   `GemmContextType`: the type of the `context` parameter. At the moment, this
*5f39d1b3SJooyung Han    must be `gemmlowp::GemmContext`.
*5f39d1b3SJooyung Han
*5f39d1b3SJooyung Han### Function parameters
*5f39d1b3SJooyung Han
*5f39d1b3SJooyung HanThe function parameters taken by `GemmWithOutputPipeline` are:
*5f39d1b3SJooyung Han
*5f39d1b3SJooyung Han*   `context`: The `gemmlowp::GemmContext` object holding state and resources to
*5f39d1b3SJooyung Han    be used for this gemmlowp call.
*5f39d1b3SJooyung Han*   `lhs`, `rhs`: The LHS and RHS operand matrices. Note that these are
*5f39d1b3SJooyung Han    `MatrixMap` objects, mapping external buffers as matrices, not owning data.
*5f39d1b3SJooyung Han    See [public/map.h](../public/map.h).
*5f39d1b3SJooyung Han*   `result`: pointer to the destination `MatrixMap` object, which must be
*5f39d1b3SJooyung Han    already constructed, wrapping the external destination buffer with the
*5f39d1b3SJooyung Han    wanted destination matrix shape and storage layout. No memory allocation
*5f39d1b3SJooyung Han    will be performed by gemmlowp for the destination buffer. See
*5f39d1b3SJooyung Han    [public/map.h](../public/map.h).
*5f39d1b3SJooyung Han*   `lhs_offset`, `rhs_offset` are constants added to each matrix entry in the
*5f39d1b3SJooyung Han    LHS, RHS matrices respectively, as explained in
*5f39d1b3SJooyung Han    [low-precision.md](low-precision.md). This is only the part of the
*5f39d1b3SJooyung Han    quantization paradigm explained in [quantization.md](quantization.md) that
*5f39d1b3SJooyung Han    needs to be implemented as operations on the operands; everything else is
*5f39d1b3SJooyung Han    operations on the result, see `output_pipeline`.
*5f39d1b3SJooyung Han*   `output_pipeline` is a `std::tuple` of output stages (see
*5f39d1b3SJooyung Han    [public/output_stages.h](../public/output_stages.h)), specifying the output
*5f39d1b3SJooyung Han    pipeline (see [output.md](output.md)). This is the part of the quantization
*5f39d1b3SJooyung Han    paradigm explained in [quantization.md](quantization.md) that needs to be
*5f39d1b3SJooyung Han    implemented as operations on the result matrix.
*5f39d1b3SJooyung Han
*5f39d1b3SJooyung Han### Performance note on storage orders.
*5f39d1b3SJooyung Han
*5f39d1b3SJooyung Hangemmlowp supports arbitrary combinations of storage orders for the LHS, RHS and
*5f39d1b3SJooyung Hanresult matrices. However, not all are equally optimized for.
*5f39d1b3SJooyung Han
*5f39d1b3SJooyung HanBecause gemmlowp is primarily aimed at neural network inference workloads,
*5f39d1b3SJooyung Hanoptimization focus is on this particular combination of storage orders:
*5f39d1b3SJooyung Han
*5f39d1b3SJooyung Han*   `LhsOrder=RowMajor`
*5f39d1b3SJooyung Han*   `RhsOrder=ColMajor`
*5f39d1b3SJooyung Han*   `ResultOrder=ColMajor`
*5f39d1b3SJooyung Han
*5f39d1b3SJooyung HanThe rationale is that the LHS is typically the constant weights of a neural
*5f39d1b3SJooyung Hannetwork layer (e.g. the weights of a Convolutional layer implemented as a matrix
*5f39d1b3SJooyung Hanmultiplication), while the RHS and result are neural network activations,
*5f39d1b3SJooyung Hanrespectively the input and output activations of the layer.
*5f39d1b3SJooyung Han
*5f39d1b3SJooyung HanBecause the RHS and result are activations, we want them to share the same
*5f39d1b3SJooyung Hanstorage order -- so that one layer's output activations can be readily used as
*5f39d1b3SJooyung Hanthe next layer's input activations. Thus, we focus on `RhsOrder=ResultOrder`.
*5f39d1b3SJooyung Han
*5f39d1b3SJooyung HanWe also know from general considerations on matrix multiplication that it is
*5f39d1b3SJooyung Hanslightly more efficient to have the direction of accumulation (the "depth"
*5f39d1b3SJooyung Handimension) be the direction of contiguous storage in memory. That means that it
*5f39d1b3SJooyung Hanis always going to be slightly easier and more efficient to have
*5f39d1b3SJooyung Han`LhsOrder=RowMajor` and `RhsOrder=ColMajor`.
*5f39d1b3SJooyung Han
*5f39d1b3SJooyung HanPutting this together, we arrive at gemmlowp's focus on the above-described
*5f39d1b3SJooyung Hancombination of storage orders.
*5f39d1b3SJooyung Han
*5f39d1b3SJooyung HanUsing other storage orders will typically mean taking less efficient paths in
*5f39d1b3SJooyung Hanthe packing and unpacking stages, see [packing.md](packing.md). The compute
*5f39d1b3SJooyung Hankernel stage ([kernel.md](kernel.md)) is unaffected.
*5f39d1b3SJooyung Han
*5f39d1b3SJooyung Han## GemmWithOutputPipelinePC
*5f39d1b3SJooyung Han
*5f39d1b3SJooyung HanThis is a variant where `lhs_offset` and `rhs_offset` may be vectors instead of
*5f39d1b3SJooyung Hanscalar. They are then broadcasted against LHS, RHS respectively.
*5f39d1b3SJooyung Han
*5f39d1b3SJooyung HanThis is useful for some flavors of neural network inference with "per-channel
*5f39d1b3SJooyung Hanquantization", whence the PC suffix. This has been useful in some settings where
*5f39d1b3SJooyung Hana neural network trained in float arithmetic was subsequently quantized. On the
*5f39d1b3SJooyung Hanother hand, retraining neural networks for quantized inference tends to remove
*5f39d1b3SJooyung Hanthe need for per-channel quantization. For that reason, the long-term usefulness
*5f39d1b3SJooyung Hanof this entry point is in question.
*5f39d1b3SJooyung Han
*5f39d1b3SJooyung Han## Gemm
*5f39d1b3SJooyung Han
*5f39d1b3SJooyung HanThis is gemmlowp's original, now legacy and deprecated, entry point. See the
*5f39d1b3SJooyung Hansection of [low-precision.md](low-precision.md) on the legacy quantization
*5f39d1b3SJooyung Hanparadigm. Avoid in new code.
*5f39d1b3SJooyung Han
*5f39d1b3SJooyung Han## The eight_bit_int_gemm directory
*5f39d1b3SJooyung Han
*5f39d1b3SJooyung HanAs explained in the top-level [README.md](../README.md#public-interfaces), this
*5f39d1b3SJooyung Hanis entirely deprecated.