xref: /aosp_15_r20/external/gemmlowp/doc/output.md (revision 5f39d1b313f0528e11bae88b3029b54b9e1033e7)
1*5f39d1b3SJooyung Han# Output pipelines in gemmlowp
2*5f39d1b3SJooyung Han
3*5f39d1b3SJooyung HanIn gemmlowp, the "output pipeline" is the process that takes a final `int32`
4*5f39d1b3SJooyung Hanaccumulator value (the output of the compute/kernel stage), and processes it to
5*5f39d1b3SJooyung Hanobtain the final value (typically a `uint8` value) and write it to the
6*5f39d1b3SJooyung Handestination matrix.
7*5f39d1b3SJooyung Han
8*5f39d1b3SJooyung HanGemmlowp has some genericity in what arithmetic transformations take place in
9*5f39d1b3SJooyung Hanthe output pipeline, so as to allow different users to implement different
10*5f39d1b3SJooyung Hanquantization paradigms. See [low-precision.md](low-precision.md) and
11*5f39d1b3SJooyung Han[quantization.md](quantization.md).
12*5f39d1b3SJooyung Han
13*5f39d1b3SJooyung HanBesides implementing a quantization paradigm, the other thing that output
14*5f39d1b3SJooyung Hanpipelines is good for, is implementing fused operations where a matrix
15*5f39d1b3SJooyung Hanmultiplication feeds into other operations applied to its result, without
16*5f39d1b3SJooyung Hanadditional array traversals. For instance, when implementing neural network
17*5f39d1b3SJooyung Haninference, one might have a Convolutional layer with a bias-addition and an
18*5f39d1b3SJooyung Hanactivation. One then wants to feed the result of the matrix multiplication
19*5f39d1b3SJooyung Hanimplementing the Convolutional operator itself, directly into the bias-addition
20*5f39d1b3SJooyung Hanand activation function. gemmlowp's output pipelines allow implementing that:
21*5f39d1b3SJooyung Hanthe bias-addition and activation function are just additional stages in the
22*5f39d1b3SJooyung Hanoutput pipeline.
23*5f39d1b3SJooyung Han
24*5f39d1b3SJooyung Han## Usage
25*5f39d1b3SJooyung Han
26*5f39d1b3SJooyung HanThe gemmlowp entry point allowing to use an arbitrary output pipeline is
27*5f39d1b3SJooyung Han`GemmWithOutputPipeline` in [public/gemmlowp.h](../public/gemmlowp.h).
28*5f39d1b3SJooyung Han
29*5f39d1b3SJooyung HanThe output pipeline is specified as a `std::tuple` of "output stages", each of
30*5f39d1b3SJooyung Hanwhich defining an elementary arithmetic transformation.
31*5f39d1b3SJooyung Han
32*5f39d1b3SJooyung HanAll available output stages are defined in
33*5f39d1b3SJooyung Han[public/output_stages.h](../public/output_stages.h).
34*5f39d1b3SJooyung Han
35*5f39d1b3SJooyung Han## Example usage
36*5f39d1b3SJooyung Han
37*5f39d1b3SJooyung HanThe best part to see examples of using various output pipelines is in the unit
38*5f39d1b3SJooyung Hantest,
39*5f39d1b3SJooyung Han
40*5f39d1b3SJooyung Han```
41*5f39d1b3SJooyung Hantest/test.cc
42*5f39d1b3SJooyung Han```
43*5f39d1b3SJooyung Han
44*5f39d1b3SJooyung Hanspecifically in this function:
45*5f39d1b3SJooyung Han
46*5f39d1b3SJooyung Han```
47*5f39d1b3SJooyung HanTestOutputStages
48*5f39d1b3SJooyung Han```
49*5f39d1b3SJooyung Han
50*5f39d1b3SJooyung HanSeparately, a self-contained example showing how to use gemmlowp to compute a
51*5f39d1b3SJooyung Hanquantized matrix multiplication with a sounds quantization paradigm, is here:
52*5f39d1b3SJooyung Han
53*5f39d1b3SJooyung Han[doc/quantization_example.cc](quantization_example.cc)
54