xref: /aosp_15_r20/external/gemmlowp/meta/README (revision 5f39d1b313f0528e11bae88b3029b54b9e1033e7)
1*5f39d1b3SJooyung HanMETAPROGRAMMED GEMM
2*5f39d1b3SJooyung Han===================
3*5f39d1b3SJooyung Han
4*5f39d1b3SJooyung HanThe two main goals of this library are:
5*5f39d1b3SJooyung Han- providing a new matrix multiplication kernel.
6*5f39d1b3SJooyung Han- providing the optimized codepaths for as many possible user scenarios without
7*5f39d1b3SJooyung Han  enforcing additional input data constraints (padding, sizes, strides, layout)
8*5f39d1b3SJooyung Han
9*5f39d1b3SJooyung HanTo enable this code add -DGEMMLOWP_USE_META_FASTPATH to your build setup.
10*5f39d1b3SJooyung Han
11*5f39d1b3SJooyung HanThe new kernel
12*5f39d1b3SJooyung Han--------------
13*5f39d1b3SJooyung Han
14*5f39d1b3SJooyung HanThe multiplication kernel - the most inner loop of the matrix multiplication,
15*5f39d1b3SJooyung Hanwhich is responsible for the row/column products was rewritten. The new code
16*5f39d1b3SJooyung Hanproduces a 3x3 result patch and processes the row/column arrays in 8 element
17*5f39d1b3SJooyung Hanpacks (the kernel 'shape' is 3x3x8 compared to the previous 12x4x2). By using
18*5f39d1b3SJooyung Hanspecialized 8bit multiplication, aggregating to vector aggregators and then
19*5f39d1b3SJooyung Hanreduction with parallel horizontal addition we devised code that achieved
20*5f39d1b3SJooyung Hanhigher arithmetical density (arithmetical operation per assembly instruction).
21*5f39d1b3SJooyung HanThe arithmetical performance of the new kernel exceeds 18 GOps/s on a vanilla
22*5f39d1b3SJooyung HanNexus 5 phone (which is practically peak for this device).
23*5f39d1b3SJooyung Han
24*5f39d1b3SJooyung HanIn order to feed the kernel with input data and minimize the number of
25*5f39d1b3SJooyung Haninstructions other than the arithmetical operations a different packing
26*5f39d1b3SJooyung Hanscheme was used. Three rows (columns) are interweaved every 8 elements so that
27*5f39d1b3SJooyung Hanthey can be read from continuous memory in one op inside the kernel. Additional
28*5f39d1b3SJooyung Hanmemory preload hint operations are inserted into the kernel to hide memory
29*5f39d1b3SJooyung Hanlatency behind arithmetical operations.
30*5f39d1b3SJooyung Han
31*5f39d1b3SJooyung HanGenerated code
32*5f39d1b3SJooyung Han--------------
33*5f39d1b3SJooyung Han
34*5f39d1b3SJooyung HanThe basic kernel used in this approach is of shape 3x3x8. Obviously this
35*5f39d1b3SJooyung Hankernel can be easily applied to multipications where matrix sizes are:
36*5f39d1b3SJooyung HanM x K, K x N where M and N are multiplies of 3 and K is a multiply of 8.
37*5f39d1b3SJooyung Han
38*5f39d1b3SJooyung HanWe rejected two obvious solutions of: padding the matrix sizes to appropriate
39*5f39d1b3SJooyung Hansizes, or using the reference implementation for the leftovers. Neither did
40*5f39d1b3SJooyung Hanwe consider enforcing extra constraints on the caller.
41*5f39d1b3SJooyung Han
42*5f39d1b3SJooyung HanIn order to allow all matrix sizes the kernels processing all combinations of
43*5f39d1b3SJooyung Han1, 2 or 3 rows and 1, 2 or 3 columns are required. Similarily to allow all
44*5f39d1b3SJooyung Hanpossible depths the leftover values (up to 7 elements) needed to be handled.
45*5f39d1b3SJooyung Han
46*5f39d1b3SJooyung HanInstead of writing those kernels by hand we decided to generate them with
47*5f39d1b3SJooyung Hansome python scripts. 9 Versions of the multiplication kernel were prepared.
48*5f39d1b3SJooyung HanAdditionally packing and unpacking code for different row/column counts and
49*5f39d1b3SJooyung Handepth leftovers was generated. Finally different code was generated for
50*5f39d1b3SJooyung Hanaligned memory reads/writes and unaligned memory reads/writes.
51*5f39d1b3SJooyung Han
52*5f39d1b3SJooyung HanUsing those multiplication and packing/unpacking primitives 144 gemm function
53*5f39d1b3SJooyung Hanversions were prepared. On top of them one high level gemm function that would
54*5f39d1b3SJooyung Hanswitch to one of those preoptimized versions.
55*5f39d1b3SJooyung Han
56*5f39d1b3SJooyung HanThis approach allowed moving all unnecessary branching and conditional execution
57*5f39d1b3SJooyung Hanoutside of the inner loops. It also allowed removing of all short loops required
58*5f39d1b3SJooyung Hanfor leftover handling. Finally aligned memory reads/writes are used everywhere
59*5f39d1b3SJooyung Hanwhere the provided input data allows.
60*5f39d1b3SJooyung Han
61*5f39d1b3SJooyung HanResults
62*5f39d1b3SJooyung Han-------
63*5f39d1b3SJooyung Han
64*5f39d1b3SJooyung HanThe library shows up to 35% faster gemm execution in some cases (e.g. ImageNet
65*5f39d1b3SJooyung Hanbenchmark).
66*5f39d1b3SJooyung Han
67*5f39d1b3SJooyung HanFiles
68*5f39d1b3SJooyung Han-----
69*5f39d1b3SJooyung Han
70*5f39d1b3SJooyung Hansingle_thread_gemm.h
71*5f39d1b3SJooyung Han-- generated ARM/NEON 8bit x 8bit gemm implementation. Contains all the
72*5f39d1b3SJooyung Han   optimized, unrolled and curried pack/unpack, and multiply procedures and
73*5f39d1b3SJooyung Han   a single gemm function that switches between the optimized versions based
74*5f39d1b3SJooyung Han   on the runtime parameters.
75*5f39d1b3SJooyung Han
76*5f39d1b3SJooyung Hanmulti_thread_gemm.h
77*5f39d1b3SJooyung Han-- a simple parallelization scheme for the gemm function.
78*5f39d1b3SJooyung Han
79*5f39d1b3SJooyung Hangenerators/gemm_NxMxK_neon.py
80*5f39d1b3SJooyung Han-- script that generates the single_thread_gemm.h header library.
81*5f39d1b3SJooyung Han   Usage: python gemm_NxMxK_neon > single_thread_gemm.h
82