README (revision 5f39d1b313f0528e11bae88b3029b54b9e1033e7) - OpenGrok cross reference for /aosp_15_r20/external/gemmlowp/meta/README

*5f39d1b3SJooyung HanMETAPROGRAMMED GEMM
*5f39d1b3SJooyung Han===================
*5f39d1b3SJooyung Han
*5f39d1b3SJooyung HanThe two main goals of this library are:
*5f39d1b3SJooyung Han- providing a new matrix multiplication kernel.
*5f39d1b3SJooyung Han- providing the optimized codepaths for as many possible user scenarios without
*5f39d1b3SJooyung Han  enforcing additional input data constraints (padding, sizes, strides, layout)
*5f39d1b3SJooyung Han
*5f39d1b3SJooyung HanTo enable this code add -DGEMMLOWP_USE_META_FASTPATH to your build setup.
*5f39d1b3SJooyung Han
*5f39d1b3SJooyung HanThe new kernel
*5f39d1b3SJooyung Han--------------
*5f39d1b3SJooyung Han
*5f39d1b3SJooyung HanThe multiplication kernel - the most inner loop of the matrix multiplication,
*5f39d1b3SJooyung Hanwhich is responsible for the row/column products was rewritten. The new code
*5f39d1b3SJooyung Hanproduces a 3x3 result patch and processes the row/column arrays in 8 element
*5f39d1b3SJooyung Hanpacks (the kernel 'shape' is 3x3x8 compared to the previous 12x4x2). By using
*5f39d1b3SJooyung Hanspecialized 8bit multiplication, aggregating to vector aggregators and then
*5f39d1b3SJooyung Hanreduction with parallel horizontal addition we devised code that achieved
*5f39d1b3SJooyung Hanhigher arithmetical density (arithmetical operation per assembly instruction).
*5f39d1b3SJooyung HanThe arithmetical performance of the new kernel exceeds 18 GOps/s on a vanilla
*5f39d1b3SJooyung HanNexus 5 phone (which is practically peak for this device).
*5f39d1b3SJooyung Han
*5f39d1b3SJooyung HanIn order to feed the kernel with input data and minimize the number of
*5f39d1b3SJooyung Haninstructions other than the arithmetical operations a different packing
*5f39d1b3SJooyung Hanscheme was used. Three rows (columns) are interweaved every 8 elements so that
*5f39d1b3SJooyung Hanthey can be read from continuous memory in one op inside the kernel. Additional
*5f39d1b3SJooyung Hanmemory preload hint operations are inserted into the kernel to hide memory
*5f39d1b3SJooyung Hanlatency behind arithmetical operations.
*5f39d1b3SJooyung Han
*5f39d1b3SJooyung HanGenerated code
*5f39d1b3SJooyung Han--------------
*5f39d1b3SJooyung Han
*5f39d1b3SJooyung HanThe basic kernel used in this approach is of shape 3x3x8. Obviously this
*5f39d1b3SJooyung Hankernel can be easily applied to multipications where matrix sizes are:
*5f39d1b3SJooyung HanM x K, K x N where M and N are multiplies of 3 and K is a multiply of 8.
*5f39d1b3SJooyung Han
*5f39d1b3SJooyung HanWe rejected two obvious solutions of: padding the matrix sizes to appropriate
*5f39d1b3SJooyung Hansizes, or using the reference implementation for the leftovers. Neither did
*5f39d1b3SJooyung Hanwe consider enforcing extra constraints on the caller.
*5f39d1b3SJooyung Han
*5f39d1b3SJooyung HanIn order to allow all matrix sizes the kernels processing all combinations of
*5f39d1b3SJooyung Han1, 2 or 3 rows and 1, 2 or 3 columns are required. Similarily to allow all
*5f39d1b3SJooyung Hanpossible depths the leftover values (up to 7 elements) needed to be handled.
*5f39d1b3SJooyung Han
*5f39d1b3SJooyung HanInstead of writing those kernels by hand we decided to generate them with
*5f39d1b3SJooyung Hansome python scripts. 9 Versions of the multiplication kernel were prepared.
*5f39d1b3SJooyung HanAdditionally packing and unpacking code for different row/column counts and
*5f39d1b3SJooyung Handepth leftovers was generated. Finally different code was generated for
*5f39d1b3SJooyung Hanaligned memory reads/writes and unaligned memory reads/writes.
*5f39d1b3SJooyung Han
*5f39d1b3SJooyung HanUsing those multiplication and packing/unpacking primitives 144 gemm function
*5f39d1b3SJooyung Hanversions were prepared. On top of them one high level gemm function that would
*5f39d1b3SJooyung Hanswitch to one of those preoptimized versions.
*5f39d1b3SJooyung Han
*5f39d1b3SJooyung HanThis approach allowed moving all unnecessary branching and conditional execution
*5f39d1b3SJooyung Hanoutside of the inner loops. It also allowed removing of all short loops required
*5f39d1b3SJooyung Hanfor leftover handling. Finally aligned memory reads/writes are used everywhere
*5f39d1b3SJooyung Hanwhere the provided input data allows.
*5f39d1b3SJooyung Han
*5f39d1b3SJooyung HanResults
*5f39d1b3SJooyung Han-------
*5f39d1b3SJooyung Han
*5f39d1b3SJooyung HanThe library shows up to 35% faster gemm execution in some cases (e.g. ImageNet
*5f39d1b3SJooyung Hanbenchmark).
*5f39d1b3SJooyung Han
*5f39d1b3SJooyung HanFiles
*5f39d1b3SJooyung Han-----
*5f39d1b3SJooyung Han
*5f39d1b3SJooyung Hansingle_thread_gemm.h
*5f39d1b3SJooyung Han-- generated ARM/NEON 8bit x 8bit gemm implementation. Contains all the
*5f39d1b3SJooyung Han   optimized, unrolled and curried pack/unpack, and multiply procedures and
*5f39d1b3SJooyung Han   a single gemm function that switches between the optimized versions based
*5f39d1b3SJooyung Han   on the runtime parameters.
*5f39d1b3SJooyung Han
*5f39d1b3SJooyung Hanmulti_thread_gemm.h
*5f39d1b3SJooyung Han-- a simple parallelization scheme for the gemm function.
*5f39d1b3SJooyung Han
*5f39d1b3SJooyung Hangenerators/gemm_NxMxK_neon.py
*5f39d1b3SJooyung Han-- script that generates the single_thread_gemm.h header library.
*5f39d1b3SJooyung Han   Usage: python gemm_NxMxK_neon > single_thread_gemm.h