1*5f39d1b3SJooyung HanMETAPROGRAMMED GEMM 2*5f39d1b3SJooyung Han=================== 3*5f39d1b3SJooyung Han 4*5f39d1b3SJooyung HanThe two main goals of this library are: 5*5f39d1b3SJooyung Han- providing a new matrix multiplication kernel. 6*5f39d1b3SJooyung Han- providing the optimized codepaths for as many possible user scenarios without 7*5f39d1b3SJooyung Han enforcing additional input data constraints (padding, sizes, strides, layout) 8*5f39d1b3SJooyung Han 9*5f39d1b3SJooyung HanTo enable this code add -DGEMMLOWP_USE_META_FASTPATH to your build setup. 10*5f39d1b3SJooyung Han 11*5f39d1b3SJooyung HanThe new kernel 12*5f39d1b3SJooyung Han-------------- 13*5f39d1b3SJooyung Han 14*5f39d1b3SJooyung HanThe multiplication kernel - the most inner loop of the matrix multiplication, 15*5f39d1b3SJooyung Hanwhich is responsible for the row/column products was rewritten. The new code 16*5f39d1b3SJooyung Hanproduces a 3x3 result patch and processes the row/column arrays in 8 element 17*5f39d1b3SJooyung Hanpacks (the kernel 'shape' is 3x3x8 compared to the previous 12x4x2). By using 18*5f39d1b3SJooyung Hanspecialized 8bit multiplication, aggregating to vector aggregators and then 19*5f39d1b3SJooyung Hanreduction with parallel horizontal addition we devised code that achieved 20*5f39d1b3SJooyung Hanhigher arithmetical density (arithmetical operation per assembly instruction). 21*5f39d1b3SJooyung HanThe arithmetical performance of the new kernel exceeds 18 GOps/s on a vanilla 22*5f39d1b3SJooyung HanNexus 5 phone (which is practically peak for this device). 23*5f39d1b3SJooyung Han 24*5f39d1b3SJooyung HanIn order to feed the kernel with input data and minimize the number of 25*5f39d1b3SJooyung Haninstructions other than the arithmetical operations a different packing 26*5f39d1b3SJooyung Hanscheme was used. Three rows (columns) are interweaved every 8 elements so that 27*5f39d1b3SJooyung Hanthey can be read from continuous memory in one op inside the kernel. Additional 28*5f39d1b3SJooyung Hanmemory preload hint operations are inserted into the kernel to hide memory 29*5f39d1b3SJooyung Hanlatency behind arithmetical operations. 30*5f39d1b3SJooyung Han 31*5f39d1b3SJooyung HanGenerated code 32*5f39d1b3SJooyung Han-------------- 33*5f39d1b3SJooyung Han 34*5f39d1b3SJooyung HanThe basic kernel used in this approach is of shape 3x3x8. Obviously this 35*5f39d1b3SJooyung Hankernel can be easily applied to multipications where matrix sizes are: 36*5f39d1b3SJooyung HanM x K, K x N where M and N are multiplies of 3 and K is a multiply of 8. 37*5f39d1b3SJooyung Han 38*5f39d1b3SJooyung HanWe rejected two obvious solutions of: padding the matrix sizes to appropriate 39*5f39d1b3SJooyung Hansizes, or using the reference implementation for the leftovers. Neither did 40*5f39d1b3SJooyung Hanwe consider enforcing extra constraints on the caller. 41*5f39d1b3SJooyung Han 42*5f39d1b3SJooyung HanIn order to allow all matrix sizes the kernels processing all combinations of 43*5f39d1b3SJooyung Han1, 2 or 3 rows and 1, 2 or 3 columns are required. Similarily to allow all 44*5f39d1b3SJooyung Hanpossible depths the leftover values (up to 7 elements) needed to be handled. 45*5f39d1b3SJooyung Han 46*5f39d1b3SJooyung HanInstead of writing those kernels by hand we decided to generate them with 47*5f39d1b3SJooyung Hansome python scripts. 9 Versions of the multiplication kernel were prepared. 48*5f39d1b3SJooyung HanAdditionally packing and unpacking code for different row/column counts and 49*5f39d1b3SJooyung Handepth leftovers was generated. Finally different code was generated for 50*5f39d1b3SJooyung Hanaligned memory reads/writes and unaligned memory reads/writes. 51*5f39d1b3SJooyung Han 52*5f39d1b3SJooyung HanUsing those multiplication and packing/unpacking primitives 144 gemm function 53*5f39d1b3SJooyung Hanversions were prepared. On top of them one high level gemm function that would 54*5f39d1b3SJooyung Hanswitch to one of those preoptimized versions. 55*5f39d1b3SJooyung Han 56*5f39d1b3SJooyung HanThis approach allowed moving all unnecessary branching and conditional execution 57*5f39d1b3SJooyung Hanoutside of the inner loops. It also allowed removing of all short loops required 58*5f39d1b3SJooyung Hanfor leftover handling. Finally aligned memory reads/writes are used everywhere 59*5f39d1b3SJooyung Hanwhere the provided input data allows. 60*5f39d1b3SJooyung Han 61*5f39d1b3SJooyung HanResults 62*5f39d1b3SJooyung Han------- 63*5f39d1b3SJooyung Han 64*5f39d1b3SJooyung HanThe library shows up to 35% faster gemm execution in some cases (e.g. ImageNet 65*5f39d1b3SJooyung Hanbenchmark). 66*5f39d1b3SJooyung Han 67*5f39d1b3SJooyung HanFiles 68*5f39d1b3SJooyung Han----- 69*5f39d1b3SJooyung Han 70*5f39d1b3SJooyung Hansingle_thread_gemm.h 71*5f39d1b3SJooyung Han-- generated ARM/NEON 8bit x 8bit gemm implementation. Contains all the 72*5f39d1b3SJooyung Han optimized, unrolled and curried pack/unpack, and multiply procedures and 73*5f39d1b3SJooyung Han a single gemm function that switches between the optimized versions based 74*5f39d1b3SJooyung Han on the runtime parameters. 75*5f39d1b3SJooyung Han 76*5f39d1b3SJooyung Hanmulti_thread_gemm.h 77*5f39d1b3SJooyung Han-- a simple parallelization scheme for the gemm function. 78*5f39d1b3SJooyung Han 79*5f39d1b3SJooyung Hangenerators/gemm_NxMxK_neon.py 80*5f39d1b3SJooyung Han-- script that generates the single_thread_gemm.h header library. 81*5f39d1b3SJooyung Han Usage: python gemm_NxMxK_neon > single_thread_gemm.h 82