1*c217d954SCole Faust# Gemm Tuner 2*c217d954SCole Faust 3*c217d954SCole Faust## Introduction 4*c217d954SCole Faust 5*c217d954SCole FaustThis is a set of tools for tuning the performance of OpenCL GEMM kernels. Specifically, we tune 3 GEMM kernels, each 6*c217d954SCole Fausthas a different implementation **strategy** of the GEMM operation: **native**, **reshaped**, **reshaped only rhs**. 7*c217d954SCole FaustThe details of these strategies can be found in the documentations of the corresponding kernels: 8*c217d954SCole Faust**CLGEMMMatrixMultiplyNativeKernel**, **CLGEMMMatrixMultiplyReshapedKernel** and 9*c217d954SCole Faust**CLGEMMMatrixMultiplyReshapedOnlyRHSKernel**. 10*c217d954SCole Faust 11*c217d954SCole FaustThe Tuner consists of 2 scripts and 3 binaries: 12*c217d954SCole Faust* cl_gemm_benchmark and GemmTuner.py under examples/gemm_tuner, and 13*c217d954SCole Faust* benchmark_cl_gemm_native, benchmark_cl_gemm_reshaped_rhs_only and benchmark_cl_gemm_reshaped under 14*c217d954SCole Faust build/tests/gemm_tuner (you'll need to build the library first) 15*c217d954SCole Faust 16*c217d954SCole FaustThe inputs to the Tuner are a list of 4 valued tuples we call **GEMM shape** or **GEMMParam** (M, N, K, B, and possibly 17*c217d954SCole Faustdata type). They define the "shape" and other parameters (eg. data type) of a GEMM operation: 18*c217d954SCole Faust``` 19*c217d954SCole FaustLHS x RHS = DST 20*c217d954SCole Faust``` 21*c217d954SCole FaustWhere LHS is of shape MxK, RHS is of shape KxN and DST is of shape MxN, and B is the batch size. 22*c217d954SCole Faust 23*c217d954SCole FaustThe outputs of the tuning process are 4 json files: 24*c217d954SCole Faust1. gemm_type_selection.json: selects which kernel type is the best for each GEMMParam 25*c217d954SCole Faust2. gemm_config_native.json: selects a list of best **GEMMConfigs** of the native kernel for each GEMMParam 26*c217d954SCole Faust3. gemm_config_reshapedonlyrhs.json: selects a list of best GEMMConfigs of the reshaped_only_rhs kernel for each GEMMParam 27*c217d954SCole Faust4. gemm_config_reshaped.json: selects a list of best GEMMConfigs of the reshaped kernel for each GEMMParam 28*c217d954SCole Faust 29*c217d954SCole FaustThese 4 files are the current representations we use for what we call the **heuristics** of a GEMM op: given a GEMMParam, 30*c217d954SCole Faustwhat kernel and subsequently what configurations for that kernels are the most performant. 31*c217d954SCole Faust 32*c217d954SCole Faust## Step-by-step example 33*c217d954SCole Faust 34*c217d954SCole Faust### Step1: Prepare the shape and configs files 35*c217d954SCole Faust1. We first need to identify the shapes that we are interested in and store them in a csv file, say *gemm_shapes.csv*. 36*c217d954SCole Faust2. Then we need to specify a set of good GEMMConfig candidates for each kernel in 3 separate csv files (this requires 37*c217d954SCole Faust some prior heuristics, but can be provided by the Compute Library developers upon requests, based on your target device). 38*c217d954SCole Faust 39*c217d954SCole Faust Say we have *gemm_configs_native.csv", "gemm_configs_reshaped.csv" and "gemm_configs_reshaped_only_rhs.csv". 40*c217d954SCole Faust 41*c217d954SCole Faust Please refer to the Prerequisite section for more details 42*c217d954SCole Faust 43*c217d954SCole Faust### Step2: Push relevant files to the target device 44*c217d954SCole FaustAll the files that need to be present on the target device are: 45*c217d954SCole Faust* benchmark script: \<ComputeLibrary\>/examples/gemm_tuner/cl_gemm_benchmark 46*c217d954SCole Faust* shapes and configs csv files: gemm_shapes.csv, gemm_configs_native.csv, gemm_configs_reshaped_only_rhs.csv, gemm_configs_reshaped.csv 47*c217d954SCole Faust* Example benchmark binaries: \<ComputeLibrary\>/build/tests/gemm_tuner/benchmark_cl_gemm* 48*c217d954SCole Faust 49*c217d954SCole Faust### Step3: Collect benchmark data 50*c217d954SCole FaustWith these files on device, we can collect benchmark data using the script. Assume all the example binaries are pushed 51*c217d954SCole Faustto a folder called *gemm_tuner*. While logged onto our device: 52*c217d954SCole Faust``` 53*c217d954SCole Faust# Native 54*c217d954SCole Faust./cl_gemm_benchmark -s native -e ./gemm_tuner -g ./gemm_shapes.csv -c ./gemm_configs_native.csv -o results/native 55*c217d954SCole Faust# Reshaped Only RHS 56*c217d954SCole Faust./cl_gemm_benchmark -s reshaped_rhs_only -e ./gemm_tuner -g ./gemm_shapes.csv -c ./gemm_configs_reshaped_only_rhs.csv -o results/reshaped_only_rhs 57*c217d954SCole Faust# Reshaped 58*c217d954SCole Faust./cl_gemm_benchmark -s reshaped -e ./gemm_tuner -g ./gemm_shapes.csv -c ./gemm_configs_reshaped.csv -o results/reshaped 59*c217d954SCole Faust``` 60*c217d954SCole FaustYou can repeat the 3 commands above to have a bit redundancy in your benchmark data (as you can imagine, measurement is noisy), 61*c217d954SCole Faustbut you may need to change the output folder for each repeat 62*c217d954SCole Faust 63*c217d954SCole FaustIt is also possible to split the benchmark phase among different platforms using the **-i** and **-n** options to specificy the starting experiment and the number of benchmark to run. 64*c217d954SCole Faust 65*c217d954SCole Faust# Reshaped benchmark on 3 different platforms 66*c217d954SCole Faust## Platform 1 67*c217d954SCole Faust./cl_gemm_benchmark -s reshaped -e ./gemm_tuner -g ./gemm_shapes.csv -c ./gemm_configs_reshaped.csv -o results/reshaped -i 0 -n 8 68*c217d954SCole Faust## Platform 2 69*c217d954SCole Faust./cl_gemm_benchmark -s reshaped -e ./gemm_tuner -g ./gemm_shapes.csv -c ./gemm_configs_reshaped.csv -o results/reshaped -i 8 -n 8 70*c217d954SCole Faust## Platform 3 71*c217d954SCole Faust./cl_gemm_benchmark -s reshaped -e ./gemm_tuner -g ./gemm_shapes.csv -c ./gemm_configs_reshaped.csv -o results/reshaped -i 16 -n 8 72*c217d954SCole Faust 73*c217d954SCole Faust### Step4: Generate the heuristics 74*c217d954SCole Faust1. After benchmarking, we pull the benchmark data, the *results* folder, from the target device to our host machine 75*c217d954SCole Faust2. We use the GemmTuner.py script to give us the heuristics 76*c217d954SCole Faust ``` 77*c217d954SCole Faust python3 <ComputeLibrary>/examples/gemm_tuner/GemmTuner.py -b ./results -o heuristics 78*c217d954SCole Faust ``` 79*c217d954SCole Faust When it's finished, there should be 4 json files in the *heuristics* folder 80*c217d954SCole Faust 81*c217d954SCole FaustOne thing to notice is that the config heuristics might give more than 1 recommendations for each GEMMParam, because 82*c217d954SCole Faustwe accept all good GEMMConfigs with a tolerance. If you want fewer recommendations, you can decrease the tolerance by 83*c217d954SCole Faustpassing a lower value to *-t \<tolerance\>* to the GemmTuner.py script. 84*c217d954SCole Faust 85*c217d954SCole Faust## Prerequisite 86*c217d954SCole Faust* A target device to be tuned, plus the following on the device: 87*c217d954SCole Faust * Android or Linux OS 88*c217d954SCole Faust * Bash shell 89*c217d954SCole Faust * Built Compute Library with benchmark examples binaries 90*c217d954SCole Faust * cl_gemm_benchmark script 91*c217d954SCole Faust * gemm shape file 92*c217d954SCole Faust 93*c217d954SCole Faust A csv file containing the **GEMMParam search list**. This is the list of GEMMParams/gemm shapes that we're 94*c217d954SCole Faust interested in (For more details see Approach section). The default list is prepared by Compute Library developers in advance 95*c217d954SCole Faust and can be provided on request. 96*c217d954SCole Faust 97*c217d954SCole Faust The format is described as: 98*c217d954SCole Faust 99*c217d954SCole Faust A headerless csv file with fields separated by commas. 100*c217d954SCole Faust 101*c217d954SCole Faust A gemm shape is a list of 4 positive integers \<M, N, K, B\> describing the shapes of the two matrices (LHS and 102*c217d954SCole Faust RHS) with: 103*c217d954SCole Faust 104*c217d954SCole Faust M - Number of lhs matrix rows 105*c217d954SCole Faust N - Number of rhs matrix columns 106*c217d954SCole Faust K - Number of lhs matrix columns/rhs matrix rows 107*c217d954SCole Faust B - Batch size 108*c217d954SCole Faust 109*c217d954SCole Faust An example gemm shape file looks like: 110*c217d954SCole Faust ``` 111*c217d954SCole Faust 100,100,30,1 112*c217d954SCole Faust 100,100,30,3 113*c217d954SCole Faust ... 114*c217d954SCole Faust ``` 115*c217d954SCole Faust * gemm config file 116*c217d954SCole Faust A csv file containing the **GEMMConfig search list**. This is the list of candidate GEMMConfigs among which we 117*c217d954SCole Faust search for the optimal one. **Note that we have a different list for each strategy.** 118*c217d954SCole Faust The default lists are prepared by Compute Library developers in advance and can be provided on request. 119*c217d954SCole Faust 120*c217d954SCole Faust The format of the file for each strategy is the same: 121*c217d954SCole Faust 122*c217d954SCole Faust A headerless csv file with fields separated by commas. 123*c217d954SCole Faust 124*c217d954SCole Faust However the fields of GEMMConfig differ for each strategy: 125*c217d954SCole Faust 126*c217d954SCole Faust * Strategy **native**: 127*c217d954SCole Faust A gemm config is a list of 3 positive integers \<m0, n0, k0\>, with: 128*c217d954SCole Faust 129*c217d954SCole Faust m0 - Number of rows processed by the matrix multiplication 130*c217d954SCole Faust n0 - Number of columns processed by the matrix multiplication 131*c217d954SCole Faust k0 - Number of partial accumulations performed by the matrix multiplication 132*c217d954SCole Faust 133*c217d954SCole Faust Only the following configurations of M0, N0 and K0 are currently supported: 134*c217d954SCole Faust 135*c217d954SCole Faust M0 = 1, 2, 3, 4, 5, 6, 7, 8 136*c217d954SCole Faust N0 = 2, 3, 4, 8, 16 137*c217d954SCole Faust K0 = 2, 3, 4, 8, 16 138*c217d954SCole Faust 139*c217d954SCole Faust An example gemm config file looks like: 140*c217d954SCole Faust ``` 141*c217d954SCole Faust 1,4,4 142*c217d954SCole Faust 2,3,8 143*c217d954SCole Faust ... 144*c217d954SCole Faust ``` 145*c217d954SCole Faust * Strategy **reshaped_rhs_only**: 146*c217d954SCole Faust A gemm config is a list of 4 positive integers <m0, n0, k0, h0> and 3 boolean values: 147*c217d954SCole Faust 148*c217d954SCole Faust m0 - Number of rows processed by the matrix multiplication 149*c217d954SCole Faust n0 - Number of columns processed by the matrix multiplication 150*c217d954SCole Faust k0 - Number of partial accumulations performed by the matrix multiplication 151*c217d954SCole Faust h0 - Number of horizontal blocks of size (k0xn0) stored on the same output row 152*c217d954SCole Faust interleave_rhs - Interleave rhs matrix (1) / Do not interleave rhs matrix (0) 153*c217d954SCole Faust transpose_rhs - Transpose rhs matrix (1) / Do not transpose rhs matrix (0) 154*c217d954SCole Faust export_to_cl_image_rhs - Export rhs matrix to cl_image (1) / Do not export rhs matrix to cl_image (0). Can only be true 155*c217d954SCole Faust with certain combinations of the GEMMParams and other configs. Please refer to CLGEMMReshapeRHSMatrixKernel 156*c217d954SCole Faust for more details 157*c217d954SCole Faust 158*c217d954SCole Faust Only the following configurations of M0, N0 and K0 are currently supported: 159*c217d954SCole Faust 160*c217d954SCole Faust M0 = 1, 2, 3, 4, 5, 6, 7, 8 161*c217d954SCole Faust N0 = 2, 3, 4, 8, 16 162*c217d954SCole Faust K0 = 2, 3, 4, 8, 16 163*c217d954SCole Faust H0 >= 1 164*c217d954SCole Faust 165*c217d954SCole Faust An example gemm config file looks like: 166*c217d954SCole Faust ``` 167*c217d954SCole Faust 4,4,4,1,1,1,0 168*c217d954SCole Faust 4,4,4,3,1,0,1 169*c217d954SCole Faust ... 170*c217d954SCole Faust ``` 171*c217d954SCole Faust * Strategy **reshaped**: 172*c217d954SCole Faust A gemm config is a list of 5 positive integers <m0, n0, k0, v0, h0> and 4 boolean values: 173*c217d954SCole Faust 174*c217d954SCole Faust m0 - Number of rows processed by the matrix multiplication 175*c217d954SCole Faust n0 - Number of columns processed by the matrix multiplication 176*c217d954SCole Faust k0 - Number of partial accumulations performed by the matrix multiplication 177*c217d954SCole Faust v0 - Number of vertical blocks of size (m0xk0) stored on the same output row 178*c217d954SCole Faust h0 - Number of horizontal blocks of size (k0xn0) stored on the same output row 179*c217d954SCole Faust interleave_lhs - Interleave lhs matrix (1) / Do not interleave lhs matrix (0) 180*c217d954SCole Faust interleave_rhs - Interleave rhs matrix (1) / Do not interleave rhs matrix (0) 181*c217d954SCole Faust transpose_rhs - Transpose rhs matrix but not lhs matrix (1) / Do not transpose rhs matrix but do transpose lhs matrix (0) 182*c217d954SCole Faust export_to_cl_image_rhs - Export rhs matrix to cl_image (1) / Do not export rhs matrix to cl_image (0). Can only be true 183*c217d954SCole Faust with certain combinations of the GEMMParams and other configs. Please refer to CLGEMMReshapeRHSMatrixKernel 184*c217d954SCole Faust for more details 185*c217d954SCole Faust 186*c217d954SCole Faust If rhs matrix is transposed only the following configurations are currently supported: 187*c217d954SCole Faust 188*c217d954SCole Faust M0 = 2, 3, 4, 5, 6, 7, 8 189*c217d954SCole Faust N0 = 2, 3, 4, 8, 16 190*c217d954SCole Faust K0 = 2, 3, 4, 8, 16 191*c217d954SCole Faust V0 >= 1 192*c217d954SCole Faust H0 >= 1 193*c217d954SCole Faust 194*c217d954SCole Faust If lhs matrix is transposed only the following configurations are currently supported: 195*c217d954SCole Faust 196*c217d954SCole Faust M0 = 2, 3, 4, 8 197*c217d954SCole Faust N0 = 2, 3, 4, 8, 16 198*c217d954SCole Faust K0 = 2, 3, 4, 8, 16 199*c217d954SCole Faust V0 >= 1 200*c217d954SCole Faust H0 >= 1 201*c217d954SCole Faust 202*c217d954SCole Faust An example gemm config file looks like: 203*c217d954SCole Faust ``` 204*c217d954SCole Faust 4,4,4,1,3,1,1,1,0 205*c217d954SCole Faust 4,4,4,3,3,1,1,0,1 206*c217d954SCole Faust ... 207*c217d954SCole Faust ``` 208*c217d954SCole Faust* A host machine, plus these on the machine: 209*c217d954SCole Faust * python >= 3.6 210*c217d954SCole Faust * GemmTuner.py script 211*c217d954SCole Faust 212*c217d954SCole Faust## Usage 213*c217d954SCole FaustThe usage of the 2 scripts: 214*c217d954SCole Faust 215*c217d954SCole Faust1. cl_gemm_benchmark 216*c217d954SCole Faust 217*c217d954SCole Faust Run the shell script (**cl_gemm_benchmark**) on your **target device**. Note that all the built benchmark 218*c217d954SCole Faust examples: build/tests/gemm_tuner/benchmark_cl_gemm*, have to be present on your target device prior to running. 219*c217d954SCole Faust The benchmark results will be saved to json files in an output directory. 220*c217d954SCole Faust ``` 221*c217d954SCole Faust Usage: cl_gemm_benchmark [-h] -s \<strategy\> -e \<example_binary_dir\> -g \<gemm_shape_file\> 222*c217d954SCole Faust -c \<gemm_config_file\> [-d \<data_type\>] [-o \<out_dir\>] 223*c217d954SCole Faust 224*c217d954SCole Faust Options: 225*c217d954SCole Faust -h 226*c217d954SCole Faust Print help messages. If a strategy is specified with -s <strategy>, then only display messages relevant to that 227*c217d954SCole Faust strategy. Otherwise if no strategy is specified, display messages for all available strategies. 228*c217d954SCole Faust 229*c217d954SCole Faust -s <strategy> 230*c217d954SCole Faust Strategy option. 231*c217d954SCole Faust Options: ${ALL_STRATEGY_OPTIONS[@]}. 232*c217d954SCole Faust 233*c217d954SCole Faust -e <example_binary_dir> 234*c217d954SCole Faust Path to directory that holds all example binaries 235*c217d954SCole Faust 236*c217d954SCole Faust -g <gemm_shape_file> 237*c217d954SCole Faust Path to gemm shape csv file 238*c217d954SCole Faust 239*c217d954SCole Faust -c <gemm_config_file> 240*c217d954SCole Faust Path to gemm config csv file 241*c217d954SCole Faust 242*c217d954SCole Faust -d <data_type> 243*c217d954SCole Faust Data type option with which to run benchmark examples 244*c217d954SCole Faust Default: ${DEFAULT_DATA_TYPE} 245*c217d954SCole Faust Supported options: 246*c217d954SCole Faust Strategy : Data Types 247*c217d954SCole Faust Native : F32 248*c217d954SCole Faust Reshaped : F16, F32 249*c217d954SCole Faust Reshaped RHS Only : F16, F32 250*c217d954SCole Faust 251*c217d954SCole Faust -o <out_dir> 252*c217d954SCole Faust Path to output directory that holds output json files 253*c217d954SCole Faust Default: ${DEFAULT_OUT_DIR} 254*c217d954SCole Faust ``` 255*c217d954SCole Faust2. GemmTuner.py: 256*c217d954SCole Faust 257*c217d954SCole Faust Run the python script (**GemmTuner.py**) on your **host machine**. 258*c217d954SCole Faust You'll need to transfer all the benchmark result json files generated from the previous step to your host machine 259*c217d954SCole Faust beforehand. The script will output the best kernel and gemm configurations for each gemm param in the 4 output json files 260*c217d954SCole Faust ``` 261*c217d954SCole Faust Usage: GemmTuner.py [-h] -b PATH [-o PATH] [-t TOLERANCE] [-D] 262*c217d954SCole Faust 263*c217d954SCole Faust CL GEMM Tuner 264*c217d954SCole Faust optional arguments: 265*c217d954SCole Faust -h, --help show this help message and exit 266*c217d954SCole Faust -b PATH, --benchmark_results PATH 267*c217d954SCole Faust Path to benchmark result directory, where benchmark 268*c217d954SCole Faust result json files have a file extension of 269*c217d954SCole Faust 'gemmtuner_benchmark' 270*c217d954SCole Faust -o PATH, --output_dir PATH 271*c217d954SCole Faust Path to directory that holds output json files. 272*c217d954SCole Faust -t TOLERANCE, --tolerance TOLERANCE 273*c217d954SCole Faust For testing if two GEMMConfigs are equivalent in terms 274*c217d954SCole Faust of performance. The tolerance is OpenCL timer in 275*c217d954SCole Faust milliseconds. Recommended value: <= 0.1 ms 276*c217d954SCole Faust -D, --debug Enable script debugging output 277*c217d954SCole Faust 278*c217d954SCole Faust ``` 279