1# Gemm Tuner 2 3## Introduction 4 5This is a set of tools for tuning the performance of OpenCL GEMM kernels. Specifically, we tune 3 GEMM kernels, each 6has a different implementation **strategy** of the GEMM operation: **native**, **reshaped**, **reshaped only rhs**. 7The details of these strategies can be found in the documentations of the corresponding kernels: 8**CLGEMMMatrixMultiplyNativeKernel**, **CLGEMMMatrixMultiplyReshapedKernel** and 9**CLGEMMMatrixMultiplyReshapedOnlyRHSKernel**. 10 11The Tuner consists of 2 scripts and 3 binaries: 12* cl_gemm_benchmark and GemmTuner.py under examples/gemm_tuner, and 13* benchmark_cl_gemm_native, benchmark_cl_gemm_reshaped_rhs_only and benchmark_cl_gemm_reshaped under 14 build/tests/gemm_tuner (you'll need to build the library first) 15 16The inputs to the Tuner are a list of 4 valued tuples we call **GEMM shape** or **GEMMParam** (M, N, K, B, and possibly 17data type). They define the "shape" and other parameters (eg. data type) of a GEMM operation: 18``` 19LHS x RHS = DST 20``` 21Where LHS is of shape MxK, RHS is of shape KxN and DST is of shape MxN, and B is the batch size. 22 23The outputs of the tuning process are 4 json files: 241. gemm_type_selection.json: selects which kernel type is the best for each GEMMParam 252. gemm_config_native.json: selects a list of best **GEMMConfigs** of the native kernel for each GEMMParam 263. gemm_config_reshapedonlyrhs.json: selects a list of best GEMMConfigs of the reshaped_only_rhs kernel for each GEMMParam 274. gemm_config_reshaped.json: selects a list of best GEMMConfigs of the reshaped kernel for each GEMMParam 28 29These 4 files are the current representations we use for what we call the **heuristics** of a GEMM op: given a GEMMParam, 30what kernel and subsequently what configurations for that kernels are the most performant. 31 32## Step-by-step example 33 34### Step1: Prepare the shape and configs files 351. We first need to identify the shapes that we are interested in and store them in a csv file, say *gemm_shapes.csv*. 362. Then we need to specify a set of good GEMMConfig candidates for each kernel in 3 separate csv files (this requires 37 some prior heuristics, but can be provided by the Compute Library developers upon requests, based on your target device). 38 39 Say we have *gemm_configs_native.csv", "gemm_configs_reshaped.csv" and "gemm_configs_reshaped_only_rhs.csv". 40 41 Please refer to the Prerequisite section for more details 42 43### Step2: Push relevant files to the target device 44All the files that need to be present on the target device are: 45* benchmark script: \<ComputeLibrary\>/examples/gemm_tuner/cl_gemm_benchmark 46* shapes and configs csv files: gemm_shapes.csv, gemm_configs_native.csv, gemm_configs_reshaped_only_rhs.csv, gemm_configs_reshaped.csv 47* Example benchmark binaries: \<ComputeLibrary\>/build/tests/gemm_tuner/benchmark_cl_gemm* 48 49### Step3: Collect benchmark data 50With these files on device, we can collect benchmark data using the script. Assume all the example binaries are pushed 51to a folder called *gemm_tuner*. While logged onto our device: 52``` 53# Native 54./cl_gemm_benchmark -s native -e ./gemm_tuner -g ./gemm_shapes.csv -c ./gemm_configs_native.csv -o results/native 55# Reshaped Only RHS 56./cl_gemm_benchmark -s reshaped_rhs_only -e ./gemm_tuner -g ./gemm_shapes.csv -c ./gemm_configs_reshaped_only_rhs.csv -o results/reshaped_only_rhs 57# Reshaped 58./cl_gemm_benchmark -s reshaped -e ./gemm_tuner -g ./gemm_shapes.csv -c ./gemm_configs_reshaped.csv -o results/reshaped 59``` 60You can repeat the 3 commands above to have a bit redundancy in your benchmark data (as you can imagine, measurement is noisy), 61but you may need to change the output folder for each repeat 62 63It is also possible to split the benchmark phase among different platforms using the **-i** and **-n** options to specificy the starting experiment and the number of benchmark to run. 64 65# Reshaped benchmark on 3 different platforms 66## Platform 1 67./cl_gemm_benchmark -s reshaped -e ./gemm_tuner -g ./gemm_shapes.csv -c ./gemm_configs_reshaped.csv -o results/reshaped -i 0 -n 8 68## Platform 2 69./cl_gemm_benchmark -s reshaped -e ./gemm_tuner -g ./gemm_shapes.csv -c ./gemm_configs_reshaped.csv -o results/reshaped -i 8 -n 8 70## Platform 3 71./cl_gemm_benchmark -s reshaped -e ./gemm_tuner -g ./gemm_shapes.csv -c ./gemm_configs_reshaped.csv -o results/reshaped -i 16 -n 8 72 73### Step4: Generate the heuristics 741. After benchmarking, we pull the benchmark data, the *results* folder, from the target device to our host machine 752. We use the GemmTuner.py script to give us the heuristics 76 ``` 77 python3 <ComputeLibrary>/examples/gemm_tuner/GemmTuner.py -b ./results -o heuristics 78 ``` 79 When it's finished, there should be 4 json files in the *heuristics* folder 80 81One thing to notice is that the config heuristics might give more than 1 recommendations for each GEMMParam, because 82we accept all good GEMMConfigs with a tolerance. If you want fewer recommendations, you can decrease the tolerance by 83passing a lower value to *-t \<tolerance\>* to the GemmTuner.py script. 84 85## Prerequisite 86* A target device to be tuned, plus the following on the device: 87 * Android or Linux OS 88 * Bash shell 89 * Built Compute Library with benchmark examples binaries 90 * cl_gemm_benchmark script 91 * gemm shape file 92 93 A csv file containing the **GEMMParam search list**. This is the list of GEMMParams/gemm shapes that we're 94 interested in (For more details see Approach section). The default list is prepared by Compute Library developers in advance 95 and can be provided on request. 96 97 The format is described as: 98 99 A headerless csv file with fields separated by commas. 100 101 A gemm shape is a list of 4 positive integers \<M, N, K, B\> describing the shapes of the two matrices (LHS and 102 RHS) with: 103 104 M - Number of lhs matrix rows 105 N - Number of rhs matrix columns 106 K - Number of lhs matrix columns/rhs matrix rows 107 B - Batch size 108 109 An example gemm shape file looks like: 110 ``` 111 100,100,30,1 112 100,100,30,3 113 ... 114 ``` 115 * gemm config file 116 A csv file containing the **GEMMConfig search list**. This is the list of candidate GEMMConfigs among which we 117 search for the optimal one. **Note that we have a different list for each strategy.** 118 The default lists are prepared by Compute Library developers in advance and can be provided on request. 119 120 The format of the file for each strategy is the same: 121 122 A headerless csv file with fields separated by commas. 123 124 However the fields of GEMMConfig differ for each strategy: 125 126 * Strategy **native**: 127 A gemm config is a list of 3 positive integers \<m0, n0, k0\>, with: 128 129 m0 - Number of rows processed by the matrix multiplication 130 n0 - Number of columns processed by the matrix multiplication 131 k0 - Number of partial accumulations performed by the matrix multiplication 132 133 Only the following configurations of M0, N0 and K0 are currently supported: 134 135 M0 = 1, 2, 3, 4, 5, 6, 7, 8 136 N0 = 2, 3, 4, 8, 16 137 K0 = 2, 3, 4, 8, 16 138 139 An example gemm config file looks like: 140 ``` 141 1,4,4 142 2,3,8 143 ... 144 ``` 145 * Strategy **reshaped_rhs_only**: 146 A gemm config is a list of 4 positive integers <m0, n0, k0, h0> and 3 boolean values: 147 148 m0 - Number of rows processed by the matrix multiplication 149 n0 - Number of columns processed by the matrix multiplication 150 k0 - Number of partial accumulations performed by the matrix multiplication 151 h0 - Number of horizontal blocks of size (k0xn0) stored on the same output row 152 interleave_rhs - Interleave rhs matrix (1) / Do not interleave rhs matrix (0) 153 transpose_rhs - Transpose rhs matrix (1) / Do not transpose rhs matrix (0) 154 export_to_cl_image_rhs - Export rhs matrix to cl_image (1) / Do not export rhs matrix to cl_image (0). Can only be true 155 with certain combinations of the GEMMParams and other configs. Please refer to CLGEMMReshapeRHSMatrixKernel 156 for more details 157 158 Only the following configurations of M0, N0 and K0 are currently supported: 159 160 M0 = 1, 2, 3, 4, 5, 6, 7, 8 161 N0 = 2, 3, 4, 8, 16 162 K0 = 2, 3, 4, 8, 16 163 H0 >= 1 164 165 An example gemm config file looks like: 166 ``` 167 4,4,4,1,1,1,0 168 4,4,4,3,1,0,1 169 ... 170 ``` 171 * Strategy **reshaped**: 172 A gemm config is a list of 5 positive integers <m0, n0, k0, v0, h0> and 4 boolean values: 173 174 m0 - Number of rows processed by the matrix multiplication 175 n0 - Number of columns processed by the matrix multiplication 176 k0 - Number of partial accumulations performed by the matrix multiplication 177 v0 - Number of vertical blocks of size (m0xk0) stored on the same output row 178 h0 - Number of horizontal blocks of size (k0xn0) stored on the same output row 179 interleave_lhs - Interleave lhs matrix (1) / Do not interleave lhs matrix (0) 180 interleave_rhs - Interleave rhs matrix (1) / Do not interleave rhs matrix (0) 181 transpose_rhs - Transpose rhs matrix but not lhs matrix (1) / Do not transpose rhs matrix but do transpose lhs matrix (0) 182 export_to_cl_image_rhs - Export rhs matrix to cl_image (1) / Do not export rhs matrix to cl_image (0). Can only be true 183 with certain combinations of the GEMMParams and other configs. Please refer to CLGEMMReshapeRHSMatrixKernel 184 for more details 185 186 If rhs matrix is transposed only the following configurations are currently supported: 187 188 M0 = 2, 3, 4, 5, 6, 7, 8 189 N0 = 2, 3, 4, 8, 16 190 K0 = 2, 3, 4, 8, 16 191 V0 >= 1 192 H0 >= 1 193 194 If lhs matrix is transposed only the following configurations are currently supported: 195 196 M0 = 2, 3, 4, 8 197 N0 = 2, 3, 4, 8, 16 198 K0 = 2, 3, 4, 8, 16 199 V0 >= 1 200 H0 >= 1 201 202 An example gemm config file looks like: 203 ``` 204 4,4,4,1,3,1,1,1,0 205 4,4,4,3,3,1,1,0,1 206 ... 207 ``` 208* A host machine, plus these on the machine: 209 * python >= 3.6 210 * GemmTuner.py script 211 212## Usage 213The usage of the 2 scripts: 214 2151. cl_gemm_benchmark 216 217 Run the shell script (**cl_gemm_benchmark**) on your **target device**. Note that all the built benchmark 218 examples: build/tests/gemm_tuner/benchmark_cl_gemm*, have to be present on your target device prior to running. 219 The benchmark results will be saved to json files in an output directory. 220 ``` 221 Usage: cl_gemm_benchmark [-h] -s \<strategy\> -e \<example_binary_dir\> -g \<gemm_shape_file\> 222 -c \<gemm_config_file\> [-d \<data_type\>] [-o \<out_dir\>] 223 224 Options: 225 -h 226 Print help messages. If a strategy is specified with -s <strategy>, then only display messages relevant to that 227 strategy. Otherwise if no strategy is specified, display messages for all available strategies. 228 229 -s <strategy> 230 Strategy option. 231 Options: ${ALL_STRATEGY_OPTIONS[@]}. 232 233 -e <example_binary_dir> 234 Path to directory that holds all example binaries 235 236 -g <gemm_shape_file> 237 Path to gemm shape csv file 238 239 -c <gemm_config_file> 240 Path to gemm config csv file 241 242 -d <data_type> 243 Data type option with which to run benchmark examples 244 Default: ${DEFAULT_DATA_TYPE} 245 Supported options: 246 Strategy : Data Types 247 Native : F32 248 Reshaped : F16, F32 249 Reshaped RHS Only : F16, F32 250 251 -o <out_dir> 252 Path to output directory that holds output json files 253 Default: ${DEFAULT_OUT_DIR} 254 ``` 2552. GemmTuner.py: 256 257 Run the python script (**GemmTuner.py**) on your **host machine**. 258 You'll need to transfer all the benchmark result json files generated from the previous step to your host machine 259 beforehand. The script will output the best kernel and gemm configurations for each gemm param in the 4 output json files 260 ``` 261 Usage: GemmTuner.py [-h] -b PATH [-o PATH] [-t TOLERANCE] [-D] 262 263 CL GEMM Tuner 264 optional arguments: 265 -h, --help show this help message and exit 266 -b PATH, --benchmark_results PATH 267 Path to benchmark result directory, where benchmark 268 result json files have a file extension of 269 'gemmtuner_benchmark' 270 -o PATH, --output_dir PATH 271 Path to directory that holds output json files. 272 -t TOLERANCE, --tolerance TOLERANCE 273 For testing if two GEMMConfigs are equivalent in terms 274 of performance. The tolerance is OpenCL timer in 275 milliseconds. Recommended value: <= 0.1 ms 276 -D, --debug Enable script debugging output 277 278 ``` 279