Name Date Size #Lines LOC

..--

CommonGemmExampleOptions.cppH A D25-Apr-20253.7 KiB9159

CommonGemmExampleOptions.hH A D25-Apr-20255 KiB10340

GemmTuner.pyH A D25-Apr-202525.4 KiB684481

GemmTunerHelpers.hH A D25-Apr-20252.3 KiB5728

README.mdH A D25-Apr-202512.7 KiB279224

cl_gemm_benchmark.shH A D25-Apr-202518.9 KiB597319

cl_gemm_native.cppH A D25-Apr-20259.5 KiB257162

cl_gemm_reshaped.cppH A D25-Apr-202516.4 KiB352236

cl_gemm_reshaped_rhs_only.cppH A D25-Apr-202513 KiB300202

cl_gemmlowp_reshaped.cppH A D25-Apr-202514.9 KiB332227

cl_gemmlowp_reshaped_rhs_only_fused_output_stage_fixedpoint.cppH A D25-Apr-202517.6 KiB380269

README.md

1# Gemm Tuner
2
3## Introduction
4
5This is a set of tools for tuning the performance of OpenCL GEMM kernels.  Specifically, we tune 3 GEMM kernels, each
6has a different implementation **strategy** of the GEMM operation: **native**, **reshaped**, **reshaped only rhs**.
7The details of these strategies can be found in the documentations of the corresponding kernels:
8**CLGEMMMatrixMultiplyNativeKernel**, **CLGEMMMatrixMultiplyReshapedKernel** and
9**CLGEMMMatrixMultiplyReshapedOnlyRHSKernel**.
10
11The Tuner consists of 2 scripts and 3 binaries:
12* cl_gemm_benchmark and GemmTuner.py under examples/gemm_tuner, and
13* benchmark_cl_gemm_native, benchmark_cl_gemm_reshaped_rhs_only and benchmark_cl_gemm_reshaped under
14  build/tests/gemm_tuner (you'll need to build the library first)
15
16The inputs to the Tuner are a list of 4 valued tuples we call **GEMM shape** or **GEMMParam** (M, N, K, B, and possibly
17data type). They define the "shape" and other parameters (eg. data type) of a GEMM operation:
18```
19LHS x RHS = DST
20```
21Where LHS is of shape MxK, RHS is of shape KxN and DST is of shape MxN, and B is the batch size.
22
23The outputs of the tuning process are 4 json files:
241. gemm_type_selection.json: selects which kernel type is the best for each GEMMParam
252. gemm_config_native.json: selects a list of best **GEMMConfigs** of the native kernel for each GEMMParam
263. gemm_config_reshapedonlyrhs.json: selects a list of best GEMMConfigs of the reshaped_only_rhs kernel for each GEMMParam
274. gemm_config_reshaped.json: selects a list of best GEMMConfigs of the reshaped kernel for each GEMMParam
28
29These 4 files are the current representations we use for what we call the **heuristics** of a GEMM op: given a GEMMParam,
30what kernel and subsequently what configurations for that kernels are the most performant.
31
32## Step-by-step example
33
34### Step1: Prepare the shape and configs files
351. We first need to identify the shapes that we are interested in and store them in a csv file, say *gemm_shapes.csv*.
362. Then we need to specify a set of good GEMMConfig candidates for each kernel in 3 separate csv files (this requires
37    some prior heuristics, but can be provided by the Compute Library developers upon requests, based on your target device).
38
39   Say we have *gemm_configs_native.csv", "gemm_configs_reshaped.csv" and "gemm_configs_reshaped_only_rhs.csv".
40
41   Please refer to the Prerequisite section for more details
42
43### Step2: Push relevant files to the target device
44All the files that need to be present on the target device are:
45* benchmark script: \<ComputeLibrary\>/examples/gemm_tuner/cl_gemm_benchmark
46* shapes and configs csv files: gemm_shapes.csv, gemm_configs_native.csv, gemm_configs_reshaped_only_rhs.csv, gemm_configs_reshaped.csv
47* Example benchmark binaries: \<ComputeLibrary\>/build/tests/gemm_tuner/benchmark_cl_gemm*
48
49### Step3: Collect benchmark data
50With these files on device, we can collect benchmark data using the script. Assume all the example binaries are pushed
51to a folder called *gemm_tuner*. While logged onto our device:
52```
53# Native
54./cl_gemm_benchmark -s native -e ./gemm_tuner -g ./gemm_shapes.csv -c ./gemm_configs_native.csv -o results/native
55# Reshaped Only RHS
56./cl_gemm_benchmark -s reshaped_rhs_only -e ./gemm_tuner -g ./gemm_shapes.csv -c ./gemm_configs_reshaped_only_rhs.csv -o results/reshaped_only_rhs
57# Reshaped
58./cl_gemm_benchmark -s reshaped -e ./gemm_tuner -g ./gemm_shapes.csv -c ./gemm_configs_reshaped.csv -o results/reshaped
59```
60You can repeat the 3 commands above to have a bit redundancy in your benchmark data (as you can imagine, measurement is noisy),
61but you may need to change the output folder for each repeat
62
63It is also possible to split the benchmark phase among different platforms using the **-i** and **-n** options to specificy the starting experiment and the number of benchmark to run.
64
65# Reshaped benchmark on 3 different platforms
66## Platform 1
67./cl_gemm_benchmark -s reshaped -e ./gemm_tuner -g ./gemm_shapes.csv -c ./gemm_configs_reshaped.csv -o results/reshaped -i 0 -n 8
68## Platform 2
69./cl_gemm_benchmark -s reshaped -e ./gemm_tuner -g ./gemm_shapes.csv -c ./gemm_configs_reshaped.csv -o results/reshaped -i 8 -n 8
70## Platform 3
71./cl_gemm_benchmark -s reshaped -e ./gemm_tuner -g ./gemm_shapes.csv -c ./gemm_configs_reshaped.csv -o results/reshaped -i 16 -n 8
72
73### Step4: Generate the heuristics
741. After benchmarking, we pull the benchmark data, the *results* folder, from the target device to our host machine
752. We use the GemmTuner.py script to give us the heuristics
76   ```
77   python3 <ComputeLibrary>/examples/gemm_tuner/GemmTuner.py -b ./results -o heuristics
78   ```
79   When it's finished, there should be 4 json files in the *heuristics* folder
80
81One thing to notice is that the config heuristics might give more than 1 recommendations for each GEMMParam, because
82we accept all good GEMMConfigs with a tolerance. If you want fewer recommendations, you can decrease the tolerance by
83passing a lower value to *-t \<tolerance\>* to the GemmTuner.py script.
84
85## Prerequisite
86* A target device to be tuned, plus the following on the device:
87    * Android or Linux OS
88    * Bash shell
89    * Built Compute Library with benchmark examples binaries
90    * cl_gemm_benchmark script
91    * gemm shape file
92
93       A csv file containing the **GEMMParam search list**. This is the list of GEMMParams/gemm shapes that we're
94       interested in (For more details see Approach section). The default list is prepared by Compute Library developers in advance
95       and can be provided on request.
96
97       The format is described as:
98
99       A headerless csv file with fields separated by commas.
100
101       A gemm shape is a list of 4 positive integers \<M, N, K, B\> describing the shapes of the two matrices (LHS and
102       RHS) with:
103
104       M - Number of lhs matrix rows
105       N - Number of rhs matrix columns
106       K - Number of lhs matrix columns/rhs matrix rows
107       B - Batch size
108
109       An example gemm shape file looks like:
110  ```
111  100,100,30,1
112  100,100,30,3
113  ...
114  ```
115    * gemm config file
116      A csv file containing the **GEMMConfig search list**. This is the list of candidate GEMMConfigs among which we
117      search for the optimal one. **Note that we have a different list for each strategy.**
118      The default lists are prepared by Compute Library developers in advance and can be provided on request.
119
120      The format of the file for each strategy is the same:
121
122      A headerless csv file with fields separated by commas.
123
124      However the fields of GEMMConfig differ for each strategy:
125
126      * Strategy **native**:
127        A gemm config is a list of 3 positive integers \<m0, n0, k0\>, with:
128
129        m0 - Number of rows processed by the matrix multiplication
130        n0 - Number of columns processed by the matrix multiplication
131        k0 - Number of partial accumulations performed by the matrix multiplication
132
133        Only the following configurations of M0, N0 and K0 are currently supported:
134
135        M0 = 1, 2, 3, 4, 5, 6, 7, 8
136        N0 = 2, 3, 4, 8, 16
137        K0 = 2, 3, 4, 8, 16
138
139        An example gemm config file looks like:
140  ```
141  1,4,4
142  2,3,8
143  ...
144  ```
145      * Strategy **reshaped_rhs_only**:
146        A gemm config is a list of 4 positive integers <m0, n0, k0, h0> and 3 boolean values:
147
148        m0 - Number of rows processed by the matrix multiplication
149        n0 - Number of columns processed by the matrix multiplication
150        k0 - Number of partial accumulations performed by the matrix multiplication
151        h0 - Number of horizontal blocks of size (k0xn0) stored on the same output row
152        interleave_rhs - Interleave rhs matrix (1) / Do not interleave rhs matrix (0)
153        transpose_rhs - Transpose rhs matrix (1) / Do not transpose rhs matrix (0)
154        export_to_cl_image_rhs - Export rhs matrix to cl_image (1) / Do not export rhs matrix to cl_image (0). Can only be true
155                                with certain combinations of the GEMMParams and other configs. Please refer to CLGEMMReshapeRHSMatrixKernel
156                                for more details
157
158        Only the following configurations of M0, N0 and K0 are currently supported:
159
160        M0 = 1, 2, 3, 4, 5, 6, 7, 8
161        N0 = 2, 3, 4, 8, 16
162        K0 = 2, 3, 4, 8, 16
163        H0 >= 1
164
165        An example gemm config file looks like:
166  ```
167  4,4,4,1,1,1,0
168  4,4,4,3,1,0,1
169  ...
170  ```
171      * Strategy **reshaped**:
172        A gemm config is a list of 5 positive integers <m0, n0, k0, v0, h0> and 4 boolean values:
173
174        m0 - Number of rows processed by the matrix multiplication
175        n0 - Number of columns processed by the matrix multiplication
176        k0 - Number of partial accumulations performed by the matrix multiplication
177        v0 - Number of vertical blocks of size (m0xk0) stored on the same output row
178        h0 - Number of horizontal blocks of size (k0xn0) stored on the same output row
179        interleave_lhs - Interleave lhs matrix (1) / Do not interleave lhs matrix (0)
180        interleave_rhs - Interleave rhs matrix (1) / Do not interleave rhs matrix (0)
181        transpose_rhs - Transpose rhs matrix but not lhs matrix (1) / Do not transpose rhs matrix but do transpose lhs matrix (0)
182        export_to_cl_image_rhs - Export rhs matrix to cl_image (1) / Do not export rhs matrix to cl_image (0). Can only be true
183                                with certain combinations of the GEMMParams and other configs. Please refer to CLGEMMReshapeRHSMatrixKernel
184                                for more details
185
186        If rhs matrix is transposed only the following configurations are currently supported:
187
188        M0 = 2, 3, 4, 5, 6, 7, 8
189        N0 = 2, 3, 4, 8, 16
190        K0 = 2, 3, 4, 8, 16
191        V0 >= 1
192        H0 >= 1
193
194        If lhs matrix is transposed only the following configurations are currently supported:
195
196        M0 = 2, 3, 4, 8
197        N0 = 2, 3, 4, 8, 16
198        K0 = 2, 3, 4, 8, 16
199        V0 >= 1
200        H0 >= 1
201
202        An example gemm config file looks like:
203  ```
204  4,4,4,1,3,1,1,1,0
205  4,4,4,3,3,1,1,0,1
206  ...
207  ```
208* A host machine, plus these on the machine:
209    * python >= 3.6
210    * GemmTuner.py script
211
212## Usage
213The usage of the 2 scripts:
214
2151. cl_gemm_benchmark
216
217   Run the shell script (**cl_gemm_benchmark**) on your **target device**. Note that all the built benchmark
218   examples: build/tests/gemm_tuner/benchmark_cl_gemm*, have to be present on your target device prior to running.
219   The benchmark results will be saved to json files in an output directory.
220   ```
221   Usage: cl_gemm_benchmark [-h] -s \<strategy\> -e \<example_binary_dir\> -g \<gemm_shape_file\>
222   -c \<gemm_config_file\> [-d \<data_type\>] [-o \<out_dir\>]
223
224   Options:
225           -h
226           Print help messages. If a strategy is specified with -s <strategy>, then only display messages relevant to that
227           strategy. Otherwise if no strategy is specified, display messages for all available strategies.
228
229           -s <strategy>
230           Strategy option.
231           Options: ${ALL_STRATEGY_OPTIONS[@]}.
232
233           -e <example_binary_dir>
234           Path to directory that holds all example binaries
235
236           -g <gemm_shape_file>
237           Path to gemm shape csv file
238
239           -c <gemm_config_file>
240           Path to gemm config csv file
241
242           -d <data_type>
243           Data type option with which to run benchmark examples
244           Default: ${DEFAULT_DATA_TYPE}
245           Supported options:
246           Strategy            :    Data Types
247           Native              :    F32
248           Reshaped            :    F16, F32
249           Reshaped RHS Only   :    F16, F32
250
251           -o <out_dir>
252           Path to output directory that holds output json files
253           Default: ${DEFAULT_OUT_DIR}
254   ```
2552. GemmTuner.py:
256
257  Run the python script (**GemmTuner.py**) on your **host machine**.
258  You'll need to transfer all the benchmark result json files generated from the previous step to your host machine
259  beforehand. The script will output the best kernel and gemm configurations for each gemm param in the 4 output json files
260   ```
261   Usage: GemmTuner.py [-h] -b PATH [-o PATH] [-t TOLERANCE] [-D]
262
263   CL GEMM Tuner
264   optional arguments:
265     -h, --help            show this help message and exit
266     -b PATH, --benchmark_results PATH
267                           Path to benchmark result directory, where benchmark
268                           result json files have a file extension of
269                           'gemmtuner_benchmark'
270     -o PATH, --output_dir PATH
271                           Path to directory that holds output json files.
272     -t TOLERANCE, --tolerance TOLERANCE
273                           For testing if two GEMMConfigs are equivalent in terms
274                           of performance. The tolerance is OpenCL timer in
275                           milliseconds. Recommended value: <= 0.1 ms
276     -D, --debug           Enable script debugging output
277
278   ```
279