xref: /aosp_15_r20/external/ComputeLibrary/examples/gemm_tuner/README.md (revision c217d954acce2dbc11938adb493fc0abd69584f3)
1*c217d954SCole Faust# Gemm Tuner
2*c217d954SCole Faust
3*c217d954SCole Faust## Introduction
4*c217d954SCole Faust
5*c217d954SCole FaustThis is a set of tools for tuning the performance of OpenCL GEMM kernels.  Specifically, we tune 3 GEMM kernels, each
6*c217d954SCole Fausthas a different implementation **strategy** of the GEMM operation: **native**, **reshaped**, **reshaped only rhs**.
7*c217d954SCole FaustThe details of these strategies can be found in the documentations of the corresponding kernels:
8*c217d954SCole Faust**CLGEMMMatrixMultiplyNativeKernel**, **CLGEMMMatrixMultiplyReshapedKernel** and
9*c217d954SCole Faust**CLGEMMMatrixMultiplyReshapedOnlyRHSKernel**.
10*c217d954SCole Faust
11*c217d954SCole FaustThe Tuner consists of 2 scripts and 3 binaries:
12*c217d954SCole Faust* cl_gemm_benchmark and GemmTuner.py under examples/gemm_tuner, and
13*c217d954SCole Faust* benchmark_cl_gemm_native, benchmark_cl_gemm_reshaped_rhs_only and benchmark_cl_gemm_reshaped under
14*c217d954SCole Faust  build/tests/gemm_tuner (you'll need to build the library first)
15*c217d954SCole Faust
16*c217d954SCole FaustThe inputs to the Tuner are a list of 4 valued tuples we call **GEMM shape** or **GEMMParam** (M, N, K, B, and possibly
17*c217d954SCole Faustdata type). They define the "shape" and other parameters (eg. data type) of a GEMM operation:
18*c217d954SCole Faust```
19*c217d954SCole FaustLHS x RHS = DST
20*c217d954SCole Faust```
21*c217d954SCole FaustWhere LHS is of shape MxK, RHS is of shape KxN and DST is of shape MxN, and B is the batch size.
22*c217d954SCole Faust
23*c217d954SCole FaustThe outputs of the tuning process are 4 json files:
24*c217d954SCole Faust1. gemm_type_selection.json: selects which kernel type is the best for each GEMMParam
25*c217d954SCole Faust2. gemm_config_native.json: selects a list of best **GEMMConfigs** of the native kernel for each GEMMParam
26*c217d954SCole Faust3. gemm_config_reshapedonlyrhs.json: selects a list of best GEMMConfigs of the reshaped_only_rhs kernel for each GEMMParam
27*c217d954SCole Faust4. gemm_config_reshaped.json: selects a list of best GEMMConfigs of the reshaped kernel for each GEMMParam
28*c217d954SCole Faust
29*c217d954SCole FaustThese 4 files are the current representations we use for what we call the **heuristics** of a GEMM op: given a GEMMParam,
30*c217d954SCole Faustwhat kernel and subsequently what configurations for that kernels are the most performant.
31*c217d954SCole Faust
32*c217d954SCole Faust## Step-by-step example
33*c217d954SCole Faust
34*c217d954SCole Faust### Step1: Prepare the shape and configs files
35*c217d954SCole Faust1. We first need to identify the shapes that we are interested in and store them in a csv file, say *gemm_shapes.csv*.
36*c217d954SCole Faust2. Then we need to specify a set of good GEMMConfig candidates for each kernel in 3 separate csv files (this requires
37*c217d954SCole Faust    some prior heuristics, but can be provided by the Compute Library developers upon requests, based on your target device).
38*c217d954SCole Faust
39*c217d954SCole Faust   Say we have *gemm_configs_native.csv", "gemm_configs_reshaped.csv" and "gemm_configs_reshaped_only_rhs.csv".
40*c217d954SCole Faust
41*c217d954SCole Faust   Please refer to the Prerequisite section for more details
42*c217d954SCole Faust
43*c217d954SCole Faust### Step2: Push relevant files to the target device
44*c217d954SCole FaustAll the files that need to be present on the target device are:
45*c217d954SCole Faust* benchmark script: \<ComputeLibrary\>/examples/gemm_tuner/cl_gemm_benchmark
46*c217d954SCole Faust* shapes and configs csv files: gemm_shapes.csv, gemm_configs_native.csv, gemm_configs_reshaped_only_rhs.csv, gemm_configs_reshaped.csv
47*c217d954SCole Faust* Example benchmark binaries: \<ComputeLibrary\>/build/tests/gemm_tuner/benchmark_cl_gemm*
48*c217d954SCole Faust
49*c217d954SCole Faust### Step3: Collect benchmark data
50*c217d954SCole FaustWith these files on device, we can collect benchmark data using the script. Assume all the example binaries are pushed
51*c217d954SCole Faustto a folder called *gemm_tuner*. While logged onto our device:
52*c217d954SCole Faust```
53*c217d954SCole Faust# Native
54*c217d954SCole Faust./cl_gemm_benchmark -s native -e ./gemm_tuner -g ./gemm_shapes.csv -c ./gemm_configs_native.csv -o results/native
55*c217d954SCole Faust# Reshaped Only RHS
56*c217d954SCole Faust./cl_gemm_benchmark -s reshaped_rhs_only -e ./gemm_tuner -g ./gemm_shapes.csv -c ./gemm_configs_reshaped_only_rhs.csv -o results/reshaped_only_rhs
57*c217d954SCole Faust# Reshaped
58*c217d954SCole Faust./cl_gemm_benchmark -s reshaped -e ./gemm_tuner -g ./gemm_shapes.csv -c ./gemm_configs_reshaped.csv -o results/reshaped
59*c217d954SCole Faust```
60*c217d954SCole FaustYou can repeat the 3 commands above to have a bit redundancy in your benchmark data (as you can imagine, measurement is noisy),
61*c217d954SCole Faustbut you may need to change the output folder for each repeat
62*c217d954SCole Faust
63*c217d954SCole FaustIt is also possible to split the benchmark phase among different platforms using the **-i** and **-n** options to specificy the starting experiment and the number of benchmark to run.
64*c217d954SCole Faust
65*c217d954SCole Faust# Reshaped benchmark on 3 different platforms
66*c217d954SCole Faust## Platform 1
67*c217d954SCole Faust./cl_gemm_benchmark -s reshaped -e ./gemm_tuner -g ./gemm_shapes.csv -c ./gemm_configs_reshaped.csv -o results/reshaped -i 0 -n 8
68*c217d954SCole Faust## Platform 2
69*c217d954SCole Faust./cl_gemm_benchmark -s reshaped -e ./gemm_tuner -g ./gemm_shapes.csv -c ./gemm_configs_reshaped.csv -o results/reshaped -i 8 -n 8
70*c217d954SCole Faust## Platform 3
71*c217d954SCole Faust./cl_gemm_benchmark -s reshaped -e ./gemm_tuner -g ./gemm_shapes.csv -c ./gemm_configs_reshaped.csv -o results/reshaped -i 16 -n 8
72*c217d954SCole Faust
73*c217d954SCole Faust### Step4: Generate the heuristics
74*c217d954SCole Faust1. After benchmarking, we pull the benchmark data, the *results* folder, from the target device to our host machine
75*c217d954SCole Faust2. We use the GemmTuner.py script to give us the heuristics
76*c217d954SCole Faust   ```
77*c217d954SCole Faust   python3 <ComputeLibrary>/examples/gemm_tuner/GemmTuner.py -b ./results -o heuristics
78*c217d954SCole Faust   ```
79*c217d954SCole Faust   When it's finished, there should be 4 json files in the *heuristics* folder
80*c217d954SCole Faust
81*c217d954SCole FaustOne thing to notice is that the config heuristics might give more than 1 recommendations for each GEMMParam, because
82*c217d954SCole Faustwe accept all good GEMMConfigs with a tolerance. If you want fewer recommendations, you can decrease the tolerance by
83*c217d954SCole Faustpassing a lower value to *-t \<tolerance\>* to the GemmTuner.py script.
84*c217d954SCole Faust
85*c217d954SCole Faust## Prerequisite
86*c217d954SCole Faust* A target device to be tuned, plus the following on the device:
87*c217d954SCole Faust    * Android or Linux OS
88*c217d954SCole Faust    * Bash shell
89*c217d954SCole Faust    * Built Compute Library with benchmark examples binaries
90*c217d954SCole Faust    * cl_gemm_benchmark script
91*c217d954SCole Faust    * gemm shape file
92*c217d954SCole Faust
93*c217d954SCole Faust       A csv file containing the **GEMMParam search list**. This is the list of GEMMParams/gemm shapes that we're
94*c217d954SCole Faust       interested in (For more details see Approach section). The default list is prepared by Compute Library developers in advance
95*c217d954SCole Faust       and can be provided on request.
96*c217d954SCole Faust
97*c217d954SCole Faust       The format is described as:
98*c217d954SCole Faust
99*c217d954SCole Faust       A headerless csv file with fields separated by commas.
100*c217d954SCole Faust
101*c217d954SCole Faust       A gemm shape is a list of 4 positive integers \<M, N, K, B\> describing the shapes of the two matrices (LHS and
102*c217d954SCole Faust       RHS) with:
103*c217d954SCole Faust
104*c217d954SCole Faust       M - Number of lhs matrix rows
105*c217d954SCole Faust       N - Number of rhs matrix columns
106*c217d954SCole Faust       K - Number of lhs matrix columns/rhs matrix rows
107*c217d954SCole Faust       B - Batch size
108*c217d954SCole Faust
109*c217d954SCole Faust       An example gemm shape file looks like:
110*c217d954SCole Faust  ```
111*c217d954SCole Faust  100,100,30,1
112*c217d954SCole Faust  100,100,30,3
113*c217d954SCole Faust  ...
114*c217d954SCole Faust  ```
115*c217d954SCole Faust    * gemm config file
116*c217d954SCole Faust      A csv file containing the **GEMMConfig search list**. This is the list of candidate GEMMConfigs among which we
117*c217d954SCole Faust      search for the optimal one. **Note that we have a different list for each strategy.**
118*c217d954SCole Faust      The default lists are prepared by Compute Library developers in advance and can be provided on request.
119*c217d954SCole Faust
120*c217d954SCole Faust      The format of the file for each strategy is the same:
121*c217d954SCole Faust
122*c217d954SCole Faust      A headerless csv file with fields separated by commas.
123*c217d954SCole Faust
124*c217d954SCole Faust      However the fields of GEMMConfig differ for each strategy:
125*c217d954SCole Faust
126*c217d954SCole Faust      * Strategy **native**:
127*c217d954SCole Faust        A gemm config is a list of 3 positive integers \<m0, n0, k0\>, with:
128*c217d954SCole Faust
129*c217d954SCole Faust        m0 - Number of rows processed by the matrix multiplication
130*c217d954SCole Faust        n0 - Number of columns processed by the matrix multiplication
131*c217d954SCole Faust        k0 - Number of partial accumulations performed by the matrix multiplication
132*c217d954SCole Faust
133*c217d954SCole Faust        Only the following configurations of M0, N0 and K0 are currently supported:
134*c217d954SCole Faust
135*c217d954SCole Faust        M0 = 1, 2, 3, 4, 5, 6, 7, 8
136*c217d954SCole Faust        N0 = 2, 3, 4, 8, 16
137*c217d954SCole Faust        K0 = 2, 3, 4, 8, 16
138*c217d954SCole Faust
139*c217d954SCole Faust        An example gemm config file looks like:
140*c217d954SCole Faust  ```
141*c217d954SCole Faust  1,4,4
142*c217d954SCole Faust  2,3,8
143*c217d954SCole Faust  ...
144*c217d954SCole Faust  ```
145*c217d954SCole Faust      * Strategy **reshaped_rhs_only**:
146*c217d954SCole Faust        A gemm config is a list of 4 positive integers <m0, n0, k0, h0> and 3 boolean values:
147*c217d954SCole Faust
148*c217d954SCole Faust        m0 - Number of rows processed by the matrix multiplication
149*c217d954SCole Faust        n0 - Number of columns processed by the matrix multiplication
150*c217d954SCole Faust        k0 - Number of partial accumulations performed by the matrix multiplication
151*c217d954SCole Faust        h0 - Number of horizontal blocks of size (k0xn0) stored on the same output row
152*c217d954SCole Faust        interleave_rhs - Interleave rhs matrix (1) / Do not interleave rhs matrix (0)
153*c217d954SCole Faust        transpose_rhs - Transpose rhs matrix (1) / Do not transpose rhs matrix (0)
154*c217d954SCole Faust        export_to_cl_image_rhs - Export rhs matrix to cl_image (1) / Do not export rhs matrix to cl_image (0). Can only be true
155*c217d954SCole Faust                                with certain combinations of the GEMMParams and other configs. Please refer to CLGEMMReshapeRHSMatrixKernel
156*c217d954SCole Faust                                for more details
157*c217d954SCole Faust
158*c217d954SCole Faust        Only the following configurations of M0, N0 and K0 are currently supported:
159*c217d954SCole Faust
160*c217d954SCole Faust        M0 = 1, 2, 3, 4, 5, 6, 7, 8
161*c217d954SCole Faust        N0 = 2, 3, 4, 8, 16
162*c217d954SCole Faust        K0 = 2, 3, 4, 8, 16
163*c217d954SCole Faust        H0 >= 1
164*c217d954SCole Faust
165*c217d954SCole Faust        An example gemm config file looks like:
166*c217d954SCole Faust  ```
167*c217d954SCole Faust  4,4,4,1,1,1,0
168*c217d954SCole Faust  4,4,4,3,1,0,1
169*c217d954SCole Faust  ...
170*c217d954SCole Faust  ```
171*c217d954SCole Faust      * Strategy **reshaped**:
172*c217d954SCole Faust        A gemm config is a list of 5 positive integers <m0, n0, k0, v0, h0> and 4 boolean values:
173*c217d954SCole Faust
174*c217d954SCole Faust        m0 - Number of rows processed by the matrix multiplication
175*c217d954SCole Faust        n0 - Number of columns processed by the matrix multiplication
176*c217d954SCole Faust        k0 - Number of partial accumulations performed by the matrix multiplication
177*c217d954SCole Faust        v0 - Number of vertical blocks of size (m0xk0) stored on the same output row
178*c217d954SCole Faust        h0 - Number of horizontal blocks of size (k0xn0) stored on the same output row
179*c217d954SCole Faust        interleave_lhs - Interleave lhs matrix (1) / Do not interleave lhs matrix (0)
180*c217d954SCole Faust        interleave_rhs - Interleave rhs matrix (1) / Do not interleave rhs matrix (0)
181*c217d954SCole Faust        transpose_rhs - Transpose rhs matrix but not lhs matrix (1) / Do not transpose rhs matrix but do transpose lhs matrix (0)
182*c217d954SCole Faust        export_to_cl_image_rhs - Export rhs matrix to cl_image (1) / Do not export rhs matrix to cl_image (0). Can only be true
183*c217d954SCole Faust                                with certain combinations of the GEMMParams and other configs. Please refer to CLGEMMReshapeRHSMatrixKernel
184*c217d954SCole Faust                                for more details
185*c217d954SCole Faust
186*c217d954SCole Faust        If rhs matrix is transposed only the following configurations are currently supported:
187*c217d954SCole Faust
188*c217d954SCole Faust        M0 = 2, 3, 4, 5, 6, 7, 8
189*c217d954SCole Faust        N0 = 2, 3, 4, 8, 16
190*c217d954SCole Faust        K0 = 2, 3, 4, 8, 16
191*c217d954SCole Faust        V0 >= 1
192*c217d954SCole Faust        H0 >= 1
193*c217d954SCole Faust
194*c217d954SCole Faust        If lhs matrix is transposed only the following configurations are currently supported:
195*c217d954SCole Faust
196*c217d954SCole Faust        M0 = 2, 3, 4, 8
197*c217d954SCole Faust        N0 = 2, 3, 4, 8, 16
198*c217d954SCole Faust        K0 = 2, 3, 4, 8, 16
199*c217d954SCole Faust        V0 >= 1
200*c217d954SCole Faust        H0 >= 1
201*c217d954SCole Faust
202*c217d954SCole Faust        An example gemm config file looks like:
203*c217d954SCole Faust  ```
204*c217d954SCole Faust  4,4,4,1,3,1,1,1,0
205*c217d954SCole Faust  4,4,4,3,3,1,1,0,1
206*c217d954SCole Faust  ...
207*c217d954SCole Faust  ```
208*c217d954SCole Faust* A host machine, plus these on the machine:
209*c217d954SCole Faust    * python >= 3.6
210*c217d954SCole Faust    * GemmTuner.py script
211*c217d954SCole Faust
212*c217d954SCole Faust## Usage
213*c217d954SCole FaustThe usage of the 2 scripts:
214*c217d954SCole Faust
215*c217d954SCole Faust1. cl_gemm_benchmark
216*c217d954SCole Faust
217*c217d954SCole Faust   Run the shell script (**cl_gemm_benchmark**) on your **target device**. Note that all the built benchmark
218*c217d954SCole Faust   examples: build/tests/gemm_tuner/benchmark_cl_gemm*, have to be present on your target device prior to running.
219*c217d954SCole Faust   The benchmark results will be saved to json files in an output directory.
220*c217d954SCole Faust   ```
221*c217d954SCole Faust   Usage: cl_gemm_benchmark [-h] -s \<strategy\> -e \<example_binary_dir\> -g \<gemm_shape_file\>
222*c217d954SCole Faust   -c \<gemm_config_file\> [-d \<data_type\>] [-o \<out_dir\>]
223*c217d954SCole Faust
224*c217d954SCole Faust   Options:
225*c217d954SCole Faust           -h
226*c217d954SCole Faust           Print help messages. If a strategy is specified with -s <strategy>, then only display messages relevant to that
227*c217d954SCole Faust           strategy. Otherwise if no strategy is specified, display messages for all available strategies.
228*c217d954SCole Faust
229*c217d954SCole Faust           -s <strategy>
230*c217d954SCole Faust           Strategy option.
231*c217d954SCole Faust           Options: ${ALL_STRATEGY_OPTIONS[@]}.
232*c217d954SCole Faust
233*c217d954SCole Faust           -e <example_binary_dir>
234*c217d954SCole Faust           Path to directory that holds all example binaries
235*c217d954SCole Faust
236*c217d954SCole Faust           -g <gemm_shape_file>
237*c217d954SCole Faust           Path to gemm shape csv file
238*c217d954SCole Faust
239*c217d954SCole Faust           -c <gemm_config_file>
240*c217d954SCole Faust           Path to gemm config csv file
241*c217d954SCole Faust
242*c217d954SCole Faust           -d <data_type>
243*c217d954SCole Faust           Data type option with which to run benchmark examples
244*c217d954SCole Faust           Default: ${DEFAULT_DATA_TYPE}
245*c217d954SCole Faust           Supported options:
246*c217d954SCole Faust           Strategy            :    Data Types
247*c217d954SCole Faust           Native              :    F32
248*c217d954SCole Faust           Reshaped            :    F16, F32
249*c217d954SCole Faust           Reshaped RHS Only   :    F16, F32
250*c217d954SCole Faust
251*c217d954SCole Faust           -o <out_dir>
252*c217d954SCole Faust           Path to output directory that holds output json files
253*c217d954SCole Faust           Default: ${DEFAULT_OUT_DIR}
254*c217d954SCole Faust   ```
255*c217d954SCole Faust2. GemmTuner.py:
256*c217d954SCole Faust
257*c217d954SCole Faust  Run the python script (**GemmTuner.py**) on your **host machine**.
258*c217d954SCole Faust  You'll need to transfer all the benchmark result json files generated from the previous step to your host machine
259*c217d954SCole Faust  beforehand. The script will output the best kernel and gemm configurations for each gemm param in the 4 output json files
260*c217d954SCole Faust   ```
261*c217d954SCole Faust   Usage: GemmTuner.py [-h] -b PATH [-o PATH] [-t TOLERANCE] [-D]
262*c217d954SCole Faust
263*c217d954SCole Faust   CL GEMM Tuner
264*c217d954SCole Faust   optional arguments:
265*c217d954SCole Faust     -h, --help            show this help message and exit
266*c217d954SCole Faust     -b PATH, --benchmark_results PATH
267*c217d954SCole Faust                           Path to benchmark result directory, where benchmark
268*c217d954SCole Faust                           result json files have a file extension of
269*c217d954SCole Faust                           'gemmtuner_benchmark'
270*c217d954SCole Faust     -o PATH, --output_dir PATH
271*c217d954SCole Faust                           Path to directory that holds output json files.
272*c217d954SCole Faust     -t TOLERANCE, --tolerance TOLERANCE
273*c217d954SCole Faust                           For testing if two GEMMConfigs are equivalent in terms
274*c217d954SCole Faust                           of performance. The tolerance is OpenCL timer in
275*c217d954SCole Faust                           milliseconds. Recommended value: <= 0.1 ms
276*c217d954SCole Faust     -D, --debug           Enable script debugging output
277*c217d954SCole Faust
278*c217d954SCole Faust   ```
279