xref: /aosp_15_r20/external/executorch/docs/source/executorch-arm-delegate-tutorial.md (revision 523fa7a60841cd1ecfb9cc4201f1ca8b03ed023a)
1<!---- Name is a WIP - this reflects better what it can do today ----->
2# Building and Running ExecuTorch with ARM Ethos-U Backend
3
4<!----This will show a grid card on the page----->
5::::{grid} 2
6
7:::{grid-item-card}  Tutorials we recommend you complete before this:
8:class-card: card-prerequisites
9* [Introduction to ExecuTorch](./intro-how-it-works.md)
10* [Setting up ExecuTorch](./getting-started-setup.md)
11* [Building ExecuTorch with CMake](./runtime-build-and-cross-compilation.md)
12:::
13
14:::{grid-item-card}  What you will learn in this tutorial:
15:class-card: card-prerequisites
16In this tutorial you will learn how to export a simple PyTorch model for ExecuTorch Arm Ethos-u backend delegate and run it on a Corstone FVP Simulators.
17:::
18
19::::
20
21```{warning}
22This ExecuTorch backend delegate is under active development. You may encounter some rough edges and features which may be documented or planned but not implemented.
23```
24
25```{tip}
26If you are already familiar with this delegate, you may want to jump directly to the examples source dir - [https://github.com/pytorch/executorch/tree/main/examples/arm](https://github.com/pytorch/executorch/tree/main/examples/arm)
27```
28
29## Prerequisites
30
31Let's make sure you have everything you need before we get started.
32
33### Hardware
34
35To successfully complete this tutorial, you will need a Linux-based host machine with Arm aarch64 or x86_64 processor architecture.
36
37The target device will be an embedded platform with an Arm Cortex-M CPUs and Ethos-U NPUs (ML processor). This tutorial will show you how to run PyTorch models on both.
38
39We will be using a [Fixed Virtual Platform (FVP)](https://www.arm.com/products/development-tools/simulation/fixed-virtual-platforms), simulating [Corstone-300](https://developer.arm.com/Processors/Corstone-300)(cs300) and [Corstone-320](https://developer.arm.com/Processors/Corstone-320)(cs320)systems. Since we will be using the FVP (think of it as virtual hardware), we won't be requiring any real embedded hardware for this tutorial.
40
41### Software
42
43First, you will need to install ExecuTorch. Please follow the recommended tutorials if you haven't already, to set up a working ExecuTorch development environment.
44
45To generate software which can be run on an embedded platform (real or virtual), we will need a tool chain for cross-compilation and an Arm Ethos-U software development kit, including the Vela compiler for Ethos-U NPUs.
46
47In the following sections we will walk through the steps to download each of the dependencies listed above.
48
49## Set Up the Developer Environment
50
51In this section, we will do a one-time setup, like downloading and installing necessary software, for the platform support files needed to run ExecuTorch programs in this tutorial.
52
53For that we will use the `examples/arm/setup.sh` script to pull each item in an automated fashion. It is recommended to run the script in a conda environment. Upon successful execution, you can directly go to [the next step](#convert-the-pytorch-model-to-the-pte-file).
54
55As mentioned before, we currently support only Linux based platforms with x86_64 or aarch64 processor architecture. Let’s make sure we are indeed on a supported platform.
56
57```bash
58uname -s
59# Linux
60
61uname -m
62# x86_64 or aarch64
63```
64
65Next we will walk through the steps performed by the `setup.sh` script to better understand the development setup.
66
67### Download and Set Up the Corstone-300 and Corstone-320 FVP
68
69Fixed Virtual Platforms (FVPs) are pre-configured, functionally accurate simulations of popular system configurations. Here in this tutorial, we are interested in Corstone-300 and Corstone-320 systems. We can download this from the Arm website.
70
71```{note}
72 By downloading and running the FVP software, you will be agreeing to the FVP [End-user license agreement (EULA)](https://developer.arm.com/downloads/-/arm-ecosystem-fvps/eula).
73```
74
75To download, we can either download `Corstone-300 Ecosystem FVP` and `Corstone-320 Ecosystem FVP`from [here](https://developer.arm.com/downloads/-/arm-ecosystem-fvps). or `setup.sh` script does that for you under `setup_fvp` function.
76
77### Download and Install the Arm GNU AArch32 Bare-Metal Toolchain
78
79Similar to the FVP, we would also need a tool-chain to cross-compile ExecuTorch runtime, executor-runner bare-metal application, as well as the rest of the bare-metal stack for Cortex-M55/M85 CPU available on the Corstone-300/Corstone-320 platform.
80
81These toolchains are available [here](https://developer.arm.com/downloads/-/arm-gnu-toolchain-downloads). We will be using GCC 12.3 targeting `arm-none-eabi` here for our tutorial. Just like FVP, `setup.sh` script will down the toolchain for you. See `setup_toolchain` function.
82
83### Setup the Arm Ethos-U Software Development
84
85This git repository is the root directory for all Arm Ethos-U software. It is to help us download required repositories and place them in a tree structure. See `setup_ethos_u` function of the setup script for more details.
86
87Once this is done, you should have a working FVP simulator, a functioning toolchain for cross compilation, and the Ethos-U software development setup ready for the bare-metal developement.
88
89### Install the Vela Compiler
90Once this is done, the script will finish the setup by installing the Vela compiler for you, details are in `setup_vela` function.
91
92### Install the TOSA reference model
93This is the last step of the setup process, using `setup_tosa_reference_model` function `setup.sh` script will install TOSA reference model for you.
94
95At the end of the setup, if everything goes well, your top level devlopement dir might look something like this,
96
97```bash
98.
99├── arm-gnu-toolchain-12.3.rel1-x86_64-arm-none-eabi # for x86-64 hosts
100├── ethos-u
101│   ├── core_platform
102│   ├── core_software
103│   ├── fetch_externals.py
104│   └── [...]
105├── ethos-u-vela
106├── FVP-corstone300
107│   ├── FVP_Corstone_SSE-300.sh
108│   └── [...]
109├── FVP-corstone320
110│   ├── FVP_Corstone_SSE-320.sh
111│   └── [...]
112├── FVP_cs300.tgz
113├── FVP_cs320.tgz
114├── gcc.tar.xz
115└── reference_model
116```
117
118## Convert the PyTorch Model to the `.pte` File
119
120`.pte` is a binary file produced by ExecuTorch Ahead-of-Time (AoT) pipeline by taking in a PyTorch Model (a torch.nn.Module), exporting it, running a variety of passes, and finally serializing it to a `.pte` file format. This binary file is typically consumed by the ExecuTorch Runtime. This [document](https://github.com/pytorch/executorch/blob/main/docs/source/getting-started-architecture.md) goes in much more depth about the ExecuTorch software stack for both AoT as well as Runtime.
121
122In this section, we will primarily focus on the AoT flow with the end goal of producing a `.pte` file. There are a set of export configurations to target different backends at runtime. For each, the AoT flow will produce a unique `.pte` file. We will explore a couple of different configurations producing different `.pte` files, particularly interesting for our Corstone-300 system and available processing elements.
123
124Before we get started, let's first talk about the PyTorch modules we will be using.
125
126### PyTorch Example Modules
127We will use a couple of simple PyTorch Modules to explore the end-to-end flow. These modules will be used in various different ways throughout the tutorial, referring to them by their `<class_name>`.
128
129#### SoftmaxModule
130This is a very simple PyTorch module with just one [Softmax](https://pytorch.org/docs/stable/generated/torch.nn.Softmax.html#torch.nn.Softmax) operator.
131
132```python
133import torch
134
135class SoftmaxModule(torch.nn.Module):
136    def __init__(self):
137        super().__init__()
138        self.softmax = torch.nn.Softmax()
139
140    def forward(self, x):
141        z = self.softmax(x)
142        return z
143```
144
145Running it using the Python environment (on the same development Linux machine), we get the expected output.
146
147```python
148>>> m = SoftmaxModule()
149>>> m(torch.ones(2,2))
150tensor([[0.5000, 0.5000],
151        [0.5000, 0.5000]])
152```
153
154#### AddModule
155Let's write another simple PyTorch module with just one [Add](https://pytorch.org/docs/stable/generated/torch.add.html#torch.add) operator.
156
157```python
158class AddModule(torch.nn.Module):
159    def __init__(self):
160        super().__init__()
161
162    def forward(self, x):
163        return x + x
164```
165
166Running it using the Python environment (on the same development Linux machine), and as expected 1 + 1 indeed produces 2.
167
168```python
169>>> m = AddModule()
170>>> m(torch.ones(5, dtype=torch.int32)) # integer types for non-quantized Ethos-U delegation
171tensor([2, 2, 2, 2, 2], dtype=torch.int32)
172```
173Keep the inputs and outputs to these modules in mind. When we will lower and run this through alternate means as opposed to running on this Linux machine, we will use the same inputs, and expect the outputs to match with the one shown here.
174
175```{tip}
176We need to be aware of data types for running networks on the Ethos-U55 as it is an integer only processor. For this example we use integer types explicitly, for typical use of such a flow networks are built and trained in floating point, and then are quantized from floating point to integer for efficient inference.
177```
178
179#### MobileNetV2 Module
180[MobileNetV2](https://arxiv.org/abs/1801.04381) is a commonly in-production used network for edge and mobile devices.
181It's also available as a default model in [torchvision](https://github.com/pytorch/vision), so we can load it with the sample code below.
182```
183from torchvision.models import mobilenet_v2  # @manual
184from torchvision.models.mobilenetv2 import MobileNet_V2_Weights
185
186mv2 = mobilenet_v2(weights=MobileNet_V2_Weights.DEFAULT)
187```
188For more details, you can refer to the code snippet [here](https://github.com/pytorch/executorch/blob/2354945d47f67f60d9a118ea1a08eef8ba2364b5/examples/models/mobilenet_v2/model.py#L18).
189
190### Non-delegated Workflow
191
192In the ExecuTorch AoT pipeline, one of the options is to select a backend. ExecuTorch offers a variety of different backends. Selecting backend is optional, it is typically done to target a particular mode of acceleration or hardware for a given model compute requirements. Without any backends, ExecuTorch runtime will fallback to using, available by default, a highly portable set of operators.
193
194It's expected that on platforms with dedicated acceleration like the Ethos-U55, that the non-delegated flow is used for two primary cases:
1951. When the network is designed to be very small and best suited to run on the Cortex-M alone.
1962. When the network has a mix of operations that can target the NPU and those that can't, e.g. the Ethos-U55 supports integer operations and so floating point softmax will fall back to execute on the CPU.
197
198In this flow, without any backend delegates, to illustrate the portability of the ExecuTorch runtime, as well as of the operator library we will skip specifying the backend during the `.pte` generation.
199
200Following script will serve as a helper utility to help us generate the `.pte` file. This is available in the `examples/arm` directory.
201
202```bash
203python3 -m examples.arm.aot_arm_compiler --model_name="softmax"
204# This should produce ./softmax.pte
205```
206
207### Delegated Workflow
208
209Working with Arm, we introduced a new Arm backend delegate for ExecuTorch. This backend is under active development and has a limited set of features available as of writing this.
210
211By including a following step during the ExecuTorch AoT export pipeline to generate the `.pte` file, we can enable this backend delegate.
212
213```python
214from executorch.backends.arm.arm_backend import generate_ethosu_compile_spec
215
216graph_module_edge.exported_program = to_backend(
217    model.exported_program,
218    ArmPartitioner(generate_ethosu_compile_spec("ethos-u55-128")))
219```
220
221Similar to the non-delegate flow, the same script will server as a helper utility to help us generate the `.pte` file. Notice the `--delegate` option to enable the `to_backend` call.
222
223```bash
224python3 -m examples.arm.aot_arm_compiler --model_name="add" --delegate
225# should produce ./add_arm_delegate.pte
226```
227
228### Delegated Quantized Workflow
229Before generating the `.pte` file for delegated quantized networks like MobileNetV2, we need to build the `quantized_ops_aot_lib`
230
231```bash
232SITE_PACKAGES="$(python3 -c 'from distutils.sysconfig import get_python_lib; print(get_python_lib())')"
233CMAKE_PREFIX_PATH="${SITE_PACKAGES}/torch"
234
235cd <executorch_root_dir>
236mkdir -p cmake-out-aot-lib
237cmake -DCMAKE_BUILD_TYPE=Release \
238    -DEXECUTORCH_BUILD_XNNPACK=OFF \
239    -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \
240    -DEXECUTORCH_BUILD_KERNELS_QUANTIZED_AOT=ON \
241    -DCMAKE_PREFIX_PATH="$CMAKE_PREFIX_PATH" \
242    -DPYTHON_EXECUTABLE=python3 \
243-Bcmake-out-aot-lib \
244    "${et_root_dir}"
245
246cmake --build cmake-out-aot-lib --parallel -- quantized_ops_aot_lib
247```
248
249After the `quantized_ops_aot_lib` build, we can run the following script to generate the `.pte` file
250```bash
251python3 -m examples.arm.aot_arm_compiler --model_name="mv2" --delegate --quantize --so_library="$(find cmake-out-aot-lib -name libquantized_ops_aot_lib.so)"
252# should produce ./mv2_arm_delegate.pte.pte
253```
254
255<br />
256
257At the end of this, we should have three different `.pte` files.
258
259- The first one contains the [SoftmaxModule](#softmaxmodule), without any backend delegates.
260- The second one contains the [AddModule](#addmodule), with Arm Ethos-U backend delegate enabled.
261- The third one contains the [quantized MV2Model](#mv2module), with the Arm Ethos-U backend delegate enabled as well.
262
263Now let's try to run these `.pte` files on a Corstone-300 and Corstone-320 platforms in a bare-metal environment.
264
265## Getting a Bare-Metal Executable
266
267In this section, we will go over steps that you need to go through to build the runtime application. This then run on the target device. In the executorch repository we have a functioning script which does the exact same steps. It is located at `executorch/examples/arm/run.sh`. We will use that to build necessary pieces and finally run the previously generated PTE file on an FVP.
268
269Also before we get started, make sure that you have completed ExecuTorch cmake build setup, and the instructions to setup the development environment described [earlier](#set-up-the-developer-environment).
270
271The block diagram below demonstrates, at the high level, how the various build artifacts are generated and are linked together to generate the final bare-metal executable.
272
273![](./arm-delegate-runtime-build.svg)
274
275```{tip}
276The `generate_pte_file` function in `run.sh` script produces the `.pte` files based on the models provided through `--model_name` input argument
277```
278
279### Generating ExecuTorch Libraries
280
281ExecuTorch's CMake build system produces a set of build pieces which are critical for us to include and run the ExecuTorch runtime with-in the bare-metal environment we have for Corstone FVPs from Ethos-U SDK.
282
283[This](./runtime-build-and-cross-compilation.md) document provides a detailed overview of each individual build piece. For running either variant of the `.pte` file, we will need a core set of libraries. Here is a list,
284
285- `libexecutorch.a`
286- `libportable_kernels.a`
287- `libportable_ops_lib.a`
288
289To run a `.pte` file with the Arm backend delegate call instructions, we will need the Arm backend delegate runtime library, that is,
290
291- `libexecutorch_delegate_ethos_u.a`
292
293These libraries are generated in `build_executorch` and `build_quantization_aot_lib` function of the `run.sh` script.
294
295In this function, `EXECUTORCH_SELECT_OPS_LIST` will decide the number of portable operators included in the build and are available at runtime. It must match with `.pte` file's requirements, otherwise you will get `Missing Operator` error at runtime.
296
297For example, there  in the command line above, to run SoftmaxModule, we only included the softmax CPU operator. Similarly, to run AddModule in a non-delegated manner you will need add op and so on. As you might have already realized, for the delegated operators, which will be executed by the Arm backend delegate, we do not need to include those operators in this list. This is only for *non-delegated* operators.
298
299```{tip}
300The `run.sh` script takes in `--portable_kernels` option, which provides a way to supply a comma seperated list of portable kernels to be included.
301```
302
303### Building the executor_runner Bare-Metal Application
304
305The SDK dir is the same one prepared [earlier](#setup-the-arm-ethos-u-software-development). And, we will be passing the `.pte` file (any one of them) generated above.
306
307Note, you have to generate a new `executor-runner` binary if you want to change the model or the `.pte` file. This constraint is from the constrained bare-metal runtime environment we have for Corstone-300/Corstone-320 platforms.
308
309This is performed by the `build_executorch_runner` function in `run.sh`.
310
311```{tip}
312The `run.sh` script takes in `--target` option, which provides a way to provide a specific target, Corstone-300(ethos-u55-128) or Corstone-320(ethos-u85-128)
313```
314
315## Running on Corstone FVP Platforms
316
317Once the elf is prepared, regardless of the `.pte` file variant is used to generate the bare metal elf. The below command is used to run the [MV2Model](#mv2module) on Corstone-320 FVP
318
319```bash
320ethos_u_build_dir=examples/arm/executor_runner/
321
322elf=$(find ${ethos_u_build_dir} -name "arm_executor_runner")
323
324FVP_Corstone_SSE-320_Ethos-U85                          \
325    -C mps4_board.subsystem.cpu0.CFGITCMSZ=11           \
326    -C mps4_board.subsystem.ethosu.num_macs=${num_macs} \
327    -C mps4_board.visualisation.disable-visualisation=1 \
328    -C vis_hdlcd.disable_visualisation=1                \
329    -C mps4_board.telnetterminal0.start_telnet=0        \
330    -C mps4_board.uart0.out_file='-'                    \
331    -C mps4_board.uart0.shutdown_on_eot=1               \
332    -a "${elf}"                                         \
333    --timelimit 120 || true # seconds- after which sim will kill itself
334```
335
336If successful, the simulator should produce something like the following on the shell,
337
338```console
339I [executorch:arm_executor_runner.cpp:364] Model in 0x70000000 $
340I [executorch:arm_executor_runner.cpp:366] Model PTE file loaded. Size: 4425968 bytes.
341I [executorch:arm_executor_runner.cpp:376] Model buffer loaded, has 1 methods
342I [executorch:arm_executor_runner.cpp:384] Running method forward
343I [executorch:arm_executor_runner.cpp:395] Setup Method allocator pool. Size: 62914560 bytes.
344I [executorch:arm_executor_runner.cpp:412] Setting up planned buffer 0, size 752640.
345I [executorch:ArmBackendEthosU.cpp:79] ArmBackend::init 0x70000070
346I [executorch:arm_executor_runner.cpp:445] Method loaded.
347I [executorch:arm_executor_runner.cpp:447] Preparing inputs...
348I [executorch:arm_executor_runner.cpp:461] Input prepared.
349I [executorch:arm_executor_runner.cpp:463] Starting the model execution...
350I [executorch:ArmBackendEthosU.cpp:118] ArmBackend::execute 0x70000070
351I [executorch:ArmBackendEthosU.cpp:298] Tensor input/output 0 will be permuted
352I [executorch:arm_perf_monitor.cpp:120] NPU Inferences : 1
353I [executorch:arm_perf_monitor.cpp:121] Profiler report, CPU cycles per operator:
354I [executorch:arm_perf_monitor.cpp:125] ethos-u : cycle_cnt : 1498202 cycles
355I [executorch:arm_perf_monitor.cpp:132] Operator(s) total: 1498202 CPU cycles
356I [executorch:arm_perf_monitor.cpp:138] Inference runtime: 6925114 CPU cycles total
357I [executorch:arm_perf_monitor.cpp:140] NOTE: CPU cycle values and ratio calculations require FPGA and identical CPU/NPU frequency
358I [executorch:arm_perf_monitor.cpp:149] Inference CPU ratio: 99.99 %
359I [executorch:arm_perf_monitor.cpp:153] Inference NPU ratio: 0.01 %
360I [executorch:arm_perf_monitor.cpp:162] cpu_wait_for_npu_cntr : 729 CPU cycles
361I [executorch:arm_perf_monitor.cpp:167] Ethos-U PMU report:
362I [executorch:arm_perf_monitor.cpp:168] ethosu_pmu_cycle_cntr : 5920305
363I [executorch:arm_perf_monitor.cpp:171] ethosu_pmu_cntr0 : 359921
364I [executorch:arm_perf_monitor.cpp:171] ethosu_pmu_cntr1 : 0
365I [executorch:arm_perf_monitor.cpp:171] ethosu_pmu_cntr2 : 0
366I [executorch:arm_perf_monitor.cpp:171] ethosu_pmu_cntr3 : 503
367I [executorch:arm_perf_monitor.cpp:178] Ethos-U PMU Events:[ETHOSU_PMU_EXT0_RD_DATA_BEAT_RECEIVED, ETHOSU_PMU_EXT1_RD_DATA_BEAT_RECEIVED, ETHOSU_PMU_EXT0_WR_DATA_BEAT_WRITTEN, ETHOSU_PMU_NPU_IDLE]
368I [executorch:arm_executor_runner.cpp:470] model_pte_loaded_size:     4425968 bytes.
369I [executorch:arm_executor_runner.cpp:484] method_allocator_used:     1355722 / 62914560  free: 61558838 ( used: 2 % )
370I [executorch:arm_executor_runner.cpp:491] method_allocator_planned:  752640 bytes
371I [executorch:arm_executor_runner.cpp:493] method_allocator_loaded:   966 bytes
372I [executorch:arm_executor_runner.cpp:494] method_allocator_input:    602116 bytes
373I [executorch:arm_executor_runner.cpp:495] method_allocator_executor: 0 bytes
374I [executorch:arm_executor_runner.cpp:498] temp_allocator_used:       0 / 1048576 free: 1048576 ( used: 0 % )
375I executorch:arm_executor_runner.cpp:152] Model executed successfully.
376I executorch:arm_executor_runner.cpp:156] 1 outputs:
377Output[0][0]: -0.749744
378Output[0][1]: -0.019224
379Output[0][2]: 0.134570
380...(Skipped)
381Output[0][996]: -0.230691
382Output[0][997]: -0.634399
383Output[0][998]: -0.115345
384Output[0][999]: 1.576386
385I executorch:arm_executor_runner.cpp:177] Program complete, exiting.
386I executorch:arm_executor_runner.cpp:179]
387```
388
389```{note}
390The `run.sh` script provides various options to select a particular FVP target, use desired models, select portable kernels and can be explored using the `--help` argument
391```
392
393## Takeaways
394Through this tutorial we've learnt how to use the ExecuTorch software to both export a standard model from PyTorch and to run it on the compact and fully functioned ExecuTorch runtime, enabling a smooth path for offloading models from PyTorch to Arm based platforms.
395
396To recap, there are two major flows:
397 * A direct flow which offloads work onto the Cortex-M using libraries built into ExecuTorch.
398 * A delegated flow which partitions the graph into sections for Cortex-M and sections which can be offloaded and accelerated on the Ethos-U hardware.
399
400Both of these flows continue to evolve, enabling more use-cases and better performance.
401
402## FAQs
403<!----
404Describe what common errors users may see and how to resolve them.
405
406* TODO - Binary size and operator Selection
407* TODO - Cross-compilation targeting baremetal
408* TODO - Debugging on FVP
409----->
410
411If you encountered any bugs or issues following this tutorial please file a bug/issue here on [Github](https://github.com/pytorch/executorch/issues/new).
412