README.md
1## Summary
2LLaVA is the first multi-modal LLM ExecuTorch supports. In this directory, we
3- Host a model definition for [LLavA](https://github.com/haotian-liu/LLaVA).
4- Demonstrate how to export LLavA multimodal model to generate ExecuTorch .PTE file.
5- Provide a C++ runner, Android/iOS Apps that loads the .pte file, the tokenizer and an image, then generate responses based on user prompt.
6- Discuss optimizations went into enabling LlaVA on a phone, and early performance numbers
7
8Tokenizer, image encoder, and the pretrained text model, which is based on Meta
9[Llama2-7b](https://llama.meta.com/llama2/), is loaded from Llava
10huggingface page [llava-hf/llava-1.5-7b-hf](https://huggingface.co/llava-hf/llava-1.5-7b-hf) .
11
12
13<p align="center">
14 <img src="./llava_via_xnnpack.gif" width=300>
15 <br>
16 <em>
17 Running Llava1.5 7B on Android phone
18 </em>
19</p>
20
21## What is LLaVA?
22
23[LLaVA](https://llava-vl.github.io/) is a novel end-to-end trained large
24multimodal model that combines a vision encoder and Vicuna (a LLama2 based text
25model) for general-purpose visual and language understanding, achieving
26impressive chat capabilities mimicking spirits of the cutting edge multimodal
27models and setting a high bar for accuracy on Science QA.
28
29## Instructions
30
31First you need to generate a .PTE file for the model, along with input image,
32and other artifacts. Then you need either a C++ runner, or Android or iOS
33application to test things out on device.
34
35### Generate ExecuTorch .PTE and other artifacts
36
37Run the following command to generate `llava.pte`, `tokenizer.bin` and an image
38tensor (serialized in TorchScript) `image.pt`.
39
40Prerequisite: run `install_requirements.sh` to install ExecuTorch and run
41`examples/models/llava/install_requirements.sh` to install dependencies.
42
43```bash
44python -m executorch.examples.models.llava.export_llava --pte-name llava.pte --with-artifacts
45```
46
47Currently the whole export process takes about 6 minutes. We also provide a
48small test utility to verify the correctness of the exported .pte file. Just run:
49
50```bash
51python -m executorch.examples.models.llava.test.test_pte llava.pte
52```
53
54### Build C++ Runner
55
56See or run `.ci/scripts/test_llava.sh` shell script to build a C++ runner. This
57script also has preliminary support to build the C++ runner for Android.
58
59This also has an image utility Python script to generate image in PyTorch
60loadable format. Alternatively, we are working on generating image format which
61doesn't need PyTorch to load an image. Motivation for this is to build the C++
62runner on Android.
63
64Then you should be able to find `llava_main` binary:
65
66```bash
67cmake-out/examples/models/llava/llava_main
68```
69
70### Build Mobile Apps
71
72#### Android
73
74We can run LLAVA using the LLAMA Demo Apps. Please refer to [this
75tutorial](https://github.com/pytorch/executorch/tree/main/examples/demo-apps/android/LlamaDemo)
76to for full instructions on building the Android LLAMA Demo App.
77
78#### iOS
79
80We can run LLAVA using the LLAMA Demo Apps. Please refer to [this
81tutorial](https://github.com/pytorch/executorch/tree/main/examples/demo-apps/apple_ios/LLaMA)
82to for full instructions on building the iOS LLAMA Demo App.
83
84### Running LLaVA
85
86Run:
87```bash
88cmake-out/examples/models/llava/llava_main \
89 --model_path=llava.pte \
90 --tokenizer_path=tokenizer.bin \
91 --image_path=image.pt \
92 --prompt="ASSISTANT:" \
93 --seq_len=768 \
94 --temperature=0
95```
96(see --help for other options).
97
98For this example input used in this example,
99
100
101
102You should get a response like (tested on Arm CPUs with ET XNNPACK delegate):
103
104```
105ASSISTANT: image captures a basketball game in progress, with several players on the court. ...
106```
107
108## Optimizations and Results
109
110Since LLaVA model needs at least 4-bit quantization to fit even within some of
111the high-end phones, results presented here correspond to 4-bit groupwise
112post-training quantized model.
113
114In addition to that, work is mainly focused on using Arm CPUs and ET XNNPACK delegate.
115
116### Memory Footprint Reduction Techniques
117
118With Llava, we needed to find a way to reduce the memory footprint in order to
119make it feasible to run on edge devices. Out of the box, even with 4-bit
120quantized weights, the memory footprint is around ~11 GiB, which is
121prohibitively large even for high-end Android or iOS devices.
122
123We did several optimizations, which should be already enabled if you follow this
124tutorial, to get the memory footprint down to ~5 GiB, which unblocks us to run
125on high-end devices.
126
127#### Sharing intermediate memory across delegates
128
129Sharing working memory across ET XNNPACK delegates helps reduce the peak memory
130usage for LLMs with many DQLinears. We reduced it by 36.1% (from 10.44GiB to
1316.67GiB) for Llava towards unblocking it to run on Phones.
132
133#### Reducing maximum sequence length
134
135To free up more memory, we examined non-constant memory usage, specifically
136focusing on intermediate tensors used throughout the model during inference.
137The majority of these were found in the KV-cache allocations. Based on “minimum
138can get away with” heuristic, we reduced max sequence length number to 768 from
139previous default 2048. This adjustment led to a further memory reduction of
140approximately 1.23 GiB (from 6.67 GiB to 5.44 GiB).
141
142#### Quantizing embedding weights to 8b
143
144By quantizing the embedding layer to 8 bit, we were able to achieve an
145additional memory footprint reduction of approximately 300 MiB, bringing the
146total down to ~5 GiB.
147
148### Performance Optimizations
149
150#### Decode performance
151
152This was already heavily optimized through KV-cache and GEMV kernel
153optimization efforts for LLama2/3.
154
155#### Encode performance
156
157With image based large prompts, this was the focus of performance
158optimizations for LLaVA. We implemented two main optimizations to bring the decode or
159prefill performance for the image down by more than 100% from the baseline.
160
161* **Two XNNPACK Partitioners**
162
163For text-only LLMs, our approach involved lowering only DQLinear ops
164to XNNPACK and relying on ExecuTorch-optimized operators or custom ops
165(utilizing Neon SIMD) to support multiplication, addition, and other
166operations. Lowering these operations to XNNPACK significantly improves Time to
167First Token (TTFT).
168
169
170* **New Arm Neon I8mm GEMM kernels**
171
172We introduced new kernels in XNNPACK for the quantization scheme used
173here, which upgrades our existing dot-prod based GEMM kernels to i8mm based
174GEMM kernels. The new kernel offers significantly improved performance by
175leveraging the more efficient SMMLA instruction from Arm Neon. However, it's
176worth noting that this instruction is only available on newer Arm CPUs.
177
178
179### Results
180
181Note this is an active area of development in the ExecuTorch repository. You
182will need this PR [5380](https://github.com/pytorch/executorch/pull/5380) to
183supply an image to the C++ runner on Android without Torch dependency. This
184should be merged soon.
185
186With those caveats out of the way, here are some preliminary numbers (as average of
187three runs) for LLaVA using a C++ runner on Android OnePlus12 device with 12GiB
188memory.
189
190| Experiment Setup | Prefill time in seconds | Decode tokens/second |
191| :------------- | -------------: | -------------: |
192| Baseline | 29.95 | 8.75 |
193| + Two XNNPACK Partitioners | 17.82 | 8.93 |
194| + New Arm Neon i8mm GEMM Kernels | 14.60 | 8.92 |
195
196We appreciate your feedback. Please let us know if you run into any issues.
197