xref: /aosp_15_r20/external/executorch/examples/models/llava/README.md (revision 523fa7a60841cd1ecfb9cc4201f1ca8b03ed023a)
1*523fa7a6SAndroid Build Coastguard Worker## Summary
2*523fa7a6SAndroid Build Coastguard WorkerLLaVA is the first multi-modal LLM ExecuTorch supports. In this directory, we
3*523fa7a6SAndroid Build Coastguard Worker- Host a model definition for [LLavA](https://github.com/haotian-liu/LLaVA).
4*523fa7a6SAndroid Build Coastguard Worker- Demonstrate how to export LLavA multimodal model to generate ExecuTorch .PTE file.
5*523fa7a6SAndroid Build Coastguard Worker- Provide a C++ runner, Android/iOS Apps that loads the .pte file, the tokenizer and an image, then generate responses based on user prompt.
6*523fa7a6SAndroid Build Coastguard Worker- Discuss optimizations went into enabling LlaVA on a phone, and early performance numbers
7*523fa7a6SAndroid Build Coastguard Worker
8*523fa7a6SAndroid Build Coastguard WorkerTokenizer, image encoder, and the pretrained text model, which is based on Meta
9*523fa7a6SAndroid Build Coastguard Worker[Llama2-7b](https://llama.meta.com/llama2/), is loaded from Llava
10*523fa7a6SAndroid Build Coastguard Workerhuggingface page [llava-hf/llava-1.5-7b-hf](https://huggingface.co/llava-hf/llava-1.5-7b-hf) .
11*523fa7a6SAndroid Build Coastguard Worker
12*523fa7a6SAndroid Build Coastguard Worker
13*523fa7a6SAndroid Build Coastguard Worker<p align="center">
14*523fa7a6SAndroid Build Coastguard Worker      <img src="./llava_via_xnnpack.gif" width=300>
15*523fa7a6SAndroid Build Coastguard Worker      <br>
16*523fa7a6SAndroid Build Coastguard Worker      <em>
17*523fa7a6SAndroid Build Coastguard Worker      Running Llava1.5 7B on Android phone
18*523fa7a6SAndroid Build Coastguard Worker      </em>
19*523fa7a6SAndroid Build Coastguard Worker</p>
20*523fa7a6SAndroid Build Coastguard Worker
21*523fa7a6SAndroid Build Coastguard Worker## What is LLaVA?
22*523fa7a6SAndroid Build Coastguard Worker
23*523fa7a6SAndroid Build Coastguard Worker[LLaVA](https://llava-vl.github.io/) is a novel end-to-end trained large
24*523fa7a6SAndroid Build Coastguard Workermultimodal model that combines a vision encoder and Vicuna (a LLama2 based text
25*523fa7a6SAndroid Build Coastguard Workermodel) for general-purpose visual and language understanding, achieving
26*523fa7a6SAndroid Build Coastguard Workerimpressive chat capabilities mimicking spirits of the cutting edge multimodal
27*523fa7a6SAndroid Build Coastguard Workermodels and setting a high bar for accuracy on Science QA.
28*523fa7a6SAndroid Build Coastguard Worker
29*523fa7a6SAndroid Build Coastguard Worker## Instructions
30*523fa7a6SAndroid Build Coastguard Worker
31*523fa7a6SAndroid Build Coastguard WorkerFirst you need to generate a .PTE file for the model, along with input image,
32*523fa7a6SAndroid Build Coastguard Workerand other artifacts. Then you need either a C++ runner, or Android or iOS
33*523fa7a6SAndroid Build Coastguard Workerapplication to test things out on device.
34*523fa7a6SAndroid Build Coastguard Worker
35*523fa7a6SAndroid Build Coastguard Worker### Generate ExecuTorch .PTE and other artifacts
36*523fa7a6SAndroid Build Coastguard Worker
37*523fa7a6SAndroid Build Coastguard WorkerRun the following command to generate `llava.pte`, `tokenizer.bin` and an image
38*523fa7a6SAndroid Build Coastguard Workertensor (serialized in TorchScript) `image.pt`.
39*523fa7a6SAndroid Build Coastguard Worker
40*523fa7a6SAndroid Build Coastguard WorkerPrerequisite: run `install_requirements.sh` to install ExecuTorch and run
41*523fa7a6SAndroid Build Coastguard Worker`examples/models/llava/install_requirements.sh` to install dependencies.
42*523fa7a6SAndroid Build Coastguard Worker
43*523fa7a6SAndroid Build Coastguard Worker```bash
44*523fa7a6SAndroid Build Coastguard Workerpython -m executorch.examples.models.llava.export_llava --pte-name llava.pte --with-artifacts
45*523fa7a6SAndroid Build Coastguard Worker```
46*523fa7a6SAndroid Build Coastguard Worker
47*523fa7a6SAndroid Build Coastguard WorkerCurrently the whole export process takes about 6 minutes. We also provide a
48*523fa7a6SAndroid Build Coastguard Workersmall test utility to verify the correctness of the exported .pte file. Just run:
49*523fa7a6SAndroid Build Coastguard Worker
50*523fa7a6SAndroid Build Coastguard Worker```bash
51*523fa7a6SAndroid Build Coastguard Workerpython -m executorch.examples.models.llava.test.test_pte llava.pte
52*523fa7a6SAndroid Build Coastguard Worker```
53*523fa7a6SAndroid Build Coastguard Worker
54*523fa7a6SAndroid Build Coastguard Worker### Build C++ Runner
55*523fa7a6SAndroid Build Coastguard Worker
56*523fa7a6SAndroid Build Coastguard WorkerSee or run `.ci/scripts/test_llava.sh` shell script to build a C++ runner. This
57*523fa7a6SAndroid Build Coastguard Workerscript also has preliminary support to build the C++ runner for Android.
58*523fa7a6SAndroid Build Coastguard Worker
59*523fa7a6SAndroid Build Coastguard WorkerThis also has an image utility Python script to generate image in PyTorch
60*523fa7a6SAndroid Build Coastguard Workerloadable format. Alternatively, we are working on generating image format which
61*523fa7a6SAndroid Build Coastguard Workerdoesn't need PyTorch to load an image. Motivation for this is to build the C++
62*523fa7a6SAndroid Build Coastguard Workerrunner on Android.
63*523fa7a6SAndroid Build Coastguard Worker
64*523fa7a6SAndroid Build Coastguard WorkerThen you should be able to find `llava_main` binary:
65*523fa7a6SAndroid Build Coastguard Worker
66*523fa7a6SAndroid Build Coastguard Worker```bash
67*523fa7a6SAndroid Build Coastguard Workercmake-out/examples/models/llava/llava_main
68*523fa7a6SAndroid Build Coastguard Worker```
69*523fa7a6SAndroid Build Coastguard Worker
70*523fa7a6SAndroid Build Coastguard Worker### Build Mobile Apps
71*523fa7a6SAndroid Build Coastguard Worker
72*523fa7a6SAndroid Build Coastguard Worker#### Android
73*523fa7a6SAndroid Build Coastguard Worker
74*523fa7a6SAndroid Build Coastguard WorkerWe can run LLAVA using the LLAMA Demo Apps. Please refer to [this
75*523fa7a6SAndroid Build Coastguard Workertutorial](https://github.com/pytorch/executorch/tree/main/examples/demo-apps/android/LlamaDemo)
76*523fa7a6SAndroid Build Coastguard Workerto for full instructions on building the Android LLAMA Demo App.
77*523fa7a6SAndroid Build Coastguard Worker
78*523fa7a6SAndroid Build Coastguard Worker#### iOS
79*523fa7a6SAndroid Build Coastguard Worker
80*523fa7a6SAndroid Build Coastguard WorkerWe can run LLAVA using the LLAMA Demo Apps. Please refer to [this
81*523fa7a6SAndroid Build Coastguard Workertutorial](https://github.com/pytorch/executorch/tree/main/examples/demo-apps/apple_ios/LLaMA)
82*523fa7a6SAndroid Build Coastguard Workerto for full instructions on building the iOS LLAMA Demo App.
83*523fa7a6SAndroid Build Coastguard Worker
84*523fa7a6SAndroid Build Coastguard Worker### Running LLaVA
85*523fa7a6SAndroid Build Coastguard Worker
86*523fa7a6SAndroid Build Coastguard WorkerRun:
87*523fa7a6SAndroid Build Coastguard Worker```bash
88*523fa7a6SAndroid Build Coastguard Workercmake-out/examples/models/llava/llava_main \
89*523fa7a6SAndroid Build Coastguard Worker    --model_path=llava.pte                 \
90*523fa7a6SAndroid Build Coastguard Worker    --tokenizer_path=tokenizer.bin         \
91*523fa7a6SAndroid Build Coastguard Worker    --image_path=image.pt                  \
92*523fa7a6SAndroid Build Coastguard Worker    --prompt="ASSISTANT:" \
93*523fa7a6SAndroid Build Coastguard Worker    --seq_len=768                          \
94*523fa7a6SAndroid Build Coastguard Worker    --temperature=0
95*523fa7a6SAndroid Build Coastguard Worker```
96*523fa7a6SAndroid Build Coastguard Worker(see --help for other options).
97*523fa7a6SAndroid Build Coastguard Worker
98*523fa7a6SAndroid Build Coastguard WorkerFor this example input used in this example,
99*523fa7a6SAndroid Build Coastguard Worker
100*523fa7a6SAndroid Build Coastguard Worker![image](https://upload.wikimedia.org/wikipedia/commons/3/3e/Chicago_Bulls_-_New_Jersey_Nets_match_on_March_28%2C_1991.jpg)
101*523fa7a6SAndroid Build Coastguard Worker
102*523fa7a6SAndroid Build Coastguard WorkerYou should get a response like (tested on Arm CPUs with ET XNNPACK delegate):
103*523fa7a6SAndroid Build Coastguard Worker
104*523fa7a6SAndroid Build Coastguard Worker```
105*523fa7a6SAndroid Build Coastguard WorkerASSISTANT: image captures a basketball game in progress, with several players on the court. ...
106*523fa7a6SAndroid Build Coastguard Worker```
107*523fa7a6SAndroid Build Coastguard Worker
108*523fa7a6SAndroid Build Coastguard Worker## Optimizations and Results
109*523fa7a6SAndroid Build Coastguard Worker
110*523fa7a6SAndroid Build Coastguard WorkerSince LLaVA model needs at least 4-bit quantization to fit even within some of
111*523fa7a6SAndroid Build Coastguard Workerthe high-end phones, results presented here correspond to 4-bit groupwise
112*523fa7a6SAndroid Build Coastguard Workerpost-training quantized model.
113*523fa7a6SAndroid Build Coastguard Worker
114*523fa7a6SAndroid Build Coastguard WorkerIn addition to that, work is mainly focused on using Arm CPUs and ET XNNPACK delegate.
115*523fa7a6SAndroid Build Coastguard Worker
116*523fa7a6SAndroid Build Coastguard Worker### Memory Footprint Reduction Techniques
117*523fa7a6SAndroid Build Coastguard Worker
118*523fa7a6SAndroid Build Coastguard WorkerWith Llava, we needed to find a way to reduce the memory footprint in order to
119*523fa7a6SAndroid Build Coastguard Workermake it feasible to run on edge devices. Out of the box, even with 4-bit
120*523fa7a6SAndroid Build Coastguard Workerquantized weights, the memory footprint is around ~11 GiB, which is
121*523fa7a6SAndroid Build Coastguard Workerprohibitively large even for high-end Android or iOS devices.
122*523fa7a6SAndroid Build Coastguard Worker
123*523fa7a6SAndroid Build Coastguard WorkerWe did several optimizations, which should be already enabled if you follow this
124*523fa7a6SAndroid Build Coastguard Workertutorial, to get the memory footprint down to ~5 GiB, which unblocks us to run
125*523fa7a6SAndroid Build Coastguard Workeron high-end devices.
126*523fa7a6SAndroid Build Coastguard Worker
127*523fa7a6SAndroid Build Coastguard Worker#### Sharing intermediate memory across delegates
128*523fa7a6SAndroid Build Coastguard Worker
129*523fa7a6SAndroid Build Coastguard WorkerSharing working memory across ET XNNPACK delegates helps reduce the peak memory
130*523fa7a6SAndroid Build Coastguard Workerusage for LLMs with many DQLinears. We reduced it by 36.1% (from 10.44GiB to
131*523fa7a6SAndroid Build Coastguard Worker6.67GiB) for Llava towards unblocking it to run on Phones.
132*523fa7a6SAndroid Build Coastguard Worker
133*523fa7a6SAndroid Build Coastguard Worker#### Reducing maximum sequence length
134*523fa7a6SAndroid Build Coastguard Worker
135*523fa7a6SAndroid Build Coastguard WorkerTo free up more memory, we examined non-constant memory usage, specifically
136*523fa7a6SAndroid Build Coastguard Workerfocusing on intermediate tensors used throughout the model during inference.
137*523fa7a6SAndroid Build Coastguard WorkerThe majority of these were found in the KV-cache allocations. Based on “minimum
138*523fa7a6SAndroid Build Coastguard Workercan get away with” heuristic, we reduced max sequence length number to 768 from
139*523fa7a6SAndroid Build Coastguard Workerprevious default 2048. This adjustment led to a further memory reduction of
140*523fa7a6SAndroid Build Coastguard Workerapproximately 1.23 GiB (from 6.67 GiB to 5.44 GiB).
141*523fa7a6SAndroid Build Coastguard Worker
142*523fa7a6SAndroid Build Coastguard Worker#### Quantizing embedding weights to 8b
143*523fa7a6SAndroid Build Coastguard Worker
144*523fa7a6SAndroid Build Coastguard WorkerBy quantizing the embedding layer to 8 bit, we were able to achieve an
145*523fa7a6SAndroid Build Coastguard Workeradditional memory footprint reduction of approximately 300 MiB, bringing the
146*523fa7a6SAndroid Build Coastguard Workertotal down to ~5 GiB.
147*523fa7a6SAndroid Build Coastguard Worker
148*523fa7a6SAndroid Build Coastguard Worker### Performance Optimizations
149*523fa7a6SAndroid Build Coastguard Worker
150*523fa7a6SAndroid Build Coastguard Worker#### Decode performance
151*523fa7a6SAndroid Build Coastguard Worker
152*523fa7a6SAndroid Build Coastguard WorkerThis was already heavily optimized through KV-cache and GEMV kernel
153*523fa7a6SAndroid Build Coastguard Workeroptimization efforts for LLama2/3.
154*523fa7a6SAndroid Build Coastguard Worker
155*523fa7a6SAndroid Build Coastguard Worker#### Encode performance
156*523fa7a6SAndroid Build Coastguard Worker
157*523fa7a6SAndroid Build Coastguard WorkerWith image based large prompts, this was the focus of performance
158*523fa7a6SAndroid Build Coastguard Workeroptimizations for LLaVA. We implemented two main optimizations to bring the decode or
159*523fa7a6SAndroid Build Coastguard Workerprefill performance for the image down by more than 100% from the baseline.
160*523fa7a6SAndroid Build Coastguard Worker
161*523fa7a6SAndroid Build Coastguard Worker* **Two XNNPACK Partitioners**
162*523fa7a6SAndroid Build Coastguard Worker
163*523fa7a6SAndroid Build Coastguard WorkerFor text-only LLMs, our approach involved lowering only DQLinear ops
164*523fa7a6SAndroid Build Coastguard Workerto XNNPACK and relying on ExecuTorch-optimized operators or custom ops
165*523fa7a6SAndroid Build Coastguard Worker(utilizing Neon SIMD) to support multiplication, addition, and other
166*523fa7a6SAndroid Build Coastguard Workeroperations. Lowering these operations to XNNPACK significantly improves Time to
167*523fa7a6SAndroid Build Coastguard WorkerFirst Token (TTFT).
168*523fa7a6SAndroid Build Coastguard Worker
169*523fa7a6SAndroid Build Coastguard Worker
170*523fa7a6SAndroid Build Coastguard Worker* **New Arm Neon I8mm GEMM kernels**
171*523fa7a6SAndroid Build Coastguard Worker
172*523fa7a6SAndroid Build Coastguard WorkerWe introduced new kernels in XNNPACK for the quantization scheme used
173*523fa7a6SAndroid Build Coastguard Workerhere, which upgrades our existing dot-prod based GEMM kernels to i8mm based
174*523fa7a6SAndroid Build Coastguard WorkerGEMM kernels. The new kernel offers significantly improved performance by
175*523fa7a6SAndroid Build Coastguard Workerleveraging the more efficient SMMLA instruction from Arm Neon. However, it's
176*523fa7a6SAndroid Build Coastguard Workerworth noting that this instruction is only available on newer Arm CPUs.
177*523fa7a6SAndroid Build Coastguard Worker
178*523fa7a6SAndroid Build Coastguard Worker
179*523fa7a6SAndroid Build Coastguard Worker### Results
180*523fa7a6SAndroid Build Coastguard Worker
181*523fa7a6SAndroid Build Coastguard WorkerNote this is an active area of development in the ExecuTorch repository. You
182*523fa7a6SAndroid Build Coastguard Workerwill need this PR [5380](https://github.com/pytorch/executorch/pull/5380) to
183*523fa7a6SAndroid Build Coastguard Workersupply an image to the C++ runner on Android without Torch dependency. This
184*523fa7a6SAndroid Build Coastguard Workershould be merged soon.
185*523fa7a6SAndroid Build Coastguard Worker
186*523fa7a6SAndroid Build Coastguard WorkerWith those caveats out of the way, here are some preliminary numbers (as average of
187*523fa7a6SAndroid Build Coastguard Workerthree runs) for LLaVA using a C++ runner on Android OnePlus12 device with 12GiB
188*523fa7a6SAndroid Build Coastguard Workermemory.
189*523fa7a6SAndroid Build Coastguard Worker
190*523fa7a6SAndroid Build Coastguard Worker| Experiment Setup  | Prefill time in seconds | Decode tokens/second |
191*523fa7a6SAndroid Build Coastguard Worker| :------------- | -------------: | -------------: |
192*523fa7a6SAndroid Build Coastguard Worker| Baseline  | 29.95  | 8.75 |
193*523fa7a6SAndroid Build Coastguard Worker| + Two XNNPACK Partitioners  | 17.82  | 8.93 |
194*523fa7a6SAndroid Build Coastguard Worker| + New Arm Neon i8mm GEMM Kernels  | 14.60 | 8.92 |
195*523fa7a6SAndroid Build Coastguard Worker
196*523fa7a6SAndroid Build Coastguard WorkerWe appreciate your feedback. Please let us know if you run into any issues.
197