1*523fa7a6SAndroid Build Coastguard Worker## Summary 2*523fa7a6SAndroid Build Coastguard WorkerLLaVA is the first multi-modal LLM ExecuTorch supports. In this directory, we 3*523fa7a6SAndroid Build Coastguard Worker- Host a model definition for [LLavA](https://github.com/haotian-liu/LLaVA). 4*523fa7a6SAndroid Build Coastguard Worker- Demonstrate how to export LLavA multimodal model to generate ExecuTorch .PTE file. 5*523fa7a6SAndroid Build Coastguard Worker- Provide a C++ runner, Android/iOS Apps that loads the .pte file, the tokenizer and an image, then generate responses based on user prompt. 6*523fa7a6SAndroid Build Coastguard Worker- Discuss optimizations went into enabling LlaVA on a phone, and early performance numbers 7*523fa7a6SAndroid Build Coastguard Worker 8*523fa7a6SAndroid Build Coastguard WorkerTokenizer, image encoder, and the pretrained text model, which is based on Meta 9*523fa7a6SAndroid Build Coastguard Worker[Llama2-7b](https://llama.meta.com/llama2/), is loaded from Llava 10*523fa7a6SAndroid Build Coastguard Workerhuggingface page [llava-hf/llava-1.5-7b-hf](https://huggingface.co/llava-hf/llava-1.5-7b-hf) . 11*523fa7a6SAndroid Build Coastguard Worker 12*523fa7a6SAndroid Build Coastguard Worker 13*523fa7a6SAndroid Build Coastguard Worker<p align="center"> 14*523fa7a6SAndroid Build Coastguard Worker <img src="./llava_via_xnnpack.gif" width=300> 15*523fa7a6SAndroid Build Coastguard Worker <br> 16*523fa7a6SAndroid Build Coastguard Worker <em> 17*523fa7a6SAndroid Build Coastguard Worker Running Llava1.5 7B on Android phone 18*523fa7a6SAndroid Build Coastguard Worker </em> 19*523fa7a6SAndroid Build Coastguard Worker</p> 20*523fa7a6SAndroid Build Coastguard Worker 21*523fa7a6SAndroid Build Coastguard Worker## What is LLaVA? 22*523fa7a6SAndroid Build Coastguard Worker 23*523fa7a6SAndroid Build Coastguard Worker[LLaVA](https://llava-vl.github.io/) is a novel end-to-end trained large 24*523fa7a6SAndroid Build Coastguard Workermultimodal model that combines a vision encoder and Vicuna (a LLama2 based text 25*523fa7a6SAndroid Build Coastguard Workermodel) for general-purpose visual and language understanding, achieving 26*523fa7a6SAndroid Build Coastguard Workerimpressive chat capabilities mimicking spirits of the cutting edge multimodal 27*523fa7a6SAndroid Build Coastguard Workermodels and setting a high bar for accuracy on Science QA. 28*523fa7a6SAndroid Build Coastguard Worker 29*523fa7a6SAndroid Build Coastguard Worker## Instructions 30*523fa7a6SAndroid Build Coastguard Worker 31*523fa7a6SAndroid Build Coastguard WorkerFirst you need to generate a .PTE file for the model, along with input image, 32*523fa7a6SAndroid Build Coastguard Workerand other artifacts. Then you need either a C++ runner, or Android or iOS 33*523fa7a6SAndroid Build Coastguard Workerapplication to test things out on device. 34*523fa7a6SAndroid Build Coastguard Worker 35*523fa7a6SAndroid Build Coastguard Worker### Generate ExecuTorch .PTE and other artifacts 36*523fa7a6SAndroid Build Coastguard Worker 37*523fa7a6SAndroid Build Coastguard WorkerRun the following command to generate `llava.pte`, `tokenizer.bin` and an image 38*523fa7a6SAndroid Build Coastguard Workertensor (serialized in TorchScript) `image.pt`. 39*523fa7a6SAndroid Build Coastguard Worker 40*523fa7a6SAndroid Build Coastguard WorkerPrerequisite: run `install_requirements.sh` to install ExecuTorch and run 41*523fa7a6SAndroid Build Coastguard Worker`examples/models/llava/install_requirements.sh` to install dependencies. 42*523fa7a6SAndroid Build Coastguard Worker 43*523fa7a6SAndroid Build Coastguard Worker```bash 44*523fa7a6SAndroid Build Coastguard Workerpython -m executorch.examples.models.llava.export_llava --pte-name llava.pte --with-artifacts 45*523fa7a6SAndroid Build Coastguard Worker``` 46*523fa7a6SAndroid Build Coastguard Worker 47*523fa7a6SAndroid Build Coastguard WorkerCurrently the whole export process takes about 6 minutes. We also provide a 48*523fa7a6SAndroid Build Coastguard Workersmall test utility to verify the correctness of the exported .pte file. Just run: 49*523fa7a6SAndroid Build Coastguard Worker 50*523fa7a6SAndroid Build Coastguard Worker```bash 51*523fa7a6SAndroid Build Coastguard Workerpython -m executorch.examples.models.llava.test.test_pte llava.pte 52*523fa7a6SAndroid Build Coastguard Worker``` 53*523fa7a6SAndroid Build Coastguard Worker 54*523fa7a6SAndroid Build Coastguard Worker### Build C++ Runner 55*523fa7a6SAndroid Build Coastguard Worker 56*523fa7a6SAndroid Build Coastguard WorkerSee or run `.ci/scripts/test_llava.sh` shell script to build a C++ runner. This 57*523fa7a6SAndroid Build Coastguard Workerscript also has preliminary support to build the C++ runner for Android. 58*523fa7a6SAndroid Build Coastguard Worker 59*523fa7a6SAndroid Build Coastguard WorkerThis also has an image utility Python script to generate image in PyTorch 60*523fa7a6SAndroid Build Coastguard Workerloadable format. Alternatively, we are working on generating image format which 61*523fa7a6SAndroid Build Coastguard Workerdoesn't need PyTorch to load an image. Motivation for this is to build the C++ 62*523fa7a6SAndroid Build Coastguard Workerrunner on Android. 63*523fa7a6SAndroid Build Coastguard Worker 64*523fa7a6SAndroid Build Coastguard WorkerThen you should be able to find `llava_main` binary: 65*523fa7a6SAndroid Build Coastguard Worker 66*523fa7a6SAndroid Build Coastguard Worker```bash 67*523fa7a6SAndroid Build Coastguard Workercmake-out/examples/models/llava/llava_main 68*523fa7a6SAndroid Build Coastguard Worker``` 69*523fa7a6SAndroid Build Coastguard Worker 70*523fa7a6SAndroid Build Coastguard Worker### Build Mobile Apps 71*523fa7a6SAndroid Build Coastguard Worker 72*523fa7a6SAndroid Build Coastguard Worker#### Android 73*523fa7a6SAndroid Build Coastguard Worker 74*523fa7a6SAndroid Build Coastguard WorkerWe can run LLAVA using the LLAMA Demo Apps. Please refer to [this 75*523fa7a6SAndroid Build Coastguard Workertutorial](https://github.com/pytorch/executorch/tree/main/examples/demo-apps/android/LlamaDemo) 76*523fa7a6SAndroid Build Coastguard Workerto for full instructions on building the Android LLAMA Demo App. 77*523fa7a6SAndroid Build Coastguard Worker 78*523fa7a6SAndroid Build Coastguard Worker#### iOS 79*523fa7a6SAndroid Build Coastguard Worker 80*523fa7a6SAndroid Build Coastguard WorkerWe can run LLAVA using the LLAMA Demo Apps. Please refer to [this 81*523fa7a6SAndroid Build Coastguard Workertutorial](https://github.com/pytorch/executorch/tree/main/examples/demo-apps/apple_ios/LLaMA) 82*523fa7a6SAndroid Build Coastguard Workerto for full instructions on building the iOS LLAMA Demo App. 83*523fa7a6SAndroid Build Coastguard Worker 84*523fa7a6SAndroid Build Coastguard Worker### Running LLaVA 85*523fa7a6SAndroid Build Coastguard Worker 86*523fa7a6SAndroid Build Coastguard WorkerRun: 87*523fa7a6SAndroid Build Coastguard Worker```bash 88*523fa7a6SAndroid Build Coastguard Workercmake-out/examples/models/llava/llava_main \ 89*523fa7a6SAndroid Build Coastguard Worker --model_path=llava.pte \ 90*523fa7a6SAndroid Build Coastguard Worker --tokenizer_path=tokenizer.bin \ 91*523fa7a6SAndroid Build Coastguard Worker --image_path=image.pt \ 92*523fa7a6SAndroid Build Coastguard Worker --prompt="ASSISTANT:" \ 93*523fa7a6SAndroid Build Coastguard Worker --seq_len=768 \ 94*523fa7a6SAndroid Build Coastguard Worker --temperature=0 95*523fa7a6SAndroid Build Coastguard Worker``` 96*523fa7a6SAndroid Build Coastguard Worker(see --help for other options). 97*523fa7a6SAndroid Build Coastguard Worker 98*523fa7a6SAndroid Build Coastguard WorkerFor this example input used in this example, 99*523fa7a6SAndroid Build Coastguard Worker 100*523fa7a6SAndroid Build Coastguard Worker 101*523fa7a6SAndroid Build Coastguard Worker 102*523fa7a6SAndroid Build Coastguard WorkerYou should get a response like (tested on Arm CPUs with ET XNNPACK delegate): 103*523fa7a6SAndroid Build Coastguard Worker 104*523fa7a6SAndroid Build Coastguard Worker``` 105*523fa7a6SAndroid Build Coastguard WorkerASSISTANT: image captures a basketball game in progress, with several players on the court. ... 106*523fa7a6SAndroid Build Coastguard Worker``` 107*523fa7a6SAndroid Build Coastguard Worker 108*523fa7a6SAndroid Build Coastguard Worker## Optimizations and Results 109*523fa7a6SAndroid Build Coastguard Worker 110*523fa7a6SAndroid Build Coastguard WorkerSince LLaVA model needs at least 4-bit quantization to fit even within some of 111*523fa7a6SAndroid Build Coastguard Workerthe high-end phones, results presented here correspond to 4-bit groupwise 112*523fa7a6SAndroid Build Coastguard Workerpost-training quantized model. 113*523fa7a6SAndroid Build Coastguard Worker 114*523fa7a6SAndroid Build Coastguard WorkerIn addition to that, work is mainly focused on using Arm CPUs and ET XNNPACK delegate. 115*523fa7a6SAndroid Build Coastguard Worker 116*523fa7a6SAndroid Build Coastguard Worker### Memory Footprint Reduction Techniques 117*523fa7a6SAndroid Build Coastguard Worker 118*523fa7a6SAndroid Build Coastguard WorkerWith Llava, we needed to find a way to reduce the memory footprint in order to 119*523fa7a6SAndroid Build Coastguard Workermake it feasible to run on edge devices. Out of the box, even with 4-bit 120*523fa7a6SAndroid Build Coastguard Workerquantized weights, the memory footprint is around ~11 GiB, which is 121*523fa7a6SAndroid Build Coastguard Workerprohibitively large even for high-end Android or iOS devices. 122*523fa7a6SAndroid Build Coastguard Worker 123*523fa7a6SAndroid Build Coastguard WorkerWe did several optimizations, which should be already enabled if you follow this 124*523fa7a6SAndroid Build Coastguard Workertutorial, to get the memory footprint down to ~5 GiB, which unblocks us to run 125*523fa7a6SAndroid Build Coastguard Workeron high-end devices. 126*523fa7a6SAndroid Build Coastguard Worker 127*523fa7a6SAndroid Build Coastguard Worker#### Sharing intermediate memory across delegates 128*523fa7a6SAndroid Build Coastguard Worker 129*523fa7a6SAndroid Build Coastguard WorkerSharing working memory across ET XNNPACK delegates helps reduce the peak memory 130*523fa7a6SAndroid Build Coastguard Workerusage for LLMs with many DQLinears. We reduced it by 36.1% (from 10.44GiB to 131*523fa7a6SAndroid Build Coastguard Worker6.67GiB) for Llava towards unblocking it to run on Phones. 132*523fa7a6SAndroid Build Coastguard Worker 133*523fa7a6SAndroid Build Coastguard Worker#### Reducing maximum sequence length 134*523fa7a6SAndroid Build Coastguard Worker 135*523fa7a6SAndroid Build Coastguard WorkerTo free up more memory, we examined non-constant memory usage, specifically 136*523fa7a6SAndroid Build Coastguard Workerfocusing on intermediate tensors used throughout the model during inference. 137*523fa7a6SAndroid Build Coastguard WorkerThe majority of these were found in the KV-cache allocations. Based on “minimum 138*523fa7a6SAndroid Build Coastguard Workercan get away with” heuristic, we reduced max sequence length number to 768 from 139*523fa7a6SAndroid Build Coastguard Workerprevious default 2048. This adjustment led to a further memory reduction of 140*523fa7a6SAndroid Build Coastguard Workerapproximately 1.23 GiB (from 6.67 GiB to 5.44 GiB). 141*523fa7a6SAndroid Build Coastguard Worker 142*523fa7a6SAndroid Build Coastguard Worker#### Quantizing embedding weights to 8b 143*523fa7a6SAndroid Build Coastguard Worker 144*523fa7a6SAndroid Build Coastguard WorkerBy quantizing the embedding layer to 8 bit, we were able to achieve an 145*523fa7a6SAndroid Build Coastguard Workeradditional memory footprint reduction of approximately 300 MiB, bringing the 146*523fa7a6SAndroid Build Coastguard Workertotal down to ~5 GiB. 147*523fa7a6SAndroid Build Coastguard Worker 148*523fa7a6SAndroid Build Coastguard Worker### Performance Optimizations 149*523fa7a6SAndroid Build Coastguard Worker 150*523fa7a6SAndroid Build Coastguard Worker#### Decode performance 151*523fa7a6SAndroid Build Coastguard Worker 152*523fa7a6SAndroid Build Coastguard WorkerThis was already heavily optimized through KV-cache and GEMV kernel 153*523fa7a6SAndroid Build Coastguard Workeroptimization efforts for LLama2/3. 154*523fa7a6SAndroid Build Coastguard Worker 155*523fa7a6SAndroid Build Coastguard Worker#### Encode performance 156*523fa7a6SAndroid Build Coastguard Worker 157*523fa7a6SAndroid Build Coastguard WorkerWith image based large prompts, this was the focus of performance 158*523fa7a6SAndroid Build Coastguard Workeroptimizations for LLaVA. We implemented two main optimizations to bring the decode or 159*523fa7a6SAndroid Build Coastguard Workerprefill performance for the image down by more than 100% from the baseline. 160*523fa7a6SAndroid Build Coastguard Worker 161*523fa7a6SAndroid Build Coastguard Worker* **Two XNNPACK Partitioners** 162*523fa7a6SAndroid Build Coastguard Worker 163*523fa7a6SAndroid Build Coastguard WorkerFor text-only LLMs, our approach involved lowering only DQLinear ops 164*523fa7a6SAndroid Build Coastguard Workerto XNNPACK and relying on ExecuTorch-optimized operators or custom ops 165*523fa7a6SAndroid Build Coastguard Worker(utilizing Neon SIMD) to support multiplication, addition, and other 166*523fa7a6SAndroid Build Coastguard Workeroperations. Lowering these operations to XNNPACK significantly improves Time to 167*523fa7a6SAndroid Build Coastguard WorkerFirst Token (TTFT). 168*523fa7a6SAndroid Build Coastguard Worker 169*523fa7a6SAndroid Build Coastguard Worker 170*523fa7a6SAndroid Build Coastguard Worker* **New Arm Neon I8mm GEMM kernels** 171*523fa7a6SAndroid Build Coastguard Worker 172*523fa7a6SAndroid Build Coastguard WorkerWe introduced new kernels in XNNPACK for the quantization scheme used 173*523fa7a6SAndroid Build Coastguard Workerhere, which upgrades our existing dot-prod based GEMM kernels to i8mm based 174*523fa7a6SAndroid Build Coastguard WorkerGEMM kernels. The new kernel offers significantly improved performance by 175*523fa7a6SAndroid Build Coastguard Workerleveraging the more efficient SMMLA instruction from Arm Neon. However, it's 176*523fa7a6SAndroid Build Coastguard Workerworth noting that this instruction is only available on newer Arm CPUs. 177*523fa7a6SAndroid Build Coastguard Worker 178*523fa7a6SAndroid Build Coastguard Worker 179*523fa7a6SAndroid Build Coastguard Worker### Results 180*523fa7a6SAndroid Build Coastguard Worker 181*523fa7a6SAndroid Build Coastguard WorkerNote this is an active area of development in the ExecuTorch repository. You 182*523fa7a6SAndroid Build Coastguard Workerwill need this PR [5380](https://github.com/pytorch/executorch/pull/5380) to 183*523fa7a6SAndroid Build Coastguard Workersupply an image to the C++ runner on Android without Torch dependency. This 184*523fa7a6SAndroid Build Coastguard Workershould be merged soon. 185*523fa7a6SAndroid Build Coastguard Worker 186*523fa7a6SAndroid Build Coastguard WorkerWith those caveats out of the way, here are some preliminary numbers (as average of 187*523fa7a6SAndroid Build Coastguard Workerthree runs) for LLaVA using a C++ runner on Android OnePlus12 device with 12GiB 188*523fa7a6SAndroid Build Coastguard Workermemory. 189*523fa7a6SAndroid Build Coastguard Worker 190*523fa7a6SAndroid Build Coastguard Worker| Experiment Setup | Prefill time in seconds | Decode tokens/second | 191*523fa7a6SAndroid Build Coastguard Worker| :------------- | -------------: | -------------: | 192*523fa7a6SAndroid Build Coastguard Worker| Baseline | 29.95 | 8.75 | 193*523fa7a6SAndroid Build Coastguard Worker| + Two XNNPACK Partitioners | 17.82 | 8.93 | 194*523fa7a6SAndroid Build Coastguard Worker| + New Arm Neon i8mm GEMM Kernels | 14.60 | 8.92 | 195*523fa7a6SAndroid Build Coastguard Worker 196*523fa7a6SAndroid Build Coastguard WorkerWe appreciate your feedback. Please let us know if you run into any issues. 197