1## Summary 2LLaVA is the first multi-modal LLM ExecuTorch supports. In this directory, we 3- Host a model definition for [LLavA](https://github.com/haotian-liu/LLaVA). 4- Demonstrate how to export LLavA multimodal model to generate ExecuTorch .PTE file. 5- Provide a C++ runner, Android/iOS Apps that loads the .pte file, the tokenizer and an image, then generate responses based on user prompt. 6- Discuss optimizations went into enabling LlaVA on a phone, and early performance numbers 7 8Tokenizer, image encoder, and the pretrained text model, which is based on Meta 9[Llama2-7b](https://llama.meta.com/llama2/), is loaded from Llava 10huggingface page [llava-hf/llava-1.5-7b-hf](https://huggingface.co/llava-hf/llava-1.5-7b-hf) . 11 12 13<p align="center"> 14 <img src="./llava_via_xnnpack.gif" width=300> 15 <br> 16 <em> 17 Running Llava1.5 7B on Android phone 18 </em> 19</p> 20 21## What is LLaVA? 22 23[LLaVA](https://llava-vl.github.io/) is a novel end-to-end trained large 24multimodal model that combines a vision encoder and Vicuna (a LLama2 based text 25model) for general-purpose visual and language understanding, achieving 26impressive chat capabilities mimicking spirits of the cutting edge multimodal 27models and setting a high bar for accuracy on Science QA. 28 29## Instructions 30 31First you need to generate a .PTE file for the model, along with input image, 32and other artifacts. Then you need either a C++ runner, or Android or iOS 33application to test things out on device. 34 35### Generate ExecuTorch .PTE and other artifacts 36 37Run the following command to generate `llava.pte`, `tokenizer.bin` and an image 38tensor (serialized in TorchScript) `image.pt`. 39 40Prerequisite: run `install_requirements.sh` to install ExecuTorch and run 41`examples/models/llava/install_requirements.sh` to install dependencies. 42 43```bash 44python -m executorch.examples.models.llava.export_llava --pte-name llava.pte --with-artifacts 45``` 46 47Currently the whole export process takes about 6 minutes. We also provide a 48small test utility to verify the correctness of the exported .pte file. Just run: 49 50```bash 51python -m executorch.examples.models.llava.test.test_pte llava.pte 52``` 53 54### Build C++ Runner 55 56See or run `.ci/scripts/test_llava.sh` shell script to build a C++ runner. This 57script also has preliminary support to build the C++ runner for Android. 58 59This also has an image utility Python script to generate image in PyTorch 60loadable format. Alternatively, we are working on generating image format which 61doesn't need PyTorch to load an image. Motivation for this is to build the C++ 62runner on Android. 63 64Then you should be able to find `llava_main` binary: 65 66```bash 67cmake-out/examples/models/llava/llava_main 68``` 69 70### Build Mobile Apps 71 72#### Android 73 74We can run LLAVA using the LLAMA Demo Apps. Please refer to [this 75tutorial](https://github.com/pytorch/executorch/tree/main/examples/demo-apps/android/LlamaDemo) 76to for full instructions on building the Android LLAMA Demo App. 77 78#### iOS 79 80We can run LLAVA using the LLAMA Demo Apps. Please refer to [this 81tutorial](https://github.com/pytorch/executorch/tree/main/examples/demo-apps/apple_ios/LLaMA) 82to for full instructions on building the iOS LLAMA Demo App. 83 84### Running LLaVA 85 86Run: 87```bash 88cmake-out/examples/models/llava/llava_main \ 89 --model_path=llava.pte \ 90 --tokenizer_path=tokenizer.bin \ 91 --image_path=image.pt \ 92 --prompt="ASSISTANT:" \ 93 --seq_len=768 \ 94 --temperature=0 95``` 96(see --help for other options). 97 98For this example input used in this example, 99 100 101 102You should get a response like (tested on Arm CPUs with ET XNNPACK delegate): 103 104``` 105ASSISTANT: image captures a basketball game in progress, with several players on the court. ... 106``` 107 108## Optimizations and Results 109 110Since LLaVA model needs at least 4-bit quantization to fit even within some of 111the high-end phones, results presented here correspond to 4-bit groupwise 112post-training quantized model. 113 114In addition to that, work is mainly focused on using Arm CPUs and ET XNNPACK delegate. 115 116### Memory Footprint Reduction Techniques 117 118With Llava, we needed to find a way to reduce the memory footprint in order to 119make it feasible to run on edge devices. Out of the box, even with 4-bit 120quantized weights, the memory footprint is around ~11 GiB, which is 121prohibitively large even for high-end Android or iOS devices. 122 123We did several optimizations, which should be already enabled if you follow this 124tutorial, to get the memory footprint down to ~5 GiB, which unblocks us to run 125on high-end devices. 126 127#### Sharing intermediate memory across delegates 128 129Sharing working memory across ET XNNPACK delegates helps reduce the peak memory 130usage for LLMs with many DQLinears. We reduced it by 36.1% (from 10.44GiB to 1316.67GiB) for Llava towards unblocking it to run on Phones. 132 133#### Reducing maximum sequence length 134 135To free up more memory, we examined non-constant memory usage, specifically 136focusing on intermediate tensors used throughout the model during inference. 137The majority of these were found in the KV-cache allocations. Based on “minimum 138can get away with” heuristic, we reduced max sequence length number to 768 from 139previous default 2048. This adjustment led to a further memory reduction of 140approximately 1.23 GiB (from 6.67 GiB to 5.44 GiB). 141 142#### Quantizing embedding weights to 8b 143 144By quantizing the embedding layer to 8 bit, we were able to achieve an 145additional memory footprint reduction of approximately 300 MiB, bringing the 146total down to ~5 GiB. 147 148### Performance Optimizations 149 150#### Decode performance 151 152This was already heavily optimized through KV-cache and GEMV kernel 153optimization efforts for LLama2/3. 154 155#### Encode performance 156 157With image based large prompts, this was the focus of performance 158optimizations for LLaVA. We implemented two main optimizations to bring the decode or 159prefill performance for the image down by more than 100% from the baseline. 160 161* **Two XNNPACK Partitioners** 162 163For text-only LLMs, our approach involved lowering only DQLinear ops 164to XNNPACK and relying on ExecuTorch-optimized operators or custom ops 165(utilizing Neon SIMD) to support multiplication, addition, and other 166operations. Lowering these operations to XNNPACK significantly improves Time to 167First Token (TTFT). 168 169 170* **New Arm Neon I8mm GEMM kernels** 171 172We introduced new kernels in XNNPACK for the quantization scheme used 173here, which upgrades our existing dot-prod based GEMM kernels to i8mm based 174GEMM kernels. The new kernel offers significantly improved performance by 175leveraging the more efficient SMMLA instruction from Arm Neon. However, it's 176worth noting that this instruction is only available on newer Arm CPUs. 177 178 179### Results 180 181Note this is an active area of development in the ExecuTorch repository. You 182will need this PR [5380](https://github.com/pytorch/executorch/pull/5380) to 183supply an image to the C++ runner on Android without Torch dependency. This 184should be merged soon. 185 186With those caveats out of the way, here are some preliminary numbers (as average of 187three runs) for LLaVA using a C++ runner on Android OnePlus12 device with 12GiB 188memory. 189 190| Experiment Setup | Prefill time in seconds | Decode tokens/second | 191| :------------- | -------------: | -------------: | 192| Baseline | 29.95 | 8.75 | 193| + Two XNNPACK Partitioners | 17.82 | 8.93 | 194| + New Arm Neon i8mm GEMM Kernels | 14.60 | 8.92 | 195 196We appreciate your feedback. Please let us know if you run into any issues. 197