models/llava/README.md

*523fa7a6SAndroid Build Coastguard Worker## Summary
*523fa7a6SAndroid Build Coastguard WorkerLLaVA is the first multi-modal LLM ExecuTorch supports. In this directory, we
*523fa7a6SAndroid Build Coastguard Worker- Host a model definition for [LLavA](https://github.com/haotian-liu/LLaVA).
*523fa7a6SAndroid Build Coastguard Worker- Demonstrate how to export LLavA multimodal model to generate ExecuTorch .PTE file.
*523fa7a6SAndroid Build Coastguard Worker- Provide a C++ runner, Android/iOS Apps that loads the .pte file, the tokenizer and an image, then generate responses based on user prompt.
*523fa7a6SAndroid Build Coastguard Worker- Discuss optimizations went into enabling LlaVA on a phone, and early performance numbers
*523fa7a6SAndroid Build Coastguard Worker
*523fa7a6SAndroid Build Coastguard WorkerTokenizer, image encoder, and the pretrained text model, which is based on Meta
*523fa7a6SAndroid Build Coastguard Worker[Llama2-7b](https://llama.meta.com/llama2/), is loaded from Llava
*523fa7a6SAndroid Build Coastguard Workerhuggingface page [llava-hf/llava-1.5-7b-hf](https://huggingface.co/llava-hf/llava-1.5-7b-hf) .
*523fa7a6SAndroid Build Coastguard Worker
*523fa7a6SAndroid Build Coastguard Worker
*523fa7a6SAndroid Build Coastguard Worker<p align="center">
*523fa7a6SAndroid Build Coastguard Worker      <img src="./llava_via_xnnpack.gif" width=300>
*523fa7a6SAndroid Build Coastguard Worker      <br>
*523fa7a6SAndroid Build Coastguard Worker      <em>
*523fa7a6SAndroid Build Coastguard Worker      Running Llava1.5 7B on Android phone
*523fa7a6SAndroid Build Coastguard Worker      </em>
*523fa7a6SAndroid Build Coastguard Worker</p>
*523fa7a6SAndroid Build Coastguard Worker
*523fa7a6SAndroid Build Coastguard Worker## What is LLaVA?
*523fa7a6SAndroid Build Coastguard Worker
*523fa7a6SAndroid Build Coastguard Worker[LLaVA](https://llava-vl.github.io/) is a novel end-to-end trained large
*523fa7a6SAndroid Build Coastguard Workermultimodal model that combines a vision encoder and Vicuna (a LLama2 based text
*523fa7a6SAndroid Build Coastguard Workermodel) for general-purpose visual and language understanding, achieving
*523fa7a6SAndroid Build Coastguard Workerimpressive chat capabilities mimicking spirits of the cutting edge multimodal
*523fa7a6SAndroid Build Coastguard Workermodels and setting a high bar for accuracy on Science QA.
*523fa7a6SAndroid Build Coastguard Worker
*523fa7a6SAndroid Build Coastguard Worker## Instructions
*523fa7a6SAndroid Build Coastguard Worker
*523fa7a6SAndroid Build Coastguard WorkerFirst you need to generate a .PTE file for the model, along with input image,
*523fa7a6SAndroid Build Coastguard Workerand other artifacts. Then you need either a C++ runner, or Android or iOS
*523fa7a6SAndroid Build Coastguard Workerapplication to test things out on device.
*523fa7a6SAndroid Build Coastguard Worker
*523fa7a6SAndroid Build Coastguard Worker### Generate ExecuTorch .PTE and other artifacts
*523fa7a6SAndroid Build Coastguard Worker
*523fa7a6SAndroid Build Coastguard WorkerRun the following command to generate `llava.pte`, `tokenizer.bin` and an image
*523fa7a6SAndroid Build Coastguard Workertensor (serialized in TorchScript) `image.pt`.
*523fa7a6SAndroid Build Coastguard Worker
*523fa7a6SAndroid Build Coastguard WorkerPrerequisite: run `install_requirements.sh` to install ExecuTorch and run
*523fa7a6SAndroid Build Coastguard Worker`examples/models/llava/install_requirements.sh` to install dependencies.
*523fa7a6SAndroid Build Coastguard Worker
*523fa7a6SAndroid Build Coastguard Worker```bash
*523fa7a6SAndroid Build Coastguard Workerpython -m executorch.examples.models.llava.export_llava --pte-name llava.pte --with-artifacts
*523fa7a6SAndroid Build Coastguard Worker```
*523fa7a6SAndroid Build Coastguard Worker
*523fa7a6SAndroid Build Coastguard WorkerCurrently the whole export process takes about 6 minutes. We also provide a
*523fa7a6SAndroid Build Coastguard Workersmall test utility to verify the correctness of the exported .pte file. Just run:
*523fa7a6SAndroid Build Coastguard Worker
*523fa7a6SAndroid Build Coastguard Worker```bash
*523fa7a6SAndroid Build Coastguard Workerpython -m executorch.examples.models.llava.test.test_pte llava.pte
*523fa7a6SAndroid Build Coastguard Worker```
*523fa7a6SAndroid Build Coastguard Worker
*523fa7a6SAndroid Build Coastguard Worker### Build C++ Runner
*523fa7a6SAndroid Build Coastguard Worker
*523fa7a6SAndroid Build Coastguard WorkerSee or run `.ci/scripts/test_llava.sh` shell script to build a C++ runner. This
*523fa7a6SAndroid Build Coastguard Workerscript also has preliminary support to build the C++ runner for Android.
*523fa7a6SAndroid Build Coastguard Worker
*523fa7a6SAndroid Build Coastguard WorkerThis also has an image utility Python script to generate image in PyTorch
*523fa7a6SAndroid Build Coastguard Workerloadable format. Alternatively, we are working on generating image format which
*523fa7a6SAndroid Build Coastguard Workerdoesn't need PyTorch to load an image. Motivation for this is to build the C++
*523fa7a6SAndroid Build Coastguard Workerrunner on Android.
*523fa7a6SAndroid Build Coastguard Worker
*523fa7a6SAndroid Build Coastguard WorkerThen you should be able to find `llava_main` binary:
*523fa7a6SAndroid Build Coastguard Worker
*523fa7a6SAndroid Build Coastguard Worker```bash
*523fa7a6SAndroid Build Coastguard Workercmake-out/examples/models/llava/llava_main
*523fa7a6SAndroid Build Coastguard Worker```
*523fa7a6SAndroid Build Coastguard Worker
*523fa7a6SAndroid Build Coastguard Worker### Build Mobile Apps
*523fa7a6SAndroid Build Coastguard Worker
*523fa7a6SAndroid Build Coastguard Worker#### Android
*523fa7a6SAndroid Build Coastguard Worker
*523fa7a6SAndroid Build Coastguard WorkerWe can run LLAVA using the LLAMA Demo Apps. Please refer to [this
*523fa7a6SAndroid Build Coastguard Workertutorial](https://github.com/pytorch/executorch/tree/main/examples/demo-apps/android/LlamaDemo)
*523fa7a6SAndroid Build Coastguard Workerto for full instructions on building the Android LLAMA Demo App.
*523fa7a6SAndroid Build Coastguard Worker
*523fa7a6SAndroid Build Coastguard Worker#### iOS
*523fa7a6SAndroid Build Coastguard Worker
*523fa7a6SAndroid Build Coastguard WorkerWe can run LLAVA using the LLAMA Demo Apps. Please refer to [this
*523fa7a6SAndroid Build Coastguard Workertutorial](https://github.com/pytorch/executorch/tree/main/examples/demo-apps/apple_ios/LLaMA)
*523fa7a6SAndroid Build Coastguard Workerto for full instructions on building the iOS LLAMA Demo App.
*523fa7a6SAndroid Build Coastguard Worker
*523fa7a6SAndroid Build Coastguard Worker### Running LLaVA
*523fa7a6SAndroid Build Coastguard Worker
*523fa7a6SAndroid Build Coastguard WorkerRun:
*523fa7a6SAndroid Build Coastguard Worker```bash
*523fa7a6SAndroid Build Coastguard Workercmake-out/examples/models/llava/llava_main \
*523fa7a6SAndroid Build Coastguard Worker    --model_path=llava.pte                 \
*523fa7a6SAndroid Build Coastguard Worker    --tokenizer_path=tokenizer.bin         \
*523fa7a6SAndroid Build Coastguard Worker    --image_path=image.pt                  \
*523fa7a6SAndroid Build Coastguard Worker    --prompt="ASSISTANT:" \
*523fa7a6SAndroid Build Coastguard Worker    --seq_len=768                          \
*523fa7a6SAndroid Build Coastguard Worker    --temperature=0
*523fa7a6SAndroid Build Coastguard Worker```
*523fa7a6SAndroid Build Coastguard Worker(see --help for other options).
*523fa7a6SAndroid Build Coastguard Worker
*523fa7a6SAndroid Build Coastguard WorkerFor this example input used in this example,
*523fa7a6SAndroid Build Coastguard Worker
*523fa7a6SAndroid Build Coastguard Worker![image](https://upload.wikimedia.org/wikipedia/commons/3/3e/Chicago_Bulls_-_New_Jersey_Nets_match_on_March_28%2C_1991.jpg)
*523fa7a6SAndroid Build Coastguard Worker
*523fa7a6SAndroid Build Coastguard WorkerYou should get a response like (tested on Arm CPUs with ET XNNPACK delegate):
*523fa7a6SAndroid Build Coastguard Worker
*523fa7a6SAndroid Build Coastguard Worker```
*523fa7a6SAndroid Build Coastguard WorkerASSISTANT: image captures a basketball game in progress, with several players on the court. ...
*523fa7a6SAndroid Build Coastguard Worker```
*523fa7a6SAndroid Build Coastguard Worker
*523fa7a6SAndroid Build Coastguard Worker## Optimizations and Results
*523fa7a6SAndroid Build Coastguard Worker
*523fa7a6SAndroid Build Coastguard WorkerSince LLaVA model needs at least 4-bit quantization to fit even within some of
*523fa7a6SAndroid Build Coastguard Workerthe high-end phones, results presented here correspond to 4-bit groupwise
*523fa7a6SAndroid Build Coastguard Workerpost-training quantized model.
*523fa7a6SAndroid Build Coastguard Worker
*523fa7a6SAndroid Build Coastguard WorkerIn addition to that, work is mainly focused on using Arm CPUs and ET XNNPACK delegate.
*523fa7a6SAndroid Build Coastguard Worker
*523fa7a6SAndroid Build Coastguard Worker### Memory Footprint Reduction Techniques
*523fa7a6SAndroid Build Coastguard Worker
*523fa7a6SAndroid Build Coastguard WorkerWith Llava, we needed to find a way to reduce the memory footprint in order to
*523fa7a6SAndroid Build Coastguard Workermake it feasible to run on edge devices. Out of the box, even with 4-bit
*523fa7a6SAndroid Build Coastguard Workerquantized weights, the memory footprint is around ~11 GiB, which is
*523fa7a6SAndroid Build Coastguard Workerprohibitively large even for high-end Android or iOS devices.
*523fa7a6SAndroid Build Coastguard Worker
*523fa7a6SAndroid Build Coastguard WorkerWe did several optimizations, which should be already enabled if you follow this
*523fa7a6SAndroid Build Coastguard Workertutorial, to get the memory footprint down to ~5 GiB, which unblocks us to run
*523fa7a6SAndroid Build Coastguard Workeron high-end devices.
*523fa7a6SAndroid Build Coastguard Worker
*523fa7a6SAndroid Build Coastguard Worker#### Sharing intermediate memory across delegates
*523fa7a6SAndroid Build Coastguard Worker
*523fa7a6SAndroid Build Coastguard WorkerSharing working memory across ET XNNPACK delegates helps reduce the peak memory
*523fa7a6SAndroid Build Coastguard Workerusage for LLMs with many DQLinears. We reduced it by 36.1% (from 10.44GiB to
*523fa7a6SAndroid Build Coastguard Worker6.67GiB) for Llava towards unblocking it to run on Phones.
*523fa7a6SAndroid Build Coastguard Worker
*523fa7a6SAndroid Build Coastguard Worker#### Reducing maximum sequence length
*523fa7a6SAndroid Build Coastguard Worker
*523fa7a6SAndroid Build Coastguard WorkerTo free up more memory, we examined non-constant memory usage, specifically
*523fa7a6SAndroid Build Coastguard Workerfocusing on intermediate tensors used throughout the model during inference.
*523fa7a6SAndroid Build Coastguard WorkerThe majority of these were found in the KV-cache allocations. Based on “minimum
*523fa7a6SAndroid Build Coastguard Workercan get away with” heuristic, we reduced max sequence length number to 768 from
*523fa7a6SAndroid Build Coastguard Workerprevious default 2048. This adjustment led to a further memory reduction of
*523fa7a6SAndroid Build Coastguard Workerapproximately 1.23 GiB (from 6.67 GiB to 5.44 GiB).
*523fa7a6SAndroid Build Coastguard Worker
*523fa7a6SAndroid Build Coastguard Worker#### Quantizing embedding weights to 8b
*523fa7a6SAndroid Build Coastguard Worker
*523fa7a6SAndroid Build Coastguard WorkerBy quantizing the embedding layer to 8 bit, we were able to achieve an
*523fa7a6SAndroid Build Coastguard Workeradditional memory footprint reduction of approximately 300 MiB, bringing the
*523fa7a6SAndroid Build Coastguard Workertotal down to ~5 GiB.
*523fa7a6SAndroid Build Coastguard Worker
*523fa7a6SAndroid Build Coastguard Worker### Performance Optimizations
*523fa7a6SAndroid Build Coastguard Worker
*523fa7a6SAndroid Build Coastguard Worker#### Decode performance
*523fa7a6SAndroid Build Coastguard Worker
*523fa7a6SAndroid Build Coastguard WorkerThis was already heavily optimized through KV-cache and GEMV kernel
*523fa7a6SAndroid Build Coastguard Workeroptimization efforts for LLama2/3.
*523fa7a6SAndroid Build Coastguard Worker
*523fa7a6SAndroid Build Coastguard Worker#### Encode performance
*523fa7a6SAndroid Build Coastguard Worker
*523fa7a6SAndroid Build Coastguard WorkerWith image based large prompts, this was the focus of performance
*523fa7a6SAndroid Build Coastguard Workeroptimizations for LLaVA. We implemented two main optimizations to bring the decode or
*523fa7a6SAndroid Build Coastguard Workerprefill performance for the image down by more than 100% from the baseline.
*523fa7a6SAndroid Build Coastguard Worker
*523fa7a6SAndroid Build Coastguard Worker* **Two XNNPACK Partitioners**
*523fa7a6SAndroid Build Coastguard Worker
*523fa7a6SAndroid Build Coastguard WorkerFor text-only LLMs, our approach involved lowering only DQLinear ops
*523fa7a6SAndroid Build Coastguard Workerto XNNPACK and relying on ExecuTorch-optimized operators or custom ops
*523fa7a6SAndroid Build Coastguard Worker(utilizing Neon SIMD) to support multiplication, addition, and other
*523fa7a6SAndroid Build Coastguard Workeroperations. Lowering these operations to XNNPACK significantly improves Time to
*523fa7a6SAndroid Build Coastguard WorkerFirst Token (TTFT).
*523fa7a6SAndroid Build Coastguard Worker
*523fa7a6SAndroid Build Coastguard Worker
*523fa7a6SAndroid Build Coastguard Worker* **New Arm Neon I8mm GEMM kernels**
*523fa7a6SAndroid Build Coastguard Worker
*523fa7a6SAndroid Build Coastguard WorkerWe introduced new kernels in XNNPACK for the quantization scheme used
*523fa7a6SAndroid Build Coastguard Workerhere, which upgrades our existing dot-prod based GEMM kernels to i8mm based
*523fa7a6SAndroid Build Coastguard WorkerGEMM kernels. The new kernel offers significantly improved performance by
*523fa7a6SAndroid Build Coastguard Workerleveraging the more efficient SMMLA instruction from Arm Neon. However, it's
*523fa7a6SAndroid Build Coastguard Workerworth noting that this instruction is only available on newer Arm CPUs.
*523fa7a6SAndroid Build Coastguard Worker
*523fa7a6SAndroid Build Coastguard Worker
*523fa7a6SAndroid Build Coastguard Worker### Results
*523fa7a6SAndroid Build Coastguard Worker
*523fa7a6SAndroid Build Coastguard WorkerNote this is an active area of development in the ExecuTorch repository. You
*523fa7a6SAndroid Build Coastguard Workerwill need this PR [5380](https://github.com/pytorch/executorch/pull/5380) to
*523fa7a6SAndroid Build Coastguard Workersupply an image to the C++ runner on Android without Torch dependency. This
*523fa7a6SAndroid Build Coastguard Workershould be merged soon.
*523fa7a6SAndroid Build Coastguard Worker
*523fa7a6SAndroid Build Coastguard WorkerWith those caveats out of the way, here are some preliminary numbers (as average of
*523fa7a6SAndroid Build Coastguard Workerthree runs) for LLaVA using a C++ runner on Android OnePlus12 device with 12GiB
*523fa7a6SAndroid Build Coastguard Workermemory.
*523fa7a6SAndroid Build Coastguard Worker
*523fa7a6SAndroid Build Coastguard Worker| Experiment Setup  | Prefill time in seconds | Decode tokens/second |
*523fa7a6SAndroid Build Coastguard Worker| :------------- | -------------: | -------------: |
*523fa7a6SAndroid Build Coastguard Worker| Baseline  | 29.95  | 8.75 |
*523fa7a6SAndroid Build Coastguard Worker| + Two XNNPACK Partitioners  | 17.82  | 8.93 |
*523fa7a6SAndroid Build Coastguard Worker| + New Arm Neon i8mm GEMM Kernels  | 14.60 | 8.92 |
*523fa7a6SAndroid Build Coastguard Worker
*523fa7a6SAndroid Build Coastguard WorkerWe appreciate your feedback. Please let us know if you run into any issues.