Name | Date | Size | #Lines | LOC | ||
---|---|---|---|---|---|---|
.. | - | - | ||||
llama2/ | H | 25-Apr-2025 | - | 331 | 269 | |
llama3/ | H | 25-Apr-2025 | - | 348 | 283 | |
runner/ | H | 25-Apr-2025 | - | 1,241 | 1,016 | |
CMakeLists.txt | H A D | 25-Apr-2025 | 2.6 KiB | 91 | 80 | |
README.md | H A D | 25-Apr-2025 | 3 KiB | 52 | 40 |
README.md
1# Summary 2 3## Overview 4This file provides you the instructions to run LLAMA2 and LLAMA3 with different parameters via Qualcomm HTP backend. Following settings support for Llama-2-7b-chat-hf and Llama-3-8b-chat-hf 5 6Please check corresponding section for more information. 7 8## Llama-2-7b-chat-hf 9This example demonstrates how to run Llama-2-7b-chat-hf on mobile via Qualcomm HTP backend. Model was precompiled into context binaries by [Qualcomm AI HUB](https://aihub.qualcomm.com/). 10Note that the pre-compiled context binaries could not be futher fine-tuned for other downstream tasks. 11 12### Instructions 13#### Step 1: Setup 141. Follow the [tutorial](https://pytorch.org/executorch/main/getting-started-setup) to set up ExecuTorch. 152. Follow the [tutorial](https://pytorch.org/executorch/stable/build-run-qualcomm-ai-engine-direct-backend.html) to build Qualcomm AI Engine Direct Backend. 16 17#### Step2: Prepare Model 181. Create account for https://aihub.qualcomm.com/ 192. Follow instructions in https://huggingface.co/qualcomm/Llama-v2-7B-Chat to export context binaries (will take some time to finish) 20 21```bash 22# tokenizer.model: https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/blob/main/tokenizer.model 23# tokenizer.bin: 24python -m examples.models.llama.tokenizer.tokenizer -t tokenizer.model -o tokenizer.bin 25``` 26 27#### Step3: Run default examples 28```bash 29# AIHUB_CONTEXT_BINARIES: ${PATH_TO_AIHUB_WORKSPACE}/build/llama_v2_7b_chat_quantized 30python examples/qualcomm/qaihub_scripts/llama/llama2/qaihub_llama2_7b.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --context_binaries ${AIHUB_CONTEXT_BINARIES} --tokenizer_bin tokenizer.bin --prompt "What is Python?" 31``` 32 33## Llama-3-8b-chat-hf 34This example demonstrates how to run Llama-3-8b-chat-hf on mobile via Qualcomm HTP backend. Model was precompiled into context binaries by [Qualcomm AI HUB](https://aihub.qualcomm.com/). 35Note that the pre-compiled context binaries could not be futher fine-tuned for other downstream tasks. This example script has been tested on a 16GB RAM device and verified to work. 36 37### Instructions 38#### Step 1: Setup 391. Follow the [tutorial](https://pytorch.org/executorch/main/getting-started-setup) to set up ExecuTorch. 402. Follow the [tutorial](https://pytorch.org/executorch/stable/build-run-qualcomm-ai-engine-direct-backend.html) to build Qualcomm AI Engine Direct Backend. 41 42#### Step2: Prepare Model 431. Create account for https://aihub.qualcomm.com/ 442. Follow instructions in https://huggingface.co/qualcomm/Llama-v3-8B-Chat to export context binaries (will take some time to finish) 453. For Llama 3 tokenizer, please refer to https://github.com/meta-llama/llama-models/blob/main/README.md for further instructions on how to download tokenizer.model. 46 47 48#### Step3: Run default examples 49```bash 50# AIHUB_CONTEXT_BINARIES: ${PATH_TO_AIHUB_WORKSPACE}/build/llama_v3_8b_chat_quantized 51python examples/qualcomm/qaihub_scripts/llama/llama3/qaihub_llama3_8b.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --context_binaries ${AIHUB_CONTEXT_BINARIES} --tokenizer_model tokenizer.model --prompt "What is baseball?" 52```