models/llama/UTILS.md

*523fa7a6SAndroid Build Coastguard Worker# Utility tools for Llama enablement
*523fa7a6SAndroid Build Coastguard Worker
*523fa7a6SAndroid Build Coastguard Worker## Stories110M model
*523fa7a6SAndroid Build Coastguard Worker
*523fa7a6SAndroid Build Coastguard WorkerIf you want to deploy and run a smaller model for educational purposes, you can try stories110M model. It has the same architecture as Llama, but just smaller. It can be also used for fast iteration and verification during development.
*523fa7a6SAndroid Build Coastguard Worker
*523fa7a6SAndroid Build Coastguard Worker### Export:
*523fa7a6SAndroid Build Coastguard Worker
*523fa7a6SAndroid Build Coastguard WorkerFrom `executorch` root:
*523fa7a6SAndroid Build Coastguard Worker
*523fa7a6SAndroid Build Coastguard Worker1. Download `stories110M.pt` and `tokenizer.model` from Github.
*523fa7a6SAndroid Build Coastguard Worker    ```
*523fa7a6SAndroid Build Coastguard Worker    wget "https://huggingface.co/karpathy/tinyllamas/resolve/main/stories110M.pt"
*523fa7a6SAndroid Build Coastguard Worker    wget "https://raw.githubusercontent.com/karpathy/llama2.c/master/tokenizer.model"
*523fa7a6SAndroid Build Coastguard Worker    ```
*523fa7a6SAndroid Build Coastguard Worker2. Create params file.
*523fa7a6SAndroid Build Coastguard Worker    ```
*523fa7a6SAndroid Build Coastguard Worker    echo '{"dim": 768, "multiple_of": 32, "n_heads": 12, "n_layers": 12, "norm_eps": 1e-05, "vocab_size": 32000}' > params.json
*523fa7a6SAndroid Build Coastguard Worker    ```
*523fa7a6SAndroid Build Coastguard Worker3. Export model and generate `.pte` file.
*523fa7a6SAndroid Build Coastguard Worker    ```
*523fa7a6SAndroid Build Coastguard Worker    python -m examples.models.llama.export_llama -c stories110M.pt -p params.json -X -kv
*523fa7a6SAndroid Build Coastguard Worker    ```
*523fa7a6SAndroid Build Coastguard Worker
*523fa7a6SAndroid Build Coastguard Worker## Smaller model delegated to other backends
*523fa7a6SAndroid Build Coastguard Worker
*523fa7a6SAndroid Build Coastguard WorkerCurrently we supported lowering the stories model to other backends, including, CoreML, MPS and QNN. Please refer to the instruction
*523fa7a6SAndroid Build Coastguard Workerfor each backend ([CoreML](https://pytorch.org/executorch/main/build-run-coreml.html), [MPS](https://pytorch.org/executorch/main/build-run-mps.html), [QNN](https://pytorch.org/executorch/main/build-run-qualcomm-ai-engine-direct-backend.html)) before trying to lower them. After the backend library is installed, the script to export a lowered model is
*523fa7a6SAndroid Build Coastguard Worker
*523fa7a6SAndroid Build Coastguard Worker- Lower to CoreML: `python -m examples.models.llama.export_llama -kv --disable_dynamic_shape --coreml -c stories110M.pt -p params.json `
*523fa7a6SAndroid Build Coastguard Worker- MPS: `python -m examples.models.llama.export_llama -kv --disable_dynamic_shape --mps -c stories110M.pt -p params.json `
*523fa7a6SAndroid Build Coastguard Worker- QNN: `python -m examples.models.llama.export_llama -kv --disable_dynamic_shape --qnn -c stories110M.pt -p params.json `
*523fa7a6SAndroid Build Coastguard Worker
*523fa7a6SAndroid Build Coastguard WorkerThe iOS LLAMA app supports the CoreML and MPS model and the Android LLAMA app supports the QNN model. On Android, it also allow to cross compiler the llama runner binary, push to the device and run.
*523fa7a6SAndroid Build Coastguard Worker
*523fa7a6SAndroid Build Coastguard WorkerFor CoreML, there are 2 additional optional arguments:
*523fa7a6SAndroid Build Coastguard Worker* `--coreml-ios`: Specify the minimum iOS version to deploy (and turn on available optimizations). E.g. `--coreml-ios 18` will turn on [in-place KV cache](https://developer.apple.com/documentation/coreml/mlstate?language=objc) and [fused scaled dot product attention kernel](https://apple.github.io/coremltools/source/coremltools.converters.mil.mil.ops.defs.html#coremltools.converters.mil.mil.ops.defs.iOS18.transformers.scaled_dot_product_attention) (the resulting model will then need at least iOS 18 to run, though)
*523fa7a6SAndroid Build Coastguard Worker* `--coreml-quantize`: Use [quantization tailored for CoreML](https://apple.github.io/coremltools/docs-guides/source/opt-quantization-overview.html). E.g. `--coreml-quantize b4w` will perform per-block 4-bit weight-only quantization in a way tailored for CoreML
*523fa7a6SAndroid Build Coastguard Worker
*523fa7a6SAndroid Build Coastguard WorkerTo deploy the large 8B model on the above backends, [please visit this section](non_cpu_backends.md).
*523fa7a6SAndroid Build Coastguard Worker
*523fa7a6SAndroid Build Coastguard Worker## Download models from Hugging Face and convert from safetensor format to state dict
*523fa7a6SAndroid Build Coastguard Worker
*523fa7a6SAndroid Build Coastguard WorkerYou can also download above models from [Hugging Face](https://huggingface.co/). Since ExecuTorch starts from a PyTorch model, a script like below can be used to convert the Hugging Face safetensors format to PyTorch's state dict. It leverages the utils provided by [TorchTune](https://github.com/pytorch/torchtune).
*523fa7a6SAndroid Build Coastguard Worker
*523fa7a6SAndroid Build Coastguard Worker
*523fa7a6SAndroid Build Coastguard Worker```Python
*523fa7a6SAndroid Build Coastguard Workerfrom torchtune.utils import FullModelHFCheckpointer
*523fa7a6SAndroid Build Coastguard Workerfrom torchtune.models import convert_weights
*523fa7a6SAndroid Build Coastguard Workerimport torch
*523fa7a6SAndroid Build Coastguard Worker
*523fa7a6SAndroid Build Coastguard Worker# Convert from safetensors to TorchTune. Suppose the model has been downloaded from Hugging Face
*523fa7a6SAndroid Build Coastguard Workercheckpointer = FullModelHFCheckpointer(
*523fa7a6SAndroid Build Coastguard Worker    checkpoint_dir='/home/.cache/huggingface/hub/models/snapshots/hash-number',
*523fa7a6SAndroid Build Coastguard Worker    checkpoint_files=['model-00001-of-00002.safetensors', 'model-00002-of-00002.safetensors'],
*523fa7a6SAndroid Build Coastguard Worker    output_dir='/the/destination/dir' ,
*523fa7a6SAndroid Build Coastguard Worker    model_type='LLAMA3' # or other types that TorchTune supports
*523fa7a6SAndroid Build Coastguard Worker)
*523fa7a6SAndroid Build Coastguard Worker
*523fa7a6SAndroid Build Coastguard Workerprint("loading checkpoint")
*523fa7a6SAndroid Build Coastguard Workersd = checkpointer.load_checkpoint()
*523fa7a6SAndroid Build Coastguard Worker
*523fa7a6SAndroid Build Coastguard Worker# Convert from TorchTune to Meta (PyTorch native)
*523fa7a6SAndroid Build Coastguard Workersd = convert_weights.tune_to_meta(sd['model'])
*523fa7a6SAndroid Build Coastguard Worker
*523fa7a6SAndroid Build Coastguard Workerprint("saving checkpoint")
*523fa7a6SAndroid Build Coastguard Workertorch.save(sd, "/the/destination/dir/checkpoint.pth")
*523fa7a6SAndroid Build Coastguard Worker```
*523fa7a6SAndroid Build Coastguard Worker
*523fa7a6SAndroid Build Coastguard Worker## Finetuning
*523fa7a6SAndroid Build Coastguard Worker
*523fa7a6SAndroid Build Coastguard WorkerIf you want to finetune your model based on a specific dataset, PyTorch provides [TorchTune](https://github.com/pytorch/torchtune) - a native-Pytorch library for easily authoring, fine-tuning and experimenting with LLMs.
*523fa7a6SAndroid Build Coastguard Worker
*523fa7a6SAndroid Build Coastguard WorkerOnce you have [TorchTune installed](https://github.com/pytorch/torchtune?tab=readme-ov-file#get-started) you can finetune Llama2 7B model using LoRA on a single GPU, using the following command. This will produce a checkpoint where the LoRA weights are merged with the base model and so the output checkpoint will be in the same format as the original Llama2 model.
*523fa7a6SAndroid Build Coastguard Worker
*523fa7a6SAndroid Build Coastguard Worker```
*523fa7a6SAndroid Build Coastguard Workertune run lora_finetune_single_device \
*523fa7a6SAndroid Build Coastguard Worker--config llama2/7B_lora_single_device \
*523fa7a6SAndroid Build Coastguard Workercheckpointer.checkpoint_dir=<path_to_checkpoint_folder>  \
*523fa7a6SAndroid Build Coastguard Workertokenizer.path=<path_to_checkpoint_folder>/tokenizer.model
*523fa7a6SAndroid Build Coastguard Worker```
*523fa7a6SAndroid Build Coastguard Worker
*523fa7a6SAndroid Build Coastguard WorkerTo run full finetuning with Llama2 7B on a single device, you can use the following command.
*523fa7a6SAndroid Build Coastguard Worker
*523fa7a6SAndroid Build Coastguard Worker```
*523fa7a6SAndroid Build Coastguard Workertune run full_finetune_single_device \
*523fa7a6SAndroid Build Coastguard Worker--config llama2/7B_full_single_device \
*523fa7a6SAndroid Build Coastguard Workercheckpointer.checkpoint_dir=<path_to_checkpoint_folder> \
*523fa7a6SAndroid Build Coastguard Workertokenizer.path=<path_to_checkpoint_folder>/tokenizer.model
*523fa7a6SAndroid Build Coastguard Worker```