xref: /aosp_15_r20/external/executorch/examples/models/llama/UTILS.md (revision 523fa7a60841cd1ecfb9cc4201f1ca8b03ed023a)
1*523fa7a6SAndroid Build Coastguard Worker# Utility tools for Llama enablement
2*523fa7a6SAndroid Build Coastguard Worker
3*523fa7a6SAndroid Build Coastguard Worker## Stories110M model
4*523fa7a6SAndroid Build Coastguard Worker
5*523fa7a6SAndroid Build Coastguard WorkerIf you want to deploy and run a smaller model for educational purposes, you can try stories110M model. It has the same architecture as Llama, but just smaller. It can be also used for fast iteration and verification during development.
6*523fa7a6SAndroid Build Coastguard Worker
7*523fa7a6SAndroid Build Coastguard Worker### Export:
8*523fa7a6SAndroid Build Coastguard Worker
9*523fa7a6SAndroid Build Coastguard WorkerFrom `executorch` root:
10*523fa7a6SAndroid Build Coastguard Worker
11*523fa7a6SAndroid Build Coastguard Worker1. Download `stories110M.pt` and `tokenizer.model` from Github.
12*523fa7a6SAndroid Build Coastguard Worker    ```
13*523fa7a6SAndroid Build Coastguard Worker    wget "https://huggingface.co/karpathy/tinyllamas/resolve/main/stories110M.pt"
14*523fa7a6SAndroid Build Coastguard Worker    wget "https://raw.githubusercontent.com/karpathy/llama2.c/master/tokenizer.model"
15*523fa7a6SAndroid Build Coastguard Worker    ```
16*523fa7a6SAndroid Build Coastguard Worker2. Create params file.
17*523fa7a6SAndroid Build Coastguard Worker    ```
18*523fa7a6SAndroid Build Coastguard Worker    echo '{"dim": 768, "multiple_of": 32, "n_heads": 12, "n_layers": 12, "norm_eps": 1e-05, "vocab_size": 32000}' > params.json
19*523fa7a6SAndroid Build Coastguard Worker    ```
20*523fa7a6SAndroid Build Coastguard Worker3. Export model and generate `.pte` file.
21*523fa7a6SAndroid Build Coastguard Worker    ```
22*523fa7a6SAndroid Build Coastguard Worker    python -m examples.models.llama.export_llama -c stories110M.pt -p params.json -X -kv
23*523fa7a6SAndroid Build Coastguard Worker    ```
24*523fa7a6SAndroid Build Coastguard Worker
25*523fa7a6SAndroid Build Coastguard Worker## Smaller model delegated to other backends
26*523fa7a6SAndroid Build Coastguard Worker
27*523fa7a6SAndroid Build Coastguard WorkerCurrently we supported lowering the stories model to other backends, including, CoreML, MPS and QNN. Please refer to the instruction
28*523fa7a6SAndroid Build Coastguard Workerfor each backend ([CoreML](https://pytorch.org/executorch/main/build-run-coreml.html), [MPS](https://pytorch.org/executorch/main/build-run-mps.html), [QNN](https://pytorch.org/executorch/main/build-run-qualcomm-ai-engine-direct-backend.html)) before trying to lower them. After the backend library is installed, the script to export a lowered model is
29*523fa7a6SAndroid Build Coastguard Worker
30*523fa7a6SAndroid Build Coastguard Worker- Lower to CoreML: `python -m examples.models.llama.export_llama -kv --disable_dynamic_shape --coreml -c stories110M.pt -p params.json `
31*523fa7a6SAndroid Build Coastguard Worker- MPS: `python -m examples.models.llama.export_llama -kv --disable_dynamic_shape --mps -c stories110M.pt -p params.json `
32*523fa7a6SAndroid Build Coastguard Worker- QNN: `python -m examples.models.llama.export_llama -kv --disable_dynamic_shape --qnn -c stories110M.pt -p params.json `
33*523fa7a6SAndroid Build Coastguard Worker
34*523fa7a6SAndroid Build Coastguard WorkerThe iOS LLAMA app supports the CoreML and MPS model and the Android LLAMA app supports the QNN model. On Android, it also allow to cross compiler the llama runner binary, push to the device and run.
35*523fa7a6SAndroid Build Coastguard Worker
36*523fa7a6SAndroid Build Coastguard WorkerFor CoreML, there are 2 additional optional arguments:
37*523fa7a6SAndroid Build Coastguard Worker* `--coreml-ios`: Specify the minimum iOS version to deploy (and turn on available optimizations). E.g. `--coreml-ios 18` will turn on [in-place KV cache](https://developer.apple.com/documentation/coreml/mlstate?language=objc) and [fused scaled dot product attention kernel](https://apple.github.io/coremltools/source/coremltools.converters.mil.mil.ops.defs.html#coremltools.converters.mil.mil.ops.defs.iOS18.transformers.scaled_dot_product_attention) (the resulting model will then need at least iOS 18 to run, though)
38*523fa7a6SAndroid Build Coastguard Worker* `--coreml-quantize`: Use [quantization tailored for CoreML](https://apple.github.io/coremltools/docs-guides/source/opt-quantization-overview.html). E.g. `--coreml-quantize b4w` will perform per-block 4-bit weight-only quantization in a way tailored for CoreML
39*523fa7a6SAndroid Build Coastguard Worker
40*523fa7a6SAndroid Build Coastguard WorkerTo deploy the large 8B model on the above backends, [please visit this section](non_cpu_backends.md).
41*523fa7a6SAndroid Build Coastguard Worker
42*523fa7a6SAndroid Build Coastguard Worker## Download models from Hugging Face and convert from safetensor format to state dict
43*523fa7a6SAndroid Build Coastguard Worker
44*523fa7a6SAndroid Build Coastguard WorkerYou can also download above models from [Hugging Face](https://huggingface.co/). Since ExecuTorch starts from a PyTorch model, a script like below can be used to convert the Hugging Face safetensors format to PyTorch's state dict. It leverages the utils provided by [TorchTune](https://github.com/pytorch/torchtune).
45*523fa7a6SAndroid Build Coastguard Worker
46*523fa7a6SAndroid Build Coastguard Worker
47*523fa7a6SAndroid Build Coastguard Worker```Python
48*523fa7a6SAndroid Build Coastguard Workerfrom torchtune.utils import FullModelHFCheckpointer
49*523fa7a6SAndroid Build Coastguard Workerfrom torchtune.models import convert_weights
50*523fa7a6SAndroid Build Coastguard Workerimport torch
51*523fa7a6SAndroid Build Coastguard Worker
52*523fa7a6SAndroid Build Coastguard Worker# Convert from safetensors to TorchTune. Suppose the model has been downloaded from Hugging Face
53*523fa7a6SAndroid Build Coastguard Workercheckpointer = FullModelHFCheckpointer(
54*523fa7a6SAndroid Build Coastguard Worker    checkpoint_dir='/home/.cache/huggingface/hub/models/snapshots/hash-number',
55*523fa7a6SAndroid Build Coastguard Worker    checkpoint_files=['model-00001-of-00002.safetensors', 'model-00002-of-00002.safetensors'],
56*523fa7a6SAndroid Build Coastguard Worker    output_dir='/the/destination/dir' ,
57*523fa7a6SAndroid Build Coastguard Worker    model_type='LLAMA3' # or other types that TorchTune supports
58*523fa7a6SAndroid Build Coastguard Worker)
59*523fa7a6SAndroid Build Coastguard Worker
60*523fa7a6SAndroid Build Coastguard Workerprint("loading checkpoint")
61*523fa7a6SAndroid Build Coastguard Workersd = checkpointer.load_checkpoint()
62*523fa7a6SAndroid Build Coastguard Worker
63*523fa7a6SAndroid Build Coastguard Worker# Convert from TorchTune to Meta (PyTorch native)
64*523fa7a6SAndroid Build Coastguard Workersd = convert_weights.tune_to_meta(sd['model'])
65*523fa7a6SAndroid Build Coastguard Worker
66*523fa7a6SAndroid Build Coastguard Workerprint("saving checkpoint")
67*523fa7a6SAndroid Build Coastguard Workertorch.save(sd, "/the/destination/dir/checkpoint.pth")
68*523fa7a6SAndroid Build Coastguard Worker```
69*523fa7a6SAndroid Build Coastguard Worker
70*523fa7a6SAndroid Build Coastguard Worker## Finetuning
71*523fa7a6SAndroid Build Coastguard Worker
72*523fa7a6SAndroid Build Coastguard WorkerIf you want to finetune your model based on a specific dataset, PyTorch provides [TorchTune](https://github.com/pytorch/torchtune) - a native-Pytorch library for easily authoring, fine-tuning and experimenting with LLMs.
73*523fa7a6SAndroid Build Coastguard Worker
74*523fa7a6SAndroid Build Coastguard WorkerOnce you have [TorchTune installed](https://github.com/pytorch/torchtune?tab=readme-ov-file#get-started) you can finetune Llama2 7B model using LoRA on a single GPU, using the following command. This will produce a checkpoint where the LoRA weights are merged with the base model and so the output checkpoint will be in the same format as the original Llama2 model.
75*523fa7a6SAndroid Build Coastguard Worker
76*523fa7a6SAndroid Build Coastguard Worker```
77*523fa7a6SAndroid Build Coastguard Workertune run lora_finetune_single_device \
78*523fa7a6SAndroid Build Coastguard Worker--config llama2/7B_lora_single_device \
79*523fa7a6SAndroid Build Coastguard Workercheckpointer.checkpoint_dir=<path_to_checkpoint_folder>  \
80*523fa7a6SAndroid Build Coastguard Workertokenizer.path=<path_to_checkpoint_folder>/tokenizer.model
81*523fa7a6SAndroid Build Coastguard Worker```
82*523fa7a6SAndroid Build Coastguard Worker
83*523fa7a6SAndroid Build Coastguard WorkerTo run full finetuning with Llama2 7B on a single device, you can use the following command.
84*523fa7a6SAndroid Build Coastguard Worker
85*523fa7a6SAndroid Build Coastguard Worker```
86*523fa7a6SAndroid Build Coastguard Workertune run full_finetune_single_device \
87*523fa7a6SAndroid Build Coastguard Worker--config llama2/7B_full_single_device \
88*523fa7a6SAndroid Build Coastguard Workercheckpointer.checkpoint_dir=<path_to_checkpoint_folder> \
89*523fa7a6SAndroid Build Coastguard Workertokenizer.path=<path_to_checkpoint_folder>/tokenizer.model
90*523fa7a6SAndroid Build Coastguard Worker```
91