1*523fa7a6SAndroid Build Coastguard Worker# Utility tools for Llama enablement 2*523fa7a6SAndroid Build Coastguard Worker 3*523fa7a6SAndroid Build Coastguard Worker## Stories110M model 4*523fa7a6SAndroid Build Coastguard Worker 5*523fa7a6SAndroid Build Coastguard WorkerIf you want to deploy and run a smaller model for educational purposes, you can try stories110M model. It has the same architecture as Llama, but just smaller. It can be also used for fast iteration and verification during development. 6*523fa7a6SAndroid Build Coastguard Worker 7*523fa7a6SAndroid Build Coastguard Worker### Export: 8*523fa7a6SAndroid Build Coastguard Worker 9*523fa7a6SAndroid Build Coastguard WorkerFrom `executorch` root: 10*523fa7a6SAndroid Build Coastguard Worker 11*523fa7a6SAndroid Build Coastguard Worker1. Download `stories110M.pt` and `tokenizer.model` from Github. 12*523fa7a6SAndroid Build Coastguard Worker ``` 13*523fa7a6SAndroid Build Coastguard Worker wget "https://huggingface.co/karpathy/tinyllamas/resolve/main/stories110M.pt" 14*523fa7a6SAndroid Build Coastguard Worker wget "https://raw.githubusercontent.com/karpathy/llama2.c/master/tokenizer.model" 15*523fa7a6SAndroid Build Coastguard Worker ``` 16*523fa7a6SAndroid Build Coastguard Worker2. Create params file. 17*523fa7a6SAndroid Build Coastguard Worker ``` 18*523fa7a6SAndroid Build Coastguard Worker echo '{"dim": 768, "multiple_of": 32, "n_heads": 12, "n_layers": 12, "norm_eps": 1e-05, "vocab_size": 32000}' > params.json 19*523fa7a6SAndroid Build Coastguard Worker ``` 20*523fa7a6SAndroid Build Coastguard Worker3. Export model and generate `.pte` file. 21*523fa7a6SAndroid Build Coastguard Worker ``` 22*523fa7a6SAndroid Build Coastguard Worker python -m examples.models.llama.export_llama -c stories110M.pt -p params.json -X -kv 23*523fa7a6SAndroid Build Coastguard Worker ``` 24*523fa7a6SAndroid Build Coastguard Worker 25*523fa7a6SAndroid Build Coastguard Worker## Smaller model delegated to other backends 26*523fa7a6SAndroid Build Coastguard Worker 27*523fa7a6SAndroid Build Coastguard WorkerCurrently we supported lowering the stories model to other backends, including, CoreML, MPS and QNN. Please refer to the instruction 28*523fa7a6SAndroid Build Coastguard Workerfor each backend ([CoreML](https://pytorch.org/executorch/main/build-run-coreml.html), [MPS](https://pytorch.org/executorch/main/build-run-mps.html), [QNN](https://pytorch.org/executorch/main/build-run-qualcomm-ai-engine-direct-backend.html)) before trying to lower them. After the backend library is installed, the script to export a lowered model is 29*523fa7a6SAndroid Build Coastguard Worker 30*523fa7a6SAndroid Build Coastguard Worker- Lower to CoreML: `python -m examples.models.llama.export_llama -kv --disable_dynamic_shape --coreml -c stories110M.pt -p params.json ` 31*523fa7a6SAndroid Build Coastguard Worker- MPS: `python -m examples.models.llama.export_llama -kv --disable_dynamic_shape --mps -c stories110M.pt -p params.json ` 32*523fa7a6SAndroid Build Coastguard Worker- QNN: `python -m examples.models.llama.export_llama -kv --disable_dynamic_shape --qnn -c stories110M.pt -p params.json ` 33*523fa7a6SAndroid Build Coastguard Worker 34*523fa7a6SAndroid Build Coastguard WorkerThe iOS LLAMA app supports the CoreML and MPS model and the Android LLAMA app supports the QNN model. On Android, it also allow to cross compiler the llama runner binary, push to the device and run. 35*523fa7a6SAndroid Build Coastguard Worker 36*523fa7a6SAndroid Build Coastguard WorkerFor CoreML, there are 2 additional optional arguments: 37*523fa7a6SAndroid Build Coastguard Worker* `--coreml-ios`: Specify the minimum iOS version to deploy (and turn on available optimizations). E.g. `--coreml-ios 18` will turn on [in-place KV cache](https://developer.apple.com/documentation/coreml/mlstate?language=objc) and [fused scaled dot product attention kernel](https://apple.github.io/coremltools/source/coremltools.converters.mil.mil.ops.defs.html#coremltools.converters.mil.mil.ops.defs.iOS18.transformers.scaled_dot_product_attention) (the resulting model will then need at least iOS 18 to run, though) 38*523fa7a6SAndroid Build Coastguard Worker* `--coreml-quantize`: Use [quantization tailored for CoreML](https://apple.github.io/coremltools/docs-guides/source/opt-quantization-overview.html). E.g. `--coreml-quantize b4w` will perform per-block 4-bit weight-only quantization in a way tailored for CoreML 39*523fa7a6SAndroid Build Coastguard Worker 40*523fa7a6SAndroid Build Coastguard WorkerTo deploy the large 8B model on the above backends, [please visit this section](non_cpu_backends.md). 41*523fa7a6SAndroid Build Coastguard Worker 42*523fa7a6SAndroid Build Coastguard Worker## Download models from Hugging Face and convert from safetensor format to state dict 43*523fa7a6SAndroid Build Coastguard Worker 44*523fa7a6SAndroid Build Coastguard WorkerYou can also download above models from [Hugging Face](https://huggingface.co/). Since ExecuTorch starts from a PyTorch model, a script like below can be used to convert the Hugging Face safetensors format to PyTorch's state dict. It leverages the utils provided by [TorchTune](https://github.com/pytorch/torchtune). 45*523fa7a6SAndroid Build Coastguard Worker 46*523fa7a6SAndroid Build Coastguard Worker 47*523fa7a6SAndroid Build Coastguard Worker```Python 48*523fa7a6SAndroid Build Coastguard Workerfrom torchtune.utils import FullModelHFCheckpointer 49*523fa7a6SAndroid Build Coastguard Workerfrom torchtune.models import convert_weights 50*523fa7a6SAndroid Build Coastguard Workerimport torch 51*523fa7a6SAndroid Build Coastguard Worker 52*523fa7a6SAndroid Build Coastguard Worker# Convert from safetensors to TorchTune. Suppose the model has been downloaded from Hugging Face 53*523fa7a6SAndroid Build Coastguard Workercheckpointer = FullModelHFCheckpointer( 54*523fa7a6SAndroid Build Coastguard Worker checkpoint_dir='/home/.cache/huggingface/hub/models/snapshots/hash-number', 55*523fa7a6SAndroid Build Coastguard Worker checkpoint_files=['model-00001-of-00002.safetensors', 'model-00002-of-00002.safetensors'], 56*523fa7a6SAndroid Build Coastguard Worker output_dir='/the/destination/dir' , 57*523fa7a6SAndroid Build Coastguard Worker model_type='LLAMA3' # or other types that TorchTune supports 58*523fa7a6SAndroid Build Coastguard Worker) 59*523fa7a6SAndroid Build Coastguard Worker 60*523fa7a6SAndroid Build Coastguard Workerprint("loading checkpoint") 61*523fa7a6SAndroid Build Coastguard Workersd = checkpointer.load_checkpoint() 62*523fa7a6SAndroid Build Coastguard Worker 63*523fa7a6SAndroid Build Coastguard Worker# Convert from TorchTune to Meta (PyTorch native) 64*523fa7a6SAndroid Build Coastguard Workersd = convert_weights.tune_to_meta(sd['model']) 65*523fa7a6SAndroid Build Coastguard Worker 66*523fa7a6SAndroid Build Coastguard Workerprint("saving checkpoint") 67*523fa7a6SAndroid Build Coastguard Workertorch.save(sd, "/the/destination/dir/checkpoint.pth") 68*523fa7a6SAndroid Build Coastguard Worker``` 69*523fa7a6SAndroid Build Coastguard Worker 70*523fa7a6SAndroid Build Coastguard Worker## Finetuning 71*523fa7a6SAndroid Build Coastguard Worker 72*523fa7a6SAndroid Build Coastguard WorkerIf you want to finetune your model based on a specific dataset, PyTorch provides [TorchTune](https://github.com/pytorch/torchtune) - a native-Pytorch library for easily authoring, fine-tuning and experimenting with LLMs. 73*523fa7a6SAndroid Build Coastguard Worker 74*523fa7a6SAndroid Build Coastguard WorkerOnce you have [TorchTune installed](https://github.com/pytorch/torchtune?tab=readme-ov-file#get-started) you can finetune Llama2 7B model using LoRA on a single GPU, using the following command. This will produce a checkpoint where the LoRA weights are merged with the base model and so the output checkpoint will be in the same format as the original Llama2 model. 75*523fa7a6SAndroid Build Coastguard Worker 76*523fa7a6SAndroid Build Coastguard Worker``` 77*523fa7a6SAndroid Build Coastguard Workertune run lora_finetune_single_device \ 78*523fa7a6SAndroid Build Coastguard Worker--config llama2/7B_lora_single_device \ 79*523fa7a6SAndroid Build Coastguard Workercheckpointer.checkpoint_dir=<path_to_checkpoint_folder> \ 80*523fa7a6SAndroid Build Coastguard Workertokenizer.path=<path_to_checkpoint_folder>/tokenizer.model 81*523fa7a6SAndroid Build Coastguard Worker``` 82*523fa7a6SAndroid Build Coastguard Worker 83*523fa7a6SAndroid Build Coastguard WorkerTo run full finetuning with Llama2 7B on a single device, you can use the following command. 84*523fa7a6SAndroid Build Coastguard Worker 85*523fa7a6SAndroid Build Coastguard Worker``` 86*523fa7a6SAndroid Build Coastguard Workertune run full_finetune_single_device \ 87*523fa7a6SAndroid Build Coastguard Worker--config llama2/7B_full_single_device \ 88*523fa7a6SAndroid Build Coastguard Workercheckpointer.checkpoint_dir=<path_to_checkpoint_folder> \ 89*523fa7a6SAndroid Build Coastguard Workertokenizer.path=<path_to_checkpoint_folder>/tokenizer.model 90*523fa7a6SAndroid Build Coastguard Worker``` 91