xref: /aosp_15_r20/external/executorch/extension/llm/README.md (revision 523fa7a60841cd1ecfb9cc4201f1ca8b03ed023a)
1*523fa7a6SAndroid Build Coastguard WorkerThis subtree contains libraries and utils of running generative AI, including Large Language Models (LLM) using ExecuTorch.
2*523fa7a6SAndroid Build Coastguard WorkerBelow is a list of sub folders.
3*523fa7a6SAndroid Build Coastguard Worker## export
4*523fa7a6SAndroid Build Coastguard WorkerModel preparation codes are in _export_ folder. The main entry point is the _LLMEdgeManager_ class. It hosts a _torch.nn.Module_, with a list of methods that can be used to prepare the LLM model for ExecuTorch runtime.
5*523fa7a6SAndroid Build Coastguard WorkerNote that ExecuTorch supports two [quantization APIs](https://pytorch.org/docs/stable/quantization.html#quantization-api-summary): eager mode quantization (aka source transform based quantization) and PyTorch 2 Export based quantization (aka pt2e quantization).
6*523fa7a6SAndroid Build Coastguard Worker
7*523fa7a6SAndroid Build Coastguard WorkerCommonly used methods in this class include:
8*523fa7a6SAndroid Build Coastguard Worker- _set_output_dir_: where users want to save the exported .pte file.
9*523fa7a6SAndroid Build Coastguard Worker- _to_dtype_: override the data type of the module.
10*523fa7a6SAndroid Build Coastguard Worker- _source_transform_: execute a series of source transform passes. Some transform passes include
11*523fa7a6SAndroid Build Coastguard Worker  - weight only quantization, which can be done at source (eager mode) level.
12*523fa7a6SAndroid Build Coastguard Worker  - replace some torch operators to a custom operator. For example, _replace_sdpa_with_custom_op_.
13*523fa7a6SAndroid Build Coastguard Worker- _torch.export_for_training_: get a graph that is ready for pt2 graph-based quantization.
14*523fa7a6SAndroid Build Coastguard Worker- _pt2e_quantize_ with passed in quantizers.
15*523fa7a6SAndroid Build Coastguard Worker  - util functions in _quantizer_lib.py_ can help to get different quantizers based on the needs.
16*523fa7a6SAndroid Build Coastguard Worker- _export_to_edge_: export to edge dialect
17*523fa7a6SAndroid Build Coastguard Worker- _to_backend_: lower the graph to an acceleration backend.
18*523fa7a6SAndroid Build Coastguard Worker- _to_executorch_: get the executorch graph with optional optimization passes.
19*523fa7a6SAndroid Build Coastguard Worker- _save_to_pte_: finally, the lowered and optimized graph can be saved into a .pte file for the runtime.
20*523fa7a6SAndroid Build Coastguard Worker
21*523fa7a6SAndroid Build Coastguard WorkerSome usage of LLMEdgeManager can be found in executorch/examples/models/llama, and executorch/examples/models/llava.
22*523fa7a6SAndroid Build Coastguard Worker
23*523fa7a6SAndroid Build Coastguard WorkerWhen the .pte file is exported and saved, we can load and run it in a runner (see below).
24*523fa7a6SAndroid Build Coastguard Worker
25*523fa7a6SAndroid Build Coastguard Worker## tokenizer
26*523fa7a6SAndroid Build Coastguard WorkerCurrently, we support two types of tokenizers: sentencepiece and Tiktoken.
27*523fa7a6SAndroid Build Coastguard Worker- In Python:
28*523fa7a6SAndroid Build Coastguard Worker  - _utils.py_: get the tokenizer from a model file path, based on the file format.
29*523fa7a6SAndroid Build Coastguard Worker  - _tokenizer.py_: rewrite a sentencepiece tokenizer model to a serialization format that the runtime can load.
30*523fa7a6SAndroid Build Coastguard Worker- In C++:
31*523fa7a6SAndroid Build Coastguard Worker  - _tokenizer.h_: a simple tokenizer interface. Actual tokenizer classes can be implemented based on this. In this folder, we provide two tokenizer implementations:
32*523fa7a6SAndroid Build Coastguard Worker    - _bpe_tokenizer_. Note: we need the rewritten version of tokenizer artifact (refer to _tokenizer.py_ above), for bpe tokenizer to work.
33*523fa7a6SAndroid Build Coastguard Worker    - _tiktoken_. For llama3 and llama3.1.
34*523fa7a6SAndroid Build Coastguard Worker
35*523fa7a6SAndroid Build Coastguard Worker## sampler
36*523fa7a6SAndroid Build Coastguard WorkerA sampler class in C++ to sample the logistics given some hyperparameters.
37*523fa7a6SAndroid Build Coastguard Worker
38*523fa7a6SAndroid Build Coastguard Worker## custom_ops
39*523fa7a6SAndroid Build Coastguard WorkerContains custom op, such as:
40*523fa7a6SAndroid Build Coastguard Worker- custom sdpa: implements CPU flash attention and avoids copies by taking the kv cache as one of its arguments.
41*523fa7a6SAndroid Build Coastguard Worker  - _sdpa_with_kv_cache.py_, _op_sdpa_aot.cpp_: custom op definition in PyTorch with C++ registration.
42*523fa7a6SAndroid Build Coastguard Worker  - _op_sdpa.cpp_: the optimized operator implementation and registration of _sdpa_with_kv_cache.out_.
43*523fa7a6SAndroid Build Coastguard Worker
44*523fa7a6SAndroid Build Coastguard Worker## runner
45*523fa7a6SAndroid Build Coastguard WorkerIt hosts the libary components used in a C++ llm runner. Currently, it hosts _stats.h_ on runtime status like token numbers and latency.
46*523fa7a6SAndroid Build Coastguard Worker
47*523fa7a6SAndroid Build Coastguard WorkerWith the components above, an actual runner can be built for a model or a series of models. An example is in //executorch/examples/models/llama/runner, where a C++ runner code is built to run Llama 2, 3, 3.1 and other models using the same architecture.
48*523fa7a6SAndroid Build Coastguard Worker
49*523fa7a6SAndroid Build Coastguard WorkerUsages can also be found in the [torchchat repo](https://github.com/pytorch/torchchat/tree/main/runner).
50