xref: /aosp_15_r20/external/pytorch/torch/csrc/jit/docs/serialization.md (revision da0073e96a02ea20f0ac840b70461e3646d07c45)
1# TorchScript serialization
2
3This document explains the TorchScript serialization format, and the anatomy
4of a call to `torch::jit::save()` or `torch::jit::load()`.
5
6<!-- toc -->
7
8- [Overview](#overview)
9  - [Design Notes](#design-notes)
10- [`code/`: How code is serialized](#code-how-code-is-serialized)
11  - [Printing code objects as Python source](#printing-code-objects-as-python-source)
12  - [Placing the source code in the archive](#placing-the-source-code-in-the-archive)
13- [How data is serialized](#how-data-is-serialized)
14  - [`data.pkl`: How module object state is serialized](#datapkl-how-module-object-state-is-serialized)
15  - [`data/`: How tensors are serialized](#data-how-tensors-are-serialized)
16- [`constants.pkl`: Constants in code](#constantspkl-constants-in-code)
17- [`torch:jit::load()`](#torchjitload)
18- [`__getstate__` and `__setstate__`](#__getstate__-and-__setstate__)
19- [Appendix: `CompilationUnit` and code object ownership](#appendix-compilationunit-and-code-object-ownership)
20  - [`CompilationUnit` ownership semantics](#compilationunit-ownership-semantics)
21  - [Code object naming](#code-object-naming)
22
23<!-- tocstop -->
24
25## Overview
26
27A serialized model (call it `model.pt`) is a ZIP archive containing many
28files. If you want to manually crack it open, you can call `unzip` on it to
29inspect the file structure directly:
30
31```
32$ unzip model.pt
33Archive:  model.pt
34  extracting ...
35
36$ tree model/
37├── code/
38│   ├── __torch__.py
39│   ├── __torch__.py.debug_pkl
40│   ├── foo/
41│   │   ├── bar.py
42│   │   ├── bar.py.debug_pkl
43├── data.pkl
44├── constants.pkl
45└── data/
46    ├── 0
47    └── 1
48```
49
50You'll notice that there are `.py` and `.pkl` files in this archive. That's
51because our serialization format tries to mimic Python's. All "code-like"
52information (methods, modules, classes, functions) are stored as
53human-readable `.py` containing valid Python syntax, and all "data-like"
54information (attributes, objects, etc.) are pickled using a subset of
55Python's pickle protocol.
56
57A model is really a top-level module with some submodules, parameters, and so
58on depending on what the author needs. So, `data.pkl` contains the pickled
59top-level module. Deserializing the model is as simple as calling
60`unpickle()` on `data.pkl`, which will restore the module object state and
61load its associated code on demand.
62
63### Design Notes
64
65Some things to keep in mind while working on the serialization code. These
66may help make technical decisions on which approach to take when making a
67change.
68
69**Do what Python does**. When it comes to the serialized format, it's much
70simpler in the long-run to be consistent with whatever Python does. A good
71rule of thumb is: if I tried to interact with serialized artifacts using
72Python, would it work? i.e., all serialized code should be valid Python, and
73all pickled objects should be depickle-able by Python.
74
75Being consistent with Python means our format is more debuggable (you can
76always crack it open and poke at it from Python) and leads to fewer surprises
77for developers familiar with Python but not familiar with TorchScript.
78
79**Human readable**. In addition to being valid Python, serialized code should
80attempt to be readable Python. We should try to preserve the variable names
81that authors wrote, appropriately inline short expressions, and so on. This
82helps with debugging the serialized code.
83
84**No jitter**. If we do:
85
86```
87m = MyModule()
88m.save("foo.pt")
89m_loaded = torch.load("foo.pt")
90m_loaded.save("foo2.pt")
91m_loaded2 = torch.load("foo2.pt")
92```
93
94We want the property that `m_loaded` and `m_loaded2` are identical. This
95"no-jitter" property is useful in catching bugs in the serialization process,
96and generally is desirable for debugging (models won't drift depending on how
97many times you saved/loaded them).
98
99**Initial load should be fast**. Calling `load()` should be effectively
100instantaneous to a human. Anything that takes a long time (reading in tensor
101data, for example) should be done lazily.
102
103## `code/`: How code is serialized
104
105At a high level, code serialization means:
106
1071. Transforming `ClassType`s and `Function`s (called "code objects") into Python source code.
1082. Placing the source code in the model ZIP archive.
109
110### Printing code objects as Python source
111`PythonPrint` is the function that takes as input a `ClassType` or `Function`
112("code object") and outputs Python source code. `ScriptModule`s are
113implemented as class types, so their methods and attributes will get
114serialized as well.
115
116`PythonPrint` works by walking a `Graph` (the IR representation of either a
117`ClassType`'s method or raw `Function`) and emitting Python code that
118corresponds to it. The rules for emitting Python code are mostly
119straightforward and uninteresting. There are some extra pieces of information
120that `PythonPrint` tracks, however:
121
122**Class dependencies**. While walking the graph, `PythonPrint` keeps track of
123what classes are used in the graph and adds them to a list of classes that
124the current code object depends on. For example, if we are printing a
125`Module`, it will depend on its submodules, as well as any classes used in
126its methods or attributes.
127
128**Uses of tensor constants**. Most constants are inlined as literals, like
129strings or ints. But since tensors are potentially very large, when
130`PythonPrint` encounters a constant tensor it will emit a reference to a
131global `CONSTANTS` table (like `foo = CONSTANTS.c0`).
132
133When importing, the importer will know how to resolve this reference into an
134actual tensor by looking it up in the tensor table. So `CONSTANTS.c0` means
135"this is the `0th` tensor in the tensor tuple in `constants.pkl`." See
136[the constants section](#constantspkl-constants-in-code) for more info.
137
138**Original source range records**. To aid debugging, `PythonPrint` remembers
139the "original" (user-written) location of the source code it's emitting. That
140way, when the user is debugging a model they loaded, they will see
141diagnostics that point to the code that they actually wrote, rather than the
142code that `PythonPrint` emitted.
143
144The original source range records are pickled and saved in a corresponding
145`.debug_pkl` file with the same name as the code. You can think of this
146`.debug_pkl` file as a map between source ranges in the serialized code and
147the original user-written code.
148
149**Module information**. Modules are special in a few ways. First are
150`Parameter`s: some module attributes are actually `Parameter`s, which have
151special properties (see [the `torch.nn`
152documentation](https://pytorch.org/docs/stable/nn.html#parameters) for exact
153details). We track which attributes are parameters by emitting a special
154assignment in the class body, like:
155
156```
157class MyModule(Module):
158    __parameters__ = ["foo", "bar", ]
159    foo : Tensor
160    bar : Tensor
161    attribute_but_not_param : Tensor
162```
163
164Another special thing with modules is that they are typically constructed in
165Python, and we do not compile the `__init__()` method. So in order to ensure
166they are statically typed, `PythonPrint` must enumerate a module's attributes
167(as you can see above), because it can't rely on compiling `__init__()` to
168infer the attributes.
169
170A final special thing is that some modules (like `nn.Sequential`) have
171attributes that are not valid Python identifiers. We can't write
172
173```
174# wrong!
175class MyModule(Module):
176    0 : ASubmodule
177    1 : BSubmodule
178```
179
180because this is not valid Python syntax (even though it is legal in Python to
181have attributes with those names!). So we use a trick where we write directly
182to the `__annotations__` dict:
183
184```
185class MyModule(Module):
186    __annotations__ = []
187    __annotations__["0"] = ASubmodule
188    __annotations__["1"] = ASubmodule
189```
190
191### Placing the source code in the archive
192
193Once all code objects have been `PythonPrint`ed into source strings, we have
194to figure out where to actually put this source. Explaining this necessitates
195an introduction to `CompilationUnit` and `QualifiedName`. See the appendix on
196`CompilationUnit` for more info.
197
198**`CompilationUnit`**: this is the owning container for all code objects
199associated with a given model. When we load, we load all the code objects to
200a single `CompilationUnit`.
201
202**`QualifiedName`**: this is the fully qualified name for a code object. It is
203similar to qualified names in Python, and looks like `"foo.bar.baz"`. Each
204code object has a *unique* `QualifiedName` within a `CompilationUnit`.
205
206The exporter uses the `QualifiedName` of a code object to determine its
207location in the `code/` folder. The way it does so is similar to how Python
208does it; for example, the class `Baz` with a `QualifiedName` `"foo.bar.Baz"`
209will be placed in `code/foo/bar.py` under the name `Baz`.
210
211Classes at the root of the hierarchy are given the qualified name `__torch__`
212as a prefix, just so that they can go in `__torch__.py`. (Why not `__main__`?
213Because pickle has weird special rules about things that live in `__main__`).
214
215That's about it; there's some additional logic to make sure that within a
216file, we place the classes in reverse-dependency order so that we compile the
217"leaf" dependencies before things that depend on them.
218
219## How data is serialized
220
221A model is really a top-level `ScriptModule` with any number of submodules,
222parameters, attributes, and so on. We implement a subset of the Pickle format
223necessary for pickling a module object.
224
225`pickle`'s format was chosen due to:
226
227* **user friendliness** - the attributes file can be loaded in Python with `pickle`
228* **size limits** - formats such as Protobuf empose size limits on total
229 message size, whereas pickle limits are on individual values (e.g. strings
230 cannot be longer than 4 GB)
231* **standard format** - `pickle` is a standard Python module with a reasonably
232 simple format. The format is a program to be consumed by a stack machine that
233 is detailed in Python's
234* [`pickletools.py`](https://svn.python.org/projects/python/trunk/Lib/pickletools.py)
235* **built-in memoization** - for shared reference types (e.g. Tensor, string,
236 lists, dicts)
237* **self describing** - a separate definition file is not needed to understand
238 the pickled data
239* **eager mode save** - `torch.save()` already produces a `pickle` archive, so
240 doing the same with attributes avoids introducing yet another format
241
242### `data.pkl`: How module object state is serialized
243
244All data is written into the `data.pkl` file with the exception of tensors
245(see [the tensor section](#data-how-tensors-are-serialized) below).
246"Data" means all parts of the module object state, like attributes,
247submodules, etc.
248
249PyTorch functions defined in [torch/jit/_pickle.py](../../../jit/_pickle.py)
250are used to mark special data types, such as this tensor table index or
251specialized lists.
252
253### `data/`: How tensors are serialized
254
255During export a list of all the tensors in a model is created. Tensors can
256come from either module parameters or attributes of Tensor type.
257
258Tensors are treated differently from other data (which is pickled using the
259standard pickling process) for a few reasons:
260
261- Tensors regularly exceed the `pickle` file size limit.
262- We'd like to be able to `mmap` Tensors directly.
263- We'd like to maintain compatibility with regular `PyTorch`'s serialization
264  format
265
266## `constants.pkl`: Constants in code
267
268The `pickle` format enforces a separation between data and code, which the
269TorchScript serialization process represents by having `code/` and
270`data.pkl + tensors/`.
271
272However, TorchScript inlines constants (i.e. `prim::Constant` nodes) directly
273into `code/`. This poses a problem for tensor constants, which are not easily
274representable in string form.
275
276We can't put tensor constants in `data.pkl`, because the source code must be
277loaded *before* `data.pkl`, and so putting the tensor constants there would
278create a cyclic loading dependency.
279
280We solve this problem by creating a separate `pickle` file called
281`constants.pkl`, which holds all tensor constants referenced in code. The
282load order will be explained in the next section.
283
284## `torch:jit::load()`
285
286The load process has the following steps:
287
2881. Unpickle `constants.pkl`, which produces a tuple of all tensor constants
289   referenced in code.
2902. Unpickle `data.pkl` into the top-level `Module` and return it.
291
292The unpickling process consists of a single call to unpickle the module
293object contained in `data.pkl`. The `Unpickler` is given a callback that lets it
294resolve any qualified names it encounters into `ClassType`s. This is done by
295resolving the qualified name to the appropriate file in `code/`, then
296compiling that file and returning the appropriate `ClassType`.
297
298This is why it's important to give code objects unique qualified names in the
299`CompilationUnit`. That way, every class that `Unpickler` encounters has a
300deterministic location in `code/` where it is stored.
301
302`Unpickler` is also responsible for resolving references to tensors into
303actual `at::Tensor`s. This is done by looking up offsets in the tensor table
304during the unpickling process, (soon to be replaced with the same pickling
305strategy as all other data).
306
307## `__getstate__` and `__setstate__`
308
309Like in Python's `pickle`, users can customize the pickling behavior of their
310class or module by implementing `__getstate__()` and `__setstate__()`
311methods. For basic usage, refer to the relevant [Python
312docs](https://docs.python.org/3.7/library/pickle.html#pickle-state).
313
314Calls to `__getstate__` and `__setstate__` are handled transparently by
315`Pickler` and `Unpickler`, so the serialization process shouldn't worry about
316it too much.
317
318One thing worth calling out is that the compiler implements a few special
319type inference behaviors to cheat the fact that users currently cannot type
320annotate `Module`s.
321
322`__getstate__` and `__setstate__` do not require type annotations. For
323`__getstate__`, the compiler can fully infer the return based on what
324attributes the user is returning. Then, `__setstate__` simply looks up the
325return type of `__getstate__` and uses that as its input type.
326
327For example:
328
329```
330class M(torch.nn.Module):
331    def __init__(self) -> None:
332        self.a = torch.rand(2, 3)
333        self.b = torch.nn.Linear(10, 10)
334
335    def __getstate__(self):
336        # Compiler infers that this is a tuple of (Tensor, Linear)
337        return (self.a, self.b)
338
339    def __setstate__(self, state):
340        # Don't need to annotate this, we know what type `state` is!
341        self.a = state[0]
342        self.b = state[1]
343```
344
345## Appendix: `CompilationUnit` and code object ownership
346`CompilationUnit` performs two functions:
347
3481. It is the owner (in a C++ sense) for all code objects.
3492. It forms a namespace in which code objects must have unique names.
350
351A `CompilationUnit` is created whenever `torch::jit::load()` is invoked, to
352place the newly deserialized code objects in. In Python, there is a single
353global `CompilationUnit` that holds all code objects defined in Python.
354
355### `CompilationUnit` ownership semantics
356There are a few different entities that participate in the ownership model:
357**`CompilationUnit`**: A container that owns code objects and gives them name.
358Every code object has a unique qualified name within the CompilationUnit.
359
360There are two kinds of code objects: `Function`s and `ClassType`s.
361**`Function`**: A `Graph` with an associated executor. The `Graph` may own
362`ClassType`s, since some `Value`s hold a `shared_ptr` to their type (for
363now). The `Graph` may also weakly reference other `Function`s through
364function calls.
365
366**`ClassType`**: A definition of a type. This could refer to a user-defined
367TorchScript class, or a `ScriptModule`. Owns other its attribute types
368(including other ClassTypes). Weakly references the class’s methods
369(`Function`s).
370
371**`Object`**: An instance of a particular class. Own the `CompilationUnit`
372that owns its `ClassType`. This is to ensure that if the user passes the
373object around in C++, all its code will stay around and methods will be
374invokable.
375
376**`Module`**: A view over a `ClassType` and the `Object` that holds its state.
377Also responsible for turning unqualified names (e.g. `forward()`) into
378qualified ones for lookup in the owning `CompilationUnit` (e.g.
379`__torch__.MyModule.forward`). Owns the `Object`, which transitively owns the
380`CompilationUnit`.
381
382**`Method`**: A tuple of `(Module, Function)`.
383
384### Code object naming
385
386`CompilationUnit` maintains a namespace in which all code objects
387(`ClassType`s and `Function`s) are uniquely named. These names don't have any
388particular meaning, except that they uniquely identify a code object during
389serialization and deserialization. The basic naming scheme is:
390
391* Everything starts in the `__torch__` namespace.
392* Classes are named parallel to Python’s module namespacing: so class `Bar` in
393 `foo.py` would become `__torch__.foo.Bar`.
394* Methods are attached to the module’s namespace. So `Bar.forward()` would be
395 `__torch__.foo.Bar.forward`.
396
397There are some caveats:
398
399**Some `CompilationUnit`s have no prefix**: For testing and other internal
400purposes, occasionally it’s useful to have no prefixes on names. In this
401case, everything is just a bare name inside the `CompilationUnit`. Users
402cannot construct `CompilationUnits that look like this.
403
404**Name mangling**: In Python, we can construct code objects that have the same
405qualified name. There are two cases where this happens:
406
4071. For `ScriptModule`s, since every `ScriptModule` is a singleton class in
408the JIT, a user that is constructing multiple `ScriptModule`s will create
409multiple corresponding `ClassType`s with identical names.
4102. Nesting functions will also cause qualified name clashes, due to
411limitations in Python. In these cases, we mangle the names of the code
412objects before they are placed in the global Python `CompilationUnit`.
413
414The rules for mangling are simple. Say we have a qualified name `__torch__.foo.Bar`:
415
416```
417__torch__.foo.Bar                    # first time, unchanged
418__torch__.foo.__torch_mangle_0.Bar   # second time, when we request a mangle
419__torch__.foo.__torch_mangle_1.Bar   # and so on
420```
421
422Notice that we mangle the namespace before `Bar`. This is so that when we
423pretty-print code, the unqualified name (`Bar`) is unchanged. This is a
424useful property so that things like trace-checking are oblivious to the
425mangling.
426