README.md
1# Instruction count microbenchmarks
2## Quick start
3
4### To run the benchmark:
5
6```
7# From pytorch root
8cd benchmarks/instruction_counts
9python main.py
10```
11
12Currently `main.py` contains a very simple threadpool (so that run time isn't
13unbearably onerous) and simply prints the results. These components will be
14upgraded in subsequent PRs.
15
16### To define a new benchmark:
17* `TimerArgs`: Low level definition which maps directly to
18`torch.utils.benchmark.Timer`
19* `GroupedStmts`: Benchmark a snippet. (Python, C++, or both) Can automatically
20generate TorchScript and autograd variants.
21* `GroupedModules`: Like `GroupedStmts`, but takes `nn.Module`s
22* `GroupedVariants`: Benchmark-per-line to define many related benchmarks in a
23single code block.
24
25## Architecture
26### Benchmark definition.
27
28One primary goal of this suite is to make it easy to define semantically
29related clusters of benchmarks. The crux of this effort is the
30`GroupedBenchmark` class, which is defined in `core/api.py`. It takes a
31definition for a set of related benchmarks, and produces one or more concrete
32cases. It's helpful to see an example to understand how the machinery works.
33Consider the following benchmark:
34
35```
36# `GroupedStmts` is an alias of `GroupedBenchmark.init_from_stmts`
37benchmark = GroupedStmts(
38 py_stmt=r"y = x * w",
39 cpp_stmt=r"auto y = x * w;",
40
41 setup=GroupedSetup(
42 py_setup="""
43 x = torch.ones((4, 4))
44 w = torch.ones((4, 4), requires_grad=True)
45 """,
46 cpp_setup="""
47 auto x = torch::ones((4, 4));
48 auto w = torch::ones((4, 4));
49 w.set_requires_grad(true);
50 """,
51 ),
52
53 signature="f(x, w) -> y",
54 torchscript=True,
55 autograd=True,
56),
57```
58
59It is trivial to generate Timers for the eager forward mode case (ignoring
60`num_threads` for now):
61
62```
63Timer(
64 stmt=benchmark.py_fwd_stmt,
65 setup=benchmark.setup.py_setup,
66)
67
68Timer(
69 stmt=benchmark.cpp_fwd_stmt,
70 setup=benchmark.setup.cpp_setup,
71 language="cpp",
72)
73```
74
75Moreover, because `signature` is provided we know that creation of `x` and `w`
76is part of setup, and the overall computation uses `x` and `w` to produce `y`.
77As a result, we can derive TorchScript'd and AutoGrad variants as well. We can
78deduce that a TorchScript model will take the form:
79
80```
81@torch.jit.script
82def f(x, w):
83 # Paste `benchmark.py_fwd_stmt` into the function body.
84 y = x * w
85 return y # Set by `-> y` in signature.
86```
87
88And because we will want to use this model in both Python and C++, we save it to
89disk and load it as needed. At this point Timers for TorchScript become:
90
91```
92Timer(
93 stmt="""
94 y = jit_model(x, w)
95 """,
96 setup=""",
97 # benchmark.setup.py_setup
98 # jit_model = torch.jit.load(...)
99 # Warm up jit_model
100 """,
101)
102
103Timer(
104 stmt="""
105 std::vector<torch::jit::IValue> ivalue_inputs(
106 torch::jit::IValue({x}),
107 torch::jit::IValue({w})
108 );
109 auto y = jit_model.forward(ivalue_inputs);
110 """,
111 setup="""
112 # benchmark.setup.cpp_setup
113 # jit_model = torch::jit::load(...)
114 # Warm up jit_model
115 """,
116)
117```
118
119While nothing above is particularly complex, there is non-trivial bookkeeping
120(managing the model artifact, setting up IValues) which if done manually would
121be rather bug-prone and hard to read.
122
123The story is similar for autograd: because we know the output variable (`y`)
124and we make sure to assign it when calling TorchScript models, testing AutoGrad
125is as simple as appending `y.backward()` (or `y.backward();` in C++) to the
126stmt of the forward only variant. Of course this requires that `signature` be
127provided, as there is nothing special about the name `y`.
128
129The logic for the manipulations above is split between `core/api.py` (for
130generating `stmt` based on language, Eager/TorchScript, with or without AutoGrad)
131and `core/expand.py` (for larger, more expansive generation). The benchmarks
132themselves are defined in `definitions/standard.py`. The current set is chosen
133to demonstrate the various model definition APIs, and will be expanded when the
134benchmark runner infrastructure is better equipped to deal with a larger run.
135
136### Benchmark execution.
137
138Once `expand.materialize` has flattened the abstract benchmark definitions into
139`TimerArgs`, they can be sent to a worker (`worker/main.py`) subprocess to
140execution. This worker has no concept of the larger benchmark suite; `TimerArgs`
141is a one-to-one and direct mapping to the `torch.utils.benchmark.Timer` instance
142that the worker instantiates.
143