1# Instruction count microbenchmarks 2## Quick start 3 4### To run the benchmark: 5 6``` 7# From pytorch root 8cd benchmarks/instruction_counts 9python main.py 10``` 11 12Currently `main.py` contains a very simple threadpool (so that run time isn't 13unbearably onerous) and simply prints the results. These components will be 14upgraded in subsequent PRs. 15 16### To define a new benchmark: 17* `TimerArgs`: Low level definition which maps directly to 18`torch.utils.benchmark.Timer` 19* `GroupedStmts`: Benchmark a snippet. (Python, C++, or both) Can automatically 20generate TorchScript and autograd variants. 21* `GroupedModules`: Like `GroupedStmts`, but takes `nn.Module`s 22* `GroupedVariants`: Benchmark-per-line to define many related benchmarks in a 23single code block. 24 25## Architecture 26### Benchmark definition. 27 28One primary goal of this suite is to make it easy to define semantically 29related clusters of benchmarks. The crux of this effort is the 30`GroupedBenchmark` class, which is defined in `core/api.py`. It takes a 31definition for a set of related benchmarks, and produces one or more concrete 32cases. It's helpful to see an example to understand how the machinery works. 33Consider the following benchmark: 34 35``` 36# `GroupedStmts` is an alias of `GroupedBenchmark.init_from_stmts` 37benchmark = GroupedStmts( 38 py_stmt=r"y = x * w", 39 cpp_stmt=r"auto y = x * w;", 40 41 setup=GroupedSetup( 42 py_setup=""" 43 x = torch.ones((4, 4)) 44 w = torch.ones((4, 4), requires_grad=True) 45 """, 46 cpp_setup=""" 47 auto x = torch::ones((4, 4)); 48 auto w = torch::ones((4, 4)); 49 w.set_requires_grad(true); 50 """, 51 ), 52 53 signature="f(x, w) -> y", 54 torchscript=True, 55 autograd=True, 56), 57``` 58 59It is trivial to generate Timers for the eager forward mode case (ignoring 60`num_threads` for now): 61 62``` 63Timer( 64 stmt=benchmark.py_fwd_stmt, 65 setup=benchmark.setup.py_setup, 66) 67 68Timer( 69 stmt=benchmark.cpp_fwd_stmt, 70 setup=benchmark.setup.cpp_setup, 71 language="cpp", 72) 73``` 74 75Moreover, because `signature` is provided we know that creation of `x` and `w` 76is part of setup, and the overall computation uses `x` and `w` to produce `y`. 77As a result, we can derive TorchScript'd and AutoGrad variants as well. We can 78deduce that a TorchScript model will take the form: 79 80``` 81@torch.jit.script 82def f(x, w): 83 # Paste `benchmark.py_fwd_stmt` into the function body. 84 y = x * w 85 return y # Set by `-> y` in signature. 86``` 87 88And because we will want to use this model in both Python and C++, we save it to 89disk and load it as needed. At this point Timers for TorchScript become: 90 91``` 92Timer( 93 stmt=""" 94 y = jit_model(x, w) 95 """, 96 setup=""", 97 # benchmark.setup.py_setup 98 # jit_model = torch.jit.load(...) 99 # Warm up jit_model 100 """, 101) 102 103Timer( 104 stmt=""" 105 std::vector<torch::jit::IValue> ivalue_inputs( 106 torch::jit::IValue({x}), 107 torch::jit::IValue({w}) 108 ); 109 auto y = jit_model.forward(ivalue_inputs); 110 """, 111 setup=""" 112 # benchmark.setup.cpp_setup 113 # jit_model = torch::jit::load(...) 114 # Warm up jit_model 115 """, 116) 117``` 118 119While nothing above is particularly complex, there is non-trivial bookkeeping 120(managing the model artifact, setting up IValues) which if done manually would 121be rather bug-prone and hard to read. 122 123The story is similar for autograd: because we know the output variable (`y`) 124and we make sure to assign it when calling TorchScript models, testing AutoGrad 125is as simple as appending `y.backward()` (or `y.backward();` in C++) to the 126stmt of the forward only variant. Of course this requires that `signature` be 127provided, as there is nothing special about the name `y`. 128 129The logic for the manipulations above is split between `core/api.py` (for 130generating `stmt` based on language, Eager/TorchScript, with or without AutoGrad) 131and `core/expand.py` (for larger, more expansive generation). The benchmarks 132themselves are defined in `definitions/standard.py`. The current set is chosen 133to demonstrate the various model definition APIs, and will be expanded when the 134benchmark runner infrastructure is better equipped to deal with a larger run. 135 136### Benchmark execution. 137 138Once `expand.materialize` has flattened the abstract benchmark definitions into 139`TimerArgs`, they can be sent to a worker (`worker/main.py`) subprocess to 140execution. This worker has no concept of the larger benchmark suite; `TimerArgs` 141is a one-to-one and direct mapping to the `torch.utils.benchmark.Timer` instance 142that the worker instantiates. 143