utils/benchmark/README.md

# Modular Benchmarking Components:

NOTE: These components are currently work in progress.

## Timer
This class is modeled on the `timeit.Timer` API, but with PyTorch specific
facilities which make it more suitable for benchmarking kernels. These fall
into two broad categories:

### Managing 'gotchas':

  `Timer` will invoke `torch.cuda.synchronize()` if applicable, control the
  number of torch threads, add a warmup, and warn if a measurement appears
  suspect or downright unreliable.

### Integration and better measurement:

  `Timer`, while modeled after the `timeit` analog, uses a slightly different
  API from `timeit.Timer`.

  * The constructor accepts additional metadata and timing methods return
  a `Measurement` class rather than a float. This `Measurement` class is
  serializable and allows many examples to be grouped and interpreted.
  (See `Compare` for more details.)

  * `Timer` implements the `blocked_autorange` function which is a
  mixture of `timeit.Timer.repeat` and `timeit.Timer.autorange`. This function
  selects and appropriate number and runs for a roughly fixed amount of time
  (like `autorange`), but is less wasteful than `autorange` which discards
  ~75% of measurements. It runs many times, similar to `repeat`, and returns
  a `Measurement` containing all of the run results.

## Compare

`Compare` takes a list of `Measurement`s in its constructor, and displays them
as a formatted table for easier analysis. Identical measurements will be
merged, which allows `Compare` to process replicate measurements. Several
convenience methods are also provided to truncate displayed values based on
the number of significant figures and color code measurements to highlight
performance differences. Grouping and layout is based on metadata passed to
`Timer`:
* `label`: This is a top level description. (e.g. `add`, or `multiply`) one
table will be generated per unique label.

* `sub_label`: This is the label for a given configuration. Multiple statements
may be logically equivalent differ in implementation. Assigning separate
sub_labels will result in a row per sub_label. If a sublabel is not provided,
`stmt` is used instead. Statistics (such as computing the fastest
implementation) are use all sub_labels.

* `description`: This describes the inputs. For instance, `stmt=torch.add(x, y)`
can be run over several values of `x` and `y`. Each pair should be given its
own `description`, which allows them to appear in separate columns.
Statistics do not mix values of different descriptions, since comparing the
run time of drastically different inputs is generally not meaningful.

* `env`: An optional description of the torch environment. (e.g. `master` or
`my_branch`). Like sub_labels, statistics are calculated across envs. (Since
comparing a branch to master or a stable release is a common use case.)
However `Compare` will visually group rows which are run with the same `env`.

* `num_threads`: By default, `Timer` will run in single-threaded mode. If
`Measurements` with different numbers of threads are given to `Compare`, they
will be grouped into separate blocks of rows.

## Fuzzing

The `Fuzzer` class is designed to allow very flexible and repeatable
construction of a wide variety of Tensors while automating away some
of the tedium that comes with creating good benchmark inputs. The two
APIs of interest are the constructor and `Fuzzer.take(self, n: int)`.
At construction, a `Fuzzer` is a spec for the kind of Tensors that
should be created. It takes a list of `FuzzedParameters`, a list of
`FuzzedTensors`, and an integer with which to seed the Fuzzer.

The reason for distinguishing between parameters and Tensors is that the shapes
and data of Tensors is often linked (e.g. shapes must be identical or
broadcastable, indices must fall within a given range, etc.) As a result we
must first materialize values for each parameter, and then use them to
construct Tensors in a second pass. As a concrete reference, the following
will create Tensors `x` and `y`, where `x` is a 2D Tensor and `y` is
broadcastable to the shape of `x`:

```
fuzzer = Fuzzer(
  parameters=[
    FuzzedParameter("k0", 16, 16 * 1024, "loguniform"),
    FuzzedParameter("k1", 16, 16 * 1024, "loguniform"),
  ],
  tensors=[
    FuzzedTensor(
      name="x", size=("k0", "k1"), probability_contiguous=0.75
    ),
    FuzzedTensor(
      name="y", size=("k0", 1), probability_contiguous=0.75
    ),
  ],
  seed=0,
)
```

Calling `fuzzer.take(n)` will create a generator with `n` elements which
yields randomly generated Tensors satisfying the above definition, as well
as some metadata about the parameters and Tensors. Critically, calling
`.take(...)` multiple times will produce generators which select the same
parameters, allowing repeat measurements and different environments to
conduct the same trial. `FuzzedParameter` and `FuzzedTensor` support a
fairly involved set of behaviors to reflect the rich character of Tensor
operations and representations. (For instance, note the
`probability_contiguous` argument which signals that some fraction of the
time non-contiguous Tensors should be created.) The best way to understand
`Fuzzer`, however, is probably to experiment with `examples.fuzzer`.

# Examples:
`python -m examples.simple_timeit`

`python -m examples.compare`

`python -m examples.fuzzer`

`python -m examples.end_to_end`