1## Inference benchmarks 2 3This folder contains a work in progress simulation of a python inference server. 4 5The v0 version of this has a backend worker that is a single process. It loads a 6ResNet-18 checkpoint to 'cuda:0' and compiles the model. It accepts requests in 7the form of (tensor, request_time) from a `multiprocessing.Queue`, runs 8inference on the request and returns (output, request_time) in the a separate 9response `multiprocessing.Queue`. 10 11The frontend worker is a process with three threads 121. A thread that generates fake data of a given batch size in the form of CPU 13 tensors and puts the data into the request queue 142. A thread that reads responses from the response queue and collects metrics on 15 the latency of the first response, which corresponds to the cold start time, 16 average, minimum and maximum response latency as well as throughput. 173. A thread that polls nvidia-smi for GPU utilization metrics. 18 19For now we omit data preprocessing as well as result post-processing. 20 21### Running a single benchmark 22 23The togglable commmand line arguments to the script are as follows: 24 - `num_iters` (default: 100): how many requests to send to the backend 25 excluding the first warmup request 26 - `batch_size` (default: 32): the batch size of the requests. 27 - `model_dir` (default: '.'): the directory to load the checkpoint from 28 - `compile` (default: compile): or `--no-compile` whether to `torch.compile()` 29 the model 30 - `output_file` (default: output.csv): The name of the csv file to write the outputs to in the `results/` directory. 31 - `num_workers` (default: 2): The `max_threads` passed to the `ThreadPoolExecutor` in charge of model prediction 32 33e.g. A sample command to run the benchmark 34 35``` 36python -W ignore server.py --num_iters 1000 --batch_size 32 37``` 38 39the results will be found in `results/output.csv`, which will be appended to if the file already exists. 40 41Note that `m.compile()` time in the csv file is not the time for the model to be compiled, 42which happens during the first iteration, but rather the time for PT2 components 43to be lazily imported (e.g. triton). 44 45### Running a sweep 46 47The script `runner.sh` will run a sweep of the benchmark over different batch 48sizes with compile on and off and collect the mean and standard deviation of warmup latency, 49average latency, throughput and GPU utilization for each. The `results/` directory will contain the metrics 50from running a sweep as we develop this benchmark where `results/output_{batch_size}_{compile}.md` 51will contain the mean and standard deviation of results for a given batch size and compile setting. 52If the file already exists, the metrics from the run will be appended as a new row in the markdown table. 53