benchmarks/inference/CHANGELOG.md

*da0073e9SAndroid Build Coastguard Worker### [#115286](https://github.com/pytorch/pytorch/pull/115286)
*da0073e9SAndroid Build Coastguard Worker* Prior to this PR, the backend worker was a process that read from the request queue, ran the model's forward and put the output in the response queue. In this PR, create a `ThreadPoolExecutor` with 1 worker and asynchronously run the model forward and response step in the executor so that it doesn't block polling the queue for more requests.
*da0073e9SAndroid Build Coastguard Worker
*da0073e9SAndroid Build Coastguard Worker##### Results
*da0073e9SAndroid Build Coastguard Worker* Warmup latency improved (likely due to the backend no longer being a new process) but all other metrics were worse.
*da0073e9SAndroid Build Coastguard Worker
*da0073e9SAndroid Build Coastguard Worker
*da0073e9SAndroid Build Coastguard Worker### [#116188](https://github.com/pytorch/pytorch/pull/116188)
*da0073e9SAndroid Build Coastguard Worker* Fixed two bugs in metrics calculation:
*da0073e9SAndroid Build Coastguard Worker    * Before this PR, each `request_time` was separated by the time for a `torch.randn(...)` to create the fake `data` tensor on CPU. This meant that the gap between requests incorrectly scaled with the batch size. Since the latency was calculated by `response_time - request_time`, the latencies were not comparable over different batch sizes.
*da0073e9SAndroid Build Coastguard Worker    * Corrected calculation of throughput: previously `(num_batches * batch_size) / sum(response_times)`, now `(num_batches * batch_size) / (last_response_time - first_request_time)`
*da0073e9SAndroid Build Coastguard Worker* Fixed bug where responses sent to frontend are on GPU.
*da0073e9SAndroid Build Coastguard Worker* Used a semaphore to ensure writing to `metrics_dict` in `metrics_thread` and `gpu_utilization_thread` in a thread-safe manner.
*da0073e9SAndroid Build Coastguard Worker
*da0073e9SAndroid Build Coastguard Worker##### Results
*da0073e9SAndroid Build Coastguard Worker* Baseline metrics were reset due to the bugs listed above.
*da0073e9SAndroid Build Coastguard Worker
*da0073e9SAndroid Build Coastguard Worker
*da0073e9SAndroid Build Coastguard Worker### [#116189](https://github.com/pytorch/pytorch/pull/116189)
*da0073e9SAndroid Build Coastguard Worker* Added two `ThreadPoolExecutor`s with 1 worker each for D2H and H2D copies. Each uses its own `cuda.Stream`. The purpose is to try to overlap D2H and H2D with compute and allow the worker handling prediction to launch compute kernels without being blocked by D2H/H2D.
*da0073e9SAndroid Build Coastguard Worker    * One thread pins memory of the CPU request and copies it into a CUDA tensor
*da0073e9SAndroid Build Coastguard Worker    * One thread moves the response to CPU and places it into the response queue
*da0073e9SAndroid Build Coastguard WorkerSemaphores are used in conjunction with `cuda.Event`s to ensure proper synchronization among the threads.
*da0073e9SAndroid Build Coastguard Worker
*da0073e9SAndroid Build Coastguard Worker##### Results:
*da0073e9SAndroid Build Coastguard Worker* Warmup latency decreases as compared to the baseline for all batch sizes.
*da0073e9SAndroid Build Coastguard Worker* For batch sizes 1, 32, 64 we observed that metrics were worse
*da0073e9SAndroid Build Coastguard Worker    * Average latency increased
*da0073e9SAndroid Build Coastguard Worker    * Throughput decreased
*da0073e9SAndroid Build Coastguard Worker    * GPU utilization decreased
*da0073e9SAndroid Build Coastguard Worker* For batch sizes 128 and 256 we observed metrics improved
*da0073e9SAndroid Build Coastguard Worker    * Average latency decreased
*da0073e9SAndroid Build Coastguard Worker    * Throughput increased
*da0073e9SAndroid Build Coastguard Worker    * GPU utilization increased
*da0073e9SAndroid Build Coastguard Worker
*da0073e9SAndroid Build Coastguard Worker
*da0073e9SAndroid Build Coastguard Worker### [#116190](https://github.com/pytorch/pytorch/pull/116190)
*da0073e9SAndroid Build Coastguard Worker* Added a `--num_workers` option to `server.py` that allows more than 1 worker in the `ThreadPoolWorker` used for model predictions. Each worker uses its own `cuda.Stream()` that is created when the worker thread is initialized.
*da0073e9SAndroid Build Coastguard Worker
*da0073e9SAndroid Build Coastguard Worker##### Results:
*da0073e9SAndroid Build Coastguard WorkerBenchmarks were only run for `compile=False` since `torch.compile()` is not thread-safe. Benchmarks were run with `num_workers={2, 3, 4}`.
*da0073e9SAndroid Build Coastguard Worker
*da0073e9SAndroid Build Coastguard WorkerFor the 2 worker case:
*da0073e9SAndroid Build Coastguard Worker* All metrics improved compared to the single worker case across all batch sizes.
*da0073e9SAndroid Build Coastguard Worker* For batch sizes 1, 32 and 64 we observed that the metrics were still slightly worse than the baseline.
*da0073e9SAndroid Build Coastguard Worker* For batch sizes 128 and 256 we observed that all metrics beat the baseline (e.g. ~300 samples/sec increase in throughput, ~5s decrease in average latency and ~2s decrease in warmup latency for bs=256)
*da0073e9SAndroid Build Coastguard Worker
*da0073e9SAndroid Build Coastguard Worker![Throughput against batch size](./src/throughput_plot.png)
*da0073e9SAndroid Build Coastguard Worker![Avg latency against batch size](./src/avg_latency_plot.png)