1*da0073e9SAndroid Build Coastguard Worker# Distributed Data Parallel Benchmark 2*da0073e9SAndroid Build Coastguard Worker 3*da0073e9SAndroid Build Coastguard WorkerThis tool is used to measure distributed training iteration time. This 4*da0073e9SAndroid Build Coastguard Workeris helpful for evaluating the performance impact of code changes to 5*da0073e9SAndroid Build Coastguard Worker`torch.nn.parallel.DistributedDataParallel`, `torch.distributed`, or 6*da0073e9SAndroid Build Coastguard Workeranything in between. 7*da0073e9SAndroid Build Coastguard Worker 8*da0073e9SAndroid Build Coastguard WorkerIt optionally produces a JSON file with all measurements, allowing for 9*da0073e9SAndroid Build Coastguard Workeran easy A/B comparison of code, configuration, or environment. This 10*da0073e9SAndroid Build Coastguard Workercomparison can be produced by `diff.py`. 11*da0073e9SAndroid Build Coastguard Worker 12*da0073e9SAndroid Build Coastguard Worker## Requirements 13*da0073e9SAndroid Build Coastguard Worker 14*da0073e9SAndroid Build Coastguard WorkerThis benchmark depends on PyTorch and torchvision. 15*da0073e9SAndroid Build Coastguard Worker 16*da0073e9SAndroid Build Coastguard Worker## How to run 17*da0073e9SAndroid Build Coastguard Worker 18*da0073e9SAndroid Build Coastguard WorkerRun as many copies of this script as you have model replicas. 19*da0073e9SAndroid Build Coastguard Worker 20*da0073e9SAndroid Build Coastguard WorkerIf you launch a single task per machine with multiple GPUs, consider 21*da0073e9SAndroid Build Coastguard Workerusing [`torch.distributed.launch`][launch] to spawn multiple processes 22*da0073e9SAndroid Build Coastguard Workerper machine. 23*da0073e9SAndroid Build Coastguard Worker 24*da0073e9SAndroid Build Coastguard Worker[launch]: https://pytorch.org/docs/stable/distributed.html#launch-utility 25*da0073e9SAndroid Build Coastguard Worker 26*da0073e9SAndroid Build Coastguard WorkerExample output (only on rank 0): 27*da0073e9SAndroid Build Coastguard Worker 28*da0073e9SAndroid Build Coastguard Worker``` 29*da0073e9SAndroid Build Coastguard Worker----------------------------------- 30*da0073e9SAndroid Build Coastguard WorkerPyTorch distributed benchmark suite 31*da0073e9SAndroid Build Coastguard Worker----------------------------------- 32*da0073e9SAndroid Build Coastguard Worker 33*da0073e9SAndroid Build Coastguard Worker* PyTorch version: 1.4.0a0+05140f0 34*da0073e9SAndroid Build Coastguard Worker* CUDA version: 10.0 35*da0073e9SAndroid Build Coastguard Worker* Distributed backend: nccl 36*da0073e9SAndroid Build Coastguard Worker 37*da0073e9SAndroid Build Coastguard Worker--- nvidia-smi topo -m --- 38*da0073e9SAndroid Build Coastguard Worker 39*da0073e9SAndroid Build Coastguard Worker GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 mlx5_2 mlx5_0 mlx5_3 mlx5_1 CPU Affinity 40*da0073e9SAndroid Build Coastguard WorkerGPU0 X NV1 NV1 NV2 NV2 SYS SYS SYS SYS PIX SYS PHB 0-19,40-59 41*da0073e9SAndroid Build Coastguard WorkerGPU1 NV1 X NV2 NV1 SYS NV2 SYS SYS SYS PIX SYS PHB 0-19,40-59 42*da0073e9SAndroid Build Coastguard WorkerGPU2 NV1 NV2 X NV2 SYS SYS NV1 SYS SYS PHB SYS PIX 0-19,40-59 43*da0073e9SAndroid Build Coastguard WorkerGPU3 NV2 NV1 NV2 X SYS SYS SYS NV1 SYS PHB SYS PIX 0-19,40-59 44*da0073e9SAndroid Build Coastguard WorkerGPU4 NV2 SYS SYS SYS X NV1 NV1 NV2 PIX SYS PHB SYS 0-19,40-59 45*da0073e9SAndroid Build Coastguard WorkerGPU5 SYS NV2 SYS SYS NV1 X NV2 NV1 PIX SYS PHB SYS 0-19,40-59 46*da0073e9SAndroid Build Coastguard WorkerGPU6 SYS SYS NV1 SYS NV1 NV2 X NV2 PHB SYS PIX SYS 0-19,40-59 47*da0073e9SAndroid Build Coastguard WorkerGPU7 SYS SYS SYS NV1 NV2 NV1 NV2 X PHB SYS PIX SYS 0-19,40-59 48*da0073e9SAndroid Build Coastguard Workermlx5_2 SYS SYS SYS SYS PIX PIX PHB PHB X SYS PHB SYS 49*da0073e9SAndroid Build Coastguard Workermlx5_0 PIX PIX PHB PHB SYS SYS SYS SYS SYS X SYS PHB 50*da0073e9SAndroid Build Coastguard Workermlx5_3 SYS SYS SYS SYS PHB PHB PIX PIX PHB SYS X SYS 51*da0073e9SAndroid Build Coastguard Workermlx5_1 PHB PHB PIX PIX SYS SYS SYS SYS SYS PHB SYS X 52*da0073e9SAndroid Build Coastguard Worker 53*da0073e9SAndroid Build Coastguard WorkerLegend: 54*da0073e9SAndroid Build Coastguard Worker 55*da0073e9SAndroid Build Coastguard Worker X = Self 56*da0073e9SAndroid Build Coastguard Worker SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) 57*da0073e9SAndroid Build Coastguard Worker NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node 58*da0073e9SAndroid Build Coastguard Worker PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) 59*da0073e9SAndroid Build Coastguard Worker PXB = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge) 60*da0073e9SAndroid Build Coastguard Worker PIX = Connection traversing a single PCIe switch 61*da0073e9SAndroid Build Coastguard Worker NV# = Connection traversing a bonded set of # NVLinks 62*da0073e9SAndroid Build Coastguard Worker 63*da0073e9SAndroid Build Coastguard Worker-------------------------- 64*da0073e9SAndroid Build Coastguard Worker 65*da0073e9SAndroid Build Coastguard Worker 66*da0073e9SAndroid Build Coastguard WorkerBenchmark: resnet50 with batch size 32 67*da0073e9SAndroid Build Coastguard Worker 68*da0073e9SAndroid Build Coastguard Worker sec/iter ex/sec sec/iter ex/sec sec/iter ex/sec sec/iter ex/sec 69*da0073e9SAndroid Build Coastguard Worker 1 GPUs -- no ddp: p50: 0.097s 329/s p75: 0.097s 329/s p90: 0.097s 329/s p95: 0.097s 329/s 70*da0073e9SAndroid Build Coastguard Worker 1 GPUs -- 1M/1G: p50: 0.100s 319/s p75: 0.100s 318/s p90: 0.100s 318/s p95: 0.100s 318/s 71*da0073e9SAndroid Build Coastguard Worker 2 GPUs -- 1M/2G: p50: 0.103s 310/s p75: 0.103s 310/s p90: 0.103s 310/s p95: 0.103s 309/s 72*da0073e9SAndroid Build Coastguard Worker 4 GPUs -- 1M/4G: p50: 0.103s 310/s p75: 0.103s 310/s p90: 0.103s 310/s p95: 0.103s 310/s 73*da0073e9SAndroid Build Coastguard Worker 8 GPUs -- 1M/8G: p50: 0.104s 307/s p75: 0.104s 307/s p90: 0.104s 306/s p95: 0.104s 306/s 74*da0073e9SAndroid Build Coastguard Worker 16 GPUs -- 2M/8G: p50: 0.104s 306/s p75: 0.104s 306/s p90: 0.104s 306/s p95: 0.104s 306/s 75*da0073e9SAndroid Build Coastguard Worker 76*da0073e9SAndroid Build Coastguard WorkerBenchmark: resnet101 with batch size 32 77*da0073e9SAndroid Build Coastguard Worker 78*da0073e9SAndroid Build Coastguard Worker sec/iter ex/sec sec/iter ex/sec sec/iter ex/sec sec/iter ex/sec 79*da0073e9SAndroid Build Coastguard Worker 1 GPUs -- no ddp: p50: 0.162s 197/s p75: 0.162s 197/s p90: 0.162s 197/s p95: 0.162s 197/s 80*da0073e9SAndroid Build Coastguard Worker 1 GPUs -- 1M/1G: p50: 0.171s 187/s p75: 0.171s 186/s p90: 0.171s 186/s p95: 0.172s 185/s 81*da0073e9SAndroid Build Coastguard Worker 2 GPUs -- 1M/2G: p50: 0.176s 182/s p75: 0.176s 181/s p90: 0.176s 181/s p95: 0.176s 181/s 82*da0073e9SAndroid Build Coastguard Worker 4 GPUs -- 1M/4G: p50: 0.176s 182/s p75: 0.176s 181/s p90: 0.176s 181/s p95: 0.176s 181/s 83*da0073e9SAndroid Build Coastguard Worker 8 GPUs -- 1M/8G: p50: 0.179s 179/s p75: 0.179s 178/s p90: 0.180s 178/s p95: 0.180s 177/s 84*da0073e9SAndroid Build Coastguard Worker 16 GPUs -- 2M/8G: p50: 0.179s 178/s p75: 0.180s 177/s p90: 0.183s 174/s p95: 0.188s 170/s 85*da0073e9SAndroid Build Coastguard Worker 86*da0073e9SAndroid Build Coastguard WorkerBenchmark: resnext50_32x4d with batch size 32 87*da0073e9SAndroid Build Coastguard Worker 88*da0073e9SAndroid Build Coastguard Worker sec/iter ex/sec sec/iter ex/sec sec/iter ex/sec sec/iter ex/sec 89*da0073e9SAndroid Build Coastguard Worker 1 GPUs -- no ddp: p50: 0.145s 220/s p75: 0.145s 220/s p90: 0.145s 220/s p95: 0.145s 220/s 90*da0073e9SAndroid Build Coastguard Worker 1 GPUs -- 1M/1G: p50: 0.147s 217/s p75: 0.147s 217/s p90: 0.148s 216/s p95: 0.148s 216/s 91*da0073e9SAndroid Build Coastguard Worker 2 GPUs -- 1M/2G: p50: 0.153s 209/s p75: 0.153s 209/s p90: 0.153s 209/s p95: 0.153s 209/s 92*da0073e9SAndroid Build Coastguard Worker 4 GPUs -- 1M/4G: p50: 0.153s 208/s p75: 0.153s 208/s p90: 0.154s 208/s p95: 0.154s 208/s 93*da0073e9SAndroid Build Coastguard Worker 8 GPUs -- 1M/8G: p50: 0.157s 204/s p75: 0.157s 204/s p90: 0.157s 203/s p95: 0.157s 203/s 94*da0073e9SAndroid Build Coastguard Worker 16 GPUs -- 2M/8G: p50: 0.157s 203/s p75: 0.157s 203/s p90: 0.158s 203/s p95: 0.158s 202/s 95*da0073e9SAndroid Build Coastguard Worker 96*da0073e9SAndroid Build Coastguard WorkerBenchmark: resnext101_32x8d with batch size 32 97*da0073e9SAndroid Build Coastguard Worker 98*da0073e9SAndroid Build Coastguard Worker sec/iter ex/sec sec/iter ex/sec sec/iter ex/sec sec/iter ex/sec 99*da0073e9SAndroid Build Coastguard Worker 1 GPUs -- no ddp: p50: 0.415s 77/s p75: 0.415s 77/s p90: 0.416s 76/s p95: 0.417s 76/s 100*da0073e9SAndroid Build Coastguard Worker 1 GPUs -- 1M/1G: p50: 0.425s 75/s p75: 0.426s 75/s p90: 0.426s 75/s p95: 0.426s 75/s 101*da0073e9SAndroid Build Coastguard Worker 2 GPUs -- 1M/2G: p50: 0.438s 73/s p75: 0.439s 72/s p90: 0.439s 72/s p95: 0.439s 72/s 102*da0073e9SAndroid Build Coastguard Worker 4 GPUs -- 1M/4G: p50: 0.439s 72/s p75: 0.439s 72/s p90: 0.440s 72/s p95: 0.440s 72/s 103*da0073e9SAndroid Build Coastguard Worker 8 GPUs -- 1M/8G: p50: 0.447s 71/s p75: 0.447s 71/s p90: 0.448s 71/s p95: 0.448s 71/s 104*da0073e9SAndroid Build Coastguard Worker 16 GPUs -- 2M/8G: p50: 0.450s 71/s p75: 0.451s 70/s p90: 0.451s 70/s p95: 0.451s 70/s 105*da0073e9SAndroid Build Coastguard Worker``` 106*da0073e9SAndroid Build Coastguard Worker 107*da0073e9SAndroid Build Coastguard Worker## How to diff 108*da0073e9SAndroid Build Coastguard Worker 109*da0073e9SAndroid Build Coastguard WorkerRun the benchmark with the `--json PATH_TO_REPORT_FILE` argument to 110*da0073e9SAndroid Build Coastguard Workerproduce the JSON file that the diff script can consume. 111*da0073e9SAndroid Build Coastguard Worker 112*da0073e9SAndroid Build Coastguard WorkerThen, run the diff script as follows: 113*da0073e9SAndroid Build Coastguard Worker 114*da0073e9SAndroid Build Coastguard Worker``` 115*da0073e9SAndroid Build Coastguard Worker$ python3 diff.py PATH_TO_BASELINE_FILE PATH_TO_TEST_FILE 116*da0073e9SAndroid Build Coastguard Worker baseline test 117*da0073e9SAndroid Build Coastguard Worker -------------------- -------------------- 118*da0073e9SAndroid Build Coastguard Workerbucket_size: 25 vs 1 119*da0073e9SAndroid Build Coastguard Workercuda_version: 10.0 vs 10.0 120*da0073e9SAndroid Build Coastguard Workerdistributed_backend: nccl vs nccl 121*da0073e9SAndroid Build Coastguard Workerpytorch_version: 1.4.0a0+05140f0 vs 1.4.0a0+05140f0 122*da0073e9SAndroid Build Coastguard Worker 123*da0073e9SAndroid Build Coastguard WorkerBenchmark: resnet50 with batch size 32 124*da0073e9SAndroid Build Coastguard Worker 125*da0073e9SAndroid Build Coastguard Worker sec/iter ex/sec diff sec/iter ex/sec diff 126*da0073e9SAndroid Build Coastguard Worker 1 GPUs: p75: 0.101s 317/s -0.3% p95: 0.101s 317/s -0.4% 127*da0073e9SAndroid Build Coastguard Worker 2 GPUs: p75: 0.104s 306/s -1.0% p95: 0.104s 306/s -1.0% 128*da0073e9SAndroid Build Coastguard Worker 4 GPUs: p75: 0.105s 305/s -1.6% p95: 0.105s 304/s -1.8% 129*da0073e9SAndroid Build Coastguard Worker 8 GPUs: p75: 0.107s 299/s -2.6% p95: 0.107s 298/s -2.7% 130*da0073e9SAndroid Build Coastguard Worker 16 GPUs: p75: 0.108s 294/s -3.8% p95: 0.122s 262/s -16.4% 131*da0073e9SAndroid Build Coastguard Worker 132*da0073e9SAndroid Build Coastguard WorkerBenchmark: resnet101 with batch size 32 133*da0073e9SAndroid Build Coastguard Worker 134*da0073e9SAndroid Build Coastguard Worker sec/iter ex/sec diff sec/iter ex/sec diff 135*da0073e9SAndroid Build Coastguard Worker 1 GPUs: p75: 0.172s 185/s -1.2% p95: 0.172s 185/s -1.3% 136*da0073e9SAndroid Build Coastguard Worker 2 GPUs: p75: 0.179s 178/s -2.1% p95: 0.179s 178/s -2.0% 137*da0073e9SAndroid Build Coastguard Worker 4 GPUs: p75: 0.180s 177/s -2.6% p95: 0.180s 177/s -2.6% 138*da0073e9SAndroid Build Coastguard Worker 8 GPUs: p75: 0.184s 173/s -3.5% p95: 0.184s 173/s -3.5% 139*da0073e9SAndroid Build Coastguard Worker 16 GPUs: p75: 0.187s 170/s -0.1% p95: 0.204s 157/s -7.9% 140*da0073e9SAndroid Build Coastguard Worker 141*da0073e9SAndroid Build Coastguard WorkerBenchmark: resnext50_32x4d with batch size 32 142*da0073e9SAndroid Build Coastguard Worker 143*da0073e9SAndroid Build Coastguard Worker sec/iter ex/sec diff sec/iter ex/sec diff 144*da0073e9SAndroid Build Coastguard Worker 1 GPUs: p75: 0.149s 214/s -1.0% p95: 0.149s 214/s -0.9% 145*da0073e9SAndroid Build Coastguard Worker 2 GPUs: p75: 0.156s 205/s -1.5% p95: 0.156s 205/s -1.6% 146*da0073e9SAndroid Build Coastguard Worker 4 GPUs: p75: 0.156s 204/s -1.6% p95: 0.157s 204/s -1.8% 147*da0073e9SAndroid Build Coastguard Worker 8 GPUs: p75: 0.159s 200/s -1.5% p95: 0.159s 200/s -1.5% 148*da0073e9SAndroid Build Coastguard Worker 16 GPUs: p75: 0.161s 198/s -1.9% p95: 0.162s 197/s -2.3% 149*da0073e9SAndroid Build Coastguard Worker 150*da0073e9SAndroid Build Coastguard WorkerBenchmark: resnext101_32x8d with batch size 32 151*da0073e9SAndroid Build Coastguard Worker 152*da0073e9SAndroid Build Coastguard Worker sec/iter ex/sec diff sec/iter ex/sec diff 153*da0073e9SAndroid Build Coastguard Worker 1 GPUs: p75: 0.427s 74/s -0.8% p95: 0.428s 74/s -0.7% 154*da0073e9SAndroid Build Coastguard Worker 2 GPUs: p75: 0.444s 72/s -1.3% p95: 0.445s 71/s -0.7% 155*da0073e9SAndroid Build Coastguard Worker 4 GPUs: p75: 0.444s 72/s -1.1% p95: 0.445s 71/s -0.8% 156*da0073e9SAndroid Build Coastguard Worker 8 GPUs: p75: 0.452s 70/s -1.3% p95: 0.452s 70/s -1.3% 157*da0073e9SAndroid Build Coastguard Worker 16 GPUs: p75: 0.455s 70/s -0.7% p95: 0.456s 70/s -0.6% 158*da0073e9SAndroid Build Coastguard Worker``` 159*da0073e9SAndroid Build Coastguard Worker 160*da0073e9SAndroid Build Coastguard WorkerThis compares throughput between `bucket_cap_mb=25` (the default) and 161*da0073e9SAndroid Build Coastguard Worker`bucket_cap_mb=1` on 8 DGX machines with V100 GPUs. It confirms that 162*da0073e9SAndroid Build Coastguard Workereven for a relatively small model on machines with a very fast 163*da0073e9SAndroid Build Coastguard Workerinterconnect (4x 100Gb InfiniBand per machine), it still pays off to 164*da0073e9SAndroid Build Coastguard Workerbatch allreduce calls. 165