xref: /aosp_15_r20/external/pytorch/benchmarks/distributed/ddp/README.md (revision da0073e96a02ea20f0ac840b70461e3646d07c45)
1# Distributed Data Parallel Benchmark
2
3This tool is used to measure distributed training iteration time. This
4is helpful for evaluating the performance impact of code changes to
5`torch.nn.parallel.DistributedDataParallel`, `torch.distributed`, or
6anything in between.
7
8It optionally produces a JSON file with all measurements, allowing for
9an easy A/B comparison of code, configuration, or environment. This
10comparison can be produced by `diff.py`.
11
12## Requirements
13
14This benchmark depends on PyTorch and torchvision.
15
16## How to run
17
18Run as many copies of this script as you have model replicas.
19
20If you launch a single task per machine with multiple GPUs, consider
21using [`torch.distributed.launch`][launch] to spawn multiple processes
22per machine.
23
24[launch]: https://pytorch.org/docs/stable/distributed.html#launch-utility
25
26Example output (only on rank 0):
27
28```
29-----------------------------------
30PyTorch distributed benchmark suite
31-----------------------------------
32
33* PyTorch version: 1.4.0a0+05140f0
34* CUDA version: 10.0
35* Distributed backend: nccl
36
37--- nvidia-smi topo -m ---
38
39        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    mlx5_2  mlx5_0  mlx5_3  mlx5_1  CPU Affinity
40GPU0     X      NV1     NV1     NV2     NV2     SYS     SYS     SYS     SYS     PIX     SYS     PHB     0-19,40-59
41GPU1    NV1      X      NV2     NV1     SYS     NV2     SYS     SYS     SYS     PIX     SYS     PHB     0-19,40-59
42GPU2    NV1     NV2      X      NV2     SYS     SYS     NV1     SYS     SYS     PHB     SYS     PIX     0-19,40-59
43GPU3    NV2     NV1     NV2      X      SYS     SYS     SYS     NV1     SYS     PHB     SYS     PIX     0-19,40-59
44GPU4    NV2     SYS     SYS     SYS      X      NV1     NV1     NV2     PIX     SYS     PHB     SYS     0-19,40-59
45GPU5    SYS     NV2     SYS     SYS     NV1      X      NV2     NV1     PIX     SYS     PHB     SYS     0-19,40-59
46GPU6    SYS     SYS     NV1     SYS     NV1     NV2      X      NV2     PHB     SYS     PIX     SYS     0-19,40-59
47GPU7    SYS     SYS     SYS     NV1     NV2     NV1     NV2      X      PHB     SYS     PIX     SYS     0-19,40-59
48mlx5_2  SYS     SYS     SYS     SYS     PIX     PIX     PHB     PHB      X      SYS     PHB     SYS
49mlx5_0  PIX     PIX     PHB     PHB     SYS     SYS     SYS     SYS     SYS      X      SYS     PHB
50mlx5_3  SYS     SYS     SYS     SYS     PHB     PHB     PIX     PIX     PHB     SYS      X      SYS
51mlx5_1  PHB     PHB     PIX     PIX     SYS     SYS     SYS     SYS     SYS     PHB     SYS      X
52
53Legend:
54
55  X    = Self
56  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
57  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
58  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
59  PXB  = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
60  PIX  = Connection traversing a single PCIe switch
61  NV#  = Connection traversing a bonded set of # NVLinks
62
63--------------------------
64
65
66Benchmark: resnet50 with batch size 32
67
68                            sec/iter    ex/sec      sec/iter    ex/sec      sec/iter    ex/sec      sec/iter    ex/sec
69   1 GPUs --   no ddp:  p50:  0.097s     329/s  p75:  0.097s     329/s  p90:  0.097s     329/s  p95:  0.097s     329/s
70   1 GPUs --    1M/1G:  p50:  0.100s     319/s  p75:  0.100s     318/s  p90:  0.100s     318/s  p95:  0.100s     318/s
71   2 GPUs --    1M/2G:  p50:  0.103s     310/s  p75:  0.103s     310/s  p90:  0.103s     310/s  p95:  0.103s     309/s
72   4 GPUs --    1M/4G:  p50:  0.103s     310/s  p75:  0.103s     310/s  p90:  0.103s     310/s  p95:  0.103s     310/s
73   8 GPUs --    1M/8G:  p50:  0.104s     307/s  p75:  0.104s     307/s  p90:  0.104s     306/s  p95:  0.104s     306/s
74  16 GPUs --    2M/8G:  p50:  0.104s     306/s  p75:  0.104s     306/s  p90:  0.104s     306/s  p95:  0.104s     306/s
75
76Benchmark: resnet101 with batch size 32
77
78                            sec/iter    ex/sec      sec/iter    ex/sec      sec/iter    ex/sec      sec/iter    ex/sec
79   1 GPUs --   no ddp:  p50:  0.162s     197/s  p75:  0.162s     197/s  p90:  0.162s     197/s  p95:  0.162s     197/s
80   1 GPUs --    1M/1G:  p50:  0.171s     187/s  p75:  0.171s     186/s  p90:  0.171s     186/s  p95:  0.172s     185/s
81   2 GPUs --    1M/2G:  p50:  0.176s     182/s  p75:  0.176s     181/s  p90:  0.176s     181/s  p95:  0.176s     181/s
82   4 GPUs --    1M/4G:  p50:  0.176s     182/s  p75:  0.176s     181/s  p90:  0.176s     181/s  p95:  0.176s     181/s
83   8 GPUs --    1M/8G:  p50:  0.179s     179/s  p75:  0.179s     178/s  p90:  0.180s     178/s  p95:  0.180s     177/s
84  16 GPUs --    2M/8G:  p50:  0.179s     178/s  p75:  0.180s     177/s  p90:  0.183s     174/s  p95:  0.188s     170/s
85
86Benchmark: resnext50_32x4d with batch size 32
87
88                            sec/iter    ex/sec      sec/iter    ex/sec      sec/iter    ex/sec      sec/iter    ex/sec
89   1 GPUs --   no ddp:  p50:  0.145s     220/s  p75:  0.145s     220/s  p90:  0.145s     220/s  p95:  0.145s     220/s
90   1 GPUs --    1M/1G:  p50:  0.147s     217/s  p75:  0.147s     217/s  p90:  0.148s     216/s  p95:  0.148s     216/s
91   2 GPUs --    1M/2G:  p50:  0.153s     209/s  p75:  0.153s     209/s  p90:  0.153s     209/s  p95:  0.153s     209/s
92   4 GPUs --    1M/4G:  p50:  0.153s     208/s  p75:  0.153s     208/s  p90:  0.154s     208/s  p95:  0.154s     208/s
93   8 GPUs --    1M/8G:  p50:  0.157s     204/s  p75:  0.157s     204/s  p90:  0.157s     203/s  p95:  0.157s     203/s
94  16 GPUs --    2M/8G:  p50:  0.157s     203/s  p75:  0.157s     203/s  p90:  0.158s     203/s  p95:  0.158s     202/s
95
96Benchmark: resnext101_32x8d with batch size 32
97
98                            sec/iter    ex/sec      sec/iter    ex/sec      sec/iter    ex/sec      sec/iter    ex/sec
99   1 GPUs --   no ddp:  p50:  0.415s      77/s  p75:  0.415s      77/s  p90:  0.416s      76/s  p95:  0.417s      76/s
100   1 GPUs --    1M/1G:  p50:  0.425s      75/s  p75:  0.426s      75/s  p90:  0.426s      75/s  p95:  0.426s      75/s
101   2 GPUs --    1M/2G:  p50:  0.438s      73/s  p75:  0.439s      72/s  p90:  0.439s      72/s  p95:  0.439s      72/s
102   4 GPUs --    1M/4G:  p50:  0.439s      72/s  p75:  0.439s      72/s  p90:  0.440s      72/s  p95:  0.440s      72/s
103   8 GPUs --    1M/8G:  p50:  0.447s      71/s  p75:  0.447s      71/s  p90:  0.448s      71/s  p95:  0.448s      71/s
104  16 GPUs --    2M/8G:  p50:  0.450s      71/s  p75:  0.451s      70/s  p90:  0.451s      70/s  p95:  0.451s      70/s
105```
106
107## How to diff
108
109Run the benchmark with the `--json PATH_TO_REPORT_FILE` argument to
110produce the JSON file that the diff script can consume.
111
112Then, run the diff script as follows:
113
114```
115$ python3 diff.py PATH_TO_BASELINE_FILE PATH_TO_TEST_FILE
116                                 baseline                      test
117                     --------------------      --------------------
118bucket_size:                           25  vs                     1
119cuda_version:                        10.0  vs                  10.0
120distributed_backend:                 nccl  vs                  nccl
121pytorch_version:          1.4.0a0+05140f0  vs       1.4.0a0+05140f0
122
123Benchmark: resnet50 with batch size 32
124
125                  sec/iter    ex/sec      diff        sec/iter    ex/sec      diff
126   1 GPUs:  p75:    0.101s     317/s     -0.3%  p95:    0.101s     317/s     -0.4%
127   2 GPUs:  p75:    0.104s     306/s     -1.0%  p95:    0.104s     306/s     -1.0%
128   4 GPUs:  p75:    0.105s     305/s     -1.6%  p95:    0.105s     304/s     -1.8%
129   8 GPUs:  p75:    0.107s     299/s     -2.6%  p95:    0.107s     298/s     -2.7%
130  16 GPUs:  p75:    0.108s     294/s     -3.8%  p95:    0.122s     262/s    -16.4%
131
132Benchmark: resnet101 with batch size 32
133
134                  sec/iter    ex/sec      diff        sec/iter    ex/sec      diff
135   1 GPUs:  p75:    0.172s     185/s     -1.2%  p95:    0.172s     185/s     -1.3%
136   2 GPUs:  p75:    0.179s     178/s     -2.1%  p95:    0.179s     178/s     -2.0%
137   4 GPUs:  p75:    0.180s     177/s     -2.6%  p95:    0.180s     177/s     -2.6%
138   8 GPUs:  p75:    0.184s     173/s     -3.5%  p95:    0.184s     173/s     -3.5%
139  16 GPUs:  p75:    0.187s     170/s     -0.1%  p95:    0.204s     157/s     -7.9%
140
141Benchmark: resnext50_32x4d with batch size 32
142
143                  sec/iter    ex/sec      diff        sec/iter    ex/sec      diff
144   1 GPUs:  p75:    0.149s     214/s     -1.0%  p95:    0.149s     214/s     -0.9%
145   2 GPUs:  p75:    0.156s     205/s     -1.5%  p95:    0.156s     205/s     -1.6%
146   4 GPUs:  p75:    0.156s     204/s     -1.6%  p95:    0.157s     204/s     -1.8%
147   8 GPUs:  p75:    0.159s     200/s     -1.5%  p95:    0.159s     200/s     -1.5%
148  16 GPUs:  p75:    0.161s     198/s     -1.9%  p95:    0.162s     197/s     -2.3%
149
150Benchmark: resnext101_32x8d with batch size 32
151
152                  sec/iter    ex/sec      diff        sec/iter    ex/sec      diff
153   1 GPUs:  p75:    0.427s      74/s     -0.8%  p95:    0.428s      74/s     -0.7%
154   2 GPUs:  p75:    0.444s      72/s     -1.3%  p95:    0.445s      71/s     -0.7%
155   4 GPUs:  p75:    0.444s      72/s     -1.1%  p95:    0.445s      71/s     -0.8%
156   8 GPUs:  p75:    0.452s      70/s     -1.3%  p95:    0.452s      70/s     -1.3%
157  16 GPUs:  p75:    0.455s      70/s     -0.7%  p95:    0.456s      70/s     -0.6%
158```
159
160This compares throughput between `bucket_cap_mb=25` (the default) and
161`bucket_cap_mb=1` on 8 DGX machines with V100 GPUs. It confirms that
162even for a relatively small model on machines with a very fast
163interconnect (4x 100Gb InfiniBand per machine), it still pays off to
164batch allreduce calls.
165