xref: /aosp_15_r20/external/pytorch/benchmarks/distributed/ddp/README.md (revision da0073e96a02ea20f0ac840b70461e3646d07c45)
1*da0073e9SAndroid Build Coastguard Worker# Distributed Data Parallel Benchmark
2*da0073e9SAndroid Build Coastguard Worker
3*da0073e9SAndroid Build Coastguard WorkerThis tool is used to measure distributed training iteration time. This
4*da0073e9SAndroid Build Coastguard Workeris helpful for evaluating the performance impact of code changes to
5*da0073e9SAndroid Build Coastguard Worker`torch.nn.parallel.DistributedDataParallel`, `torch.distributed`, or
6*da0073e9SAndroid Build Coastguard Workeranything in between.
7*da0073e9SAndroid Build Coastguard Worker
8*da0073e9SAndroid Build Coastguard WorkerIt optionally produces a JSON file with all measurements, allowing for
9*da0073e9SAndroid Build Coastguard Workeran easy A/B comparison of code, configuration, or environment. This
10*da0073e9SAndroid Build Coastguard Workercomparison can be produced by `diff.py`.
11*da0073e9SAndroid Build Coastguard Worker
12*da0073e9SAndroid Build Coastguard Worker## Requirements
13*da0073e9SAndroid Build Coastguard Worker
14*da0073e9SAndroid Build Coastguard WorkerThis benchmark depends on PyTorch and torchvision.
15*da0073e9SAndroid Build Coastguard Worker
16*da0073e9SAndroid Build Coastguard Worker## How to run
17*da0073e9SAndroid Build Coastguard Worker
18*da0073e9SAndroid Build Coastguard WorkerRun as many copies of this script as you have model replicas.
19*da0073e9SAndroid Build Coastguard Worker
20*da0073e9SAndroid Build Coastguard WorkerIf you launch a single task per machine with multiple GPUs, consider
21*da0073e9SAndroid Build Coastguard Workerusing [`torch.distributed.launch`][launch] to spawn multiple processes
22*da0073e9SAndroid Build Coastguard Workerper machine.
23*da0073e9SAndroid Build Coastguard Worker
24*da0073e9SAndroid Build Coastguard Worker[launch]: https://pytorch.org/docs/stable/distributed.html#launch-utility
25*da0073e9SAndroid Build Coastguard Worker
26*da0073e9SAndroid Build Coastguard WorkerExample output (only on rank 0):
27*da0073e9SAndroid Build Coastguard Worker
28*da0073e9SAndroid Build Coastguard Worker```
29*da0073e9SAndroid Build Coastguard Worker-----------------------------------
30*da0073e9SAndroid Build Coastguard WorkerPyTorch distributed benchmark suite
31*da0073e9SAndroid Build Coastguard Worker-----------------------------------
32*da0073e9SAndroid Build Coastguard Worker
33*da0073e9SAndroid Build Coastguard Worker* PyTorch version: 1.4.0a0+05140f0
34*da0073e9SAndroid Build Coastguard Worker* CUDA version: 10.0
35*da0073e9SAndroid Build Coastguard Worker* Distributed backend: nccl
36*da0073e9SAndroid Build Coastguard Worker
37*da0073e9SAndroid Build Coastguard Worker--- nvidia-smi topo -m ---
38*da0073e9SAndroid Build Coastguard Worker
39*da0073e9SAndroid Build Coastguard Worker        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    mlx5_2  mlx5_0  mlx5_3  mlx5_1  CPU Affinity
40*da0073e9SAndroid Build Coastguard WorkerGPU0     X      NV1     NV1     NV2     NV2     SYS     SYS     SYS     SYS     PIX     SYS     PHB     0-19,40-59
41*da0073e9SAndroid Build Coastguard WorkerGPU1    NV1      X      NV2     NV1     SYS     NV2     SYS     SYS     SYS     PIX     SYS     PHB     0-19,40-59
42*da0073e9SAndroid Build Coastguard WorkerGPU2    NV1     NV2      X      NV2     SYS     SYS     NV1     SYS     SYS     PHB     SYS     PIX     0-19,40-59
43*da0073e9SAndroid Build Coastguard WorkerGPU3    NV2     NV1     NV2      X      SYS     SYS     SYS     NV1     SYS     PHB     SYS     PIX     0-19,40-59
44*da0073e9SAndroid Build Coastguard WorkerGPU4    NV2     SYS     SYS     SYS      X      NV1     NV1     NV2     PIX     SYS     PHB     SYS     0-19,40-59
45*da0073e9SAndroid Build Coastguard WorkerGPU5    SYS     NV2     SYS     SYS     NV1      X      NV2     NV1     PIX     SYS     PHB     SYS     0-19,40-59
46*da0073e9SAndroid Build Coastguard WorkerGPU6    SYS     SYS     NV1     SYS     NV1     NV2      X      NV2     PHB     SYS     PIX     SYS     0-19,40-59
47*da0073e9SAndroid Build Coastguard WorkerGPU7    SYS     SYS     SYS     NV1     NV2     NV1     NV2      X      PHB     SYS     PIX     SYS     0-19,40-59
48*da0073e9SAndroid Build Coastguard Workermlx5_2  SYS     SYS     SYS     SYS     PIX     PIX     PHB     PHB      X      SYS     PHB     SYS
49*da0073e9SAndroid Build Coastguard Workermlx5_0  PIX     PIX     PHB     PHB     SYS     SYS     SYS     SYS     SYS      X      SYS     PHB
50*da0073e9SAndroid Build Coastguard Workermlx5_3  SYS     SYS     SYS     SYS     PHB     PHB     PIX     PIX     PHB     SYS      X      SYS
51*da0073e9SAndroid Build Coastguard Workermlx5_1  PHB     PHB     PIX     PIX     SYS     SYS     SYS     SYS     SYS     PHB     SYS      X
52*da0073e9SAndroid Build Coastguard Worker
53*da0073e9SAndroid Build Coastguard WorkerLegend:
54*da0073e9SAndroid Build Coastguard Worker
55*da0073e9SAndroid Build Coastguard Worker  X    = Self
56*da0073e9SAndroid Build Coastguard Worker  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
57*da0073e9SAndroid Build Coastguard Worker  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
58*da0073e9SAndroid Build Coastguard Worker  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
59*da0073e9SAndroid Build Coastguard Worker  PXB  = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
60*da0073e9SAndroid Build Coastguard Worker  PIX  = Connection traversing a single PCIe switch
61*da0073e9SAndroid Build Coastguard Worker  NV#  = Connection traversing a bonded set of # NVLinks
62*da0073e9SAndroid Build Coastguard Worker
63*da0073e9SAndroid Build Coastguard Worker--------------------------
64*da0073e9SAndroid Build Coastguard Worker
65*da0073e9SAndroid Build Coastguard Worker
66*da0073e9SAndroid Build Coastguard WorkerBenchmark: resnet50 with batch size 32
67*da0073e9SAndroid Build Coastguard Worker
68*da0073e9SAndroid Build Coastguard Worker                            sec/iter    ex/sec      sec/iter    ex/sec      sec/iter    ex/sec      sec/iter    ex/sec
69*da0073e9SAndroid Build Coastguard Worker   1 GPUs --   no ddp:  p50:  0.097s     329/s  p75:  0.097s     329/s  p90:  0.097s     329/s  p95:  0.097s     329/s
70*da0073e9SAndroid Build Coastguard Worker   1 GPUs --    1M/1G:  p50:  0.100s     319/s  p75:  0.100s     318/s  p90:  0.100s     318/s  p95:  0.100s     318/s
71*da0073e9SAndroid Build Coastguard Worker   2 GPUs --    1M/2G:  p50:  0.103s     310/s  p75:  0.103s     310/s  p90:  0.103s     310/s  p95:  0.103s     309/s
72*da0073e9SAndroid Build Coastguard Worker   4 GPUs --    1M/4G:  p50:  0.103s     310/s  p75:  0.103s     310/s  p90:  0.103s     310/s  p95:  0.103s     310/s
73*da0073e9SAndroid Build Coastguard Worker   8 GPUs --    1M/8G:  p50:  0.104s     307/s  p75:  0.104s     307/s  p90:  0.104s     306/s  p95:  0.104s     306/s
74*da0073e9SAndroid Build Coastguard Worker  16 GPUs --    2M/8G:  p50:  0.104s     306/s  p75:  0.104s     306/s  p90:  0.104s     306/s  p95:  0.104s     306/s
75*da0073e9SAndroid Build Coastguard Worker
76*da0073e9SAndroid Build Coastguard WorkerBenchmark: resnet101 with batch size 32
77*da0073e9SAndroid Build Coastguard Worker
78*da0073e9SAndroid Build Coastguard Worker                            sec/iter    ex/sec      sec/iter    ex/sec      sec/iter    ex/sec      sec/iter    ex/sec
79*da0073e9SAndroid Build Coastguard Worker   1 GPUs --   no ddp:  p50:  0.162s     197/s  p75:  0.162s     197/s  p90:  0.162s     197/s  p95:  0.162s     197/s
80*da0073e9SAndroid Build Coastguard Worker   1 GPUs --    1M/1G:  p50:  0.171s     187/s  p75:  0.171s     186/s  p90:  0.171s     186/s  p95:  0.172s     185/s
81*da0073e9SAndroid Build Coastguard Worker   2 GPUs --    1M/2G:  p50:  0.176s     182/s  p75:  0.176s     181/s  p90:  0.176s     181/s  p95:  0.176s     181/s
82*da0073e9SAndroid Build Coastguard Worker   4 GPUs --    1M/4G:  p50:  0.176s     182/s  p75:  0.176s     181/s  p90:  0.176s     181/s  p95:  0.176s     181/s
83*da0073e9SAndroid Build Coastguard Worker   8 GPUs --    1M/8G:  p50:  0.179s     179/s  p75:  0.179s     178/s  p90:  0.180s     178/s  p95:  0.180s     177/s
84*da0073e9SAndroid Build Coastguard Worker  16 GPUs --    2M/8G:  p50:  0.179s     178/s  p75:  0.180s     177/s  p90:  0.183s     174/s  p95:  0.188s     170/s
85*da0073e9SAndroid Build Coastguard Worker
86*da0073e9SAndroid Build Coastguard WorkerBenchmark: resnext50_32x4d with batch size 32
87*da0073e9SAndroid Build Coastguard Worker
88*da0073e9SAndroid Build Coastguard Worker                            sec/iter    ex/sec      sec/iter    ex/sec      sec/iter    ex/sec      sec/iter    ex/sec
89*da0073e9SAndroid Build Coastguard Worker   1 GPUs --   no ddp:  p50:  0.145s     220/s  p75:  0.145s     220/s  p90:  0.145s     220/s  p95:  0.145s     220/s
90*da0073e9SAndroid Build Coastguard Worker   1 GPUs --    1M/1G:  p50:  0.147s     217/s  p75:  0.147s     217/s  p90:  0.148s     216/s  p95:  0.148s     216/s
91*da0073e9SAndroid Build Coastguard Worker   2 GPUs --    1M/2G:  p50:  0.153s     209/s  p75:  0.153s     209/s  p90:  0.153s     209/s  p95:  0.153s     209/s
92*da0073e9SAndroid Build Coastguard Worker   4 GPUs --    1M/4G:  p50:  0.153s     208/s  p75:  0.153s     208/s  p90:  0.154s     208/s  p95:  0.154s     208/s
93*da0073e9SAndroid Build Coastguard Worker   8 GPUs --    1M/8G:  p50:  0.157s     204/s  p75:  0.157s     204/s  p90:  0.157s     203/s  p95:  0.157s     203/s
94*da0073e9SAndroid Build Coastguard Worker  16 GPUs --    2M/8G:  p50:  0.157s     203/s  p75:  0.157s     203/s  p90:  0.158s     203/s  p95:  0.158s     202/s
95*da0073e9SAndroid Build Coastguard Worker
96*da0073e9SAndroid Build Coastguard WorkerBenchmark: resnext101_32x8d with batch size 32
97*da0073e9SAndroid Build Coastguard Worker
98*da0073e9SAndroid Build Coastguard Worker                            sec/iter    ex/sec      sec/iter    ex/sec      sec/iter    ex/sec      sec/iter    ex/sec
99*da0073e9SAndroid Build Coastguard Worker   1 GPUs --   no ddp:  p50:  0.415s      77/s  p75:  0.415s      77/s  p90:  0.416s      76/s  p95:  0.417s      76/s
100*da0073e9SAndroid Build Coastguard Worker   1 GPUs --    1M/1G:  p50:  0.425s      75/s  p75:  0.426s      75/s  p90:  0.426s      75/s  p95:  0.426s      75/s
101*da0073e9SAndroid Build Coastguard Worker   2 GPUs --    1M/2G:  p50:  0.438s      73/s  p75:  0.439s      72/s  p90:  0.439s      72/s  p95:  0.439s      72/s
102*da0073e9SAndroid Build Coastguard Worker   4 GPUs --    1M/4G:  p50:  0.439s      72/s  p75:  0.439s      72/s  p90:  0.440s      72/s  p95:  0.440s      72/s
103*da0073e9SAndroid Build Coastguard Worker   8 GPUs --    1M/8G:  p50:  0.447s      71/s  p75:  0.447s      71/s  p90:  0.448s      71/s  p95:  0.448s      71/s
104*da0073e9SAndroid Build Coastguard Worker  16 GPUs --    2M/8G:  p50:  0.450s      71/s  p75:  0.451s      70/s  p90:  0.451s      70/s  p95:  0.451s      70/s
105*da0073e9SAndroid Build Coastguard Worker```
106*da0073e9SAndroid Build Coastguard Worker
107*da0073e9SAndroid Build Coastguard Worker## How to diff
108*da0073e9SAndroid Build Coastguard Worker
109*da0073e9SAndroid Build Coastguard WorkerRun the benchmark with the `--json PATH_TO_REPORT_FILE` argument to
110*da0073e9SAndroid Build Coastguard Workerproduce the JSON file that the diff script can consume.
111*da0073e9SAndroid Build Coastguard Worker
112*da0073e9SAndroid Build Coastguard WorkerThen, run the diff script as follows:
113*da0073e9SAndroid Build Coastguard Worker
114*da0073e9SAndroid Build Coastguard Worker```
115*da0073e9SAndroid Build Coastguard Worker$ python3 diff.py PATH_TO_BASELINE_FILE PATH_TO_TEST_FILE
116*da0073e9SAndroid Build Coastguard Worker                                 baseline                      test
117*da0073e9SAndroid Build Coastguard Worker                     --------------------      --------------------
118*da0073e9SAndroid Build Coastguard Workerbucket_size:                           25  vs                     1
119*da0073e9SAndroid Build Coastguard Workercuda_version:                        10.0  vs                  10.0
120*da0073e9SAndroid Build Coastguard Workerdistributed_backend:                 nccl  vs                  nccl
121*da0073e9SAndroid Build Coastguard Workerpytorch_version:          1.4.0a0+05140f0  vs       1.4.0a0+05140f0
122*da0073e9SAndroid Build Coastguard Worker
123*da0073e9SAndroid Build Coastguard WorkerBenchmark: resnet50 with batch size 32
124*da0073e9SAndroid Build Coastguard Worker
125*da0073e9SAndroid Build Coastguard Worker                  sec/iter    ex/sec      diff        sec/iter    ex/sec      diff
126*da0073e9SAndroid Build Coastguard Worker   1 GPUs:  p75:    0.101s     317/s     -0.3%  p95:    0.101s     317/s     -0.4%
127*da0073e9SAndroid Build Coastguard Worker   2 GPUs:  p75:    0.104s     306/s     -1.0%  p95:    0.104s     306/s     -1.0%
128*da0073e9SAndroid Build Coastguard Worker   4 GPUs:  p75:    0.105s     305/s     -1.6%  p95:    0.105s     304/s     -1.8%
129*da0073e9SAndroid Build Coastguard Worker   8 GPUs:  p75:    0.107s     299/s     -2.6%  p95:    0.107s     298/s     -2.7%
130*da0073e9SAndroid Build Coastguard Worker  16 GPUs:  p75:    0.108s     294/s     -3.8%  p95:    0.122s     262/s    -16.4%
131*da0073e9SAndroid Build Coastguard Worker
132*da0073e9SAndroid Build Coastguard WorkerBenchmark: resnet101 with batch size 32
133*da0073e9SAndroid Build Coastguard Worker
134*da0073e9SAndroid Build Coastguard Worker                  sec/iter    ex/sec      diff        sec/iter    ex/sec      diff
135*da0073e9SAndroid Build Coastguard Worker   1 GPUs:  p75:    0.172s     185/s     -1.2%  p95:    0.172s     185/s     -1.3%
136*da0073e9SAndroid Build Coastguard Worker   2 GPUs:  p75:    0.179s     178/s     -2.1%  p95:    0.179s     178/s     -2.0%
137*da0073e9SAndroid Build Coastguard Worker   4 GPUs:  p75:    0.180s     177/s     -2.6%  p95:    0.180s     177/s     -2.6%
138*da0073e9SAndroid Build Coastguard Worker   8 GPUs:  p75:    0.184s     173/s     -3.5%  p95:    0.184s     173/s     -3.5%
139*da0073e9SAndroid Build Coastguard Worker  16 GPUs:  p75:    0.187s     170/s     -0.1%  p95:    0.204s     157/s     -7.9%
140*da0073e9SAndroid Build Coastguard Worker
141*da0073e9SAndroid Build Coastguard WorkerBenchmark: resnext50_32x4d with batch size 32
142*da0073e9SAndroid Build Coastguard Worker
143*da0073e9SAndroid Build Coastguard Worker                  sec/iter    ex/sec      diff        sec/iter    ex/sec      diff
144*da0073e9SAndroid Build Coastguard Worker   1 GPUs:  p75:    0.149s     214/s     -1.0%  p95:    0.149s     214/s     -0.9%
145*da0073e9SAndroid Build Coastguard Worker   2 GPUs:  p75:    0.156s     205/s     -1.5%  p95:    0.156s     205/s     -1.6%
146*da0073e9SAndroid Build Coastguard Worker   4 GPUs:  p75:    0.156s     204/s     -1.6%  p95:    0.157s     204/s     -1.8%
147*da0073e9SAndroid Build Coastguard Worker   8 GPUs:  p75:    0.159s     200/s     -1.5%  p95:    0.159s     200/s     -1.5%
148*da0073e9SAndroid Build Coastguard Worker  16 GPUs:  p75:    0.161s     198/s     -1.9%  p95:    0.162s     197/s     -2.3%
149*da0073e9SAndroid Build Coastguard Worker
150*da0073e9SAndroid Build Coastguard WorkerBenchmark: resnext101_32x8d with batch size 32
151*da0073e9SAndroid Build Coastguard Worker
152*da0073e9SAndroid Build Coastguard Worker                  sec/iter    ex/sec      diff        sec/iter    ex/sec      diff
153*da0073e9SAndroid Build Coastguard Worker   1 GPUs:  p75:    0.427s      74/s     -0.8%  p95:    0.428s      74/s     -0.7%
154*da0073e9SAndroid Build Coastguard Worker   2 GPUs:  p75:    0.444s      72/s     -1.3%  p95:    0.445s      71/s     -0.7%
155*da0073e9SAndroid Build Coastguard Worker   4 GPUs:  p75:    0.444s      72/s     -1.1%  p95:    0.445s      71/s     -0.8%
156*da0073e9SAndroid Build Coastguard Worker   8 GPUs:  p75:    0.452s      70/s     -1.3%  p95:    0.452s      70/s     -1.3%
157*da0073e9SAndroid Build Coastguard Worker  16 GPUs:  p75:    0.455s      70/s     -0.7%  p95:    0.456s      70/s     -0.6%
158*da0073e9SAndroid Build Coastguard Worker```
159*da0073e9SAndroid Build Coastguard Worker
160*da0073e9SAndroid Build Coastguard WorkerThis compares throughput between `bucket_cap_mb=25` (the default) and
161*da0073e9SAndroid Build Coastguard Worker`bucket_cap_mb=1` on 8 DGX machines with V100 GPUs. It confirms that
162*da0073e9SAndroid Build Coastguard Workereven for a relatively small model on machines with a very fast
163*da0073e9SAndroid Build Coastguard Workerinterconnect (4x 100Gb InfiniBand per machine), it still pays off to
164*da0073e9SAndroid Build Coastguard Workerbatch allreduce calls.
165