xref: /aosp_15_r20/external/pytorch/torch/distributed/CONTRIBUTING.md (revision da0073e96a02ea20f0ac840b70461e3646d07c45)
1# Contributing to PyTorch Distributed
2
3Please go through PyTorch's top level [Contributing Guide](../../CONTRIBUTING.md) before proceeding with this guide.
4
5[PyTorch Distributed Overview](https://pytorch.org/tutorials//beginner/dist_overview.html) is a great starting point with a lot of tutorials, documentation and design docs covering PyTorch Distributed. We would highly recommend going through some of that material before you start working on PyTorch Distributed.
6
7In this document, we mostly focus on some of the code structure for PyTorch distributed and implementation details.
8
9### Onboarding Tasks
10
11A list of onboarding tasks can be found [here](https://github.com/pytorch/pytorch/issues?q=is%3Aopen+is%3Aissue+label%3A%22module%3A+distributed%22+label%3A%22topic%3A+bootcamp%22) and [here](https://github.com/pytorch/pytorch/issues?q=is%3Aopen+is%3Aissue+label%3A%22module%3A+distributed%22+label%3Apt_distributed_rampup).
12
13
14## Code Pointers
15
16The relevant code for different modules is either inside the c++ C10D library or the torch python library.
17
18#### Collectives and Communication Library (C10D)
19
20This is the place to look if you are trying to find low-level communication APIs, process group creation, etc.
21
22- API layer: [torch/distributed/distributed_c10d.py](https://github.com/pytorch/pytorch/blob/main/torch/distributed/distributed_c10d.py)
23- Python Bindings: [torch/csrc/distributed/c10d/init.cpp](https://github.com/pytorch/pytorch/blob/main/torch/csrc/distributed/c10d/init.cpp)
24- Implementations: [torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp](https://github.com/pytorch/pytorch/blob/main/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp)
25
26#### DTensor
27
28- API layer: ([torch/distributed/_tensor/api.py](https://github.com/pytorch/pytorch/blob/main/torch/distributed/_tensor/api.py))
29- Implementation: see other files in the same folder
30
31#### Distributed Data Parallel (DDP)
32
33- API layer: [torch/nn/parallel/distributed.py](https://github.com/pytorch/pytorch/blob/main/torch/nn/parallel/distributed.py)
34- Reducer (backend that schedules allreduces): [torch/csrc/distributed/c10d/reducer.cpp](https://github.com/pytorch/pytorch/blob/main/torch/csrc/distributed/c10d/reducer.cpp)
35- Mixed Precision Hooks: [torch/distributed/algorithms/ddp_comm_hooks/default_hooks.py](https://github.com/pytorch/pytorch/blob/main/torch/distributed/algorithms/ddp_comm_hooks/default_hooks.py)
36#### Fully Sharded Data Parallel (FSDP)
37
38- FSDP: [torch/distributed/fsdp/api.py](https://github.com/pytorch/pytorch/blob/main/torch/distributed/fsdp/api.py)
39- FSDP2: [torch/distributed/_composable/fsdp/fully_shard.py](https://github.com/pytorch/pytorch/blob/main/torch/distributed/_composable/fsdp/fully_shard.py)
40- Implementations are contained in other files in the same folder as the API for each variant
41
42#### Tensor Parallel (TP)
43
44- API layer: [torch/distributed/tensor/parallel/api.py](https://github.com/pytorch/pytorch/blob/main/torch/distributed/tensor/parallel/api.py)
45- Implementation: see other files in the same folder
46
47#### Pipeline Parallel (PP)
48
49- Pipeline Schedules: [torch/distributed/pipelining/schedules.py](https://github.com/pytorch/pytorch/blob/main/torch/distributed/pipelining/schedules.py)
50- Pipeline Stage: [torch/distributed/pipelining/stage.py](https://github.com/pytorch/pytorch/blob/main/torch/distributed/pipelining/stage.py)
51
52
53## Adding Tests
54
55You should write tests for your changes just like in other parts of PyTorch, but you may need to use some test infrastructure to run either multi-process tests on multiple GPUs, or use a FakeProcessGroup to mock out communications.
56
57Most testing can be done from python, and you can find existing python tests [here](https://github.com/pytorch/pytorch/tree/main/test/distributed).
58
59For an example of using the MultiProcessTestCase to run a test on multiple GPUs, see tests in [test_c10d_nccl.py](https://github.com/pytorch/pytorch/blob/main/test/distributed/test_c10d_nccl.py)
60
61## Testing Your Changes
62
63All the unit tests can be found under the [test/distributed](../../test/distributed) directory and RPC tests in particular are under [test/distributed/rpc](../../test/distributed/rpc). A few examples on how to run unit tests:
64
65
66```
67# Run the c10d unit tests.
68python test/distributed/test_c10d_common.py
69python test/distributed/test_c10d_gloo.py
70python test/distributed/test_c10d_nccl.py
71
72# Run the Store tests.
73python test/distributed/test_store.py
74
75# Run Process Group Wrapper tests.
76python test/distributed/test_pg_wrapper.py
77
78# Run distributed tests, including tests for Distributed Data Parallel.
79python test/run_test.py --verbose -i distributed/test_distributed_spawn
80
81# Run a single test in the test_distributed_spawn test suite.
82touch /tmp/barrier && TEMP_DIR="/tmp" BACKEND="nccl" WORLD_SIZE="2" python test/distributed/test_distributed_spawn.py -v TestDistBackendWithSpawn.test_ddp_profiling_torch_profiler
83
84# Run a specific test method. Uses pytest (pip install pytest).
85# ProcessGroup gloo/nccl test
86pytest -vs test/distributed/test_c10d_common.py -k test_multi_limit_single_dtype
87```
88