xref: /aosp_15_r20/external/pytorch/torch/distributed/CONTRIBUTING.md (revision da0073e96a02ea20f0ac840b70461e3646d07c45)
1*da0073e9SAndroid Build Coastguard Worker# Contributing to PyTorch Distributed
2*da0073e9SAndroid Build Coastguard Worker
3*da0073e9SAndroid Build Coastguard WorkerPlease go through PyTorch's top level [Contributing Guide](../../CONTRIBUTING.md) before proceeding with this guide.
4*da0073e9SAndroid Build Coastguard Worker
5*da0073e9SAndroid Build Coastguard Worker[PyTorch Distributed Overview](https://pytorch.org/tutorials//beginner/dist_overview.html) is a great starting point with a lot of tutorials, documentation and design docs covering PyTorch Distributed. We would highly recommend going through some of that material before you start working on PyTorch Distributed.
6*da0073e9SAndroid Build Coastguard Worker
7*da0073e9SAndroid Build Coastguard WorkerIn this document, we mostly focus on some of the code structure for PyTorch distributed and implementation details.
8*da0073e9SAndroid Build Coastguard Worker
9*da0073e9SAndroid Build Coastguard Worker### Onboarding Tasks
10*da0073e9SAndroid Build Coastguard Worker
11*da0073e9SAndroid Build Coastguard WorkerA list of onboarding tasks can be found [here](https://github.com/pytorch/pytorch/issues?q=is%3Aopen+is%3Aissue+label%3A%22module%3A+distributed%22+label%3A%22topic%3A+bootcamp%22) and [here](https://github.com/pytorch/pytorch/issues?q=is%3Aopen+is%3Aissue+label%3A%22module%3A+distributed%22+label%3Apt_distributed_rampup).
12*da0073e9SAndroid Build Coastguard Worker
13*da0073e9SAndroid Build Coastguard Worker
14*da0073e9SAndroid Build Coastguard Worker## Code Pointers
15*da0073e9SAndroid Build Coastguard Worker
16*da0073e9SAndroid Build Coastguard WorkerThe relevant code for different modules is either inside the c++ C10D library or the torch python library.
17*da0073e9SAndroid Build Coastguard Worker
18*da0073e9SAndroid Build Coastguard Worker#### Collectives and Communication Library (C10D)
19*da0073e9SAndroid Build Coastguard Worker
20*da0073e9SAndroid Build Coastguard WorkerThis is the place to look if you are trying to find low-level communication APIs, process group creation, etc.
21*da0073e9SAndroid Build Coastguard Worker
22*da0073e9SAndroid Build Coastguard Worker- API layer: [torch/distributed/distributed_c10d.py](https://github.com/pytorch/pytorch/blob/main/torch/distributed/distributed_c10d.py)
23*da0073e9SAndroid Build Coastguard Worker- Python Bindings: [torch/csrc/distributed/c10d/init.cpp](https://github.com/pytorch/pytorch/blob/main/torch/csrc/distributed/c10d/init.cpp)
24*da0073e9SAndroid Build Coastguard Worker- Implementations: [torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp](https://github.com/pytorch/pytorch/blob/main/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp)
25*da0073e9SAndroid Build Coastguard Worker
26*da0073e9SAndroid Build Coastguard Worker#### DTensor
27*da0073e9SAndroid Build Coastguard Worker
28*da0073e9SAndroid Build Coastguard Worker- API layer: ([torch/distributed/_tensor/api.py](https://github.com/pytorch/pytorch/blob/main/torch/distributed/_tensor/api.py))
29*da0073e9SAndroid Build Coastguard Worker- Implementation: see other files in the same folder
30*da0073e9SAndroid Build Coastguard Worker
31*da0073e9SAndroid Build Coastguard Worker#### Distributed Data Parallel (DDP)
32*da0073e9SAndroid Build Coastguard Worker
33*da0073e9SAndroid Build Coastguard Worker- API layer: [torch/nn/parallel/distributed.py](https://github.com/pytorch/pytorch/blob/main/torch/nn/parallel/distributed.py)
34*da0073e9SAndroid Build Coastguard Worker- Reducer (backend that schedules allreduces): [torch/csrc/distributed/c10d/reducer.cpp](https://github.com/pytorch/pytorch/blob/main/torch/csrc/distributed/c10d/reducer.cpp)
35*da0073e9SAndroid Build Coastguard Worker- Mixed Precision Hooks: [torch/distributed/algorithms/ddp_comm_hooks/default_hooks.py](https://github.com/pytorch/pytorch/blob/main/torch/distributed/algorithms/ddp_comm_hooks/default_hooks.py)
36*da0073e9SAndroid Build Coastguard Worker#### Fully Sharded Data Parallel (FSDP)
37*da0073e9SAndroid Build Coastguard Worker
38*da0073e9SAndroid Build Coastguard Worker- FSDP: [torch/distributed/fsdp/api.py](https://github.com/pytorch/pytorch/blob/main/torch/distributed/fsdp/api.py)
39*da0073e9SAndroid Build Coastguard Worker- FSDP2: [torch/distributed/_composable/fsdp/fully_shard.py](https://github.com/pytorch/pytorch/blob/main/torch/distributed/_composable/fsdp/fully_shard.py)
40*da0073e9SAndroid Build Coastguard Worker- Implementations are contained in other files in the same folder as the API for each variant
41*da0073e9SAndroid Build Coastguard Worker
42*da0073e9SAndroid Build Coastguard Worker#### Tensor Parallel (TP)
43*da0073e9SAndroid Build Coastguard Worker
44*da0073e9SAndroid Build Coastguard Worker- API layer: [torch/distributed/tensor/parallel/api.py](https://github.com/pytorch/pytorch/blob/main/torch/distributed/tensor/parallel/api.py)
45*da0073e9SAndroid Build Coastguard Worker- Implementation: see other files in the same folder
46*da0073e9SAndroid Build Coastguard Worker
47*da0073e9SAndroid Build Coastguard Worker#### Pipeline Parallel (PP)
48*da0073e9SAndroid Build Coastguard Worker
49*da0073e9SAndroid Build Coastguard Worker- Pipeline Schedules: [torch/distributed/pipelining/schedules.py](https://github.com/pytorch/pytorch/blob/main/torch/distributed/pipelining/schedules.py)
50*da0073e9SAndroid Build Coastguard Worker- Pipeline Stage: [torch/distributed/pipelining/stage.py](https://github.com/pytorch/pytorch/blob/main/torch/distributed/pipelining/stage.py)
51*da0073e9SAndroid Build Coastguard Worker
52*da0073e9SAndroid Build Coastguard Worker
53*da0073e9SAndroid Build Coastguard Worker## Adding Tests
54*da0073e9SAndroid Build Coastguard Worker
55*da0073e9SAndroid Build Coastguard WorkerYou should write tests for your changes just like in other parts of PyTorch, but you may need to use some test infrastructure to run either multi-process tests on multiple GPUs, or use a FakeProcessGroup to mock out communications.
56*da0073e9SAndroid Build Coastguard Worker
57*da0073e9SAndroid Build Coastguard WorkerMost testing can be done from python, and you can find existing python tests [here](https://github.com/pytorch/pytorch/tree/main/test/distributed).
58*da0073e9SAndroid Build Coastguard Worker
59*da0073e9SAndroid Build Coastguard WorkerFor an example of using the MultiProcessTestCase to run a test on multiple GPUs, see tests in [test_c10d_nccl.py](https://github.com/pytorch/pytorch/blob/main/test/distributed/test_c10d_nccl.py)
60*da0073e9SAndroid Build Coastguard Worker
61*da0073e9SAndroid Build Coastguard Worker## Testing Your Changes
62*da0073e9SAndroid Build Coastguard Worker
63*da0073e9SAndroid Build Coastguard WorkerAll the unit tests can be found under the [test/distributed](../../test/distributed) directory and RPC tests in particular are under [test/distributed/rpc](../../test/distributed/rpc). A few examples on how to run unit tests:
64*da0073e9SAndroid Build Coastguard Worker
65*da0073e9SAndroid Build Coastguard Worker
66*da0073e9SAndroid Build Coastguard Worker```
67*da0073e9SAndroid Build Coastguard Worker# Run the c10d unit tests.
68*da0073e9SAndroid Build Coastguard Workerpython test/distributed/test_c10d_common.py
69*da0073e9SAndroid Build Coastguard Workerpython test/distributed/test_c10d_gloo.py
70*da0073e9SAndroid Build Coastguard Workerpython test/distributed/test_c10d_nccl.py
71*da0073e9SAndroid Build Coastguard Worker
72*da0073e9SAndroid Build Coastguard Worker# Run the Store tests.
73*da0073e9SAndroid Build Coastguard Workerpython test/distributed/test_store.py
74*da0073e9SAndroid Build Coastguard Worker
75*da0073e9SAndroid Build Coastguard Worker# Run Process Group Wrapper tests.
76*da0073e9SAndroid Build Coastguard Workerpython test/distributed/test_pg_wrapper.py
77*da0073e9SAndroid Build Coastguard Worker
78*da0073e9SAndroid Build Coastguard Worker# Run distributed tests, including tests for Distributed Data Parallel.
79*da0073e9SAndroid Build Coastguard Workerpython test/run_test.py --verbose -i distributed/test_distributed_spawn
80*da0073e9SAndroid Build Coastguard Worker
81*da0073e9SAndroid Build Coastguard Worker# Run a single test in the test_distributed_spawn test suite.
82*da0073e9SAndroid Build Coastguard Workertouch /tmp/barrier && TEMP_DIR="/tmp" BACKEND="nccl" WORLD_SIZE="2" python test/distributed/test_distributed_spawn.py -v TestDistBackendWithSpawn.test_ddp_profiling_torch_profiler
83*da0073e9SAndroid Build Coastguard Worker
84*da0073e9SAndroid Build Coastguard Worker# Run a specific test method. Uses pytest (pip install pytest).
85*da0073e9SAndroid Build Coastguard Worker# ProcessGroup gloo/nccl test
86*da0073e9SAndroid Build Coastguard Workerpytest -vs test/distributed/test_c10d_common.py -k test_multi_limit_single_dtype
87*da0073e9SAndroid Build Coastguard Worker```
88