1*da0073e9SAndroid Build Coastguard Worker# Contributing to PyTorch Distributed 2*da0073e9SAndroid Build Coastguard Worker 3*da0073e9SAndroid Build Coastguard WorkerPlease go through PyTorch's top level [Contributing Guide](../../CONTRIBUTING.md) before proceeding with this guide. 4*da0073e9SAndroid Build Coastguard Worker 5*da0073e9SAndroid Build Coastguard Worker[PyTorch Distributed Overview](https://pytorch.org/tutorials//beginner/dist_overview.html) is a great starting point with a lot of tutorials, documentation and design docs covering PyTorch Distributed. We would highly recommend going through some of that material before you start working on PyTorch Distributed. 6*da0073e9SAndroid Build Coastguard Worker 7*da0073e9SAndroid Build Coastguard WorkerIn this document, we mostly focus on some of the code structure for PyTorch distributed and implementation details. 8*da0073e9SAndroid Build Coastguard Worker 9*da0073e9SAndroid Build Coastguard Worker### Onboarding Tasks 10*da0073e9SAndroid Build Coastguard Worker 11*da0073e9SAndroid Build Coastguard WorkerA list of onboarding tasks can be found [here](https://github.com/pytorch/pytorch/issues?q=is%3Aopen+is%3Aissue+label%3A%22module%3A+distributed%22+label%3A%22topic%3A+bootcamp%22) and [here](https://github.com/pytorch/pytorch/issues?q=is%3Aopen+is%3Aissue+label%3A%22module%3A+distributed%22+label%3Apt_distributed_rampup). 12*da0073e9SAndroid Build Coastguard Worker 13*da0073e9SAndroid Build Coastguard Worker 14*da0073e9SAndroid Build Coastguard Worker## Code Pointers 15*da0073e9SAndroid Build Coastguard Worker 16*da0073e9SAndroid Build Coastguard WorkerThe relevant code for different modules is either inside the c++ C10D library or the torch python library. 17*da0073e9SAndroid Build Coastguard Worker 18*da0073e9SAndroid Build Coastguard Worker#### Collectives and Communication Library (C10D) 19*da0073e9SAndroid Build Coastguard Worker 20*da0073e9SAndroid Build Coastguard WorkerThis is the place to look if you are trying to find low-level communication APIs, process group creation, etc. 21*da0073e9SAndroid Build Coastguard Worker 22*da0073e9SAndroid Build Coastguard Worker- API layer: [torch/distributed/distributed_c10d.py](https://github.com/pytorch/pytorch/blob/main/torch/distributed/distributed_c10d.py) 23*da0073e9SAndroid Build Coastguard Worker- Python Bindings: [torch/csrc/distributed/c10d/init.cpp](https://github.com/pytorch/pytorch/blob/main/torch/csrc/distributed/c10d/init.cpp) 24*da0073e9SAndroid Build Coastguard Worker- Implementations: [torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp](https://github.com/pytorch/pytorch/blob/main/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp) 25*da0073e9SAndroid Build Coastguard Worker 26*da0073e9SAndroid Build Coastguard Worker#### DTensor 27*da0073e9SAndroid Build Coastguard Worker 28*da0073e9SAndroid Build Coastguard Worker- API layer: ([torch/distributed/_tensor/api.py](https://github.com/pytorch/pytorch/blob/main/torch/distributed/_tensor/api.py)) 29*da0073e9SAndroid Build Coastguard Worker- Implementation: see other files in the same folder 30*da0073e9SAndroid Build Coastguard Worker 31*da0073e9SAndroid Build Coastguard Worker#### Distributed Data Parallel (DDP) 32*da0073e9SAndroid Build Coastguard Worker 33*da0073e9SAndroid Build Coastguard Worker- API layer: [torch/nn/parallel/distributed.py](https://github.com/pytorch/pytorch/blob/main/torch/nn/parallel/distributed.py) 34*da0073e9SAndroid Build Coastguard Worker- Reducer (backend that schedules allreduces): [torch/csrc/distributed/c10d/reducer.cpp](https://github.com/pytorch/pytorch/blob/main/torch/csrc/distributed/c10d/reducer.cpp) 35*da0073e9SAndroid Build Coastguard Worker- Mixed Precision Hooks: [torch/distributed/algorithms/ddp_comm_hooks/default_hooks.py](https://github.com/pytorch/pytorch/blob/main/torch/distributed/algorithms/ddp_comm_hooks/default_hooks.py) 36*da0073e9SAndroid Build Coastguard Worker#### Fully Sharded Data Parallel (FSDP) 37*da0073e9SAndroid Build Coastguard Worker 38*da0073e9SAndroid Build Coastguard Worker- FSDP: [torch/distributed/fsdp/api.py](https://github.com/pytorch/pytorch/blob/main/torch/distributed/fsdp/api.py) 39*da0073e9SAndroid Build Coastguard Worker- FSDP2: [torch/distributed/_composable/fsdp/fully_shard.py](https://github.com/pytorch/pytorch/blob/main/torch/distributed/_composable/fsdp/fully_shard.py) 40*da0073e9SAndroid Build Coastguard Worker- Implementations are contained in other files in the same folder as the API for each variant 41*da0073e9SAndroid Build Coastguard Worker 42*da0073e9SAndroid Build Coastguard Worker#### Tensor Parallel (TP) 43*da0073e9SAndroid Build Coastguard Worker 44*da0073e9SAndroid Build Coastguard Worker- API layer: [torch/distributed/tensor/parallel/api.py](https://github.com/pytorch/pytorch/blob/main/torch/distributed/tensor/parallel/api.py) 45*da0073e9SAndroid Build Coastguard Worker- Implementation: see other files in the same folder 46*da0073e9SAndroid Build Coastguard Worker 47*da0073e9SAndroid Build Coastguard Worker#### Pipeline Parallel (PP) 48*da0073e9SAndroid Build Coastguard Worker 49*da0073e9SAndroid Build Coastguard Worker- Pipeline Schedules: [torch/distributed/pipelining/schedules.py](https://github.com/pytorch/pytorch/blob/main/torch/distributed/pipelining/schedules.py) 50*da0073e9SAndroid Build Coastguard Worker- Pipeline Stage: [torch/distributed/pipelining/stage.py](https://github.com/pytorch/pytorch/blob/main/torch/distributed/pipelining/stage.py) 51*da0073e9SAndroid Build Coastguard Worker 52*da0073e9SAndroid Build Coastguard Worker 53*da0073e9SAndroid Build Coastguard Worker## Adding Tests 54*da0073e9SAndroid Build Coastguard Worker 55*da0073e9SAndroid Build Coastguard WorkerYou should write tests for your changes just like in other parts of PyTorch, but you may need to use some test infrastructure to run either multi-process tests on multiple GPUs, or use a FakeProcessGroup to mock out communications. 56*da0073e9SAndroid Build Coastguard Worker 57*da0073e9SAndroid Build Coastguard WorkerMost testing can be done from python, and you can find existing python tests [here](https://github.com/pytorch/pytorch/tree/main/test/distributed). 58*da0073e9SAndroid Build Coastguard Worker 59*da0073e9SAndroid Build Coastguard WorkerFor an example of using the MultiProcessTestCase to run a test on multiple GPUs, see tests in [test_c10d_nccl.py](https://github.com/pytorch/pytorch/blob/main/test/distributed/test_c10d_nccl.py) 60*da0073e9SAndroid Build Coastguard Worker 61*da0073e9SAndroid Build Coastguard Worker## Testing Your Changes 62*da0073e9SAndroid Build Coastguard Worker 63*da0073e9SAndroid Build Coastguard WorkerAll the unit tests can be found under the [test/distributed](../../test/distributed) directory and RPC tests in particular are under [test/distributed/rpc](../../test/distributed/rpc). A few examples on how to run unit tests: 64*da0073e9SAndroid Build Coastguard Worker 65*da0073e9SAndroid Build Coastguard Worker 66*da0073e9SAndroid Build Coastguard Worker``` 67*da0073e9SAndroid Build Coastguard Worker# Run the c10d unit tests. 68*da0073e9SAndroid Build Coastguard Workerpython test/distributed/test_c10d_common.py 69*da0073e9SAndroid Build Coastguard Workerpython test/distributed/test_c10d_gloo.py 70*da0073e9SAndroid Build Coastguard Workerpython test/distributed/test_c10d_nccl.py 71*da0073e9SAndroid Build Coastguard Worker 72*da0073e9SAndroid Build Coastguard Worker# Run the Store tests. 73*da0073e9SAndroid Build Coastguard Workerpython test/distributed/test_store.py 74*da0073e9SAndroid Build Coastguard Worker 75*da0073e9SAndroid Build Coastguard Worker# Run Process Group Wrapper tests. 76*da0073e9SAndroid Build Coastguard Workerpython test/distributed/test_pg_wrapper.py 77*da0073e9SAndroid Build Coastguard Worker 78*da0073e9SAndroid Build Coastguard Worker# Run distributed tests, including tests for Distributed Data Parallel. 79*da0073e9SAndroid Build Coastguard Workerpython test/run_test.py --verbose -i distributed/test_distributed_spawn 80*da0073e9SAndroid Build Coastguard Worker 81*da0073e9SAndroid Build Coastguard Worker# Run a single test in the test_distributed_spawn test suite. 82*da0073e9SAndroid Build Coastguard Workertouch /tmp/barrier && TEMP_DIR="/tmp" BACKEND="nccl" WORLD_SIZE="2" python test/distributed/test_distributed_spawn.py -v TestDistBackendWithSpawn.test_ddp_profiling_torch_profiler 83*da0073e9SAndroid Build Coastguard Worker 84*da0073e9SAndroid Build Coastguard Worker# Run a specific test method. Uses pytest (pip install pytest). 85*da0073e9SAndroid Build Coastguard Worker# ProcessGroup gloo/nccl test 86*da0073e9SAndroid Build Coastguard Workerpytest -vs test/distributed/test_c10d_common.py -k test_multi_limit_single_dtype 87*da0073e9SAndroid Build Coastguard Worker``` 88