1# Contributing to PyTorch Distributed 2 3Please go through PyTorch's top level [Contributing Guide](../../CONTRIBUTING.md) before proceeding with this guide. 4 5[PyTorch Distributed Overview](https://pytorch.org/tutorials//beginner/dist_overview.html) is a great starting point with a lot of tutorials, documentation and design docs covering PyTorch Distributed. We would highly recommend going through some of that material before you start working on PyTorch Distributed. 6 7In this document, we mostly focus on some of the code structure for PyTorch distributed and implementation details. 8 9### Onboarding Tasks 10 11A list of onboarding tasks can be found [here](https://github.com/pytorch/pytorch/issues?q=is%3Aopen+is%3Aissue+label%3A%22module%3A+distributed%22+label%3A%22topic%3A+bootcamp%22) and [here](https://github.com/pytorch/pytorch/issues?q=is%3Aopen+is%3Aissue+label%3A%22module%3A+distributed%22+label%3Apt_distributed_rampup). 12 13 14## Code Pointers 15 16The relevant code for different modules is either inside the c++ C10D library or the torch python library. 17 18#### Collectives and Communication Library (C10D) 19 20This is the place to look if you are trying to find low-level communication APIs, process group creation, etc. 21 22- API layer: [torch/distributed/distributed_c10d.py](https://github.com/pytorch/pytorch/blob/main/torch/distributed/distributed_c10d.py) 23- Python Bindings: [torch/csrc/distributed/c10d/init.cpp](https://github.com/pytorch/pytorch/blob/main/torch/csrc/distributed/c10d/init.cpp) 24- Implementations: [torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp](https://github.com/pytorch/pytorch/blob/main/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp) 25 26#### DTensor 27 28- API layer: ([torch/distributed/_tensor/api.py](https://github.com/pytorch/pytorch/blob/main/torch/distributed/_tensor/api.py)) 29- Implementation: see other files in the same folder 30 31#### Distributed Data Parallel (DDP) 32 33- API layer: [torch/nn/parallel/distributed.py](https://github.com/pytorch/pytorch/blob/main/torch/nn/parallel/distributed.py) 34- Reducer (backend that schedules allreduces): [torch/csrc/distributed/c10d/reducer.cpp](https://github.com/pytorch/pytorch/blob/main/torch/csrc/distributed/c10d/reducer.cpp) 35- Mixed Precision Hooks: [torch/distributed/algorithms/ddp_comm_hooks/default_hooks.py](https://github.com/pytorch/pytorch/blob/main/torch/distributed/algorithms/ddp_comm_hooks/default_hooks.py) 36#### Fully Sharded Data Parallel (FSDP) 37 38- FSDP: [torch/distributed/fsdp/api.py](https://github.com/pytorch/pytorch/blob/main/torch/distributed/fsdp/api.py) 39- FSDP2: [torch/distributed/_composable/fsdp/fully_shard.py](https://github.com/pytorch/pytorch/blob/main/torch/distributed/_composable/fsdp/fully_shard.py) 40- Implementations are contained in other files in the same folder as the API for each variant 41 42#### Tensor Parallel (TP) 43 44- API layer: [torch/distributed/tensor/parallel/api.py](https://github.com/pytorch/pytorch/blob/main/torch/distributed/tensor/parallel/api.py) 45- Implementation: see other files in the same folder 46 47#### Pipeline Parallel (PP) 48 49- Pipeline Schedules: [torch/distributed/pipelining/schedules.py](https://github.com/pytorch/pytorch/blob/main/torch/distributed/pipelining/schedules.py) 50- Pipeline Stage: [torch/distributed/pipelining/stage.py](https://github.com/pytorch/pytorch/blob/main/torch/distributed/pipelining/stage.py) 51 52 53## Adding Tests 54 55You should write tests for your changes just like in other parts of PyTorch, but you may need to use some test infrastructure to run either multi-process tests on multiple GPUs, or use a FakeProcessGroup to mock out communications. 56 57Most testing can be done from python, and you can find existing python tests [here](https://github.com/pytorch/pytorch/tree/main/test/distributed). 58 59For an example of using the MultiProcessTestCase to run a test on multiple GPUs, see tests in [test_c10d_nccl.py](https://github.com/pytorch/pytorch/blob/main/test/distributed/test_c10d_nccl.py) 60 61## Testing Your Changes 62 63All the unit tests can be found under the [test/distributed](../../test/distributed) directory and RPC tests in particular are under [test/distributed/rpc](../../test/distributed/rpc). A few examples on how to run unit tests: 64 65 66``` 67# Run the c10d unit tests. 68python test/distributed/test_c10d_common.py 69python test/distributed/test_c10d_gloo.py 70python test/distributed/test_c10d_nccl.py 71 72# Run the Store tests. 73python test/distributed/test_store.py 74 75# Run Process Group Wrapper tests. 76python test/distributed/test_pg_wrapper.py 77 78# Run distributed tests, including tests for Distributed Data Parallel. 79python test/run_test.py --verbose -i distributed/test_distributed_spawn 80 81# Run a single test in the test_distributed_spawn test suite. 82touch /tmp/barrier && TEMP_DIR="/tmp" BACKEND="nccl" WORLD_SIZE="2" python test/distributed/test_distributed_spawn.py -v TestDistBackendWithSpawn.test_ddp_profiling_torch_profiler 83 84# Run a specific test method. Uses pytest (pip install pytest). 85# ProcessGroup gloo/nccl test 86pytest -vs test/distributed/test_c10d_common.py -k test_multi_limit_single_dtype 87``` 88