在分布式nvidia cuda-pytorch中同时使用MPI和NCCL会造成死锁——分布式pytorch的backend不能同时使用MPI和NCCL-526互联

参考原文：

https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/mpi.html#inter-gpu-communication-with-cuda-aware-mpi

===========================================

说实话，我不太认为有人在使用分布式pytorch的时候会同时开两个backend，不过还真的有人提出过这个问题：

https://github.com/mpi4py/mpi4py/discussions/25

既然有人这么提，那么就意味着这个操作确实有人干过，这里也mark一下。

Nvidia 的NCCL官方回答：

https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/mpi.html#inter-gpu-communication-with-cuda-aware-mpi

Inter-GPU Communication with CUDA-aware MPI

Using NCCL to perform inter-GPU communication concurrently with CUDA-aware MPI may create deadlocks.

NCCL creates inter-device dependencies, meaning that after it has been launched, a NCCL kernel will wait (and potentially block the CUDA device) until all ranks in the communicator launch their NCCL kernel. CUDA-aware MPI may also create such dependencies between devices depending on the MPI implementation.

Using both MPI and NCCL to perform transfers between the same sets of CUDA devices concurrently is therefore not guaranteed to be safe.

分布式tensorflow角度pytorch

分布式accelerate pytorch trainer

机器分布式sagemaker pytorch

分布式单机windows pytorch

分布式单机pytorch ddp