pytorch分布式训练报错:Duplicate GPU detected : rank 1 and rank 0 both on CUDA device 35000

发布时间 2023-09-05 22:43:19作者: 脂环

之前使用的比较老的torch 1.8.1,换到torch 2.0后报错 "rank 1 and rank 0 both on CUDA device 35000"

将main函数开头部分的初始化

distributed.init_process_group(backend='nccl', init_method='env://')
device_id, device = opts.local_rank, torch.device(opts.local_rank)
rank, world_size = distributed.get_rank(), distributed.get_world_size()
torch.cuda.set_device(device_id)

换为:

torch.distributed.init_process_group("nccl")
rank, world_size = distributed.get_rank(), distributed.get_world_size()
device_id = rank % torch.cuda.device_count()
device = torch.device(device_id)

可以解决