Pytorch单机多卡加速

65 阅读 0 评论 43 点赞

我是靠谱客的博主羞涩康乃馨，最近开发中收集的这篇文章主要介绍Pytorch单机多卡加速，觉得挺不错的，现在分享给大家，希望可以做个参考。

概述

忙了两个月从收到原始数据到最后在工程项目中加载成功完成测试，好像从元旦上班后就再没休息过，昨天项目通过三期评审终于可以喘口气补点作业了。（年前写的文章，今天转过来了）
多卡并行
一定要使用

torch.nn.parallel.DistributedDataParallel() 
torch.nn.parallel.DistributedDataParallel() 
torch.nn.parallel.DistributedDataParallel()

重要的事情要说三遍！
千万不要使用torch.nn.DataParallel(model)这个函数，这家伙的提速效果几乎没有，项目压力太大没时间去仔细查API了，反正不用这个就对了。在使用了DDP模式后，每轮epoch从900s降到了250s（两张3090, bitch size: 6），这效果不用我多说了吧。

导入包，并加入初始化代码

from torch
from torch import distributed as dist

torch.distributed.init_process_group(backend="nccl")

模型并行

model = your_model()
    
local_rank = torch.distributed.get_rank()
model = model.cuda(local_rank)
torch.cuda.set_device(local_rank)
model = torch.nn.parallel.DistributedDataParallel(model,
                                                  device_ids=[0, 1],
                                                  output_device=local_rank,
                                                  find_unused_parameters=False,
                                                  broadcast_buffers=False)

修改数据加载方式

sampler = torch.utils.data.distributed.DistributedSampler(Your_Dataset())
loader_train = DataLoader(Your_Dataset(),
                              batch_size=cfg.batch_size,
                              shuffle=False,
                              num_workers=16,
                              pin_memory=True,
                              drop_last=True,
                              sampler=train_sampler)

说明：其实就是比正常的加载方式多了一个sampler ，然后将DataLoader()中的shuffle设置为False，这里并不是说不做random了，而是交给了sampler 来做。
这样就Ok啦，运行的话就已经可以并行计算了。不过，你会发现好像数据并没有打乱，每次加载的顺序都是一样的！去查一下官网的API会发现，这里的随机需要我们再多写一句话，每次epoch的时候设置一下种子

sampler.set_epoch(epoch)

所以最终的代码大概长这个样子：

form torch
from torch import distributed as dist

torch.distributed.init_process_group(backend="nccl")

model = your_model()
    
local_rank = torch.distributed.get_rank()
model = model.cuda(local_rank)
torch.cuda.set_device(local_rank)
model = torch.nn.parallel.DistributedDataParallel(model,
                                                  device_ids=[0, 1],
                                                  output_device=local_rank,
                                                  find_unused_parameters=False,
                                                  broadcast_buffers=False)
。。。。。
sampler = torch.utils.data.distributed.DistributedSampler(Your_Dataset())
loader_train = DataLoader(Your_Dataset(),
                              batch_size=cfg.batch_size,
                              shuffle=False,
                              num_workers=16,
                              pin_memory=True,
                              drop_last=True,
                              sampler=train_sampler)
。。。。。
for epoch in range(cfg.epoch):
    sampler.set_epoch(epoch)
    。。。。。。
    if local_rank == 0:
       torch.save(model.state_dict(), save_path)

最后记得在Linux下搞，不然总会出现莫名其妙的错误。还有这样写运行的时候有一个warning，由于着急赶项目也没空处理，有哪位童鞋知道的话麻烦告诉一声，多谢！

运行

python -m torch.distributed.launch train.py

加载模型
多卡训练出的模型保存后，在模型参数的前面会多出个“module.”前缀出来，加载的时候把这个前缀干掉就好了

def load_parallal_model(model, pretrain_dir):
    state_dict_ = torch.load(pretrain_dir, map_location='cuda:0')
    print('loaded pretrained weights form %s !' % pretrain_dir)
    state_dict = OrderedDict()

    # convert data_parallal to model
    for key in state_dict_:
        if key.startswith('module') and not key.startswith('module_list'):
            state_dict[key[7:]] = state_dict_[key]
        else:
            state_dict[key] = state_dict_[key]

    # check loaded parameters and created model parameters
    model_state_dict = model.state_dict()
    for key in state_dict:
        if key in model_state_dict:
            if state_dict[key].shape != model_state_dict[key].shape:
                print('Skip loading parameter {}, required shape{}, loaded shape{}.'.format(
                    key, model_state_dict[key].shape, state_dict[key].shape))
                state_dict[key] = model_state_dict[key]
        else:
            print('Drop parameter {}.'.format(key))
    for key in model_state_dict:
        if key not in state_dict:
            print('No param {}.'.format(key))
            state_dict[key] = model_state_dict[key]
    model.load_state_dict(state_dict, strict=False)

    return model

为了效率一切都得改
为了提高执行效率，记得不要写循环！不要写循环！不要写循环！特别是核心代码的循环，能不写就不要写！！！之前留下代码的哥们，为了生成三维数据的heatmap，写了个四重循环我想死的心都有了
之前的代码：

for id in class_ids:
	heatmap = np.zeros_like(img_np, dtype=np.float)	
	pos = sample[id]

	range_heat = [[0, 0], [0, 0], [0, 0]]
	range_heat[0][0] = pos[0] - r[0] if pos[0] > r[0] else 0
	range_heat[1][0] = pos[1] - r[1] if pos[1] > r[1] else 0
	range_heat[2][0] = pos[2] - r[2] if pos[2] > r[2] else 0

	range_heat[0][1] = pos[0] + r[0] if pos[0] + r[0] < X else X
	range_heat[1][1] = pos[1] + r[1] if pos[1] + r[1] < Y else Y
	range_heat[2][1] = pos[2] + r[2] if pos[2] + r[2] < Z else Z

	for z in range(range_heat[2][0], range_heat[2][1]):
		for y in range(range_heat[1][0], range_heat[1][1]):
			for x in range(range_heat[0][0], range_heat[0][1]):				
				d = np.sqrt(np.power(z - pos[2], 2) + np.power(y - pos[1], 2) + np.power(x - pos[0], 2))
				heat_value = value_transform(d)
				heatmap[z][y][x] = heat_value

所以一个epoch需要250s！！！
这是修改后的：

def gaussian3D(shape, sigma=1):
    m, n, p = [(ss - 1.) / 2. for ss in shape]
    y, x, q = np.ogrid[-m:m + 1, -n:n + 1, -p:p + 1]

    h = np.exp(-(x * x + y * y + q * q) / (2 * sigma * sigma))
    h[h < np.finfo(h.dtype).eps * h.max()] = 0
    return h


def draw_gaussian(heatmap, center, radius):
    diameter = 2 * radius + 1
    gaussian = gaussian3D((diameter, diameter, diameter), sigma=diameter / 6)
    c_x, c_y, c_z = int(center[0]), int(center[1]), int(center[2])
    Z, Y, X = heatmap.shape

    left, right = min(c_x, radius), min(X - c_x, radius + 1)
    top, bottom = min(c_y, radius), min(Y - c_y, radius + 1)
    h_top, h_bottom = min(c_z, radius), min(Z - c_z, radius + 1)

    masked_heatmap = heatmap[c_z - h_top:c_z + h_bottom, c_y - top:c_y + bottom, c_x - left:c_x + right]
    masked_gaussian = gaussian[radius - h_top: radius + h_bottom,radius - top:radius + bottom, radius - left:radius + right]
    np.maximum(masked_heatmap, masked_gaussian, out=masked_heatmap)
    return heatmap

for id in class_ids:
	heatmap = np.zeros_like(img_np, dtype=np.float)	
	pos = sample[id]
	draw_gaussian(heatmap, pos, 9)