yolov5-5的train函数简单流程

81 阅读 0 评论 54 点赞

我是靠谱客的博主强健老鼠，最近开发中收集的这篇文章主要介绍yolov5-5的train函数简单流程，觉得挺不错的，现在分享给大家，希望可以做个参考。

概述

yolov5-5的train函数简单流程

1、开始运行

2、创建opt对象

opt = parse_opt()

def parse_opt(known=False):
    print('创建opt对象')
    parser = argparse.ArgumentParser()
    # ----------常用参数--------------------
    parser.add_argument('--weights', type=str, default='yolov5s.pt', help='选择训练的权重文件') 
    parser.add_argument('--cfg', type=str, default='', help='模型配置文件，例子：yolov5s.yaml')
    parser.add_argument('--data', type=str, default='data/coco128.yaml', help='# 数据集配置文件，fruit.yaml所在位置')  
    parser.add_argument('--hyp', type=str, default='data/hyps/hyp.scratch.yaml', help='初始超参文件')  
    parser.add_argument('--epochs', type=int, default=300, help='训练轮次') 
    parser.add_argument('--batch-size', type=int, default=16, help='训练批次大小')  
    parser.add_argument('--imgsz', '--img', '--img-size', type=int, default=640, help='训练、测试图片分辨率大小')
    parser.add_argument('--project', default='runs/train', help='训练结果保存的根目录 默认是runs/train') 
    parser.add_argument('--name', default='exp', help='训练结果保存的目录 默认是runs/train/exp') 
    parser.add_argument('--resume', nargs='?', const=True, default=False, help='是否接着上次的结果接着训练，默认False')
    # ----------数据增强参数-----------------
    parser.add_argument('--rect', action='store_true', help='是否采用矩形训练，默认False')  
    parser.add_argument('--noautoanchor', action='store_true', help='不自动调整anchor 默认False(自动调整anchor)')
    parser.add_argument('--multi-scale', action='store_true', help='是否进行多尺度训练 默认False') 
    parser.add_argument('--label-smoothing', type=float, default=0.0, help='标签平滑增强 默认0.0不增强  要增强一般就设为0.1') 
    parser.add_argument('--linear-lr', action='store_true', help='是否使用linear lr  线性学习率  默认False 使用cosine lr')  
    parser.add_argument('--evolve', type=int, nargs='?', const=300, help='是否进行超参进化 默认False') 
    parser.add_argument('--cache-images', action='store_true', help='是否提前缓存图片到内存，以加快训练速度  默认False')  
    parser.add_argument('--image-weights', action='store_true', help='使用加权图像选择进行训练')
    parser.add_argument('--single-cls', action='store_true', help='数据集是否只有一个类别，默认False')  
    parser.add_argument('--adam', action='store_true', help='是否使用adam优化器 默认False(使用SGD)')  
    parser.add_argument('--sync-bn', action='store_true', help='是否使用跨卡同步BN,在DDP模式使用  默认False') 
    # ------------其他参数------------------------------------
    parser.add_argument('--nosave', action='store_true', help='仅保存最后一个模型')  
    parser.add_argument('--noval', action='store_true', help='是否只测试最后一轮 默认False  True: 只测试最后一轮   False: 每轮训练完都测试mAP')
    parser.add_argument('--bucket', type=str, default='', help='谷歌云盘bucket，一般不会用到')  
    parser.add_argument('--device', default='', help='选择训练设备（GPUorCPU）')  
    parser.add_argument('--workers', type=int, default=8, help='maximum number of dataloader workers')
    parser.add_argument('--entity', default=None, help='W&B entity')
    parser.add_argument('--exist-ok', action='store_true', help='如果文件不存在就新建或increment name  默认False(默认文件都是不存在的)')
    parser.add_argument('--quad', action='store_true', help='dataloader获取数据时, 是否使用collate_fn4代替collate_fn  默认False') 
    parser.add_argument('--upload_dataset', action='store_true', help='Upload dataset as W&B artifact table')
    parser.add_argument('--bbox_interval', type=int, default=-1, help='Set bounding-box image logging interval for W&B')
    parser.add_argument('--save_period', type=int, default=-1, help='每一个“保存期”后的日志模型') 
    parser.add_argument('--artifact_alias', type=str, default="latest", help='要使用的数据集工件的版本')  
    parser.add_argument('--local_rank', type=int, default=-1, help='rank为进程编号, -1且gpu=1时不进行分布式')
   
    # 作用就是当仅获取到基本设置时，如果运行命令中传入了之后才会获取到的其他配置，不会报错；而是将多出来的部分保存起来，留到后面使用
    # 例如我们没在train.py中设置，在终端才设置，也不会报错
    opt = parser.parse_known_args()[0] if known else parser.parse_args()
    return opt

3、main函数

main(opt)

3.1 日志初始化

    # 日志初始化
    set_logging(RANK)

3.2 打印所有训练opt参数

    # 输出所有训练opt参数  train: ...
    if RANK in [-1, 0]:
        # vars() 函数返回对象object的属性和属性值的字典对象
        # print(colorstr('本次训练设置的参数有: ') + ', '.join(f'{k}={v}' for k, v in vars(opt).items()))
        print('3.2.1:本次训练设置的参数有: ')
        for i, j in vars(opt).items():
            print(i, '=', j)
        print('3.2.1:打印完毕--------------------')

        print('3.2.2:检查代码版本是否是最新的--------------------')
        # 检查代码版本是否是最新的  github: ...
        check_git_status()

        print('3.2.3:检查requirements.txt所需包是否都满足---------------')
        # 检查requirements.txt所需包是否都满足 requirements: ...
        check_requirements(exclude=['thop'])

3.3 wandb logging初始化

    # wandb logging初始化
    wandb_run = check_wandb_resume(opt)

3.4 判断是继续上回打断的训练还是重新训练

   # 判断是继续上回打断的训练还是重新训练
    if opt.resume and not wandb_run:  # 继续中断的训练
        print('3.4.1:继续中断的训练， 一般是这个')
        # 使用断点续训 就从last.pt中读取相关参数
        # 如果resume是str，则表示传入的是模型的路径地址
        # 如果resume是True，则通过get_latest_run()函数找到runs为文件夹中最近的权重文件
        ckpt = opt.resume if isinstance(opt.resume, str) else get_latest_run()

        assert os.path.isfile(ckpt), '错误：--恢复检查点不存在'

        # 相关的opt参数也要替换成last.pt中的opt参数
        with open(Path(ckpt).parent.parent / 'opt.yaml') as f:
            opt = argparse.Namespace(**yaml.safe_load(f))  # 替换
        opt.cfg, opt.weights, opt.resume = '', ckpt, True  # reinstate
        LOGGER.info(f'Resuming training from {ckpt}')
    else:
        print('3.4.2:不继续中断的训练，重新开始-----------------')
        # 不使用断点续训 就从文件中读取相关参数
        # opt.hyp = opt.hyp or ('hyp.finetune.yaml' if opt.weights else 'hyp.scratch.yaml')
        opt.data, opt.cfg, opt.hyp = check_file(opt.data), check_file(opt.cfg), check_file(opt.hyp)  # check files

        assert len(opt.cfg) or len(opt.weights), '必须指定--cfg或--weights'
        opt.name = 'evolve' if opt.evolve else opt.name

        # 根据opt.project生成目录  如: runs/train/exp18
        opt.save_dir = str(increment_path(Path(opt.project) / opt.name, exist_ok=opt.exist_ok or opt.evolve))

3.5 ddp模式

  # DDP模式（多卡模式？）
    # 选择设备  cpu/cuda:0
    device = select_device(opt.device, batch_size=opt.batch_size)
    print('3.5.1:选择设备，设备为：', device)

    if LOCAL_RANK != -1:
        print('3.5.2.1:进行多GPU训练------------------------------------------------')
        # LOCAL_RANK != -1 进行多GPU训练
        from datetime import timedelta
        assert torch.cuda.device_count() > LOCAL_RANK, '用于DDP命令的CUDA设备不足'

        torch.cuda.set_device(LOCAL_RANK)
        # 根据GPU编号选择设备
        device = torch.device('cuda', LOCAL_RANK)

        # 初始化进程组  distributed backend
        dist.init_process_group(backend="nccl" if dist.is_nccl_available() else "gloo",
                                timeout=timedelta(seconds=60))

        assert opt.batch_size % WORLD_SIZE == 0, '--batch-size必须是CUDA设备计数的倍数'
        assert not opt.image_weights, '--图像权重参数（--image-weights argument）与DDP训练不兼容'
    else:
        print('3.5.2.2:不进行多GPU训练------------------------------------------------')

3.6 进化算法

    # 不使用进化算法 正常开始训练 默认是这个
    if not opt.evolve:
        print('main3.6.1:不使用进化算法 正常开始训练,直接调用train()函数，开始训练------------------------------')
        # 如果不进行超参进化 那么就直接调用train()函数，开始训练
        train(opt.hyp, opt, device)

        # 如果是使用多卡训练, 那么销毁进程组
        if WORLD_SIZE > 1 and RANK == 0:
            _ = [print('销毁进程组... ', end=''), dist.destroy_process_group(), print('Done.')]

    # 5、遗传进化算法，边进化边训练
    # Evolve hyperparameters (optional)
    # 否则使用超参进化算法(遗传算法) 求出最佳超参 再进行训练
    else:
        print('main3.6.2:使用进化算法----------------------------------------')
        # 超参进化列表 (突变规模, 最小值, 最大值)
        meta = {'lr0': (1, 1e-5, 1e-1),  # 学习率； initial learning rate (SGD=1E-2, Adam=1E-3)
                'lrf': (1, 0.01, 1.0),  # 最终单周期学习率； final OneCycleLR learning rate (lr0 * lrf)
                'momentum': (0.3, 0.6, 0.98),  # 学习率动量； SGD momentum/Adam beta1
                'weight_decay': (1, 0.0, 0.001),  # 权重衰减系数； optimizer weight decay
                'warmup_epochs': (1, 0.0, 5.0),  # warmup epochs (fractions ok)
                'warmup_momentum': (1, 0.0, 0.95),  # warmup initial momentum
                'warmup_bias_lr': (1, 0.0, 0.2),  # warmup initial bias lr
                'box': (1, 0.02, 0.2),  # box loss gain
                'cls': (1, 0.2, 4.0),  # 分类损失的系数； cls loss gain
                'cls_pw': (1, 0.5, 2.0),  # 分类BCELoss中正样本的权重； cls BCELoss positive_weight
                'obj': (1, 0.2, 4.0),  # 有无物体损失的系数； obj loss gain (scale with pixels)
                'obj_pw': (1, 0.5, 2.0),  # 有无物体BCELoss中正样本的权重； obj BCELoss positive_weight
                'iou_t': (0, 0.1, 0.7),  # 标签与anchors的iou阈值； IoU training threshold
                'anchor_t': (1, 2.0, 8.0),  # 标签的长h宽w/anchor的长h_a宽w_a阈值, 即h/h_a, w/w_a都要在(1/2.26, 2.26)之间
                                            # anchor-multiple thresholdanchor-multiple threshold
                'anchors': (2, 2.0, 10.0),  # anchors per output grid (0 to ignore)
                'fl_gamma': (0, 0.0, 2.0),  # 设为0则表示不使用focal loss(efficientDet default is gamma=1.5)；
                                            # focal loss gamma (efficientDet default gamma=1.5)
                # 下面是一些数据增强的系数, 包括颜色空间和图片空间
                'hsv_h': (1, 0.0, 0.1),  # 色调； image HSV-Hue augmentation (fraction)
                'hsv_s': (1, 0.0, 0.9),  # 饱和度； image HSV-Saturation augmentation (fraction)
                'hsv_v': (1, 0.0, 0.9),  # 透明度； image HSV-Value augmentation (fraction)
                'degrees': (1, 0.0, 45.0),  # 旋转角度； image rotation (+/- deg)
                'translate': (1, 0.0, 0.9),  # image translation (+/- fraction)
                'scale': (1, 0.0, 0.9),  # 图像缩放； image scale (+/- gain)
                'shear': (1, 0.0, 10.0),  # 图像剪切； image shear (+/- deg)
                'perspective': (0, 0.0, 0.001),  # 透视变换参数； image perspective (+/- fraction), range 0-0.001
                'flipud': (1, 0.0, 1.0),  # 图像上下翻转； image flip up-down (probability)
                'fliplr': (0, 0.0, 1.0),  # 图像左右翻转； image flip left-right (probability)
                'mosaic': (1, 0.0, 1.0),  # 马赛克系数
                'mixup': (1, 0.0, 1.0),  # image mixup (probability)
                'copy_paste': (1, 0.0, 1.0)}  # segment copy-paste (probability)

        with open(opt.hyp) as f:
            hyp = yaml.safe_load(f)  # 载入初始超参

            if 'anchors' not in hyp:  # anchors commented in hyp.yaml
                hyp['anchors'] = 3

        assert LOCAL_RANK == -1, '未为--evolve实现DDP模式e'
        opt.noval, opt.nosave = True, True  # only val/save final epoch
        # ei = [isinstance(x, (int, float)) for x in hyp.values()]  # evolvable indices

        yaml_file = Path(opt.save_dir) / 'hyp_evolved.yaml'  # 超参进化后文件保存地址
        if opt.bucket:
            os.system(f'gsutil cp gs://{opt.bucket}/evolve.txt .')  # download evolve.txt if exists
        """
        这里的进化算法是：根据之前训练时的hyp来确定一个base hyp再进行突变；
        如何根据？
        通过之前每次进化得到的results来确定之前每个hyp的权重
        有了每个hyp和每个hyp的权重之后有两种进化方式；
        1.根据每个hyp的权重随机选择一个之前的hyp作为base hyp，random.choices(range(n), weights=w)
        2.根据每个hyp的权重对之前所有的hyp进行融合获得一个base hyp，(x * w.reshape(n, 1)).sum(0) / w.sum()
        evolve.txt会记录每次进化之后的results+hyp
        每次进化时，hyp会根据之前的results进行从大到小的排序；
        再根据fitness函数计算之前每次进化得到的hyp的权重
        再确定哪一种进化方式，从而进行进化
        """

        for _ in range(opt.evolve):  # generations to evolve
            if Path('evolve.txt').exists():  # if evolve.txt exists: select best hyps and mutate
                # Select parent(s)
                # 选择超参进化方式 只用single和weighted两种
                parent = 'single'

                # 加载evolve.txt
                x = np.loadtxt('evolve.txt', ndmin=2)

                # 选取至多前五次进化的结果
                n = min(5, len(x))  # 先前考虑的结果数
                x = x[np.argsort(-fitness(x))][:n]  # top n mutations

                # 根据resluts计算hyp权重
                w = fitness(x) - fitness(x).min() + 1E-6  # weights (sum > 0)

                # 根据不同进化方式获得base hyp
                if parent == 'single' or len(x) == 1:
                    # x = x[random.randint(0, n - 1)]  # random selection
                    x = x[random.choices(range(n), weights=w)[0]]  # weighted selection
                elif parent == 'weighted':
                    x = (x * w.reshape(n, 1)).sum(0) / w.sum()  # weighted combination

                # Mutate 超参进化
                mp, s = 0.8, 0.2  # mutation probability, sigma 突变概率, sigma
                npr = np.random
                npr.seed(int(time.time()))

                # 获取突变初始值
                g = np.array([x[0] for x in meta.values()])  # gains 0-1
                ng = len(meta)
                v = np.ones(ng)

                # 设置突变
                while all(v == 1):  # 变异直到发生更改（防止重复）
                    v = (g * (npr.random(ng) < mp) * npr.randn(ng) * npr.random() * s + 1).clip(0.3, 3.0)

                # 将突变添加到base hyp上
                # [i+7]是因为x中前7个数字为results的指标(P,R,mAP,F1,test_loss=(box,obj,cls)),之后才是超参数hyp
                for i, k in enumerate(hyp.keys()):  # plt.hist(v.ravel(), 300)
                    hyp[k] = float(x[i + 7] * v[i])  # 变异

            # Constrain to limits 限制超参再规定范围
            for k, v in meta.items():
                hyp[k] = max(hyp[k], v[1])  # 下限
                hyp[k] = min(hyp[k], v[2])  # 上限
                hyp[k] = round(hyp[k], 5)  # 有效数字

            # 训练 使用突变后的参超 测试其效果
            results = train(hyp.copy(), opt, device)

            # 将结果写入results 并将对应的hyp写到evolve.txt evolve.txt中每一行为一次进化的结果
            # 每行前七个数字 (P, R, mAP, F1, test_losses(GIOU, obj, cls)) 之后为hyp
            # 保存hyp到yaml文件
            print_mutation(hyp.copy(), results, yaml_file, opt.bucket)

        # Plot results
        plot_evolution(yaml_file)
        print(f'超参数演化完成。最佳结果另存为: {yaml_file}n'
              f'命令使用这些超参数训练新模型: $ python train.py --hyp {yaml_file}')

4、train()函数

def train(hyp, opt, device):
    print('进入train函数--------------------------------------------------')
    """
     :params hyp: data/hyps/hyp.scratch.yaml   hyp dictionary
     :params opt: main中opt参数
     :params device: 当前设备
     """

4.1 初始化参数和配置信息

# ---------- 初始化参数和配置信息 ---------------------------------
    print('train4.1.1:初始化opt参数-----------------------------------')
    save_dir, epochs, batch_size, weights, single_cls, evolve, data, cfg, resume, noval, nosave, workers, = 
        opt.save_dir, opt.epochs, opt.batch_size, opt.weights, opt.single_cls, opt.evolve, opt.data, opt.cfg, 
        opt.resume, opt.noval, opt.nosave, opt.workers

    print('train4.1.2：设置保存pt文件结果的路径------------------------------')
    # --------设置保存pt文件结果的路径------------
    save_dir = Path(save_dir)  # 保存训练结果的目录  如runs/train/exp18
    wdir = save_dir / 'weights'  # 保存权重路径 如runs/train/exp18/weights
    wdir.mkdir(parents=True, exist_ok=True)  # make dir
    last = wdir / 'last.pt'  # 如runs/train/exp18/weights/last.pt
    best = wdir / 'best.pt'  # 如runs/train/exp18/weights/bast.pt
    results_file = save_dir / 'results.txt'  # runs/train/exp18/results.txt

    # 超参
    print('train4.1.3:加载超参信息---------------------------------------------')
    if isinstance(hyp, str):  # 如果hyp是str，则表示传入的是超参文件的路径地址
        with open(hyp) as f:
            hyp = yaml.safe_load(f)  # 加载超参信息

    print('train4.1.4:日志输出超参信息------------------------------------------')
    # 日志输出超参信息 hyperparameters: ...
    # LOGGER.info(colorstr('hyperparameters: ') + ', '.join(f'{k}={v}' for k, v in hyp.items()))
    print('超参设置为：')
    for i, j in hyp.items():
        print(i, '==', j)
    print('train4.1.4:输出完毕------------------------------------------')

    print('train4.1.5:保存运行设置---------------------------------------------------------------')
    # 保存运行设置
    with open(save_dir / 'hyp.yaml', 'w') as f:
        yaml.safe_dump(hyp, f, sort_keys=False)

    print('train4.1.6:保存opt--------------------------------------------------------------')
    # 保存opt
    with open(save_dir / 'opt.yaml', 'w') as f:
        yaml.safe_dump(vars(opt), f, sort_keys=False)

    print('train4.1.7:判断是否需要画图')
    # Configure
    # 是否需要画图: 所有的labels信息、前三次迭代的barch、训练结果等
    plots = not evolve  # create plots
    cuda = device.type != 'cpu'

    print('train4.1.8:设置一系列的随机数种子')
    init_seeds(1 + RANK)  # 设置一系列的随机数种子

    print('train4.1.9:加载VOC.yaml中的数据配置信息')
    # data_dict: 加载fruit.yaml中的数据配置信息  dict
    with open(data, 'rb') as f:
        data_dict = yaml.safe_load(f)  # data dict

    # Loggers
    loggers = {'wandb': None, 'tb': None}  # loggers dict
    if RANK in [-1, 0]:
        # TensorBoard
        if not evolve:
            prefix = colorstr('tensorboard: ')  # 彩色打印信息
            LOGGER.info(f"{prefix}Start with 'tensorboard --logdir {opt.project}', view at http://localhost:6006/")
            loggers['tb'] = SummaryWriter(str(save_dir))

        # W&B  wandb日志打印相关
        opt.hyp = hyp  # 添加超参数
        run_id = torch.load(weights).get('wandb_id') if weights.endswith('.pt') and os.path.isfile(weights) else None
        run_id = run_id if opt.resume else None  # 迁移学习开始新运行
        wandb_logger = WandbLogger(opt, save_dir.stem, run_id, data_dict)
        loggers['wandb'] = wandb_logger.wandb
        if loggers['wandb']:
            data_dict = wandb_logger.data_dict
            weights, epochs, hyp = opt.weights, opt.epochs, opt.hyp  # may update weights, epochs if resuming

    # data == paper_data/fruit.yaml
    # 数据集有多少种类别
    nc = 1 if single_cls else int(data_dict['nc'])

    # names: 数据集所有类别的名字
    names = ['item'] if single_cls and len(data_dict['names']) != 1 else data_dict['names']

    assert len(names) == nc, '%g names found for nc=%g dataset in %s' % (len(names), nc, data)  # check
    print('数据集类别数（自己设置的） == ', nc)
    print('数据集所有类别（自己设置的） == ', names)

    print('train4.1.10:判断当前数据集是否是coco数据集(80个类别) ')
    # 当前数据集是否是coco数据集(80个类别)  save_json和coco评价
    is_coco = data.endswith('coco.yaml') and nc == 80  # COCO dataset

4.2 模型模块

# ---------------------------模型--------------------------------------
    print('train4.2.1:载入模型-------------------------------------------------')
    # 载入模型
    pretrained = weights.endswith('.pt')
    if pretrained:  # 使用预训练
        print('train4.2.1.1:使用预训练（一般是这个------------------------------------------------')
        # 用于同步不同进程对数据读取的上下文管理器
        print('train4.2.1.1.1:同步不同进程对数据读取的上下文管理器-------------------------------')
        with torch_distributed_zero_first(RANK):
            # 这里下载是去google云盘下载, 一般会下载失败,所以建议自行去github中下载再放到weights下
            weights = attempt_download(weights)

        # 加载模型及参数
        ckpt = torch.load(weights, map_location=device)  # load checkpoint
        print('train4.2.1.1.2:加载模型及参数-------------------------------')
        # ckpt: 模型的层

        # 这里加载模型有两种方式，一种是通过opt.cfg 另一种是通过ckpt['model'].yaml
        # 区别在于是否使用resume 如果使用resume会将opt.cfg设为空，按照ckpt['model'].yaml来创建模型
        # 这也影响了下面是否除去anchor的key(也就是不加载anchor), 如果resume则不加载anchor
        # 原因: 保存的模型会保存anchors，有时候用户自定义了anchor之后，再resume，则原来基于coco数据集的anchor会自己覆盖自己设定的anchor
        # 详情参考: https://github.com/ultralytics/yolov5/issues/459
        # 所以下面设置intersect_dicts()就是忽略exclude

        model = Model(cfg or ckpt['model'].yaml, ch=3, nc=nc, anchors=hyp.get('anchors')).to(device)  # create
        exclude = ['anchor'] if (cfg or hyp.get('anchors')) and not resume else []  # exclude keys
        state_dict = ckpt['model'].float().state_dict()  # to FP32

        # 筛选字典中的键值对  把exclude删除
        state_dict = intersect_dicts(state_dict, model.state_dict(), exclude=exclude)  # intersect
        model.load_state_dict(state_dict, strict=False)  # 载入模型权重

        LOGGER.info('Transferred %g/%g items from %s' % (len(state_dict), len(model.state_dict()), weights))  # report
    else:  # 不使用预训练
        print('train4.2.1.2:不使用预训练-------------------------------------------------')
        model = Model(cfg, ch=3, nc=nc, anchors=hyp.get('anchors')).to(device)  # create

    print('train4.2.2:检查数据集 如果本地没有则从torch库中下载并解压数据集---------------------')
    # 检查数据集 如果本地没有则从torch库中下载并解压数据集
    with torch_distributed_zero_first(RANK):
        check_dataset(data_dict)  # check

    # 数据集参数
    print('train4.2.3:数据集参数-----------------------------------------')
    train_path = data_dict['train']
    val_path = data_dict['val']
    """ train_path == D:/yolo5-5/yolov5/paper_data/train.txt
        var_path == D:/yolo5-5/yolov5/paper_data/test.txt
        data_dict == {'train': 'D:/yolo5-5/yolov5/paper_data/train.txt',
                      'val': 'D:/yolo5-5/yolov5/paper_data/test.txt',
                      'nc': 10,
                      'names': ['AlligatorCrack', 'TransverseCrack', 'LongitudinalCrack',
                                'Sealling', 'SeallingCrack', 'Patch', 'Loose', 'LaneMarking',
                                'Joint', 'IndicatingArrow']}"""

    print('train4.2.4:冻结权重层------------------------------------')
    # list(model.named_parameters()) 权重层所有参数
    # 冻结权重层
    # 这里只是给了冻结权重层的一个例子, 但是不建议冻结权重层, 训练全部层参数, 可以得到更好的性能, 当然也会更慢
    freeze = []  # 要冻结的参数名称（完整或部分）
    for k, v in model.named_parameters():
        v.requires_grad = True  # train all layers
        if any(x in k for x in freeze):
            print('freezing %s' % k)
            v.requires_grad = False

4.3 优化器设置

# ------------------优化器----------------------------------------
    # nbs 标称的batch_size, 模拟的batch_size 比如默认的话上面设置的opt.batch_size = 16 -> nbs = 64
    # 也就是模型梯度累计 64/16=4(accumulate) 次之后就更新一次模型 等于变相的扩大了batch_size
    nbs = 64
    accumulate = max(round(nbs / batch_size), 1)  # 累积损耗优化
    print('累积损耗优化(accumulate) == ', accumulate)
    print('train4.3.1:根据accumulate设置超参: 权重衰减参数------------------------------------')
    # 根据accumulate设置超参: 权重衰减参数
    hyp['weight_decay'] *= batch_size * accumulate / nbs  # 权重衰减
    LOGGER.info(f"权重衰减 = {hyp['weight_decay']}")  # 写入日志

    print('train4.3.2:将模型参数分为三组(weights、biases、bn)来进行分组优化------------------------------------')
    # 将模型参数分为三组(weights、biases、bn)来进行分组优化
    pg0, pg1, pg2 = [], [], []  # 优化器参数组
    for k, v in model.named_modules():
        if hasattr(v, 'bias') and isinstance(v.bias, nn.Parameter):
            pg2.append(v.bias)  # biases
        if isinstance(v, nn.BatchNorm2d):
            pg0.append(v.weight)  # no decay
        elif hasattr(v, 'weight') and isinstance(v.weight, nn.Parameter):
            pg1.append(v.weight)  # apply decay

    print('train4.3.3:选择优化器 并设置pg0(bn参数)的优化方式------------')
    # 选择优化器 并设置pg0(bn参数)的优化方式
    if opt.adam:
        print('train4.3.3.1:opt.adam--------------------------')
        optimizer = optim.Adam(pg0, lr=hyp['lr0'], betas=(hyp['momentum'], 0.999))  # 将β1调整为动量
    else:
        print('train4.3.3.2:no opt.adam--------------------------')
        optimizer = optim.SGD(pg0, lr=hyp['lr0'], momentum=hyp['momentum'], nesterov=True)

    print('train4.3.4:设置pg1(weights)的优化方式-------------------------------')
    # 设置pg1(weights)的优化方式
    optimizer.add_param_group({'params': pg1, 'weight_decay': hyp['weight_decay']})  # 添加带有权重衰减的pg1

    print('train4.3.5:设置pg2(biases)的优化方式-------------------------------')
    # 设置pg2(biases)的优化方式
    optimizer.add_param_group({'params': pg2})  # add pg2 (biases)

    print('train4.3.6:打印log日志 优化信息-------------------------------')
    # 打印log日志 优化信息
    LOGGER.info('优化器组: %g .bias, %g conv.weight, %g other' % (len(pg2), len(pg1), len(pg0)))
    print('train4.3.7:删除三个变量 优化代码-------------------------------')
    # 删除三个变量 优化代码
    del pg0, pg1, pg2

4.4 学习率模块

# ----------------------------学习率-----------------------
    print('train4.4.1:学习率方式选择----------------------')
    if opt.linear_lr:  # 使用线性学习率
        print('train4.4.1.1:使用线性学习率----------------------------------')
        lf = lambda x: (1 - x / (epochs - 1)) * (1.0 - hyp['lrf']) + hyp['lrf']  # linear
    else:
        # 使用one cycle 学习率
        print('train4.4.1.2:使用one_cycle学习率(一般是这个)-------------------------')
        lf = one_cycle(1, hyp['lrf'], epochs)  # cosine 1->hyp['lrf']

    print('train4.4.2:实例化-------------------------')
    # 实例化
    scheduler = lr_scheduler.LambdaLR(optimizer, lr_lambda=lf)
    # plot_lr_scheduler(optimizer, scheduler, epochs)

4.5 训练前最后准备模块

 # EMA
    # 单卡训练: 使用EMA（指数移动平均）对模型的参数做平均, 一种给予近期数据更高权重的平均方法,
    # 以求提高测试指标并增加模型鲁棒。
    print('train4.5.1:单卡训练: 使用EMA（指数移动平均）对模型的参数做平均, 一种给予近期数据更高权重的平均方法--------------')
    ema = ModelEMA(model) if RANK in [-1, 0] else None

    # 使用预训练
    start_epoch, best_fitness = 0, 0.0
    if pretrained:
        print('train4.5.2:单使用预训练--------------------')
        # Optimizer
        if ckpt['optimizer'] is not None:
            optimizer.load_state_dict(ckpt['optimizer'])
            best_fitness = ckpt['best_fitness']

        # EMA
        if ema and ckpt.get('ema'):
            ema.ema.load_state_dict(ckpt['ema'].float().state_dict())
            ema.updates = ckpt['updates']

        # Results
        if ckpt.get('training_results') is not None:
            results_file.write_text(ckpt['training_results'])  # write results.txt

        # Epochs
        start_epoch = ckpt['epoch'] + 1
        if resume:
            assert start_epoch > 0, '%s training to %g epochs is finished, nothing to resume.' % (weights, epochs)
        if epochs < start_epoch:
            LOGGER.info('%s has been trained for %g epochs. Fine-tuning for %g additional epochs.' %
                        (weights, ckpt['epoch'], epochs))
            epochs += ckpt['epoch']  # finetune additional epochs

        del ckpt, state_dict

    print('train4.5.3:获取模型最大stride(步长)  [32 16 8]-----------')
    # Image sizes
    # gs: 获取模型最大stride(步长)=32   [32 16 8]
    gs = max(int(model.stride.max()), 32)
    print('gs == ', gs)

    print('train4.5.4:有多少个detect(检测层数)--------------------')
    # nl: 有多少个detect
    nl = model.model[-1].nl  # 检测层数（用于缩放hyp['obj']）
    print('nl == ', nl)

    print('train4.5.5:获取训练图片和测试图片分辨率--------------------------')
    # 获取训练图片和测试图片分辨率 imgsz=640  imgsz_test=640
    imgsz = check_img_size(opt.imgsz, gs, floor=gs * 2)  # verify imgsz is gs-multiple
    print('imgsz == ', imgsz)

    print('train4.5.6:是否使用DP mode(单机多卡模式 )-------------------------')
    # 是否使用DP mode
    # 如果rank=-1且gpu数量>1则使用DataParallel单机多卡模式  效果并不好（分布不平均）
    if cuda and RANK == -1 and torch.cuda.device_count() > 1:
        print('train4.5.6.1:使用DP mode----------------')
        logging.warning('DP not recommended, instead use torch.distributed.run for best DDP Multi-GPU results.n'
                        'See Multi-GPU Tutorial at https://github.com/ultralytics/yolov5/issues/475 to get started.')
        model = torch.nn.DataParallel(model)
    else:
        print('train4.5.6.2:不使用DP mode-------------------')

    print('train4.5.7:是否使用跨卡BN----------------------')
    # SyncBatchNorm  是否使用跨卡BN
    if opt.sync_bn and cuda and RANK != -1:
        print('train5.7.1:使用跨卡BN-----------------------------')
        raise Exception('can not train with --sync-bn, known issue https://github.com/ultralytics/yolov5/issues/3998')
        model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model).to(device)
        LOGGER.info('Using SyncBatchNorm()')
    else:
        print('train5.7.2:不使用跨卡BN-----------------------------')

4.6 数据加载模块


    print('train4.6.1:create_dataloader???------------------------')
    # Trainloader
    train_loader, dataset = create_dataloader(train_path, imgsz, batch_size // WORLD_SIZE, gs, single_cls,
                                              hyp=hyp, augment=True, cache=opt.cache_images, rect=opt.rect, rank=RANK,
                                              workers=workers, image_weights=opt.image_weights, quad=opt.quad,
                                              prefix=colorstr('train: '))

    print('train4.6.2:获取标签中最大类别值，与类别数作比较，如果小于类别数则表示有问题--')
    # 获取标签中最大类别值，与类别数作比较，如果小于类别数则表示有问题
    mlc = np.concatenate(dataset.labels, 0)[:, 0].max()  # 标签中最大类别值
    nb = len(train_loader)  # 类别数
    print('标签中最大类别值 == ', mlc)
    print('类别数 == ', nb)
    assert mlc < nc, '标签中最大类别值= %g ，超过了 设置的类别数=%g in %s. 可能的类标签是 0-%g' % (mlc, nc, data, nc - 1)

    # Process 0
    if RANK in [-1, 0]:
        val_loader = create_dataloader(val_path, imgsz, batch_size // WORLD_SIZE * 2, gs, single_cls,
                                       hyp=hyp, cache=opt.cache_images and not noval, rect=True, rank=-1,
                                       workers=workers, pad=0.5,
                                       prefix=colorstr('val: '))[0]

        if not resume:  # 如果不使用断点续训
            print('train4.6.3.1:如果不使用断点续训----------------')
            # 统计dataset的label信息
            print('train4.6.3.1.1:统计dataset的label信息--------------')
            labels = np.concatenate(dataset.labels, 0)
            # print('label,', labels)
            # c = torch.tensor(labels[:, 0])  # classes
            # cf = torch.bincount(c.long(), minlength=nc) + 1.  # frequency
            # model._initialize_biases(cf.to(device))
            if plots:
                print('train4.6.3.1.2:plots可视化数据集labels信息------')
                # plots可视化数据集labels信息
                plot_labels(labels, names, save_dir, loggers)

            # 计算默认锚框anchor与数据集标签框的高宽比
            # 标签的高h宽w与anchor的高h_a宽h_b的比值 即h/h_a, w/w_a都要在(1/hyp['anchor_t'], hyp['anchor_t'])是可以接受的
            # 如果bpr小于98%，则根据k-mean算法聚类新的锚框
            # Anchors
            if not opt.noautoanchor:
                print('train4.6.3.1.3:计算默认锚框anchor与数据集标签框的高宽比------')
                check_anchors(dataset, model=model, thr=hyp['anchor_t'], imgsz=imgsz)
            print('train4.6.3.1.4:预降锚精度---------------------')
            model.half().float()  # 预降锚精度

    # DDP模式
    if cuda and RANK != -1:
        model = DDP(model, device_ids=[LOCAL_RANK], output_device=LOCAL_RANK)

4.7 训练开始模块

# -------------------开始训练-------------------------------------------
    # 设置/初始化一些训练要用的参数
    hyp['box'] *= 3. / nl  # scale to layers
    hyp['cls'] *= nc / 80. * 3. / nl  # 分类损失系数
    hyp['obj'] *= (imgsz / 640) ** 2 * 3. / nl  # scale to image size and layers
    hyp['label_smoothing'] = opt.label_smoothing
    model.nc = nc  # attach number of classes to model
    model.hyp = hyp  # attach hyperparameters to model
    model.gr = 1.0  # iou loss ratio (obj_loss = 1.0 or iou)

    # 从训练样本标签得到类别权重（和类别中的目标数即类别频率成反比）
    model.class_weights = labels_to_class_weights(dataset.labels, nc).to(device) * nc  # attach class weights
    model.names = names  # 获取类别名

    # 开始训练
    t0 = time.time()

    # 获取热身迭代的次数iterations, max(3 epochs, 1k iterations)
    nw = max(round(hyp['warmup_epochs'] * nb), 1000)
    # nw = min(nw, (epochs - start_epoch) / 2 * nb)  # limit warmup to < 1/2 of training

    last_opt_step = -1
    # 初始化maps(每个类别的map)和results
    maps = np.zeros(nc)  # 每个类别的mAp
    results = (0, 0, 0, 0, 0, 0, 0)  # P, R, mAP@.5, mAP@.5-.95, val_loss(box, obj, cls)

    # 设置学习率衰减所进行到的轮次，即使打断训练，使用resume接着训练也能正常衔接之前的训练进行学习率衰减
    scheduler.last_epoch = start_epoch - 1  # do not move

    # 设置amp混合精度训练    GradScaler + autocast
    scaler = amp.GradScaler(enabled=cuda)

    # 初始化损失函数
    compute_loss = ComputeLoss(model)  # init loss class

    # 打印日志信息
    LOGGER.info(f'Image sizes {imgsz} train, {imgsz} valn'
                f'Using {train_loader.num_workers} dataloader workersn'
                f'Logging results to {save_dir}n'
                f'Starting training for {epochs} epochs...')

    # 开始训练
    print('开始啦！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！')
    for epoch in range(start_epoch, epochs):
        print('这是第', epoch, '轮')
        model.train()

        # Update image weights (optional)  并不一定好  默认是False的
        if opt.image_weights:  # 如果为True 进行图片采样策略(按数据集各类别权重采样)
            # 根据前面初始化的图片采样权重model.class_weights（每个类别的权重 频率高的权重小）以及maps配合每张图片包含的类别数
            # 通过rando.choices生成图片索引indices从而进行采用 （作者自己写的采样策略，效果不一定ok）
            # Generate indices
            if RANK in [-1, 0]:
                # 从训练(gt)标签获得每个类的权重  标签频率高的类权重低
                cw = model.class_weights.cpu().numpy() * (1 - maps) ** 2 / nc  # 类权重

                # 得到每一张图片对应的采样权重[128]
                iw = labels_to_image_weights(dataset.labels, nc=nc, class_weights=cw)  # 图片权重

                # random.choices: 从range(dataset.n)序列中按照weights(参考每张图片采样权重)进行采样, 一次取一个数字  采样次数为k
                # 最终得到所有图片的采样顺序(参考每张图片采样权重) list [128]
                dataset.indices = random.choices(range(dataset.n), weights=iw, k=dataset.n)  # rand weighted idx

            # Broadcast if DDP 采用广播采样策略
            if RANK != -1:
                indices = (torch.tensor(dataset.indices) if RANK == 0 else torch.zeros(dataset.n)).int()
                dist.broadcast(indices, 0)
                if RANK != 0:
                    dataset.indices = indices.cpu().numpy()

        # Update mosaic border
        # b = int(random.uniform(0.25 * imgsz, 0.75 * imgsz + gs) // gs * gs)
        # dataset.mosaic_border = [b - imgsz, -b]  # height, width borders

        # 初始化训练时打印的平均损失信息
        mloss = torch.zeros(4, device=device)  # mean losses

        if RANK != -1:
            # DDP模式打乱数据，并且dpp.sampler的随机采样数据是基于epoch+seed作为随机种子，每次epoch不同，随机种子不同
            train_loader.sampler.set_epoch(epoch)

        # 进度条，方便展示信息
        pbar = enumerate(train_loader)

        # 进度条标题
        LOGGER.info(('n' + '%10s' * 8) % ('Epoch', 'gpu_mem', 'box', 'obj', 'cls', 'total', 'labels', 'img_size'))
        if RANK in [-1, 0]:
            pbar = tqdm(pbar, total=nb)  # # 创建进度条

        optimizer.zero_grad()  # 梯度清零

        for i, (imgs, targets, paths, _) in pbar:  # batch -------------------------------------------------------------
            ni = i + nb * epoch  # ni: 计算当前迭代次数 iteration
            imgs = imgs.to(device, non_blocking=True).float() / 255.0  # uint8 to float32, 0-255 to 0.0-1.0

            # Warmup
            # 热身训练（前nw次迭代）热身训练迭代的次数iteration范围[1:nw]  选取较小的accumulate，学习率以及momentum,慢慢的训练
            if ni <= nw:
                xi = [0, nw]  # x interp
                # model.gr = np.interp(ni, xi, [0.0, 1.0])  # iou loss ratio (obj_loss = 1.0 or iou)
                accumulate = max(1, np.interp(ni, xi, [1, nbs / batch_size]).round())
                for j, x in enumerate(optimizer.param_groups):
                    # bias的学习率从0.1下降到基准学习率lr*lf(epoch) 其他的参数学习率增加到lr*lf(epoch)
                    # lf为上面设置的余弦退火的衰减函数
                    # bias lr falls from 0.1 to lr0, all other lrs rise from 0.0 to lr0
                    x['lr'] = np.interp(ni, xi, [hyp['warmup_bias_lr'] if j == 2 else 0.0, x['initial_lr'] * lf(epoch)])
                    if 'momentum' in x:
                        x['momentum'] = np.interp(ni, xi, [hyp['warmup_momentum'], hyp['momentum']])

            # Multi-scale 多尺度训练   从[imgsz*0.5, imgsz*1.5+gs]间随机选取一个尺寸(32的倍数)作为当前batch的尺寸送入模型开始训练
            # imgsz: 默认训练尺寸   gs: 模型最大stride=32   [32 16 8]
            # Multi-scale
            if opt.multi_scale:
                sz = random.randrange(imgsz * 0.5, imgsz * 1.5 + gs) // gs * gs  # size
                sf = sz / max(imgs.shape[2:])  # scale factor
                if sf != 1:
                    ns = [math.ceil(x * sf / gs) * gs for x in imgs.shape[2:]]  # new shape (stretched to gs-multiple)

                    # 下采样
                    imgs = F.interpolate(imgs, size=ns, mode='bilinear', align_corners=False)

            # Forward  混合精度训练 开启autocast的上下文
            with amp.autocast(enabled=cuda):
                # pred: [8, 3, 68, 68, 25] [8, 3, 34, 34, 25] [8, 3, 17, 17, 25]
                # [bs, anchor_num, grid_w, grid_h, xywh+c+20classes]
                pred = model(imgs)  # forward

                # 计算损失，包括分类损失，置信度损失和框的回归损失
                # loss为总损失值  loss_items为一个元组，包含分类损失、置信度损失、框的回归损失和总损失
                loss, loss_items = compute_loss(pred, targets.to(device))  # 按batch_size缩放的损失

                if RANK != -1:
                    loss *= WORLD_SIZE  # DDP模式下设备之间的平均梯度
                if opt.quad:
                    # 如果采用collate_fn4取出mosaic4数据loss也要翻4倍
                    loss *= 4.

            # Backward 反向传播  将梯度放大防止梯度的underflow（amp混合精度训练）
            scaler.scale(loss).backward()

            # Optimize
            # 模型反向传播accumulate次（iterations）后再根据累计的梯度更新一次参数
            if ni - last_opt_step >= accumulate:
                # scaler.step()首先把梯度的值unscale回来
                # 如果梯度的值不是 infs 或者 NaNs, 那么调用optimizer.step()来更新权重,
                # 否则，忽略step调用，从而保证权重不更新（不被破坏）
                scaler.step(optimizer)  # optimizer.step 参数更新

                scaler.update()  # 准备着，看是否要增大scaler

                optimizer.zero_grad()  # 梯度清零

                if ema:
                    ema.update(model)  # 当前epoch训练结束  更新ema
                last_opt_step = ni

            # 打印Print一些信息 包括当前epoch、显存、损失(box、obj、cls、total)、当前batch的target的数量和图片的size等信息
            if RANK in [-1, 0]:
                mloss = (mloss * i + loss_items) / (i + 1)  # 更新平均损失
                mem = '%.3gG' % (torch.cuda.memory_reserved() / 1E9 if torch.cuda.is_available() else 0)  # (GB)
                s = ('%10s' * 2 + '%10.4g' * 6) % (
                    f'{epoch}/{epochs - 1}', mem, *mloss, targets.shape[0], imgs.shape[-1])

                pbar.set_description(s)  # 进度条显示以上信息

                # Plot
                if plots and ni < 3:  # 将前三次迭代的barch的标签框再图片中画出来并保存  train_batch0/1/2.jpg
                    f = save_dir / f'train_batch{ni}.jpg'  # filename
                    Thread(target=plot_images, args=(imgs, targets, paths, f), daemon=True).start()
                    if loggers['tb'] and ni == 0:  # TensorBoard
                        with warnings.catch_warnings():
                            warnings.simplefilter('ignore')  # suppress jit trace warning
                            loggers['tb'].add_graph(torch.jit.trace(de_parallel(model), imgs[0:1], strict=False), [])

                # wandb 显示信息
                elif plots and ni == 10 and loggers['wandb']:
                    wandb_logger.log({'Mosaics': [loggers['wandb'].Image(str(x), caption=x.name) for x in
                                                  save_dir.glob('train*.jpg') if x.exists()]})

            # end batch ------------------------------------------------------------------------------------------------

        # Scheduler  一个epoch训练结束后都要调整学习率（学习率衰减）
        # group中三个学习率（pg0、pg1、pg2）每个都要调整
        lr = [x['lr'] for x in optimizer.param_groups]  # for loggers
        scheduler.step()

        # DDP process 0 or single-GPU
        if RANK in [-1, 0]:
            # mAP
            # 将model中的属性赋值给ema
            ema.update_attr(model, include=['yaml', 'nc', 'hyp', 'gr', 'names', 'stride', 'class_weights'])

            # 判断当前epoch是否是最后一轮
            final_epoch = epoch + 1 == epochs

            # notest: 是否只测试最后一轮  True: 只测试最后一轮   False: 每轮训练完都测试mAP
            if not noval or final_epoch:  # Calculate mAP
                wandb_logger.current_epoch = epoch + 1
                # 测试使用的是ema（指数移动平均 对模型的参数做平均）的模型
                # results: [1] Precision 所有类别的平均precision(最大f1时)
                #          [1] Recall 所有类别的平均recall
                #          [1] map@0.5 所有类别的平均mAP@0.5
                #          [1] map@0.5:0.95 所有类别的平均mAP@0.5:0.95
                #          [1] box_loss 验证集回归损失, obj_loss 验证集置信度损失, cls_loss 验证集分类损失
                # maps: [80] 所有类别的mAP@0.5:0.95

                results, maps, _ = val.run(data_dict,  # 数据集配置文件地址 包含数据集的路径、类别个数、类名、下载地址等信息
                                           batch_size=batch_size // WORLD_SIZE * 2,  # batch_size
                                           imgsz=imgsz,  # test img size
                                           model=ema.ema,  # ema model
                                           single_cls=single_cls,  # 是否是单类数据集
                                           dataloader=val_loader,  # test dataloader
                                           save_dir=save_dir,  # 保存地址 runs/train/expn
                                           save_json=is_coco and final_epoch,   # 是否按照coco的json格式保存预测框
                                           verbose=nc < 50 and final_epoch,  # 是否打印出每个类别的mAP
                                           plots=plots and final_epoch,  # 是否可视化
                                           wandb_logger=wandb_logger,  # 网页可视化 类似于tensorboard
                                           compute_loss=compute_loss)  # 损失函数(train)

            # Write 将测试结果写入result.txt中
            with open(results_file, 'a') as f:
                f.write(s + '%10.4g' * 7 % results + 'n')  # append metrics, val_loss

            # Log
            # wandb_logger 类似tensorboard的一种网页端显示训练信息的工具
            tags = ['train/box_loss', 'train/obj_loss', 'train/cls_loss',  # train loss
                    'metrics/precision', 'metrics/recall', 'metrics/mAP_0.5', 'metrics/mAP_0.5:0.95',
                    'val/box_loss', 'val/obj_loss', 'val/cls_loss',  # val loss
                    'x/lr0', 'x/lr1', 'x/lr2']  # params
            for x, tag in zip(list(mloss[:-1]) + list(results) + lr, tags):
                if loggers['tb']:
                    loggers['tb'].add_scalar(tag, x, epoch)  # TensorBoard
                if loggers['wandb']:
                    wandb_logger.log({tag: x})  # W&B

            # Update best mAP 这里的best mAP其实是[P, R, mAP@.5, mAP@.5-.95]的一个加权值
            # fi: [P, R, mAP@.5, mAP@.5-.95]的一个加权值 = 0.1*mAP@.5 + 0.9*mAP@.5-.95
            fi = fitness(np.array(results).reshape(1, -1))  # weighted combination of [P, R, mAP@.5, mAP@.5-.95]
            if fi > best_fitness:
                best_fitness = fi
            wandb_logger.end_epoch(best_result=best_fitness == fi)

            # Save model
            # 保存带checkpoint的模型用于inference或resuming training
            # 保存模型, 还保存了epoch, results, optimizer等信息
            # optimizer将不会在最后一轮完成后保存
            # model保存的是EMA的模型
            if (not nosave) or (final_epoch and not evolve):  # if save
                ckpt = {'epoch': epoch,
                        'best_fitness': best_fitness,
                        'training_results': results_file.read_text(),
                        'model': deepcopy(de_parallel(model)).half(),
                        'ema': deepcopy(ema.ema).half(),
                        'updates': ema.updates,
                        'optimizer': optimizer.state_dict(),
                        'wandb_id': wandb_logger.wandb_run.id if loggers['wandb'] else None}

                # Save last, best and delete
                torch.save(ckpt, last)
                if best_fitness == fi:
                    torch.save(ckpt, best)
                if loggers['wandb']:
                    if ((epoch + 1) % opt.save_period == 0 and not final_epoch) and opt.save_period != -1:
                        wandb_logger.log_model(last.parent, opt, epoch, fi, best_model=best_fitness == fi)
                del ckpt

        # end epoch -----------------------
    # end training ----

4.8 结尾，打印一些信息

   if RANK in [-1, 0]:
        # 日志: 打印训练时间
        LOGGER.info(f'{epoch - start_epoch + 1} epochs completed in {(time.time() - t0) / 3600:.3f} hours.n')

        # 可视化训练结果: results1.png   confusion_matrix.png 以及('F1', 'PR', 'P', 'R')曲线变化  日志信息
        if plots:
            plot_results(save_dir=save_dir)  # save as results.png
            if loggers['wandb']:
                files = ['results.png', 'confusion_matrix.png', *[f'{x}_curve.png' for x in ('F1', 'PR', 'P', 'R')]]
                wandb_logger.log({"Results": [loggers['wandb'].Image(str(save_dir / f), caption=f) for f in files

                                              if (save_dir / f).exists()]})
        # coco评价？？？ 只在coco数据集才会运行  一般用不到
        if not evolve:
            if is_coco:  # COCO dataset
                for m in [last, best] if best.exists() else [last]:  # speed, mAP tests
                    results, _, _ = val.run(data_dict,
                                            batch_size=batch_size // WORLD_SIZE * 2,
                                            imgsz=imgsz,
                                            model=attempt_load(m, device).half(),
                                            single_cls=single_cls,
                                            dataloader=val_loader,
                                            save_dir=save_dir,
                                            save_json=True,
                                            plots=False)

            # Strip optimizers
            # 模型训练完后, strip_optimizer函数将optimizer从ckpt中删除
            # 并对模型进行model.half() 将Float32->Float16 这样可以减少模型大小, 提高inference速度
            for f in last, best:
                if f.exists():
                    strip_optimizer(f)  # strip optimizers
            if loggers['wandb']:  # Log the stripped model
                loggers['wandb'].log_artifact(str(best if best.exists() else last), type='model',
                                              name='run_' + wandb_logger.wandb_run.id + '_model',
                                              aliases=['latest', 'best', 'stripped'])

        wandb_logger.finish_run()  # 关闭wandb_logger

    torch.cuda.empty_cache()  # 释放显存

4.9 返回results

    return results

最后

以上就是强健老鼠为你收集整理的yolov5-5的train函数简单流程的全部内容，希望文章能够帮你解决yolov5-5的train函数简单流程所遇到的程序开发问题。

如果觉得靠谱客网站的内容还不错，欢迎将靠谱客网站推荐给程序员好友。

本图文内容来源于网友提供，作为学习参考使用，或来自网络收集整理，版权属于原作者所有。

本文分类：# YOLOv5
浏览次数：81 次浏览
发布日期：2023-07-27 09:10:03
本文链接：https://www.kaopuke.com/article/k-p-k_14_uzo_22_f1_13_j_26_5.html

yolov5-5的train函数简单流程

概述

yolov5-5的train函数简单流程

1、开始运行

2、创建opt对象

3、main函数

3.1 日志初始化

3.2 打印所有训练opt参数

3.3 wandb logging初始化

3.4 判断是继续上回打断的训练还是重新训练

3.5 ddp模式

3.6 进化算法

4、train()函数

4.1 初始化参数和配置信息

4.2 模型模块

4.3 优化器设置

4.4 学习率模块

4.5 训练前最后准备模块

4.6 数据加载模块

4.7 训练开始模块

4.8 结尾，打印一些信息

4.9 返回results

最后

评论列表共有 0 条评论

发表评论取消回复

yolov5-5的train函数简单流程

概述

yolov5-5的train函数简单流程

1、开始运行

2、创建opt对象

3、main函数

3.1 日志初始化

3.2 打印所有训练opt参数

3.3 wandb logging初始化

3.4 判断是继续上回打断的训练还是重新训练

3.5 ddp模式

3.6 进化算法

4、train()函数

4.1 初始化参数和配置信息

4.2 模型模块

4.3 优化器设置

4.4 学习率模块

4.5 训练前最后准备模块

4.6 数据加载模块

4.7 训练开始模块

4.8 结尾，打印一些信息

4.9 返回results

最后

相关文章

评论列表共有 0 条评论

发表评论 取消回复

发表评论取消回复