简介:
在服务器多卡训练的时候出现这个报错
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group. 查询之后是分布式训练的问题。
PyTorch的分布式训练主要依赖于torch.distributed包,它提供了一套原语,用于同步多个进程的计算和数据。这些原语可以在多个机器上的多个进程之间进行通信,支持多种后端(如NCCL、Gloo和MPI)。
PyTorch的分布式训练主要有两种模式:
数据并行(Data Parallel):这是最常见的分布式训练模式。在这种模式下,每个进程都有一个模型的副本,并处理数据集的一个子集。所有的进程并行地进行前向传播和反向传播,然后同步更新模型的参数。PyTorch提供了torch.nn.DataParallel和torch.nn.parallel.DistributedDataParallel(简称DDP)两种方式来实现数据并行。
模型并行(Model Parallel):这种模式用于模型太大,无法在一个GPU上完全加载的情况。在这种模式下,模型的不同部分在不同的GPU上运行。这需要更复杂的编程,但可以让你训练更大的模型。
本地M6000,训练一个epoch 20-21min,
服务器两块A800, 训练一个epoch 3-4min,
服务器四块A800, 训练一个epoch 38-40s。
现在2024年7月一块A800 80GB的在10-13万,: (。
1. 本地主机训练
训练集准备
训练集都是下面的格式
dataset
|_____images
|_____labels
images文件夹下存图片,labels文件夹下存txt标签, class_id center_x center_y width height的yolo格式。
可以准备多个这样的训练集,用一个yaml文件把这些训练集统一起来。
数据集的配置文件:
Dataset/person.yaml
train:
- Dataset/CoCoPerson_Mini/train
- Dataset/ped/train
- Dataset/labeled_dataset_20240724/door_dataset_4_grid # door
- Dataset/labeled_dataset_20240724/mask_dataset_4_grid # mask
- Dataset/labeled_dataset_20240724/video_day_with_person # day
- Dataset/labeled_dataset_20240724/video_night_with_person # night
- Dataset/dark_augment_dataset # dark
val: # val images (relative to 'path')
- Dataset/CoCoPerson_Mini/val
- Dataset/val_dataset_20240724
nc: 1
# Classes
names:
0: person
model/person.yaml
模型配置文件
nc: 1 # number of classes
depth_multiple: 0.33 # model depth multiple
width_multiple: 0.25 # layer channel multiple
anchors:
- [10,13, 16,30, 33,23] # P3/8
- [30,61, 62,45, 59,119] # P4/16
- [116,90, 156,198, 373,326] # P5/32
# YOLOv5 v6.0 backbone
backbone:
# [from, number, module, args]
[[-1, 1, Conv, [64, 6, 2, 2]], # 0-P1/2 参数依次为: [ch_out, kernel, stride, padding, groups]
[-1, 1, Conv, [128, 3, 2]], # 1-P2/4
[-1, 3, C3, [128]],
[-1, 1, Conv, [256, 3, 2]], # 3-P3/8
[-1, 6, C3, [256]],
[-1, 1, Conv, [512, 3, 2]], # 5-P4/16
[-1, 9, C3, [512]],
[-1, 1, Conv, [1024, 3, 2]], # 7-P5/32
[-1, 3, C3, [1024]],
[-1, 1, SPPF, [1024, 5]], # 9
]
# YOLOv5 v6.0 head
head:
[[-1, 1, Conv, [512, 1, 1]],
[-1, 1, nn.Upsample, [None, 2, 'nearest']],
[[-1, 6], 1, Concat, [1]], # cat backbone P4
[-1, 3, C3, [512, False]], # 13
[-1, 1, Conv, [256, 1, 1]],
[-1, 1, nn.Upsample, [None, 2, 'nearest']],
[[-1, 4], 1, Concat, [1]], # cat backbone P3
[-1, 3, C3, [256, False]], # 17 (P3/8-small)
[-1, 1, Conv, [256, 3, 2]],
[[-1, 14], 1, Concat, [1]], # cat head P4
[-1, 3, C3, [512, False]], # 20 (P4/16-medium)
[-1, 1, Conv, [512, 3, 2]],
[[-1, 10], 1, Concat, [1]], # cat head P5
[-1, 3, C3, [1024, False]], # 23 (P5/32-large)
[[17, 20, 23], 1, Detect, [nc, anchors]], # Detect(P3, P4, P5)
]
训练
python train.py --epochs 150 --data Dataset/person.yaml --batch-size 32 --weights weights/yolov5n.pt --img 640 --cfg models/person.yaml --freeze 4 --name yolov5n_person --device 0
train.py
训练脚本中每个参数的作用:
–weights:初始权重的路径。
–cfg:模型配置文件(model.yaml)的路径。
–data:数据集配置文件(dataset.yaml)的路径。
–hyp:超参数配置文件的路径。
–epochs:总的训练周期数。
–batch-size:所有GPU的总批量大小,如果为-1,则自动批处理。
–imgsz:训练和验证图像的大小(像素)。
–rect:是否进行矩形训练。
–resume:是否从最近的训练恢复。
–nosave:是否只保存最后的检查点。
–noval:是否只验证最后的周期。
–noautoanchor:是否禁用AutoAnchor。
–noplots:是否不保存绘图文件。
–evolve:是否进化超参数。
–bucket:gsutil桶。
–cache:图像缓存。
–image-weights:是否在训练中使用加权图像选择。
–device:CUDA设备,例如0或0,1,2,3或cpu。
–multi-scale:是否改变图像大小。
–single-cls:是否将多类数据作为单类训练。
–optimizer:优化器,可选’SGD’,‘Adam’,‘AdamW’。
–sync-bn:是否使用SyncBatchNorm,只在DDP模式下可用。
–workers:最大的数据加载器工作者(每个RANK在DDP模式下)。
–project:保存到项目/名称的路径。
–name:保存到项目/名称的名称。
–exist-ok:如果项目/名称存在,是否可以,不增加。
–quad:是否使用四倍数据加载器。
–cos-lr:是否使用余弦学习率调度器。
–label-smoothing:标签平滑epsilon。
–patience:早停耐心(没有改进的周期数)。
–freeze:冻结层,例如backbone=10,first3=0 1 2。
–save-period:每x周期保存一次检查点(如果<1则禁用)。
–seed:全局训练种子。
–local_rank:自动DDP多GPU参数,不要修改。
Logger参数:
–entity:实体。
–upload_dataset:是否上传数据,"val"选项。
–bbox_interval:设置边界框图像记录间隔。
–artifact_alias:要使用的数据集工件的版本
2. 分布式训练
在服务器上用两个显卡训练的时候报错如下
AutoAnchor: 4.52 anchors/target, 0.998 Best Possible Recall (BPR). Current anchors are a good fit to dataset ✅
Plotting labels to runs/train/yolov5m_person3/labels.jpg...
Traceback (most recent call last):
File "train.py", line 646, in <module>
main(opt)
File "train.py", line 540, in main
train(opt.hyp, opt, device, callbacks)
File "train.py", line 235, in train
model = smart_DDP(model)
File "/code/utils/torch_utils.py", line 63, in smart_DDP
return DDP(model, device_ids=[LOCAL_RANK], output_device=LOCAL_RANK)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 520, in __init__
self.process_group = _get_default_group()
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 394, in _get_default_group
raise RuntimeError(
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.
- 分布式训练的指令
python -m torch.distributed.launch --nproc_per_node=2 --use_env train.py --epochs 100 --data Dataset/person.yaml --batch-size 256 --weights weights/yolov5n.pt --img 640 --cfg models/person.yaml --freeze 4 --name yolov5n_person --device 0,1
–nproc_per_node参数指定了每个节点(在这个上下文中,节点通常指的是一台机器)上的进程数量,通常设置为你的GPU数量。–use_env参数表示环境变量(包括MASTER_ADDR,MASTER_PORT,RANK,和WORLD_SIZE)应该从环境中获取,而不是从命令行参数中获取。
train.py是你的训练脚本,后面的是传递给训练脚本的参数。
下面是输出
python -m torch.distributed.launch --nproc_per_node=2 --use_env train.py --epochs 100 --data Dataset/person.yaml --batch-size 256 --weights weights/yolov5n.pt --img 640 --cfg models/person.yaml --freeze 4 --name yolov5n_person --device 0,1
/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py:177: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torch.distributed.run.
Note that --use_env is set by default in torch.distributed.run.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
warnings.warn(
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
wandb: Enter your choice: (30 second timeout) 3
wandb: You chose 'Don't visualize my results'
train: weights=weights/yolov5n.pt, cfg=models/person.yaml, data=Dataset/person.yaml, hyp=data/hyps/hyp.scratch-low.yaml, epochs=100, batch_size=256, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, noplots=False, evolve=None, bucket=, cache=None, image_weights=False, device=0,1, multi_scale=False, single_cls=False, optimizer=SGD, sync_bn=False, workers=8, project=runs/train, name=yolov5n_person, exist_ok=False, quad=False, cos_lr=False, label_smoothing=0.0, patience=100, freeze=[4], save_period=-1, seed=0, local_rank=-1, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest
github: skipping check (not a git repository), for updates see https://github.com/ultralytics/yolov5
YOLOv5 🚀 2024-7-24 Python-3.8.10 torch-1.10.0a0+3fd9dcf CUDA:0 (NVIDIA A800-SXM4-80GB, 81251MiB)
CUDA:1 (NVIDIA A800-SXM4-80GB, 81251MiB)
hyperparameters: lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.0, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0
ClearML: run 'pip install clearml' to automatically track, visualize and remotely train YOLOv5 🚀 in ClearML
Comet: run 'pip install comet_ml' to automatically track and visualize YOLOv5 🚀 runs in Comet
TensorBoard: Start with 'tensorboard --logdir runs/train', view at http://localhost:6006/
cnwla-a800-p01107:46170:46170 [0] NCCL INFO Bootstrap : Using bond0:10.0.0.156<0>
cnwla-a800-p01107:46170:46170 [0] NCCL INFO Plugin name set by env to libnccl-net-none.so
cnwla-a800-p01107:46170:46170 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net-none.so), using internal implementation
cnwla-a800-p01107:46170:46170 [0] NCCL INFO NET/IB : Using [0]mlx5_bond_0:1/RoCE [1]mlx5_bond_1:1/RoCE [2]mlx5_bond_2:1/RoCE [3]mlx5_bond_3:1/RoCE [4]mlx5_bond_4:1/RoCE ; OOB bond0:10.0.0.156<0>
cnwla-a800-p01107:46170:46170 [0] NCCL INFO Using network IB
NCCL version 2.11.4+cuda11.4
cnwla-a800-p01107:46171:46171 [1] NCCL INFO Bootstrap : Using bond0:10.0.0.156<0>
cnwla-a800-p01107:46171:46171 [1] NCCL INFO Plugin name set by env to libnccl-net-none.so
cnwla-a800-p01107:46171:46171 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net-none.so), using internal implementation
cnwla-a800-p01107:46171:46171 [1] NCCL INFO NET/IB : Using [0]mlx5_bond_0:1/RoCE [1]mlx5_bond_1:1/RoCE [2]mlx5_bond_2:1/RoCE [3]mlx5_bond_3:1/RoCE [4]mlx5_bond_4:1/RoCE ; OOB bond0:10.0.0.156<0>
cnwla-a800-p01107:46171:46171 [1] NCCL INFO Using network IB
cnwla-a800-p01107:46171:48359 [1] NCCL INFO NCCL_IB_QPS_PER_CONNECTION set by environment to 8.
cnwla-a800-p01107:46170:48292 [0] NCCL INFO NCCL_IB_QPS_PER_CONNECTION set by environment to 8.
cnwla-a800-p01107:46171:48359 [1] NCCL INFO NCCL_IB_GID_INDEX set by environment to 3.
cnwla-a800-p01107:46170:48292 [0] NCCL INFO NCCL_IB_GID_INDEX set by environment to 3.
cnwla-a800-p01107:46171:48359 [1] NCCL INFO NCCL_IB_TC set by environment to 136.
cnwla-a800-p01107:46171:48359 [1] NCCL INFO NCCL_IB_SL set by environment to 5.
cnwla-a800-p01107:46170:48292 [0] NCCL INFO NCCL_IB_TC set by environment to 136.
cnwla-a800-p01107:46170:48292 [0] NCCL INFO NCCL_IB_SL set by environment to 5.
cnwla-a800-p01107:46171:48359 [1] NCCL INFO NCCL_IB_TIMEOUT set by environment to 22.
cnwla-a800-p01107:46170:48292 [0] NCCL INFO NCCL_IB_TIMEOUT set by environment to 22.
cnwla-a800-p01107:46170:48292 [0] NCCL INFO NCCL_MIN_NCHANNELS set by environment to 4.
cnwla-a800-p01107:46170:48292 [0] NCCL INFO Channel 00/16 : 0 1
cnwla-a800-p01107:46170:48292 [0] NCCL INFO Channel 01/16 : 0 1
cnwla-a800-p01107:46170:48292 [0] NCCL INFO Channel 02/16 : 0 1
cnwla-a800-p01107:46170:48292 [0] NCCL INFO Channel 03/16 : 0 1
cnwla-a800-p01107:46170:48292 [0] NCCL INFO Channel 04/16 : 0 1
cnwla-a800-p01107:46170:48292 [0] NCCL INFO Channel 05/16 : 0 1
cnwla-a800-p01107:46171:48359 [1] NCCL INFO NCCL_MIN_NCHANNELS set by environment to 4.
cnwla-a800-p01107:46170:48292 [0] NCCL INFO Channel 06/16 : 0 1
cnwla-a800-p01107:46170:48292 [0] NCCL INFO Channel 07/16 : 0 1
cnwla-a800-p01107:46170:48292 [0] NCCL INFO Channel 08/16 : 0 1
cnwla-a800-p01107:46170:48292 [0] NCCL INFO Channel 09/16 : 0 1
cnwla-a800-p01107:46170:48292 [0] NCCL INFO Channel 10/16 : 0 1
cnwla-a800-p01107:46170:48292 [0] NCCL INFO Channel 11/16 : 0 1
cnwla-a800-p01107:46170:48292 [0] NCCL INFO Channel 12/16 : 0 1
cnwla-a800-p01107:46171:48359 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] -1/-1/-1->1->0 [3] -1/-1/-1->1->0 [4] -1/-1/-1->1->0 [5] -1/-1/-1->1->0 [6] -1/-1/-1->1->0 [7] -1/-1/-1->1->0 [8] -1/-1/-1->1->0 [9] -1/-1/-1->1->0 [10] -1/-1/-1->1->0 [11] -1/-1/-1->1->0 [12] -1/-1/-1->1->0 [13] -1/-1/-1->1->0 [14] -1/-1/-1->1->0 [15] -1/-1/-1->1->0
cnwla-a800-p01107:46170:48292 [0] NCCL INFO Channel 13/16 : 0 1
cnwla-a800-p01107:46170:48292 [0] NCCL INFO Channel 14/16 : 0 1
cnwla-a800-p01107:46170:48292 [0] NCCL INFO Channel 15/16 : 0 1
cnwla-a800-p01107:46171:48359 [1] NCCL INFO Setting affinity for GPU 1 to ffffffff,00000000,ffffffff,00000000
cnwla-a800-p01107:46170:48292 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1
cnwla-a800-p01107:46170:48292 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,00000000,ffffffff,00000000
cnwla-a800-p01107:46171:48359 [1] NCCL INFO Channel 00 : 1[92000] -> 0[8d000] via P2P/IPC/read
cnwla-a800-p01107:46170:48292 [0] NCCL INFO Channel 00 : 0[8d000] -> 1[92000] via P2P/IPC/read
cnwla-a800-p01107:46171:48359 [1] NCCL INFO Channel 01 : 1[92000] -> 0[8d000] via P2P/IPC/read
cnwla-a800-p01107:46170:48292 [0] NCCL INFO Channel 01 : 0[8d000] -> 1[92000] via P2P/IPC/read
cnwla-a800-p01107:46171:48359 [1] NCCL INFO Channel 02 : 1[92000] -> 0[8d000] via P2P/IPC/read
cnwla-a800-p01107:46170:48292 [0] NCCL INFO Channel 02 : 0[8d000] -> 1[92000] via P2P/IPC/read
cnwla-a800-p01107:46171:48359 [1] NCCL INFO Channel 03 : 1[92000] -> 0[8d000] via P2P/IPC/read
cnwla-a800-p01107:46170:48292 [0] NCCL INFO Channel 03 : 0[8d000] -> 1[92000] via P2P/IPC/read
cnwla-a800-p01107:46171:48359 [1] NCCL INFO Channel 04 : 1[92000] -> 0[8d000] via P2P/IPC/read
cnwla-a800-p01107:46170:48292 [0] NCCL INFO Channel 04 : 0[8d000] -> 1[92000] via P2P/IPC/read
cnwla-a800-p01107:46171:48359 [1] NCCL INFO Channel 05 : 1[92000] -> 0[8d000] via P2P/IPC/read
cnwla-a800-p01107:46170:48292 [0] NCCL INFO Channel 05 : 0[8d000] -> 1[92000] via P2P/IPC/read
cnwla-a800-p01107:46171:48359 [1] NCCL INFO Channel 06 : 1[92000] -> 0[8d000] via P2P/IPC/read
cnwla-a800-p01107:46170:48292 [0] NCCL INFO Channel 06 : 0[8d000] -> 1[92000] via P2P/IPC/read
cnwla-a800-p01107:46171:48359 [1] NCCL INFO Channel 07 : 1[92000] -> 0[8d000] via P2P/IPC/read
cnwla-a800-p01107:46170:48292 [0] NCCL INFO Channel 07 : 0[8d000] -> 1[92000] via P2P/IPC/read
cnwla-a800-p01107:46171:48359 [1] NCCL INFO Channel 08 : 1[92000] -> 0[8d000] via P2P/IPC/read
cnwla-a800-p01107:46170:48292 [0] NCCL INFO Channel 08 : 0[8d000] -> 1[92000] via P2P/IPC/read
cnwla-a800-p01107:46171:48359 [1] NCCL INFO Channel 09 : 1[92000] -> 0[8d000] via P2P/IPC/read
cnwla-a800-p01107:46170:48292 [0] NCCL INFO Channel 09 : 0[8d000] -> 1[92000] via P2P/IPC/read
cnwla-a800-p01107:46171:48359 [1] NCCL INFO Channel 10 : 1[92000] -> 0[8d000] via P2P/IPC/read
cnwla-a800-p01107:46170:48292 [0] NCCL INFO Channel 10 : 0[8d000] -> 1[92000] via P2P/IPC/read
cnwla-a800-p01107:46171:48359 [1] NCCL INFO Channel 11 : 1[92000] -> 0[8d000] via P2P/IPC/read
cnwla-a800-p01107:46170:48292 [0] NCCL INFO Channel 11 : 0[8d000] -> 1[92000] via P2P/IPC/read
cnwla-a800-p01107:46171:48359 [1] NCCL INFO Channel 12 : 1[92000] -> 0[8d000] via P2P/IPC/read
cnwla-a800-p01107:46170:48292 [0] NCCL INFO Channel 12 : 0[8d000] -> 1[92000] via P2P/IPC/read
cnwla-a800-p01107:46171:48359 [1] NCCL INFO Channel 13 : 1[92000] -> 0[8d000] via P2P/IPC/read
cnwla-a800-p01107:46170:48292 [0] NCCL INFO Channel 13 : 0[8d000] -> 1[92000] via P2P/IPC/read
cnwla-a800-p01107:46171:48359 [1] NCCL INFO Channel 14 : 1[92000] -> 0[8d000] via P2P/IPC/read
cnwla-a800-p01107:46170:48292 [0] NCCL INFO Channel 14 : 0[8d000] -> 1[92000] via P2P/IPC/read
cnwla-a800-p01107:46171:48359 [1] NCCL INFO Channel 15 : 1[92000] -> 0[8d000] via P2P/IPC/read
cnwla-a800-p01107:46170:48292 [0] NCCL INFO Channel 15 : 0[8d000] -> 1[92000] via P2P/IPC/read
cnwla-a800-p01107:46171:48359 [1] NCCL INFO Connected all rings
cnwla-a800-p01107:46171:48359 [1] NCCL INFO Connected all trees
cnwla-a800-p01107:46171:48359 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
cnwla-a800-p01107:46171:48359 [1] NCCL INFO 16 coll channels, 16 p2p channels, 16 p2p channels per peer
cnwla-a800-p01107:46170:48292 [0] NCCL INFO Connected all rings
cnwla-a800-p01107:46170:48292 [0] NCCL INFO Connected all trees
cnwla-a800-p01107:46170:48292 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
cnwla-a800-p01107:46170:48292 [0] NCCL INFO 16 coll channels, 16 p2p channels, 16 p2p channels per peer
cnwla-a800-p01107:46170:48292 [0] NCCL INFO comm 0x7fc600008fb0 rank 0 nranks 2 cudaDev 0 busId 8d000 - Init COMPLETE
cnwla-a800-p01107:46171:48359 [1] NCCL INFO comm 0x7fa7b8008fb0 rank 1 nranks 2 cudaDev 1 busId 92000 - Init COMPLETE
cnwla-a800-p01107:46170:46170 [0] NCCL INFO Launch mode Parallel
from n params module arguments
0 -1 1 1760 models.common.Conv [3, 16, 6, 2, 2]
1 -1 1 4672 models.common.Conv [16, 32, 3, 2]
2 -1 1 4800 models.common.C3 [32, 32, 1]
3 -1 1 18560 models.common.Conv [32, 64, 3, 2]
4 -1 2 29184 models.common.C3 [64, 64, 2]
5 -1 1 73984 models.common.Conv [64, 128, 3, 2]
6 -1 3 156928 models.common.C3 [128, 128, 3]
7 -1 1 295424 models.common.Conv [128, 256, 3, 2]
8 -1 1 296448 models.common.C3 [256, 256, 1]
9 -1 1 164608 models.common.SPPF [256, 256, 5]
10 -1 1 33024 models.common.Conv [256, 128, 1, 1]
11 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
12 [-1, 6] 1 0 models.common.Concat [1]
13 -1 1 90880 models.common.C3 [256, 128, 1, False]
14 -1 1 8320 models.common.Conv [128, 64, 1, 1]
15 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
16 [-1, 4] 1 0 models.common.Concat [1]
17 -1 1 22912 models.common.C3 [128, 64, 1, False]
18 -1 1 36992 models.common.Conv [64, 64, 3, 2]
19 [-1, 14] 1 0 models.common.Concat [1]
20 -1 1 74496 models.common.C3 [128, 128, 1, False]
21 -1 1 147712 models.common.Conv [128, 128, 3, 2]
22 [-1, 10] 1 0 models.common.Concat [1]
23 -1 1 296448 models.common.C3 [256, 256, 1, False]
24 [17, 20, 23] 1 8118 models.yolo.Detect [1, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [64, 128, 256]]
person summary: 214 layers, 1765270 parameters, 1765270 gradients, 4.2 GFLOPs
Transferred 342/349 items from weights/yolov5n.pt
freezing model.0.conv.weight
freezing model.0.bn.weight
freezing model.0.bn.bias
freezing model.1.conv.weight
freezing model.1.bn.weight
freezing model.1.bn.bias
freezing model.2.cv1.conv.weight
freezing model.2.cv1.bn.weight
freezing model.2.cv1.bn.bias
freezing model.2.cv2.conv.weight
freezing model.2.cv2.bn.weight
freezing model.2.cv2.bn.bias
freezing model.2.cv3.conv.weight
freezing model.2.cv3.bn.weight
freezing model.2.cv3.bn.bias
freezing model.2.m.0.cv1.conv.weight
freezing model.2.m.0.cv1.bn.weight
freezing model.2.m.0.cv1.bn.bias
freezing model.2.m.0.cv2.conv.weight
freezing model.2.m.0.cv2.bn.weight
freezing model.2.m.0.cv2.bn.bias
freezing model.3.conv.weight
freezing model.3.bn.weight
freezing model.3.bn.bias
optimizer: SGD(lr=0.01) with parameter groups 57 weight(decay=0.0), 60 weight(decay=0.002), 60 bias
albumentations: Blur(p=0.01, blur_limit=(3, 7)), MedianBlur(p=0.01, blur_limit=(3, 7)), ToGray(p=0.01), CLAHE(p=0.01, clip_limit=(1, 4.0), tile_grid_size=(8, 8))
train: Scanning /workspace/Dataset/CoCoPerson_Mini/train/labels.cache... 24223 images, 2501 backgrounds, 1 corrupt: 100%|██████████| 24
train: WARNING ⚠️ /workspace/Dataset/CoCoPerson_Mini/train/images/000000458309.jpg: ignoring corrupt image/label: negative label values [-0.00081699]
val: Scanning /workspace/Dataset/CoCoPerson_Mini/val/labels.cache... 10767 images, 536 backgrounds, 0 corrupt: 100%|██████████| 10767/1
AutoAnchor: 4.52 anchors/target, 0.998 Best Possible Recall (BPR). Current anchors are a good fit to dataset ✅
Plotting labels to runs/train/yolov5n_person3/labels.jpg...
Image sizes 640 train, 640 val
Using 16 dataloader workers
Logging results to runs/train/yolov5n_person3
Starting training for 100 epochs...
Epoch GPU_mem box_loss obj_loss cls_loss Instances Size
0/99 15.9G 0.07459 0.03704 0 467 640: 100%|██████████| 95/95 [01:46<00:00, 1.12s/it]
Class Images Instances P R mAP50 mAP50-95: 100%|██████████| 43/43 [00:35<00:00, 1.21it/s]
all 10767 32315 0.762 0.775 0.817 0.368
Epoch GPU_mem box_loss obj_loss cls_loss Instances Size
1/99 41.3G 0.05806 0.03021 0 410 640: 100%|██████████| 95/95 [01:00<00:00, 1.56it/s]
Class Images Instances P R mAP50 mAP50-95: 100%|██████████| 43/43 [00:33<00:00, 1.28it/s]
all 10767 32315 0.743 0.794 0.818 0.436
Epoch GPU_mem box_loss obj_loss cls_loss Instances Size
2/99 41.3G 0.05144 0.02921 0 459 640: 100%|██████████| 95/95 [01:01<00:00, 1.53it/s]
Class Images Instances P R mAP50 mAP50-95: 100%|██████████| 43/43 [00:32<00:00, 1.32it/s]
all 10767 32315 0.86 0.838 0.91 0.488
Epoch GPU_mem box_loss obj_loss cls_loss Instances Size
3/99 41.3G 0.04577 0.02913 0 419 640: 100%|██████████| 95/95 [01:01<00:00, 1.55it/s]
Class Images Instances P R mAP50 mAP50-95: 100%|██████████| 43/43 [00:32<00:00, 1.31it/s]
all 10767 32315 0.861 0.851 0.92 0.614
Epoch GPU_mem box_loss obj_loss cls_loss Instances Size
4/99 41.3G 0.04273 0.02861 0 417 640: 100%|██████████| 95/95 [01:01<00:00, 1.55it/s]
Class Images Instances P R mAP50 mAP50-95: 100%|██████████| 43/43 [00:33<00:00, 1.28it/s]
all 10767 32315 0.877 0.848 0.927 0.669
Epoch GPU_mem box_loss obj_loss cls_loss Instances Size
5/99 41.3G 0.04143 0.02842 0 420 640: 100%|██████████| 95/95 [01:01<00:00, 1.54it/s]
Class Images Instances P R mAP50 mAP50-95: 100%|██████████| 43/43 [00:33<00:00, 1.28it/s]
all 10767 32315 0.864 0.837 0.913 0.618
Epoch GPU_mem box_loss obj_loss cls_loss Instances Size
6/99 41.3G 0.04074 0.02866 0 463 640: 100%|██████████| 95/95 [01:01<00:00, 1.55it/s]
Class Images Instances P R mAP50 mAP50-95: 100%|██████████| 43/43 [00:34<00:00, 1.26it/s]
all 10767 32315 0.863 0.838 0.915 0.68
Epoch GPU_mem box_loss obj_loss cls_loss Instances Size
7/99 41.3G 0.0401 0.02827 0 418 640: 100%|██████████| 95/95 [01:00<00:00, 1.57it/s]
Class Images Instances P R mAP50 mAP50-95: 100%|██████████| 43/43 [00:33<00:00, 1.28it/s]
all 10767 32315 0.857 0.814 0.899 0.655
Epoch GPU_mem box_loss obj_loss cls_loss Instances Size
8/99 41.3G 0.03952 0.02857 0 412 640: 100%|██████████| 95/95 [00:59<00:00, 1.59it/s]
Class Images Instances P R mAP50 mAP50-95: 100%|██████████| 43/43 [00:33<00:00, 1.28it/s]
all 10767 32315 0.863 0.834 0.916 0.681
3. 查看GPU占用
/code# nvidia-smi
Wed Jul 24 20:40:14 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.199.02 Driver Version: 470.199.02 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A800-SXM... On | 00000000:8D:00.0 Off | 0 |
| N/A 36C P0 172W / 400W | 41510MiB / 81251MiB | 88% E. Process |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A800-SXM... On | 00000000:92:00.0 Off | 0 |
| N/A 39C P0 166W / 400W | 17158MiB / 81251MiB | 98% E. Process |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1640244 C /opt/conda/bin/python 41479MiB |
| 1 N/A N/A 1640245 C /opt/conda/bin/python 17151MiB |
+-----------------------------------------------------------------------------+
可以看到两块GPU都在运行!
- 用4块卡训练
Thu Jul 25 20:14:08 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.199.02 Driver Version: 470.199.02 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A800-SXM... On | 00000000:21:00.0 Off | 0 |
| N/A 39C P0 195W / 400W | 33384MiB / 81251MiB | 95% E. Process |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A800-SXM... On | 00000000:27:00.0 Off | 0 |
| N/A 43C P0 183W / 400W | 33382MiB / 81251MiB | 98% E. Process |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A800-SXM... On | 00000000:51:00.0 Off | 0 |
| N/A 40C P0 125W / 400W | 33382MiB / 81251MiB | 98% E. Process |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA A800-SXM... On | 00000000:56:00.0 Off | 0 |
| N/A 38C P0 166W / 400W | 33286MiB / 81251MiB | 98% E. Process |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 3135579 C /opt/conda/bin/python 33343MiB |
| 1 N/A N/A 3135580 C /opt/conda/bin/python 33373MiB |
| 2 N/A N/A 3135581 C /opt/conda/bin/python 33373MiB |
| 3 N/A N/A 3135582 C /opt/conda/bin/python 33277MiB |
+-----------------------------------------------------------------------------+
四块卡训练的时候一直有个断言错误,
assert torch.cuda.device_count() > LOCAL_RANK, ‘insufficient CUDA devices for DDP command’
我在服务器上离线用torch.cuda.device_count()输出是4,也就是说挂载了4块显卡的,但是在运行的时候在断言前面输出torch.cuda.device_count()为2, 而LOCAL_RANK的值为0,1,2,3所以在LOCAL_RANK 为2,3的时候触发断言错误。
但是在train.py的顶端GIT_INFO下面输出torch.cuda.device_count(),后面的结果也就为4了,很神奇, 提前调用一下就解决了这个问题。
LOCAL_RANK = int(os.getenv('LOCAL_RANK', -1)) # https://pytorch.org/docs/stable/elastic/run.html
RANK = int(os.getenv('RANK', -1))
WORLD_SIZE = int(os.getenv('WORLD_SIZE', 1))
GIT_INFO = check_git_info()
# 输出下面的结果后都正常了
print("local rank:", LOCAL_RANK)
print("RANK: ",RANK)
print("WORLD_SIZE: ",WORLD_SIZE)
print("torch.cuda.device_count(): ",torch.cuda.device_count())
python -m torch.distributed.run --nproc_per_node=4 train.py --epochs 250 --data Dataset/person.yaml --batch-size 1024 --weights weights/yolov5n.pt --img 640 --cfg models/person.yaml --freeze 3 --name yolov5n_person --device 0,1 --project /workspace/runs/model --save-period 10
部分输出
local rank: 2
RANK: 2
WORLD_SIZE: 4
local rank: 1
RANK: 1
WORLD_SIZE: 4
wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
wandb: Enter your choice: (30 second timeout) local rank: 3
RANK: 3
WORLD_SIZE: 4
torch.cuda.device_count(): 4
torch.cuda.device_count(): 4
torch.cuda.device_count(): 4
4 LOCAL_RANK: 2
4 LOCAL_RANK: 3
4 LOCAL_RANK: 1
wandb: W&B disabled due to login timeout.
local rank: 0
RANK: 0
WORLD_SIZE: 4
torch.cuda.device_count(): 4
pytorch分布式训练中local_rank参数的说明:https://pytorch.org/docs/stable/elastic/run.html