尺寸变化
1. 初始输入
# 图像数据
imgs = torch.randn(4, 3, 8, 224, 224)
# 热度图数据
heatmap_imgs = torch.randn(4, 17, 32, 56, 56)
# 调用 forward 函数
a = RGBPoseConv3D()
output = a.forward(imgs, heatmap_imgs)
1.1 解释
N:一个batch_size的样本数是4
Cin:输入的通道数是3
Din:表示连续的 8 个图像帧
Hin:每个图像帧的尺寸高度是224
Win:每个图像帧的尺寸宽度是224
2. begin_rgb_path_conv1
print("begin_rgb_path_conv1")
x_rgb = self.rgb_path.conv1(imgs)
print(x_rgb.shape)
2.1 输出
begin_rgb_path_conv1
torch.Size([4, 64, 8, 112, 112])
2.2 网络结构
ConvModule(
(conv): Conv3d(3, 64, kernel_size=(1, 7, 7), stride=(1, 2, 2), padding=(0, 3, 3), bias=False)
(bn): BatchNorm3d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(activate): ReLU(inplace=True)
)
2.3 对应创建模型代码
def _make_stem_layer(self):
"""Construct the stem layers consists of a conv+norm+act module and a
pooling layer."""
self.conv1 = ConvModule(
self.in_channels,
self.base_channels,
kernel_size=self.conv1_kernel,
stride=(self.conv1_stride[0], self.conv1_stride[1], self.conv1_stride[1]),
padding=tuple([(k - 1) // 2 for k in _triple(self.conv1_kernel)]),
bias=False,
conv_cfg=self.conv_cfg,
norm_cfg=self.norm_cfg,
act_cfg=self.act_cfg)
self.maxpool = nn.MaxPool3d(
kernel_size=(1, 3, 3),
stride=(self.pool1_stride[0], self.pool1_stride[1], self.pool1_stride[1]),
padding=(0, 1, 1))
2.4 base_channels解释
base_channels (int): Channel num of stem output features. Default: 64.
2.5 计算过程
3. begin_rgb_path_maxpool
print("begin_rgb_path_maxpool")
x_rgb = self.rgb_path.maxpool(x_rgb)
print(x_rgb.shape)
3.1 输出
begin_rgb_path_maxpool
torch.Size([4, 64, 8, 56, 56])
3.2 网络结构
MaxPool3d(kernel_size=(1, 3, 3), stride=(1, 2, 2), padding=(0, 1, 1), dilation=1, ceil_mode=False)
3.3 对应创建模型代码
def _make_stem_layer(self):
"""Construct the stem layers consists of a conv+norm+act module and a
pooling layer."""
self.conv1 = ConvModule(
self.in_channels,
self.base_channels,
kernel_size=self.conv1_kernel,
stride=(self.conv1_stride[0], self.conv1_stride[1], self.conv1_stride[1]),
padding=tuple([(k - 1) // 2 for k in _triple(self.conv1_kernel)]),
bias=False,
conv_cfg=self.conv_cfg,
norm_cfg=self.norm_cfg,
act_cfg=self.act_cfg)
self.maxpool = nn.MaxPool3d(
kernel_size=(1, 3, 3),
stride=(self.pool1_stride[0], self.pool1_stride[1], self.pool1_stride[1]),
padding=(0, 1, 1))
4. begin_rgb_path_layer1
4.0 输入
torch.Size([4, 64, 8, 56, 56])
4.1 输出
print("begin_rgb_path_layer1")
x_rgb = self.rgb_path.layer1(x_rgb)
print(x_rgb.shape)
# 输出如下
begin_rgb_path_layer1
torch.Size([4, 256, 8, 56, 56])
4.2 网络结构
Sequential(
(0): Bottleneck3d(
(conv1): ConvModule(
(conv): Conv3d(64, 64, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
(bn): BatchNorm3d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(activate): ReLU(inplace=True)
)
(conv2): ConvModule(
(conv): Conv3d(64, 64, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
(bn): BatchNorm3d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(activate): ReLU(inplace=True)
)
(conv3): ConvModule(
(conv): Conv3d(64, 256, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
(bn): BatchNorm3d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
(downsample): ConvModule(
(conv): Conv3d(64, 256, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
(bn): BatchNorm3d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
(relu): ReLU(inplace=True)
)
(1): Bottleneck3d(
(conv1): ConvModule(
(conv): Conv3d(256, 64, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
(bn): BatchNorm3d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(activate): ReLU(inplace=True)
)
(conv2): ConvModule(
(conv): Conv3d(64, 64, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
(bn): BatchNorm3d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(activate): ReLU(inplace=True)
)
(conv3): ConvModule(
(conv): Conv3d(64, 256, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
(bn): BatchNorm3d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
(relu): ReLU(inplace=True)
)
(2): Bottleneck3d(
(conv1): ConvModule(
(conv): Conv3d(256, 64, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
(bn): BatchNorm3d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(activate): ReLU(inplace=True)
)
(conv2): ConvModule(
(conv): Conv3d(64, 64, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
(bn): BatchNorm3d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(activate): ReLU(inplace=True)
)
(conv3): ConvModule(
(conv): Conv3d(64, 256, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
(bn): BatchNorm3d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
(relu): ReLU(inplace=True)
)
)
4.3 计算过程
输入张量尺寸: torch.Size([4, 64, 8, 56, 56])
第一个 Bottleneck3d 模块:
conv1 层:
输入尺寸: (4, 64, 8, 56, 56)
卷积核尺寸: (1, 1, 1)
输出尺寸: (4, 64, 8, 56, 56) (由于 stride 为 1,输出尺寸不变)
conv2 层:
输入尺寸: (4, 64, 8, 56, 56)
卷积核尺寸: (1, 3, 3)
输出尺寸: (4, 64, 8, 56, 56) (由于 stride 为 1,输出尺寸不变)
conv3 层:
输入尺寸: (4, 64, 8, 56, 56)
卷积核尺寸: (1, 1, 1)
输出尺寸: (4, 256, 8, 56, 56) (通道数从 64 变为 256)
downsample 层:
输入尺寸: (4, 64, 8, 56, 56)
卷积核尺寸: (1, 1, 1)
输出尺寸: (4, 256, 8, 56, 56) (通道数从 64 变为 256)
最终输出尺寸: (4, 256, 8, 56, 56)
第二个 Bottleneck3d 模块:
conv1 层:
输入尺寸: (4, 256, 8, 56, 56)
卷积核尺寸: (1, 1, 1)
输出尺寸: (4, 64, 8, 56, 56) (通道数从 256 变为 64)
conv2 层:
输入尺寸: (4, 64, 8, 56, 56)
卷积核尺寸: (1, 3, 3)
输出尺寸: (4, 64, 8, 56, 56) (由于 stride 为 1,输出尺寸不变)
conv3 层:
输入尺寸: (4, 64, 8, 56, 56)
卷积核尺寸: (1, 1, 1)
输出尺寸: (4, 256, 8, 56, 56) (通道数从 64 变为 256)
最终输出尺寸: (4, 256, 8, 56, 56)
第三个 Bottleneck3d 模块:
conv1 层:
输入尺寸: (4, 256, 8, 56, 56)
卷积核尺寸: (1, 1, 1)
输出尺寸: (4, 64, 8, 56, 56) (通道数从 256 变为 64)
conv2 层:
输入尺寸: (4, 64, 8, 56, 56)
卷积核尺寸: (1, 3, 3)
输出尺寸: (4, 64, 8, 56, 56) (由于 stride 为 1,输出尺寸不变)
conv3 层:
输入尺寸: (4, 64, 8, 56, 56)
卷积核尺寸: (1, 1, 1)
输出尺寸: (4, 256, 8, 56, 56) (通道数从 64 变为 256)
最终输出尺寸: (4, 256, 8, 56, 56)
因此,经过 3 个 Bottleneck3d 模块后,最终输出张量的尺寸为 (4, 256, 8, 56, 56)。
4.4 疑问提出
conv3 层:
输入尺寸: (4, 64, 8, 56, 56)
卷积核尺寸: (1, 1, 1)
输出尺寸: (4, 256, 8, 56, 56) (通道数从 64 变为 256)
downsample 层:
输入尺寸: (4, 64, 8, 56, 56)
卷积核尺寸: (1, 1, 1)
输出尺寸: (4, 256, 8, 56, 56) (通道数从 64 变为 256)
最终输出尺寸: (4, 256, 8, 56, 56)这个地方不太懂,通道数从 64 变为 256,然后进入downsample 层:不应该是输入通道数是256吗?
通道数从 64 变为 256 的过程确实需要进一步解释。
在你提供的信息中:
conv3
层:- 输入通道数
C_in = 64
- 输出通道数
C_out = 256
- 这一层的作用是将通道数从 64 增加到 256。
- 输入通道数
downsample
层:- 输入通道数
C_in = 64
- 输出通道数
C_out = 256
- 这一层的作用是将通道数从 64 增加到 256。
- 输入通道数
你观察到的问题是,为什么 downsample
层的输入通道数还是 64,而不是 256,因为前面的 conv3
层已经将通道数增加到 256 了。
这是因为 Bottleneck3d
模块中还有其他的连接方式:
out = self.conv1(x)
out = self.conv2(out)
residual = self.downsample(x) if downsample else x
out += residual
out = self.conv3(out)
可以看到,downsample
层是从原始的输入 x
进行下采样,而不是从 conv3
的输出进行下采样。这是为了实现 “skip connection” 或 “residual connection”。
也就是说,downsample
层的作用是将 x
的通道数从 64 增加到 256,以匹配 conv3
层的输出通道数。这种设计可以更好地保留网络中的低层次特征,提高整体的性能。
总之,通道数从 64 变为 256 的过程是在 conv3
层完成的,而 downsample
层则是用来匹配 conv3
层输出的通道数,以便进行残差连接。这种设计是 ResNet 系列模型的一个关键特点。
4.5 反思
# 输入
torch.Size([4, 64, 8, 56, 56])
# 经过downsample 层:
torch.Size([4, 256, 8, 56, 56])
5. begin_rgb_path_layer2
5.0 输入
torch.Size([4, 256, 8, 56, 56])
5.1 输出
print("begin_rgb_path_layer2")
x_rgb = self.rgb_path.layer2(x_rgb)
print(x_rgb.shape)
# 输出如下
begin_rgb_path_layer2
torch.Size([4, 512, 8, 28, 28])
5.2 网络结构
Sequential(
(0): Bottleneck3d(
(conv1): ConvModule(
(conv): Conv3d(256, 128, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
(bn): BatchNorm3d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(activate): ReLU(inplace=True)
)
(conv2): ConvModule(
(conv): Conv3d(128, 128, kernel_size=(1, 3, 3), stride=(1, 2, 2), padding=(0, 1, 1), bias=False)
(bn): BatchNorm3d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(activate): ReLU(inplace=True)
)
(conv3): ConvModule(
(conv): Conv3d(128, 512, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
(bn): BatchNorm3d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
(downsample): ConvModule(
(conv): Conv3d(256, 512, kernel_size=(1, 1, 1), stride=(1, 2, 2), bias=False)
(bn): BatchNorm3d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
(relu): ReLU(inplace=True)
)
(1): Bottleneck3d(
(conv1): ConvModule(
(conv): Conv3d(512, 128, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
(bn): BatchNorm3d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(activate): ReLU(inplace=True)
)
(conv2): ConvModule(
(conv): Conv3d(128, 128, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
(bn): BatchNorm3d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(activate): ReLU(inplace=True)
)
(conv3): ConvModule(
(conv): Conv3d(128, 512, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
(bn): BatchNorm3d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
(relu): ReLU(inplace=True)
)
(2): Bottleneck3d(
(conv1): ConvModule(
(conv): Conv3d(512, 128, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
(bn): BatchNorm3d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(activate): ReLU(inplace=True)
)
(conv2): ConvModule(
(conv): Conv3d(128, 128, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
(bn): BatchNorm3d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(activate): ReLU(inplace=True)
)
(conv3): ConvModule(
(conv): Conv3d(128, 512, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
(bn): BatchNorm3d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
(relu): ReLU(inplace=True)
)
(3): Bottleneck3d(
(conv1): ConvModule(
(conv): Conv3d(512, 128, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
(bn): BatchNorm3d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(activate): ReLU(inplace=True)
)
(conv2): ConvModule(
(conv): Conv3d(128, 128, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
(bn): BatchNorm3d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(activate): ReLU(inplace=True)
)
(conv3): ConvModule(
(conv): Conv3d(128, 512, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
(bn): BatchNorm3d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
(relu): ReLU(inplace=True)
)
)
5.3 计算过程
6. 初始输入
# 图像数据
imgs = torch.randn(4, 3, 8, 224, 224)
# 热度图数据
heatmap_imgs = torch.randn(4, 17, 32, 56, 56)
# 调用 forward 函数
a = RGBPoseConv3D()
output = a.forward(imgs, heatmap_imgs)
6.1 解释
N:一个batch_size的样本数是4
Cin:输入的通道数是17
Din:表示连续的 32 个图像帧
Hin:每个图像帧的尺寸高度是56
Win:每个图像帧的尺寸宽度是56
其中 heatmap_imgs 的尺寸为 (4, 17, 32, 56, 56)。这里的 17 表示热度图的通道数。
在人体姿态估计任务中,热度图通常用于表示关键点的位置概率分布。每个关键点(如头部、左手腕等)对应一个热度图通道。因此,热度图的通道数通常等于关键点的数量。
在这个例子中,热度图的通道数为 17。这意味着这个模型需要预测 17 个关键点的位置。这些关键点可能是人体的主要关节点,如头部、肩膀、手肘、膝盖等。
总之,热度图的通道数表示这个任务需要预测的关键点数量。这在人体姿态估计等计算机视觉任务中非常常见。
7. begin_pose_path_conv1
7.0 输入
torch.Size([4, 17, 32, 56, 56])
7.1 输出
print("begin_pose_path_conv1")
x_pose = self.pose_path.conv1(heatmap_imgs)
print(x_pose.shape)
begin_pose_path_conv1
torch.Size([4, 32, 32, 56, 56])
7.2 网络结构
ConvModule(
(conv): Conv3d(17, 32, kernel_size=(1, 7, 7), stride=(1, 1, 1), padding=(0, 3, 3), bias=False)
(bn): BatchNorm3d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(activate): ReLU(inplace=True)
)
7.3 计算过程
8. begin_pose_path_maxpool
8.0 输入
torch.Size([4, 32, 32, 56, 56])
8.1 输出
print("begin_pose_path_maxpool")
x_pose = self.pose_path.maxpool(x_pose)
print(x_pose.shape)
begin_pose_path_maxpool
torch.Size([4, 32, 32, 56, 56])
8.2 网络结构
MaxPool3d(kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), dilation=1, ceil_mode=False)
9. begin_pose_path_layer1
9.0 输入
torch.Size([4, 32, 32, 56, 56])
9.1 输出
print("begin_pose_path_layer1")
x_pose = self.pose_path.layer1(x_pose)
print(x_pose.shape)
begin_pose_path_layer1
torch.Size([4, 128, 32, 28, 28])
9.2 网络结构
Sequential(
(0): Bottleneck3d(
(conv1): ConvModule(
(conv): Conv3d(32, 32, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
(bn): BatchNorm3d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(activate): ReLU(inplace=True)
)
(conv2): ConvModule(
(conv): Conv3d(32, 32, kernel_size=(1, 3, 3), stride=(1, 2, 2), padding=(0, 1, 1), bias=False)
(bn): BatchNorm3d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(activate): ReLU(inplace=True)
)
(conv3): ConvModule(
(conv): Conv3d(32, 128, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
(bn): BatchNorm3d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
(downsample): ConvModule(
(conv): Conv3d(32, 128, kernel_size=(1, 1, 1), stride=(1, 2, 2), bias=False)
(bn): BatchNorm3d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
(relu): ReLU(inplace=True)
)
(1): Bottleneck3d(
(conv1): ConvModule(
(conv): Conv3d(128, 32, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
(bn): BatchNorm3d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(activate): ReLU(inplace=True)
)
(conv2): ConvModule(
(conv): Conv3d(32, 32, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
(bn): BatchNorm3d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(activate): ReLU(inplace=True)
)
(conv3): ConvModule(
(conv): Conv3d(32, 128, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
(bn): BatchNorm3d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
(relu): ReLU(inplace=True)
)
(2): Bottleneck3d(
(conv1): ConvModule(
(conv): Conv3d(128, 32, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
(bn): BatchNorm3d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(activate): ReLU(inplace=True)
)
(conv2): ConvModule(
(conv): Conv3d(32, 32, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
(bn): BatchNorm3d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(activate): ReLU(inplace=True)
)
(conv3): ConvModule(
(conv): Conv3d(32, 128, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
(bn): BatchNorm3d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
(relu): ReLU(inplace=True)
)
(3): Bottleneck3d(
(conv1): ConvModule(
(conv): Conv3d(128, 32, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
(bn): BatchNorm3d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(activate): ReLU(inplace=True)
)
(conv2): ConvModule(
(conv): Conv3d(32, 32, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
(bn): BatchNorm3d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(activate): ReLU(inplace=True)
)
(conv3): ConvModule(
(conv): Conv3d(32, 128, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
(bn): BatchNorm3d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
(relu): ReLU(inplace=True)
)
)