Deeplab系列讲解

DeepLab系列论文一共有四篇，分别是DeepLab V1、DeepLab V2、DeepLab V3、DeepLab V3+。

因为卷积神经网络的空间信息细节已经被高度抽象画，所以它就具有很好的平移不变性，这样可以能够很好的处理图像分类问题，但是它的最后一层的输出不足以准确的定位物体进行像素级分类

一、基础知识

1.1 空洞卷积

如上图所示，同样的两个卷积输出的特征图大小事3×3，其中传统卷积3×3对应的感受野的大小是5×5（此处stride为2），空洞卷积参数也是3×3，但这是尺寸为5×5，这是dilated为2自动构建的空洞卷积。其对应的感受野则是7×7。

上面两图清晰表达了网络中常用3×3卷积进行堆叠操作了，在感受野相同情况下，堆叠计算量少，同时增加了非线性变换。

1.2 SPP

SPP网络在yolo中也曾经应用过，即通过maxpooling操作，使输出固定。原理是：无论输入的特征图尺寸多大，分别分成4×4,2×2,1×1大小的块，如此输出就是（16+4+1）×256.(在maxpoolin构建中因为输入特征图大小不同，则pool的size以及stride也不同)

第一张图为图像金字塔解决多尺度问题，此处耗时。第二张为unet结构，多尺度输入。第三张为unet引入了空洞卷积，第四章则是在spp中引入了空洞卷积替代了maxpool，具体为：

因为空洞卷积的rate不同，感受野不同，重点关注特征也不同，进而实现了多尺度问题，同时保证输出的大小相同，为contact打好基础。这就是ASPP。

1.3 DeeplabV3+!

DeepLab v1：空洞卷积+CRF

减少下采样次数，尽可能的保留空间位置信息；
使用空洞卷积，扩大感受野，获取更多的上下文信息;
采用完全连接的条件随机场(CRF)这种后处理方式，提高模型捕获细节的能力。

DeepLab v2：ASPP

在DeepLab v1的基础上提出了图像多尺度的问题，并提出ASPP模块来捕捉图像多个尺度的上下文信息。
仍然使用CRF后处理方式，处理边缘细节信息。

DeepLab v3：改进ASPP

改进了ASPP模块：加入了BN层，
探讨了ASPP模块的构建方式：并行的方式精度更好。
由于大采样率的空洞卷积的权重变小，只有中心权重起作用，退化成1×11×11×11×1卷积。所以将图像级特征和ASPP特征进行融合。

DeepLab v3+：deeplabv3 + encoder-decoder

使用了encoder-decoder（高层特征提供语义，decoder逐步回复边界信息）：提升了分割效果的同时，关注了边界的信息
encoder结构中：采用Xception做为 BackBone，并将深度可分离卷积（depthwise deparable conv）应用在了ASPP 和 encoder 模块中，使网络更快。

deeplab的思想与unet不相同，在向分类网络靠拢，区别在于head的处理方式不同。如果从分割理论上将编码和解码差异还是比较大。

二、源码讲解

2.1 不同网络层的配置文件讲解

文章中源码使用mmsegmention，上图分别是三个网络的配置文件详情，为此针对重点模块讲解。

2.2 整体逻辑

在网络构造中，基本使用EncoderDecoder引擎，训练的代码为：

 def loss(self, inputs: Tensor, data_samples: SampleList) -> dict:
        """Calculate losses from a batch of inputs and data samples.

        Args:
            inputs (Tensor): Input images.
            data_samples (list[:obj:`SegDataSample`]): The seg data samples.
                It usually includes information such as `metainfo` and
                `gt_sem_seg`.

        Returns:
            dict[str, Tensor]: a dictionary of loss components
        """

        x = self.extract_feat(inputs)

        losses = dict()

        loss_decode = self._decode_head_forward_train(x, data_samples)
        losses.update(loss_decode)

        if self.with_auxiliary_head:
            loss_aux = self._auxiliary_head_forward_train(x, data_samples)
            losses.update(loss_aux)

        return losses

大体逻辑为：

1、主干网络提取特征

2、解码头进行loss计算

3、一般还有辅助头，同时更新loss

主干提取代码为：

    def extract_feat(self, inputs: Tensor) -> List[Tensor]:
        """Extract features from images."""
        x = self.backbone(inputs)
        if self.with_neck:
            x = self.neck(x)
        return x

loss的代码操作为：

 def _decode_head_forward_train(self, inputs: List[Tensor],
                                   data_samples: SampleList) -> dict:
        """Run forward function and calculate loss for decode head in
        training."""
        losses = dict()
        loss_decode = self.decode_head.loss(inputs, data_samples,
                                            self.train_cfg)

        losses.update(add_prefix(loss_decode, 'decode'))
        return losses

    def _auxiliary_head_forward_train(self, inputs: List[Tensor],
                                      data_samples: SampleList) -> dict:
        """Run forward function and calculate loss for auxiliary head in
        training."""
        losses = dict()
        if isinstance(self.auxiliary_head, nn.ModuleList):
            for idx, aux_head in enumerate(self.auxiliary_head):
                loss_aux = aux_head.loss(inputs, data_samples, self.train_cfg)
                losses.update(add_prefix(loss_aux, f'aux_{idx}'))
        else:
            loss_aux = self.auxiliary_head.loss(inputs, data_samples,
                                                self.train_cfg)
            losses.update(add_prefix(loss_aux, 'aux'))

        return losses

将特征图和真实标定图像进行loss计算，具体实现代码见下节讲解。

在代码中，loss的具体实现在head里面的loss实现，进而找到ASPP的代码，在

class ASPPHead(BaseDecodeHead):
    """Rethinking Atrous Convolution for Semantic Image Segmentation.

    This head is the implementation of `DeepLabV3
    <https://arxiv.org/abs/1706.05587>`_.

    Args:
        dilations (tuple[int]): Dilation rates for ASPP module.
            Default: (1, 6, 12, 18).
    """

继承了基类，基类有其loss执行，代码为：

 def loss(self, inputs: Tuple[Tensor], batch_data_samples: SampleList,
             train_cfg: ConfigType) -> dict:
        """Forward function for training.

        Args:
            inputs (Tuple[Tensor]): List of multi-level img features.
            batch_data_samples (list[:obj:`SegDataSample`]): The seg
                data samples. It usually includes information such
                as `img_metas` or `gt_semantic_seg`.
            train_cfg (dict): The training config.

        Returns:
            dict[str, Tensor]: a dictionary of loss components
        """
        seg_logits = self.forward(inputs)
        losses = self.loss_by_feat(seg_logits, batch_data_samples)
        return losses

进而找到ASPP的forward函数

 def forward(self, inputs):
        """Forward function."""
        output = self._forward_feature(inputs)
        output = self.cls_seg(output)
        return output

整体代码结构比较简单

2.3 ResnetV1c

@MODELS.register_module()
class ResNetV1c(ResNet):
    """ResNetV1c variant described in [1]_.

    Compared with default ResNet(ResNetV1b), ResNetV1c replaces the 7x7 conv in
    the input stem with three 3x3 convs. For more details please refer to `Bag
    of Tricks for Image Classification with Convolutional Neural Networks
    <https://arxiv.org/abs/1812.01187>`_.
    """

    def __init__(self, **kwargs):
        super().__init__(deep_stem=True, avg_down=False, **kwargs)

正如注释所讲，这是常规的残差网络，但此处用了3个3×3卷积，替换了7×7卷积，此处不再详解

2.3 逆卷积

对于一个图片A，设定它的高度和宽度分别为Height，Width，通道数为Channels。然后我们用卷积核（kernel*kernel）去做卷积，（这里设定卷积核为正方形，实际长方形也可以类推），步长为stride(同样的，不区分高宽方向)，做padding,卷积后得到B。重复上面的话就是利用一个卷积操作将A变成B

逆卷积的示意图。

第一步：对输入的特征图a进行一些变换，得到新的特征图a′

第二步：求新的卷积核设置

第三步：用新的卷积核在新的特征图上做常规的卷积，得到的结果就是逆卷积的结果。

class DeconvModule(nn.Module):
    """Deconvolution upsample module in decoder for UNet (2X upsample).

    This module uses deconvolution to upsample feature map in the decoder
    of UNet.

    Args:
        in_channels (int): Number of input channels.
        out_channels (int): Number of output channels.
        with_cp (bool): Use checkpoint or not. Using checkpoint will save some
            memory while slowing down the training speed. Default: False.
        norm_cfg (dict | None): Config dict for normalization layer.
            Default: dict(type='BN').
        act_cfg (dict | None): Config dict for activation layer in ConvModule.
            Default: dict(type='ReLU').
        kernel_size (int): Kernel size of the convolutional layer. Default: 4.
    """

    def __init__(self,
                 in_channels,
                 out_channels,
                 with_cp=False,
                 norm_cfg=dict(type='BN'),
                 act_cfg=dict(type='ReLU'),
                 *,
                 kernel_size=4,
                 scale_factor=2):
        super().__init__()

        assert (kernel_size - scale_factor >= 0) and\
               (kernel_size - scale_factor) % 2 == 0,\
               f'kernel_size should be greater than or equal to scale_factor '\
               f'and (kernel_size - scale_factor) should be even numbers, '\
               f'while the kernel size is {kernel_size} and scale_factor is '\
               f'{scale_factor}.'

        #重点在此
        #放大系数等于stride。根据逆卷积步骤，根据stride首先插值形成新的特征图，大小已经等于
        #输出特征图大小，剩下就是普通卷积的操作了
        stride = scale_factor
        padding = (kernel_size - scale_factor) // 2
        self.with_cp = with_cp
        deconv = nn.ConvTranspose2d(
            in_channels,
            out_channels,
            kernel_size=kernel_size,
            stride=stride,
            padding=padding)

        norm_name, norm = build_norm_layer(norm_cfg, out_channels)
        activate = build_activation_layer(act_cfg)
        self.deconv_upsamping = nn.Sequential(deconv, norm, activate)

2.3 ASPP

先看ASPP的整体结构为：

 def _forward_feature(self, inputs):
        """Forward function for feature maps before classifying each pixel with
        ``self.cls_seg`` fc.

        Args:
            inputs (list[Tensor]): List of multi-level img features.

        Returns:
            feats (Tensor): A tensor of shape (batch_size, self.channels,
                H, W) which is feature map for last layer of decoder head.
        """
        x = self._transform_inputs(inputs)
        #获取特征图，此处对应的imagepool模块，先变小，再变大，此处特征图大小不变
        aspp_outs = [
            resize(
                self.image_pool(x),
                size=x.size()[2:],
                mode='bilinear',
                align_corners=self.align_corners)
        ]
        #此处将特征图输入到aspp模块，获取到了4个特征图
        aspp_outs.extend(self.aspp_modules(x))
        #所有特征图连起来，然后卷积操作
        aspp_outs = torch.cat(aspp_outs, dim=1)
        feats = self.bottleneck(aspp_outs)
        return feats

    def forward(self, inputs):
        """Forward function."""
        output = self._forward_feature(inputs)
        output = self.cls_seg(output)
        return output