Fine-Grained Segmentation Networks: Self-Supervised Segmentation for Improved Long-Term Visual Local

210 阅读 0 评论 139 点赞

我是靠谱客的博主害怕衬衫，这篇文章主要介绍Fine-Grained Segmentation Networks: Self-Supervised Segmentation for Improved Long-Term Visual Local，现在分享给大家，希望可以做个参考。

Long-term visual localization is the problem of estimating the camera pose of a given query image in a scene whose appearance changes over time. It is an important problem in practice, for example, encountered in autonomous driving. In order to gain robustness to such changes, long-term localization approaches often use segmantic segmentations as an invariant scene representation, as the semantic meaning of each scene part should not be affected by seasonal and other changes. However, these representations are typically not very discriminative due to the limited number of available classes.
长期的视觉定位是在外观随时间变化的场景中估算给定查询图像的相机姿态的问题。在实践中，这是一个重要的问题，例如在自动驾驶中遇到的问题。为了获得对此类变化的鲁棒性，长期定位方法通常使用segmantic分割作为不变的场景表示，因为每个场景部分的语义均不受季节和其他变化的影响。但是，由于可用类的数量有限，因此这些表示形式通常不是非常有区别。
In this paper, we propose a new neural network, the Fine-Grained Segmentation Network (FGSN), that can be used to provide image segmentations with a larger number of labels and can be trained in a self-supervised fashion. In addition, we show how FGSNs can be trained to output consistent labels across seasonal changes. We demonstrate through extensive experiments that integrating the fine-grained segmentations produced by our FGSNs into existing localization algorithms leads to substantial improvements in localization performance.
在本文中，我们提出了一种新的神经网络，即细粒度分割网络（FGSN），该网络可用于提供带有更多标签的图像分割，并且可以自我监督的方式进行训练。此外，我们展示了如何训练FGSN在整个季节变化中输出一致的标签。我们通过广泛的实验证明，将FGSN生成的细粒度细分集成到现有的定位算法中，可以显着提高定位性能。

1.Introduction
Visual localization is the problem of estimating the camera pose of a given image relative to a visual representation of a known scene. It is a classical problem in computer vision and solving the visual localization problem is one key to advanced computer vision applications such as self-driving cars and other autonomous robots, as well as Augmented / Mixed / Virtual Reality.
1.简介
视觉定位是相对于已知场景的视觉表示来估计给定图像的相机姿态的问题。这是计算机视觉中的经典问题，解决视觉本地化问题是高级计算机视觉应用程序（例如无人驾驶汽车和其他自动机器人以及增强/混合/虚拟现实）的关键之一。
The scene representation used by localization algorithms is typically recovered from images depicting a given scene. The type of representation can vary from a set of images with associated camera poses [8,75,98], over 3D models constructed from Structure-from-Motion [77,81], to weights encoded in convolutional neural networks(CNNs) [8,10,12,13,35,36,52] or random forests [11,16,79]. In practice, capturing a scene from all possible viewpoints and under all potential conditions,e.g., different illu-mination conditions, is prohibitively expensive [74]. Local-ization algorithms thus need to be robust to such changes.
定位算法使用的场景表示通常是从描述给定场景的图像中恢复的。表示的类型可以从一组具有相关相机姿势的图像[8,75,98]，到从运动结构[77,81]构建的3D模型，到在卷积神经网络（CNN）中编码的权重[ 8,10,12,13,35,36,52]或随机森林[11,16,79]。在实践中，从所有可能的角度以及在所有潜在条件下（例如，不同的照明条件）捕获场景是非常昂贵的[74]。因此，本地化算法需要对此类更改具有鲁棒性。
In the context of long-term operation,e.g., under seasonal changes, the scene appearance can vary drasticallyover time. However, the semantic meaning of scene partsremains the same,e.g., a tree is a tree whether it carriesleaves or not. Based on this insight, approaches for semantic long-term visual localization use semantic segmen-tations of images or object detections to obtain an invariant scene representation [4,21,64,78,80,83,85,86,93,94,94].However, this invariance comes at the price of a lower dis-criminative power as often only few classes are available.For example, the Cityscapes dataset [22] uses 19 classesfor evaluation, 8 of which cover dynamic objects such as cars or pedestrians that are not useful for localization. TheMapillary Vistas dataset [55] contains 66 classes, with 15 classes for dynamic objects. At the same time, annotatingmore classes comes at significant human labor cost and an-notation time.
在长期运行的情况下，例如在季节性变化下，场景外观可能会随时间而发生巨大变化。但是，场景部分的语义含义保持不变，例如，一棵树无论是否携带叶子，都是一棵树。基于这种见识，用于语义长期视觉本地化的方法使用图像或对象检测的语义分段来获得不变的场景表示[4,21,64,78,80,83,85,86,93,94， 94]。然而，这种不变性是以较低的判别力为代价的，因为通常只有很少的类别可用。例如，Cityscapes数据集[22]使用19个类别进行评估，其中8个涵盖了动态对象，例如汽车或对本地化无用的行人。 TheMapillary特斯数据集[55]包含66类，15个类动态对象。同时，注释更多的类需要大量的人工成本和注释时间。
In this paper, we show that using significantly more classlabels leads to better performance of semantic visual local-ization algorithms. In order to avoid heavy human anno-tation time, we use the following central insight: the im-age segmentations used by such methods need to be sta-ble under viewpoint, illumination, seasonal,etc. changes.However, the classes of the segmentations do not needto map to human-understandable concepts to be useful,i.e., they might not necessarily need to be semantic. In-spired by recent work on usingk-means clustering to pre-train CNNs from unlabelled data [15], we thus propose a self-supervised, data-driven approach to define fine-grainedclasses for image segmentation. More precisely, we usek-means clustering on pixel-level CNN features to definekclasses. As shown in Fig.1, this allows our approach,termed Fine-Grained Segmentation Networks (FGSNs), tocreate more fine-grained segmentations.
在本文中，我们表明使用明显更多的类标签可以提高语义视觉本地化算法的性能。为了避免繁重的人工标注时间，我们使用以下主要见解：这种方法所使用的图像分割必须在视点，光照，季节等条件下保持稳定。但是，细分的类别不一定要映射到人类可以理解的概念才有用，即不一定需要语义。受最近使用k均值聚类从未标记数据中预训练CNN的工作启发[15]，我们提出了一种自我监督，数据驱动的方法来定义用于图像分割的细粒度类。更准确地说，我们在像素级CNN功能上使用k均值聚类来定义k类。如图1所示，这允许我们的方法（称为细粒度分割网络（FGSN））创建更多细粒度的分割。
In detail, this paper makes the following contributions:1)We present a novel type of segmentation network, the Fine-Grained Segmentation Network (FGSN), that outputs dense segmentation maps based on cluster indices. This removes the need for human-defined classes and allows us to define classes in a data-driven way through self-supervised learning. Using a 2D-2D correspondence dataset [42] for training, we ensure that our classes are stable under seasonal and viewpoint changes. The source code of our approach is publicly available1.2)FGSNs allow us to create finer segmentations with more classes. We show that this has a positive impact on semantic visual localization algorithms and can lead to substantial improvements when used by existing localization approaches.3)We perform detailed experiments to investigate the impact the number of clusters has on multiple visual localization algorithms. In addition, we compare two types of weight initializations, using networks pre-trained for semantic segmentation and image classification, respectively.
具体而言，本文做出了以下贡献：1）我们提出了一种新型的分割网络，即细粒度分割网络（FGSN），其基于聚类索引输出密集的分割图。这消除了对人类定义的类的需求，并允许我们通过自我监督学习以数据驱动的方式定义类。使用2D-2D对应数据集[42]进行训练，我们确保我们的课程在季节和视点变化下保持稳定。我们的方法的源代码是公开可用的1.2）FGSN允许我们创建具有更多类的更精细的细分。我们证明这对语义视觉本地化算法有积极的影响，并在使用现有的本地化方法时可以带来实质性的改进。3）我们进行了详细的实验，研究了簇数对多种视觉本地化算法的影响。此外，我们分别比较了两种类型的权重初始化，分别使用针对语义分割和图像分类进行预训练的网络。
2. Related Work
The following reviews work related to our approach,most notably semantic segmentation and visual localization.
2.相关工作
以下评论工作与我们的方法有关，尤其是语义分割和视觉本地化。
Semantic Segmentation.
Semantic segmentation is the task of assigning a class label to each pixel in an input image. Modern approaches use fully convolutional net-works [47], potentially pre-trained for classification [47],while incorporating higher level context [99], enlarging the receptive field [17,19,92], or fusing multi-scale features [18,66]. Another line of work combines FCNs with probabilistic graphical models,e.g., in the form of a post-processing step [17] or as a differentiable component in an end-to-end trainable network [41,46,100].
语义分割。
语义分割是为输入图像中的每个像素分配一个类别标签的任务。现代方法使用完全卷积网络[47]，可能会接受分类训练[47]，同时合并更高级别的上下文[99]，扩大接受域[17,19,92]或融合多尺度特征[ 18,66]。另一项工作是将FCN与概率图形模型相结合，例如以后处理步骤[17]的形式，或作为端对端可训练网络中的可区分组件[41,46,100]。
CNNs for semantic segmentation are usually trained in a fully supervised fashion. However, obtaining a large amount of densely labeled images is very time-consuming and expensive [22,55]. As a result, approaches based on weaker forms of annotations have been developed. Some examples of weak labels used to train FCNs are bound-ing boxes [23,37,57], image level tags [57,59,62,82],points [9], or 2D-2D point matches [42]. In this paper, we show that the classes used for “semantic” visual localization do not need to carry semantic meaning. This allows us to directly learn a large set of classes for image segmentation from data in a self-supervised fashion. During training, we use 2D-2D point matches [42] to encourage consistency of the segmentations across seasonal changes and across different weather conditions.
通常以完全监督的方式训练用于语义分割的CNN。但是，获取大量密集标记的图像非常耗时且昂贵[22,55]。结果，已经开发了基于较弱形式的注释的方法。用于训练FCN的弱标签的一些示例是绑定框[23,37,57]，图像级别标签[57、59、62、82]，点[9]或2D-2D点匹配[42]。在本文中，我们表明用于“语义”视觉本地化的类不需要带有语义。这使我们能够以自我监督的方式直接从数据中学习大量的图像分割类。在训练过程中，我们使用2D-2D点匹配[42]来鼓励季节性变化和不同天气条件下分割的一致性。