17
17
论文:http://static.tongtianta.site/paper_pdf/27fd95d8-eb4a-11e9-9236-00163e08bb86.pdf
Abstract
摘要
Data labeling for learning 3D hand pose estimation models is a huge effort. Readily available, accurately labeled synthetic data has the potential to reduce the effort. However, to successfully exploit synthetic data, current state-of-the-art methods still require a large amount of labeled real data. In this work, we remove this requirement by learning to map from the features of real data to the features of synthetic data mainly using a large amount of synthetic and unlabeled real data. We exploit unlabeled data using two auxiliary objectives, which enforce that (i) the mapped representation is pose specific and (ii) at the same time, the distributions of real and synthetic data are aligned. While pose specifity is enforced by a self-supervisory signal requiring that the representation is predictive for the appearance from different views, distributions are aligned by an adversarial term. In this way, we can significantly improve the results of the baseline system, which does not use unlabeled data and outperform many recent approaches already with about 1% of the labeled real data. This presents a step towards faster deployment of learning based hand pose estimation, making it accessible for a larger range of applications.
用于学习3D手势估计模型的数据标记是一项巨大的工作。现成的,准确标记的合成数据有可能减少工作量。但是,要成功利用合成数据,当前的最新方法仍然需要大量带标签的真实数据。在这项工作中,我们通过学习主要使用大量合成的和未标记的真实数据将真实数据的特征映射到合成数据的特征来消除此要求。我们使用两个辅助目标来利用未标记的数据,这两个目标实现了:(i)映射表示形式是特定于姿势的;(ii)同时,实数据和合成数据的分布是对齐的。姿势专用性是通过自我监控信号来强制执行的,该信号要求表示形式可以从不同的角度预测外观,而分布则按对抗性术语对齐。这样,我们可以显着改善基线系统的结果,该基线系统不使用未标记的数据,并且以大约1%的标记真实数据胜过许多最新方法。这为加快基于学习的手部姿势估计的部署迈出了一步,使其可用于更大范围的应用。
1. Introduction
1.简介
To provide labeled data in the needed quantity, accuracy and realism for learning pose estimation models currently requires a significant manual effort. This is especially the case if the goal is to estimate the pose of articulated objects like the human hand. For this task a significant effort has been taken in order to provide semi-/automatic labeling procedures and corresponding datasets [24, 37, 47]. However, to provide the labeled data for a novel application, © 2019 IEEE Project webpage providing code and additional material can be found at https://poier.github.io/murauer viewpoint, or sensor still requires significant effort, specific hardware and/or great care to not affect the captured data. Recent methods aiming to reduce the effort often employ synthetic data or semi-supervised learning [22, 39], which, both, have their specific drawbacks. Approaches, employing synthetic data have to deal with the domain gap, which has been recently approached for hand pose estimation by learning a mapping between the feature spaces of real and synthetic data [29]. Unfortunately, learning this mapping requires a large amount of labeled real data and corresponding synthetic data. On the other hand, semi-supervised approaches can better exploit a small amount of labeled data, however, the results are often still not competitive.
为了提供所需数量的标记数据,用于学习姿势估计模型的准确性和真实性目前需要大量的人工工作。如果目标是估计关节连接的物体(如人的手)的姿势,则情况尤其如此。为了完成此任务,已进行了大量工作以提供半自动标记程序和相应的数据集[24、37、47]。但是,要为新颖的应用程序提供标记的数据,©2019 IEEE Project网页提供代码和其他材料可以在https://poier.github.io/murauer的观点中找到,或者传感器仍然需要大量的精力,特定的硬件和/或格外小心,不要影响捕获的数据。旨在减少工作量的最新方法通常采用合成数据或半监督学习[22,39],两者都有其特定的缺点。使用合成数据的方法必须处理域间隙,最近已经通过学习真实数据和合成数据的特征空间之间的映射来进行手姿势估计[29]。不幸的是,学习该映射需要大量标记的真实数据和相应的合成数据。另一方面,半监督方法可以更好地利用少量的标记数据,但是结果通常仍然没有竞争力。
Figure 1: Comparison of results. We introduce a method to exploit unlabeled data, which improves results especially for highly distorted images and difficult poses. Left: ground truth. Middle: baseline trained with labeled data (synthetic and 100 real). Right: our result from training with the same labeled data and additional unlabeled real data. Best viewed in color.
图1:结果比较。我们介绍了一种利用未标记数据的方法,该方法可以改善结果,尤其是对于高度失真的图像和困难的姿势。左:事实真相。中:采用标记数据(综合和100真实)训练的基线。正确:我们使用相同的标记数据和其他未标记的真实数据进行训练的结果。彩色效果最佳。
We aim to overcome these issues by exploiting accurately labeled synthetic data together with unlabeled real data in a specifically devised semi-supervised approach. We employ a large amount of synthetic data to learn an accurate pose predictor, and, inspired by recent work [19, 29], learn to map the features of real data to those of synthetic data to overcome the domain gap. However, in contrast to previous work, we learn this mapping mainly from unlabeled data.
我们旨在通过专门设计的半监督方法,利用准确标记的合成数据和未标记的真实数据来克服这些问题。我们使用大量的合成数据来学习准确的姿态预测器,并受最新工作的启发[19,29],学习将真实数据的特征映射到合成数据的特征以克服领域空白。但是,与以前的工作相反,我们主要从未标记的数据中学习此映射。
We train the mapping from the features of real to those of synthetic data using two auxiliary objectives based on unlabeled data. One objective enforces the mapped features to be pose specific [28, 30], and the other one enforces the feature distributions of real and synthetic data to be aligned.
我们使用基于未标记数据的两个辅助目标来训练从真实特征到合成数据的映射。一个目标强制将映射的特征设置为特定的姿势[28,30],而另一个目标则强制将实际数据和合成数据的特征分布对齐。
For the first of the two auxiliary objectives, which is responsible for enforcing a pose specific representation, we build upon our recent work [28]. In [28] we showed that by learning to predict a different view from the latent representation, the latent representations of similar poses are pushed close together. That is, the only necessary supervision to learn such a pose specific representation can be obtained by simply capturing the scene simultaneously from different view points. In this work we enforce the joint latent representation of real and synthetic data (i.e., after mapping) to be pose specific by enforcing the representation to be predictive for the appearance in another view.
对于负责执行姿势特定表示的两个辅助目标中的第一个,我们以最近的工作为基础[28]。在[28]中,我们表明,通过学习预测与潜在表示不同的视图,相似姿势的潜在表示被推到了一起。也就是说,通过简单地从不同的视角同时捕获场景,就可以获得学习这种姿势特定表示的唯一必要的监督。在这项工作中,我们通过强制另一种视图中的外观具有预测性,将实际和合成数据(即映射后)的联合潜在表示强制为特定姿势。
The second objective is to align the feature distributions of real and synthetic data. The underlying idea of learning a mapping from the features of real samples to the features of synthetic samples is that the labeled synthetic data can be better exploited if real and synthetic samples with similar poses are close together in the latent space. Simply ensuring that the latent representation is pose specific does, however, not guarantee that the features of real and synthetic data are close together in the latent space: Similar poses could form clusters for real and synthetic data, individually. To avoid that, we employ an adversarial loss, which acts on the latent space and penalizes a mismatch of the feature distributions.
第二个目标是对齐真实数据和合成数据的特征分布。学习从真实样本的特征到合成样本的特征的映射的基本思想是,如果具有相似姿势的真实样本和合成样本在潜在空间中靠得很近,则可以更好地利用标记的合成数据。但是,仅仅确保潜在表示是特定于姿势的,就不能保证真实和合成数据的特征在潜在空间中紧密结合:相似的姿势可以分别形成真实和合成数据的簇。为了避免这种情况,我们采用对抗损失,该损失作用于潜在空间并惩罚特征分布的不匹配。
By simultaneously ensuring that similar poses are close together and feature distributions are aligned, we show that we are able to train state-of-the-art pose predictors – already with small amounts of labeled real data. More specifically, employing about 1% of the labeled real samples from the NYU dataset [37] our method outperforms many recent state-of-the-art approaches, which use all labeled real samples. Furthermore, besides quantitative experiments, we perform qualitative analysis showing that the latent representations of real and synthetic samples are well aligned when using mainly unlabeled real data. Moreover, in our extensive ablation study we find that, both, enforcing pose specificity as well as aligning the distributions of real and synthetic samples benefits performance (see Sec. 4.3).
通过同时确保相似的姿势靠得很近并且特征分布保持对齐,我们证明了我们能够训练出最先进的姿势预测器-已经有了少量标记的真实数据。更具体地说,使用来自纽约大学数据集的大约1%的标记真实样本[37],我们的方法优于许多使用所有标记真实样本的最新技术。此外,除定量实验外,我们进行定性分析,结果表明,在主要使用未标记的真实数据的情况下,真实样品和合成样品的潜在表示形式可以很好地对齐。此外,在我们广泛的消融研究中,我们发现,增强姿势特异性以及对齐真实样品和合成样品的分布都可以提高性能(请参见第4.3节)。
所有论文

添加客服微信,加入用户群
蜀ICP备18016327号