Bootstrap

WACV2023论文速览3D相关

在这里插入图片描述

Paper1 3D Change Localization and Captioning From Dynamic Scans of Indoor Scenes

摘要原文: Daily indoor scenes often involve constant changes due to human activities. To recognize scene changes, existing change captioning methods focus on describing changes from two images of a scene. However, to accurately perceive and appropriately evaluate physical changes and then identify the geometry of changed objects, recognizing and localizing changes in 3D space is crucial. Therefore, we propose a task to explicitly localize changes in 3D bounding boxes from two point clouds and describe detailed scene changes, including change types, object attributes, and spatial locations. Moreover, we create a simulated dataset with various scenes, allowing generating data without labor costs. We further propose a framework that allows different 3D object detectors to be incorporated in the change detection process, after which captions are generated based on the correlations of different change regions. The proposed framework achieves promising results in both change detection and captioning. Furthermore, we also evaluated on data collected from real scenes. The experiments show that pretraining on the proposed dataset increases the change detection accuracy by +12.8% (mAP0.25) when applied to real-world data. We believe that our proposed dataset and discussion could provide both a new benchmark and insights for future studies in scene change understanding.

中文总结: 这段话主要讨论了日常室内场景中由于人类活动而产生的不断变化,现有的变化描述方法主要集中在描述从场景的两个图像中的变化。然而,为了准确感知和适当评估物理变化并识别变化对象的几何形状,从三维空间中识别和定位变化至关重要。因此,提出了一个任务,从两个点云中明确定位三维边界框中的变化,并描述详细的场景变化,包括变化类型、对象属性和空间位置。此外,创建了一个包含各种场景的模拟数据集,可以在不需要人力成本的情况下生成数据。进一步提出了一个框架,允许不同的三维物体检测器被整合到变化检测过程中,之后根据不同变化区域的相关性生成标题。所提出的框架在变化检测和描述生成方面取得了令人满意的结果。此外,还在从真实场景中收集的数据上进行了评估。实验证明,在应用于真实世界数据时,对所提出的数据集进行预训练可将变化检测准确度提高12.8%(mAP0.25)。我们相信,我们提出的数据集和讨论可以为未来的场景变化理解研究提供新的基准和见解。

Paper2 Text and Image Guided 3D Avatar Generation and Manipulation

摘要原文: The manipulation of latent space has recently become an interesting topic in the field of generative models. Recent research shows that latent directions can be used to manipulate images towards certain attributes. However, controlling the generation process of 3D generative models remains a challenge. In this work, we propose a novel 3D manipulation method that can manipulate both the shape and texture of the model using text or image-based prompts such as ‘a young face’ or ‘a surprised face’. We leverage the power of Contrastive Language-Image Pre-training (CLIP) model and a pre-trained 3D GAN model designed to generate face avatars, and create a fully differentiable rendering pipeline to manipulate meshes. More specifically, our method takes an input latent code and modifies it such that the target attribute specified by a text or image prompt is present or enhanced, while leaving other attributes largely unaffected. Our method requires only 5 minutes per manipulation, and we demonstrate the effectiveness of our approach with extensive results and comparisons.

中文总结: 最近,潜在空间的操纵已成为生成模型领域的一个有趣话题。最新研究表明,潜在方向可以用来操纵图像朝向特定属性。然而,控制3D生成模型的生成过程仍然是一个挑战。在这项工作中,我们提出了一种新颖的3D操纵方法,可以利用文本或基于图像的提示(如“一个年轻的脸”或“一个惊讶的脸”)来操纵模型的形状和纹理。我们利用对比语言-图像预训练(CLIP)模型的强大能力和一个预先训练的用于生成面部头像的3D GAN模型,并创建一个完全可微分的渲染管线来操纵网格。具体来说,我们的方法接受一个输入的潜在代码,并修改它,以便目标属性(由文本或图像提示指定)存在或增强,同时基本不影响其他属性。我们的方法每次操纵仅需5分钟,并通过大量结果和比较展示了我们方法的有效性。

Paper3 Controllable 3D Generative Adversarial Face Model via Disentangling Shape and Appearance

摘要原文: 3D face modeling has been an active area of research in computer vision and computer graphics, fueling applications ranging from facial expression transfer in virtual avatars to synthetic data generation. Existing 3D deep learning generative models (e.g., VAE, GANs) allow generating compact face representations (both shape and texture) that can model non-linearities in the shape and appearance space (e.g., scatter effects, specularities,…). However, they lack the capability to control the generation of subtle expressions. This paper proposes a new 3D face generative model that can decouple identity and expression and provides granular control over expressions. In particular, we propose using a pair of supervised auto-encoder and generative adversarial networks to produce high-quality 3D faces, both in terms of appearance and shape. Experimental results in the generation of 3D faces learned with holistic expression labels, or Action Unit (AU) labels, show how we can decouple identity and expression; gaining fine-control over expressions while preserving identity.

中文总结: 这段话主要讨论了3D人脸建模在计算机视觉和计算机图形学领域的活跃研究,推动了从虚拟头像中的面部表情转移到合成数据生成等应用。现有的3D深度学习生成模型(例如VAE、GANs)允许生成紧凑的面部表示(包括形状和纹理),可以模拟形状和外观空间中的非线性(例如散射效应、镜面反射等)。然而,它们缺乏控制生成微妙表情的能力。本文提出了一种新的3D人脸生成模型,可以解耦身份和表情,并提供对表情的细粒度控制。具体来说,我们建议使用一对监督自编码器和生成对抗网络来生成高质量的3D人脸,无论是外观还是形状。通过使用全面表情标签或动作单元(AU)标签学习的3D人脸生成的实验结果表明,我们可以解耦身份和表情,实现对表情的精细控制同时保留身份。

Paper4 Far3Det: Towards Far-Field 3D Detection

摘要原文: We focus on the task of far-field 3D detection (Far3Det) of objects beyond a certain distance from an observer, e.g., >50m. Far3Det is particularly important for autonomous vehicles (AVs) operating at highway speeds, which require detections of far-field obstacles to ensure sufficient braking distances. However, contemporary AV benchmarks such as nuScenes underemphasize this problem because they evaluate performance only up to a certain distance (50m). One reason is that obtaining far-field 3D annotations is difficult, particularly for lidar sensors that produce very few point returns for far-away objects. Indeed, we find that almost 50% of far-field objects (beyond 50m) contain zero lidar points. Secondly, current metrics for 3D detection employ a “one-size-fits-all” philosophy, using the same tolerance thresholds for near and far objects, inconsistent with tolerances for both human vision and stereo disparities. Both factors lead to an incomplete analysis of the Far3Det task. For example, while conventional wisdom tells us that high-resolution RGB sensors should be vital for the 3D detection of far-away objects, lidar-based methods still rank higher compared to RGB counterparts on the current benchmark leaderboards. As a first step towards a Far3Det benchmark, we develop a method to find well-annotated scenes from the nuScenes dataset and derive a well-annotated far-field validation set. We also propose a Far3Det evaluation protocol and explore various 3D detection methods for Far3Det. Our result convincingly justifies the long held conventional wisdom that high-resolution RGB improves 3D detection in the far-field. We further propose a simple yet effective method that fuses detections from RGB and lidar detectors based on non-maximum suppression, which remarkably outperforms state-of-the-art 3D detectors in the far-field.

中文总结: 这段话主要讨论了远场三维检测(Far3Det)的任务,即对观察者距离一定距离以外的物体进行检测,例如>50m。远场三维检测对于高速公路上运行的自动驾驶车辆(AVs)尤为重要,因为这需要检测远处障碍物以确保足够的制动距离。然而,目前的AV基准测试(如nuScenes)在评估性能时只关注一定距离内(50m)的情况,忽视了这一问题。获取远场三维注释难度较大,尤其是对于激光雷达传感器来说,对于远处物体产生的点返回较少。事实上,我们发现近50%的远场物体(超过50m)不包含任何激光点。其次,当前的三维检测度量标准采用“一刀切”的哲学,对近距离和远距离物体使用相同的容差阈值,与人类视觉和立体视差的容差不一致。这两个因素导致对Far3Det任务的分析不完整。例如,传统智慧告诉我们,高分辨率的RGB传感器应该对远处物体的三维检测至关重要,但在当前基准排行榜上,基于激光雷达的方法仍然比RGB对应物方法排名更高。作为Far3Det基准测试的第一步,我们开发了一种方法,从nuScenes数据集中找到有良好注释的场景,并制定了一个良好注释的远场验证集。我们还提出了一个Far3Det评估协议,并探索了各种用于Far3Det的三维检测方法。我们的结果令人信服地证明了长期以来的传统智慧,即高分辨率的RGB传感器可以改善远场的三维检测。此外,我们进一步提出了一种简单而有效的方法,基于非极大值抑制融合了RGB和激光雷达探测器的检测结果,在远场明显优于最先进的三维检测器。

Paper5 IDD-3D: Indian Driving Dataset for 3D Unstructured Road Scenes

摘要原文: Autonomous driving and assistance systems rely on annotated data from traffic and road scenarios to model and learn the various object relations in complex real-world scenarios. Preparation and training of deploy-able deep learning architectures require the models to be suited to different traffic scenarios and adapt to different situations. Currently, existing datasets, while large-scale, lack such diversities and are geographically biased towards mainly developed cities. An unstructured and complex driving layout found in several developing countries such as India poses a challenge to these models due to the sheer degree of variations in the object types, densities, and locations. To facilitate better research toward accommodating such scenarios, we build a new dataset, IDD-3D , which consists of multi-modal data from multiple cameras and LiDAR sensors with 12k annotated driving LiDAR frames across various traffic scenarios. We discuss the need for this dataset through statistical comparisons with existing datasets and highlight benchmarks on standard 3D object detection and tracking tasks in complex layouts.

中文总结: 这段话主要讨论了自动驾驶和辅助系统依赖来自交通和道路场景的注释数据,以建模和学习复杂现实世界场景中各种物体关系。为了准备和训练可部署的深度学习架构,模型需要适应不同的交通场景并适应不同的情况。目前,现有数据集虽然规模庞大,但缺乏这种多样性,并且在地理上偏向于主要发达城市。在印度等几个发展中国家发现的无序和复杂的驾驶布局给这些模型带来挑战,因为物体类型、密度和位置的变化程度非常大。为了促进更好地研究以适应这种情况,我们构建了一个新数据集 IDD-3D,其中包含来自多个摄像头和激光雷达传感器的多模态数据,涵盖各种交通场景的12k个带注释的驾驶激光雷达帧。我们通过与现有数据集的统计比较来讨论这一数据集的必要性,并在复杂布局中的标准3D物体检测和跟踪任务上强调了基准。

Paper6 Real-Time Concealed Weapon Detection on 3D Radar Images for Walk-Through Screening System

摘要原文: This paper presents a framework for real-time concealed weapon detection (CWD) on 3D radar images for walk-through screening systems. The walk-through screening system aims to ensure security in crowded areas by performing CWD on walking persons, hence it requires an accurate and real-time detection approach. To ensure accuracy, a weapon needs to be detected irrespective of its 3D orientation, thus we use the 3D radar images as detection input. For achieving real-time, we reformulate classic U-Net based segmentation networks to perform 3D detection tasks. Our 3D segmentation network predicts peak-shaped probability map, instead of voxel-wise masks, to enable position inference by elementary peak detection operation on the predicted map. In the peak-shaped probability map, the peak marks the weapon’s position. So, weapon detection task translates to peak detection on the probability map. A Gaussian function is used to model weapons in the probability map. We experimentally validate our approach on realistic 3D radar images obtained from a walk-through weapon screening system prototype. Extensive ablation studies verify the effectiveness of our proposed approach over existing conventional approaches. The experimental results demonstrate that our proposed approach can perform accurate and real-time CWD, thus making it suitable for practical applications of walk-through screening.

中文总结: 本文提出了一个用于步行式筛查系统上的3D雷达图像中实时隐蔽武器检测(CWD)的框架。步行式筛查系统旨在通过对行人进行CWD来确保拥挤区域的安全,因此需要准确和实时的检测方法。为了确保准确性,武器需要被检测到,无论其3D方向如何,因此我们使用3D雷达图像作为检测输入。为了实现实时性,我们重新构建了基于U-Net的经典分割网络,以执行3D检测任务。我们的3D分割网络预测峰值形状的概率图,而不是体素级的掩模,以便通过对预测图上的基本峰值检测操作进行位置推断。在峰值形状的概率图中,峰值标记了武器的位置。因此,武器检测任务转化为在概率图上进行峰值检测。高斯函数被用来在概率图中建模武器。我们在从步行式武器筛查系统原型获取的真实3D雷达图像上进行了实验验证。广泛的消融研究验证了我们提出的方法相对于现有传统方法的有效性。实验结果表明,我们提出的方法可以进行准确和实时的CWD,因此适用于步行式筛查的实际应用。

Paper7 Automated Line Labelling: Dataset for Contour Detection and 3D Reconstruction

摘要原文: Understanding the finer details of a 3D object, its contours, is the first step toward a physical understanding of an object. Many real-world application domains require adaptable 3D object shape recognition models, usually with little training data. For this purpose, we develop the first automatically generated contour labeled dataset, bypassing manual human labeling. Using this dataset, we study the performance of current state-of-the-art instance segmentation algorithms on detecting and labeling the contours. We produce promising visual results with accurate contour prediction and labeling. We demonstrate that our finely labeled contours can help downstream tasks in computer vision, such as 3D reconstruction from a 2D image.

中文总结: 这段话主要讲述了理解三维物体的细节,即其轮廓,是理解物体的物理特性的第一步。许多现实世界的应用领域需要可适应的三维物体形状识别模型,通常训练数据较少。为此,我们开发了第一个自动生成的轮廓标记数据集,绕过了手动人工标记。利用这个数据集,我们研究了当前最先进的实例分割算法在检测和标记轮廓方面的性能。我们产生了令人满意的视觉结果,准确预测和标记轮廓。我们证明了我们精细标记的轮廓可以帮助计算机视觉中的下游任务,如从二维图像进行三维重建。

Paper8 TransPillars: Coarse-To-Fine Aggregation for Multi-Frame 3D Object Detection

摘要原文: 3D object detection using point clouds has attracted increasing attention due to its wide applications in autonomous driving and robotics. However, most existing studies focus on single point cloud frames without harnessing the temporal information in point cloud sequences. In this paper, we design TransPillars, a novel transformer-based feature aggregation technique that exploits temporal features of consecutive point cloud frames for multi-frame 3D object detection. TransPillars aggregates spatial-temporal point cloud features from two perspectives. First, it fuses voxel-level features directly from multi-frame feature maps instead of pooled instance features to preserve instance details with contextual information that are essential to accurate object localization. Second, it introduces a hierarchical coarse-to-fine strategy to fuse multi-scale features progressively to effectively capture the motion of moving objects and guide the aggregation of fine features. Besides, a variant of deformable transformer is introduced to improve the effectiveness of cross-frame feature matching. Extensive experiments show that our proposed TransPillars achieves state-of-art performance as compared to existing multi-frame detection approaches.

中文总结: 这段话主要介绍了使用点云进行3D物体检测的研究。由于在自动驾驶和机器人领域的广泛应用,这一领域越来越受到关注。然而,大多数现有研究集中在单个点云帧上,没有充分利用点云序列中的时间信息。本文设计了一种名为TransPillars的新型基于Transformer的特征聚合技术,用于多帧3D物体检测。TransPillars从两个方面聚合空间-时间点云特征。首先,它直接从多帧特征图中融合体素级特征,而不是从池化的实例特征中,以保留实例细节和上下文信息,这对于准确的物体定位至关重要。其次,它引入了一种层次化的由粗到细的策略,逐步融合多尺度特征,以有效捕捉移动物体的运动并引导细粒度特征的聚合。此外,引入了一种变形Transformer的变体,以提高跨帧特征匹配的效果。大量实验证明,我们提出的TransPillars相比现有的多帧检测方法取得了最先进的性能。

Paper9 Generative Range Imaging for Learning Scene Priors of 3D LiDAR Data

摘要原文: 3D LiDAR sensors are indispensable for the robust vision of autonomous mobile robots. However, deploying LiDAR-based perception algorithms often fails due to a domain gap from the training environment, such as inconsistent angular resolution and missing properties. Existing studies have tackled the issue by learning inter-domain mapping, while the transferability is constrained by the training configuration and the training is susceptible to peculiar lossy noises called ray-drop. To address the issue, this paper proposes a generative model of LiDAR range images applicable to the data-level domain transfer. Motivated by the fact that LiDAR measurement is based on point-by-point range imaging, we train an implicit image representation-based generative adversarial networks along with a differentiable ray-drop effect. We demonstrate the fidelity and diversity of our model in comparison with the point-based and image-based state-of-the-art generative models. We also showcase upsampling and restoration applications. Furthermore, we introduce a Sim2Real application for LiDAR semantic segmentation. We demonstrate that our method is effective as a realistic ray-drop simulator and outperforms state-of-the-art methods.

中文总结: 这段话主要讨论了3D LiDAR传感器对于自主移动机器人的强大视觉至关重要,但由于训练环境中存在的领域差距,如角度分辨率不一致和属性缺失,部署基于LiDAR的感知算法经常会失败。现有研究通过学习跨领域映射来解决这个问题,但其可转移性受到训练配置的限制,训练容易受到称为射线丢失的特殊丢失噪声的影响。为了解决这个问题,本文提出了一种适用于数据级域转移的LiDAR范围图像生成模型。受到LiDAR测量基于逐点范围成像的事实的启发,我们训练了一种基于隐式图像表示的生成对抗网络,同时具有可微的射线丢失效应。我们展示了我们的模型在保真度和多样性方面与基于点和基于图像的最先进生成模型的比较。我们还展示了上采样和恢复应用。此外,我们介绍了一种用于LiDAR语义分割的Sim2Real应用。我们证明了我们的方法作为一个真实的射线丢失模拟器是有效的,并且胜过了最先进的方法。

Paper10 Placing Human Animations Into 3D Scenes by Learning Interaction- and Geometry-Driven Keyframes

摘要原文: We present a novel method for placing a 3D human animation into a 3D scene while maintaining any human-scene interactions in the animation. We use the notion of computing the most important meshes in the animation for the interaction with the scene, which we call “keyframes.” These keyframes allow us to better optimize the placement of the animation into the scene such that interactions in the animations (standing, laying, sitting, etc.) match the affordances of the scene (e.g., standing on the floor or laying in a bed). We compare our method, which we call PAAK, with prior approaches, including POSA, PROX ground truth, and a motion synthesis method, and highlight the benefits of our method with a perceptual study. Human raters preferred our PAAK method over the PROX ground truth data 64.6% of the time. Additionally, in direct comparisons, the raters preferred PAAK over competing methods including 61.5% compared to POSA. Our project website is available at https://gamma.umd.edu/paak/.

中文总结: 这段话主要介绍了一种新颖的方法,用于将3D人物动画放置到3D场景中,并保持动画中的人物与场景的互动。该方法利用计算动画中与场景互动最重要的网格的概念,称为“关键帧”。这些关键帧使我们能够更好地优化将动画放置到场景中,以使动画中的互动(站立、躺下、坐着等)与场景的可利用性(例如站在地板上或躺在床上)匹配。我们将我们的方法命名为PAAK,并与之前的方法进行比较,包括POSA、PROX地面真实数据和一种运动合成方法,并通过感知研究突出了我们方法的优势。人类评分者64.6%的时间更喜欢我们的PAAK方法而不是PROX地面真实数据。此外,在直接比较中,评分者61.5%更喜欢PAAK而不是竞争方法,包括POSA。我们的项目网站可在https://gamma.umd.edu/paak/上找到。

Paper11 Dense Voxel Fusion for 3D Object Detection

摘要原文: Camera and LiDAR sensor modalities provide complementary appearance and geometric information useful for detecting 3D objects for autonomous vehicle applications. However, current end-to-end fusion methods are challenging to train and underperform state-of-the-art LiDAR-only detectors. Sequential fusion methods suffer from a limited number of pixel and point correspondences due to point cloud sparsity, or their performance is strictly capped by the detections of one of the modalities. Our proposed solution, Dense Voxel Fusion (DVF) is a sequential fusion method that generates multi-scale dense voxel feature representations, improving expressiveness in low point density regions. To enhance multi-modal learning, we train directly with projected ground truth 3D bounding box labels, avoiding noisy, detector-specific 2D predictions. Both DVF and the multi-modal training approach can be applied to any voxel-based LiDAR backbone. DVF ranks 3rd among published fusion methods on KITTI’s 3D car detection benchmark without introducing additional trainable parameters, nor requiring stereo images or dense depth labels. In addition, DVF significantly improves 3D vehicle detection performance of voxel-based methods on the Waymo Open Dataset.

中文总结: 摄像头和激光雷达传感器模态提供了互补的外观和几何信息,对于自动驾驶车辆应用中的3D对象检测非常有用。然而,目前的端到端融合方法很难训练,并且表现不如最先进的仅激光雷达检测器。顺序融合方法由于点云稀疏性导致像素和点对应关系有限,或者它们的性能受到其中一种模态检测的严格限制。我们提出的解决方案,密集体素融合(DVF),是一种生成多尺度密集体素特征表示的顺序融合方法,可以提高在低点密度区域的表达能力。为了增强多模态学习,我们直接使用投影的真实3D边界框标签进行训练,避免了嘈杂的、特定于检测器的2D预测。DVF和多模态训练方法均可应用于任何基于体素的激光雷达骨干网络。DVF在KITTI的3D车辆检测基准测试中排名第三,而无需引入额外的可训练参数,也不需要立体图像或密集深度标签。此外,DVF显著提高了基于体素方法在Waymo开放数据集上的3D车辆检测性能。

Paper12 Class-Level Confidence Based 3D Semi-Supervised Learning

摘要原文: Current pseudo-labeling strategies in 3D semi-supervised learning (SSL) fail to dynamically incorporate the variance of learning status which is affected by each class’s learning difficulty and data imbalance. To address this problem, we practically demonstrate that 3D unlabeled data class-level confidence can represent the learning status. Based on this finding, we present a novel class-level confidence based 3D SSL method. Firstly, a dynamic thresholding strategy is proposed to utilize more unlabeled data, especially for low learning status classes. Then, a re-sampling strategy is designed to avoid biasing toward high learning status classes, which dynamically changes the sampling probability of each class. Unlike the latest state-of-the-art SSL method FlexMatch which also utilizes dynamic threshold, our method can be applied to the inherently imbalanced dataset and thus is more general. To show the effectiveness of our method in 3D SSL tasks, we conduct extensive experiments on 3D SSL classification and detection tasks. Our method significantly outperforms state-of-the-art counterparts for both 3D SSL classification and detection tasks in all datasets.

中文总结: 当前的3D半监督学习中的伪标记策略未能动态地整合学习状态的差异,这受到每个类别的学习困难和数据不平衡的影响。为了解决这个问题,我们实际证明了3D未标记数据的类别级置信度可以代表学习状态。基于这一发现,我们提出了一种基于类别级置信度的3D半监督学习方法。首先,提出了一种动态阈值策略,以利用更多未标记数据,特别是对于学习状态较低的类别。然后,设计了一种重新采样策略,以避免偏向于学习状态较高的类别,动态改变每个类别的采样概率。与最新的半监督学习方法FlexMatch不同,我们的方法可以应用于固有不平衡的数据集,因此更具有普适性。为了展示我们的方法在3D半监督学习任务中的有效性,我们在3D半监督学习分类和检测任务上进行了大量实验。我们的方法在所有数据集上显著优于最先进的对手,无论是在3D半监督学习分类还是检测任务中。

Paper13 PIDS: Joint Point Interaction-Dimension Search for 3D Point Cloud

摘要原文: The interaction and dimension of points are two important axes in designing point operators to serve hierarchical 3D models. Yet, these two axes are heterogeneous and challenging to fully explore. Existing works craft point operator under a single axis and reuse the crafted operator in all parts of 3D models. This overlooks the opportunity to better combine point interactions and dimensions by exploiting varying geometry/density of 3D point clouds. In this work, we establish PIDS, a novel paradigm to jointly explore point interactions and point dimensions to serve semantic segmentation on point cloud data. We establish a large search space to jointly consider versatile point interactions and point dimensions. This supports point operators with various geometry/density considerations. The enlarged search space with heterogeneous search components calls for a better ranking of candidate models. To achieve this, we improve the search space exploration by leveraging predictor-based Neural Architecture Search (NAS), and enhance the quality of prediction by assigning unique encoding to heterogeneous search components based on their priors. We thoroughly evaluate the networks crafted by PIDS on two semantic segmentation benchmarks, showing 1% mIOU improvement on SemanticKITTI and S3DIS over state-of-the-art 3D models.

中文总结: 这段话主要讨论了在设计用于服务层次化3D模型的点操作符时,交互和点的维度是两个重要的轴。然而,这两个轴是异质的,并且很难完全探索。现有的工作在单一轴下设计点操作符,并在3D模型的所有部分重复使用这些操作符。这忽视了通过利用3D点云的几何/密度变化更好地结合点交互和维度的机会。在这项工作中,我们建立了PIDS,这是一种新的范式,用于共同探索点交互和点维度,以服务于点云数据的语义分割。我们建立了一个大的搜索空间,共同考虑多样化的点交互和点维度。这支持具有各种几何/密度考虑的点操作符。扩大的搜索空间具有异质搜索组件,需要更好地对候选模型进行排名。为了实现这一点,我们通过利用基于预测器的神经架构搜索(NAS)改进了搜索空间的探索,并通过为异质搜索组件分配基于它们的先验知识的唯一编码来提高预测质量。我们在两个语义分割基准上对由PIDS创建的网络进行了全面评估,结果显示在SemanticKITTI和S3DIS上,与最先进的3D模型相比,mIOU提高了1%。

Paper14 RSF: Optimizing Rigid Scene Flow From 3D Point Clouds Without Labels

摘要原文: We present a method for optimizing object-level rigid 3D scene flow over two successive point clouds without any annotated labels in autonomous driving settings. Rather than using pointwise flow vectors, our approach represents scene flow as the composition a global ego-motion and a set of bounding boxes with their own rigid motions, exploiting the multi-body rigidity commonly present in dynamic scenes. We jointly optimize these parameters over a novel loss function based on the nearest neighbor distance using a differentiable bounding box formulation. Our approach achieves state-of-the-art accuracy on KITTI Scene Flow and nuScenes without requiring any annotations, outperforming even supervised methods. Additionally, we demonstrate the effectiveness of our approach on motion segmentation and ego-motion estimation. Lastly, we visualize our predictions and validate our loss function design with an ablation study.

中文总结: 我们提出了一种在自动驾驶环境中优化两个连续点云上的物体级刚性3D场景流的方法,而无需任何标注标签。与使用逐点流向量不同,我们的方法将场景流表示为全局自我运动和一组带有自己刚性运动的边界框的组合,利用动态场景中常见的多体刚性。我们通过基于最近邻距离的新型损失函数联合优化这些参数,使用可微分的边界框公式。我们的方法在不需要任何注释的情况下,在KITTI Scene Flow和nuScenes上实现了最先进的准确性,甚至优于监督方法。此外,我们展示了我们的方法在运动分割和自我运动估计上的有效性。最后,我们通过消融研究可视化我们的预测并验证我们的损失函数设计。

Paper15 ImpDet: Exploring Implicit Fields for 3D Object Detection

摘要原文: Conventional 3D object detection approaches concentrate on bounding boxes representation learning with several parameters, i.e., localization, dimension, and orientation. Despite its popularity and universality, such a straightforward paradigm is sensitive to slight numerical deviations, especially in localization. By exploiting the property that point clouds are naturally captured on the surface of objects along with accurate location and intensity information, we introduce a new perspective that views bounding box regression as an implicit function. This leads to our proposed framework, termed Implicit Detection or ImpDet, which leverages implicit field learning for 3D object detection. Our ImpDet assigns specific values to points in different local 3D spaces, thereby high-quality boundaries can be generated by classifying points inside or outside the boundary. To solve the problem of sparsity on the object surface, we further present a simple yet efficient virtual sampling strategy to not only fill the empty region, but also learn rich semantic features to help refine the boundaries. Extensive experimental results on KITTI and Waymo benchmarks demonstrate the effectiveness and robustness of unifying implicit fields into object detection.

中文总结: 传统的3D物体检测方法主要集中在边界框表示学习上,具有几个参数,即定位、尺寸和方向。尽管这种简单的范式流行且普遍,但它对轻微数值偏差特别在定位方面敏感。通过利用点云自然地捕捉到物体表面上的位置和强度信息的特性,我们提出了一个新的视角,将边界框回归视为一个隐式函数。这导致了我们提出的框架,称为Implicit Detection或ImpDet,它利用了隐式场学习进行3D物体检测。我们的ImpDet为不同局部3D空间中的点分配特定值,从而可以通过对内部或外部边界内的点进行分类生成高质量的边界。为了解决物体表面上的稀疏问题,我们进一步提出了一种简单而高效的虚拟采样策略,不仅填补了空白区域,还学习了丰富的语义特征,有助于优化边界。在KITTI和Waymo基准测试上的大量实验结果证明了将隐式场统一到物体检测中的有效性和稳健性。

Paper16 Multivariate Probabilistic Monocular 3D Object Detection

摘要原文: In autonomous driving, monocular 3D object detection is an important but challenging task. Towards accurate monocular 3D object detection, some recent methods recover the distance of objects from the physical height and visual height of objects. Such decomposition framework can introduce explicit constraints on the distance prediction, thus improving its accuracy and robustness. However, the inaccurate physical height and visual height prediction still may exacerbate the inaccuracy of the distance prediction. In this paper, we improve the framework by multivariate probabilistic modeling. We explicitly model the joint probability distribution of the physical height and visual height. This is achieved by learning a full covariance matrix of the physical height and visual height during training, with the guide of a multivariate likelihood. Such explicit joint probability distribution modeling not only leads to robust distance prediction when both the predicted physical height and visual height are inaccurate, but also brings learned covariance matrices with expected behaviors. The experimental results on the challenging Waymo Open and KITTI datasets show the effectiveness of our framework.

中文总结: 在自动驾驶中,单目3D物体检测是一项重要但具有挑战性的任务。为了实现准确的单目3D物体检测,一些最近的方法通过从物体的物理高度和视觉高度恢复物体的距离。这种分解框架可以对距离预测引入显式约束,从而提高其准确性和鲁棒性。然而,不准确的物理高度和视觉高度预测仍可能加剧距离预测的不准确性。在本文中,我们通过多元概率建模改进了这个框架。我们明确地建模了物理高度和视觉高度的联合概率分布。这是通过在训练期间学习物理高度和视觉高度的完整协方差矩阵,并在多元似然的指导下实现的。这种显式联合概率分布建模不仅在预测的物理高度和视觉高度都不准确时导致了鲁棒的距离预测,还带来了具有期望行为的学习协方差矩阵。在具有挑战性的Waymo Open和KITTI数据集上的实验结果显示了我们框架的有效性。

Paper17 PointNeuron: 3D Neuron Reconstruction via Geometry and Topology Learning of Point Clouds

摘要原文: Digital neuron reconstruction from 3D microscopy images is an essential technique for investigating brain connectomics and neuron morphology. Existing reconstruction frameworks use convolution-based segmentation networks to partition the neuron from noisy backgrounds before applying the tracing algorithm. The tracing results are sensitive to the raw image quality and segmentation accuracy. In this paper, we propose a novel framework for 3D neuron reconstruction. Our key idea is to use the geometric representation power of the point cloud to better explore the intrinsic structural information of neurons. Our proposed framework adopts one graph convolutional network to predict the neural skeleton points and another one to produce the connectivity of these points. We finally generate the target SWC file through the interpretation of the predicted point coordinates, radius, and connections. Evaluated on the Janelia-Fly dataset from the BigNeuron project, we show that our framework achieves competitive neuron reconstruction performance. Our geometry and topology learning of point clouds could further benefit 3D medical image analysis, such as cardiac surface reconstruction. Our code is available at https://github.com/RunkaiZhao/PointNeuron.

中文总结: 这段话主要介绍了从3D显微镜图像中进行数字神经元重建的重要性以及提出的一种新的框架。现有的重建框架使用基于卷积的分割网络来将神经元从噪声背景中分割出来,然后应用追踪算法。追踪结果对原始图像质量和分割准确性敏感。作者提出的新框架利用点云的几何表示能力更好地探索神经元的内在结构信息。该框架采用一个图卷积网络来预测神经骨架点,另一个用于产生这些点的连接性。最终通过解释预测点的坐标、半径和连接来生成目标SWC文件。在BigNeuron项目的Janelia-Fly数据集上评估后,结果显示该框架实现了竞争性的神经元重建性能。点云的几何和拓扑学习还有助于3D医学图像分析,如心脏表面重建。他们的代码可在https://github.com/RunkaiZhao/PointNeuron 上找到。

Paper18 3D Neural Sculpting (3DNS): Editing Neural Signed Distance Functions

摘要原文: In recent years, implicit surface representations through neural networks that encode the signed distance have gained popularity and have achieved state-of-the-art results in various tasks (e.g. shape representation, shape reconstruction and learning shape priors). However, in contrast to conventional shape representations such as polygon meshes, the implicit representations cannot be easily edited and existing works that attempt to address this problem are extremely limited. In this work, we propose the first method for efficient interactive editing of signed distance functions expressed through neural networks, allowing free-form editing. Inspired by 3D sculpting software for meshes, we use a brush-based framework that is intuitive and can in the future be used by sculptors and digital artists. In order to localize the desired surface deformations, we regulate the network by using a copy of it to sample the previously expressed surface. We introduce a novel framework for simulating sculpting-style surface edits, in conjunction with interactive surface sampling and efficient adaptation of network weights. We qualitatively and quantitatively evaluate our method in various different 3D objects and under many different edits. The reported results clearly show that our method yields high accuracy, in terms of achieving the desired edits, while in the same time preserving the geometry outside the interaction areas.

中文总结: 近年来,通过神经网络对符号距离进行编码的隐式表面表示方式变得流行,并在各种任务中取得了最先进的结果(例如形状表示、形状重建和学习形状先验知识)。然而,与传统的形状表示(如多边形网格)相比,隐式表示无法轻松编辑,现有的解决此问题的工作极为有限。在这项工作中,我们提出了第一种用于通过神经网络表达的符号距离函数进行高效交互式编辑的方法,实现了自由形式编辑。受3D网格雕刻软件的启发,我们使用了一种基于笔刷的框架,直观易懂,并且未来可以被雕塑家和数字艺术家使用。为了定位所需的表面变形,我们通过使用网络的副本对先前表达的表面进行采样来规范网络。我们引入了一种新颖的框架,用于模拟雕塑风格的表面编辑,结合交互式表面采样和网络权重的高效调整。我们在各种不同的3D对象和许多不同的编辑下对我们的方法进行了定性和定量评估。报告的结果清楚地表明,我们的方法在实现所需的编辑方面具有高准确性,同时保留了交互区域之外的几何形状。

Paper19 Learning To Detect 3D Lanes by Shape Matching and Embedding

摘要原文: 3D lane detection based on LiDAR point clouds is a challenging task that requires precise locations, accurate topologies, and distinguishable instances. In this paper, we propose a dual-level shape attention network (DSANet) with two branches for high-precision 3D lane predictions. Specifically, one branch predicts the refined lane segment shapes and the shape embeddings that encode the approximate lane instance shapes, the other branch detects the coarse-grained structures of the lane instances. In the training stage, two-level shape matching loss functions are introduced to jointly optimize the shape parameters of the two-branch outputs, which are simple yet effective for precision enhancement. Furthermore, a shape-guided segments aggregator is proposed to help local lane segments aggregate into complete lane instances, according to the differences of instance shapes predicted at different levels. Experiments conducted on our BEV-3DLanes dataset demonstrate that our method outperforms previous methods.

中文总结: 这篇论文讨论基于LiDAR点云的3D车道检测是一项具有挑战性的任务,需要精确的位置、准确的拓扑结构和可区分的实例。作者提出了一种双级形状注意网络(DSANet),其中包括两个分支用于高精度的3D车道预测。具体而言,一个分支预测细化的车道段形状和编码近似车道实例形状的形状嵌入,另一个分支检测车道实例的粗粒度结构。在训练阶段,引入了两级形状匹配损失函数,共同优化两个分支输出的形状参数,这对于提高精度简单而有效。此外,提出了一种形状引导的段聚合器,根据在不同级别预测的实例形状的差异,帮助局部车道段聚合成完整的车道实例。在作者的BEV-3DLanes数据集上进行的实验表明,他们的方法优于先前的方法。

Paper20 3D GAN Inversion With Pose Optimization

摘要原文: With the recent advances in NeRF-based 3D aware GANs quality, projecting an image into the latent space of these 3D-aware GANs has a natural advantage over 2D GAN inversion: not only does it allow multi-view consistent editing of the projected image, but it also enables 3D reconstruction and novel view synthesis when given only a single image. However, the explicit viewpoint control acts as a main hindrance in the 3D GAN inversion process, as both camera pose and latent code have to be optimized simultaneously to reconstruct the given image. Most works that explore the latent space of the 3D-aware GANs rely on ground-truth camera viewpoint or deformable 3D model, thus limiting their applicability. In this work, we introduce a generalizable 3D GAN inversion method that infers camera viewpoint and latent code simultaneously to enable multi-view consistent semantic image editing. The key to our approach is to leverage pre-trained estimators for better initialization and utilize the pixel-wise depth calculated from NeRF parameters to better reconstruct the given image. We conduct extensive experiments on image reconstruction and editing both quantitatively and qualitatively, and further compare our results with 2D GAN-based editing to demonstrate the advantages of utilizing the latent space of 3D GANs. Additional results and visualizations are available at https://hypernerf.github.io/.

中文总结: 最近在基于NeRF的3D感知GAN质量方面取得了进展,将图像投影到这些3D感知GAN的潜在空间中与2D GAN反演相比具有自然优势:不仅可以实现投影图像的多视角一致编辑,还可以在仅给定单个图像时实现3D重建和新视图合成。然而,在3D GAN反演过程中,显式视点控制作为主要障碍,因为必须同时优化相机姿态和潜在代码以重建给定图像。大多数探索3D感知GAN潜在空间的作品依赖于地面真实相机视角或可变形的3D模型,从而限制了它们的适用性。在这项工作中,我们介绍了一种通用的3D GAN反演方法,可以同时推断相机视角和潜在代码,以实现多视角一致的语义图像编辑。我们方法的关键是利用预训练的估计器进行更好的初始化,并利用从NeRF参数计算的像素级深度更好地重建给定图像。我们在图像重建和编辑方面进行了广泛的定量和定性实验,并将我们的结果与基于2D GAN的编辑进行了比较,以展示利用3D GAN的潜在空间的优势。更多结果和可视化内容请访问https://hypernerf.github.io/。

Paper21 Li3DeTr: A LiDAR Based 3D Detection Transformer

摘要原文: Inspired by recent advances in vision transformers for object detection, we propose Li3DeTr, an end-to-end LiDAR based 3D Detection Transformer for autonomous driving, that inputs LiDAR point clouds and regresses 3D bounding boxes. The LiDAR local and global features are encoded using sparse convolution and multi-scale deformable attention respectively. In the decoder head, firstly, in the novel Li3DeTr cross-attention block, we link the LiDAR global features to 3D predictions leveraging the sparse set of object queries learnt from the data. Secondly, the object query interactions are formulated using multi-head self-attention. Finally, the decoder layer is repeated Ldec number of times to refine the object queries. Inspired by DETR, we employ set-to-set loss to train the Li3DeTr network. Without bells and whistles, the Li3DeTr network achieves 61.3% mAP and 67.6% NDS surpassing the state-of-the-art methods with non-maximum suppression (NMS) on the nuScenes dataset and it also achieves competitive performance on the KITTI dataset. We also employ knowledge distillation (KD) using a teacher and student model that slightly improves the performance of our network.

中文总结: 受到最近视觉变换器在目标检测方面的进展的启发,我们提出了Li3DeTr,这是一个用于自动驾驶的端到端基于LiDAR的3D检测变换器,它输入LiDAR点云并回归3D边界框。LiDAR的局部和全局特征分别使用稀疏卷积和多尺度可变形注意力进行编码。在解码器头部,首先,在新颖的Li3DeTr交叉注意力块中,我们利用从数据中学习的稀疏对象查询集将LiDAR的全局特征与3D预测联系起来。其次,对象查询之间的交互使用多头自注意力来进行建模。最后,解码器层被重复Ldec次以细化对象查询。受DETR的启发,我们采用集合到集合的损失来训练Li3DeTr网络。在nuScenes数据集上,Li3DeTr网络实现了61.3%的mAP和67.6%的NDS,超过了具有非极大值抑制(NMS)的最先进方法,并且在KITTI数据集上也实现了竞争性能。我们还使用教师和学生模型进行知识蒸馏(KD),稍微提高了我们网络的性能。

Paper22 CountNet3D: A 3D Computer Vision Approach To Infer Counts of Occluded Objects

摘要原文: 3D scene understanding is an important problem that has experienced great progress in recent years, in large part due to the development of state-of-the-art methods for 3D object detection. However, the performance of 3D object detectors can suffer in scenarios where extreme occlusion of objects is present, or the number of object classes is large. In this paper, we study the problem of inferring 3D counts from densely packed scenes with heterogeneous objects. This problem has applications to important tasks such as inventory management or automatic crop yield estimation. We propose a novel regression-based method, CountNet3D, that uses mature 2D object detectors for finegrained classification and localization, and a PointNet backbone for geometric embedding. The network processes fused data from images and point clouds for end-to-end learning of counts. We perform experiments on a novel synthetic dataset for inventory management in retail, which we construct and make publicly available to the community. Our results show that regression-based 3D counting methods systematically outperform detection-based methods, and reveal that directly learning from raw point clouds greatly assists count estimation under extreme occlusion. Finally, we study the effectiveness of CountNet3D on a large dataset of real-world scenes where extreme occlusion is present and achieve an error rate of 11.01%.

中文总结: 这段话主要讨论了3D场景理解是一个重要问题,在近年来取得了巨大进展,这在很大程度上归功于3D物体检测方法的发展。然而,在存在极端遮挡或物体类别较多的场景中,3D物体检测器的性能可能会受到影响。本文研究了从密集场景中推断3D计数的问题,该问题在库存管理或自动作物产量估计等重要任务中具有应用。作者提出了一种新颖的基于回归的方法CountNet3D,该方法利用成熟的2D物体检测器进行细粒度分类和定位,并使用PointNet骨干进行几何嵌入。网络处理来自图像和点云的融合数据,实现了端到端的计数学习。作者在一个新颖的用于零售库存管理的合成数据集上进行了实验,并将其公开共享给社区。实验结果表明,基于回归的3D计数方法系统地优于基于检测的方法,并且直接从原始点云学习极大地帮助了在极端遮挡下的计数估计。最后,作者研究了CountNet3D在一个大规模真实场景数据集上的有效性,取得了11.01%的误差率。

Paper23 3D-SpLineNet: 3D Traffic Line Detection Using Parametric Spline Representations

摘要原文: Monocular 3D traffic line detection jointly tackles the detection of lane markings and regression of their 3D location. The greatest challenge is the exact estimation of various line shapes in the world, which highly depends on the chosen representation. While anchor-based and grid-based line representations have been proposed, all suffer from the same limitation, the necessity of discretizing the 3D space. To address this limitation, we present an anchor-free parametric lane representation, which defines traffic lines as continuous curves in 3D space. Choosing splines as our representation, we show their superiority over polynomials of different degrees that were proposed in previous 2D lane detection approaches. Our continuous representation allows us to model even complex lane shapes at any position in the 3D space, while implicitly enforcing smoothness constraints. Our model is validated on a synthetic 3D lane dataset including a variety of scenes in terms of complexity of road shape and illumination. We outperform the state-of-the-art in nearly all geometric performance metrics and achieve a great leap in the detection rate. In contrast to discrete representations, our parametric model requires no post-processing achieving highest processing speed. Additionally, we provide a thorough analysis over different parametric representations for 3D lane detection. The code and trained models are available on our project website https://3d-splinenet.github.io/.

中文总结: 这段话主要讨论了单目3D交通线检测的方法,该方法同时处理车道标线的检测和它们在3D空间中的回归。最大的挑战在于准确估计世界中各种线形,这在很大程度上取决于所选择的表示方法。虽然已经提出了基于锚点和基于网格的线表示方法,但它们都受到同样的限制,即需要对3D空间进行离散化处理。为了解决这一限制,作者提出了一种无锚点的参数化车道表示方法,将交通线定义为3D空间中的连续曲线。选择样条作为表示方法,作者展示了它们相对于先前提出的不同次数的多项式在2D车道检测方法中的优越性。作者的连续表示方法使得他们能够在3D空间中的任何位置建模复杂的车道形状,同时隐式地施加平滑性约束。作者在包含各种场景的合成3D车道数据集上验证了模型,包括道路形状和照明复杂性。作者在几乎所有几何性能指标上均胜过最先进的方法,并在检测率方面取得了重大进展。与离散表示相比,作者的参数化模型无需后处理,实现了最高的处理速度。此外,作者对3D车道检测的不同参数化表示进行了深入分析。作者的代码和训练模型可在项目网站https://3d-splinenet.github.io/上获得。

Paper24 Masked Image Modeling Advances 3D Medical Image Analysis

摘要原文: Recently, masked image modeling (MIM) has gained considerable attention due to its capacity to learn from vast amounts of unlabeled data and has been demonstrated to be effective on a wide variety of vision tasks involving natural images. Meanwhile, the potential of self-supervised learning in modeling 3D medical images is anticipated to be immense due to the high quantities of unlabeled images, and the expense and difficulty of quality labels. However, MIM’s applicability to medical images remains uncertain. In this paper, we demonstrate that masked image modeling approaches can also advance 3D medical images analysis in addition to natural images. We study how masked image modeling strategies leverage performance from the viewpoints of 3D medical image segmentation as a representative downstream task: i) when compared to naive contrastive learning, masked image modeling approaches accelerate the convergence of supervised training even faster (1.40x) and ultimately produce a higher dice score; ii) predicting raw voxel values with a high masking ratio and a relatively smaller patch size is non-trivial self-supervised pretext-task for medical images modeling; iii) a lightweight decoder or projection head design for reconstruction is powerful for masked image modeling on 3D medical images which speeds up training and reduce cost; iv) finally, we also investigate the effectiveness of MIM methods under different practical scenarios where different image resolutions and labeled data ratios are applied. Anonymized codes are available at https://anonymous.4open.science/r/MIM-Med3D.

中文总结: 最近,由于遮蔽图像建模(MIM)能够从大量未标记数据中学习,并且已经证明在涉及自然图像的各种视觉任务中非常有效,因此引起了广泛关注。与此同时,自监督学习在建模3D医学图像方面的潜力被认为是巨大的,因为医学图像存在大量未标记图像,而质量标签的成本和难度也很高。然而,MIM在医学图像上的适用性仍然不确定。在本文中,我们展示了遮蔽图像建模方法不仅可以推动自然图像分析,还可以推动3D医学图像分析。我们研究了遮蔽图像建模策略如何从3D医学图像分割作为代表性下游任务的角度提升性能:i)与朴素对比学习相比,遮蔽图像建模方法加速了监督训练的收敛速度(1.40倍),最终产生了更高的Dice分数;ii)用高遮蔽比率和相对较小的块大小预测原始体素值是医学图像建模的自监督先验任务,非常困难;iii)用于重建的轻量级解码器或投影头设计对于在3D医学图像上的遮蔽图像建模非常有用,可以加快训练速度并降低成本;iv)最后,我们还研究了在不同实际场景下MIM方法的有效性,其中应用了不同的图像分辨率和标记数据比例。匿名代码可在https://anonymous.4open.science/r/MIM-Med3D获得。

Paper25 Visually Explaining 3D-CNN Predictions for Video Classification With an Adaptive Occlusion Sensitivity Analysis

摘要原文: This paper proposes a method for visually explaining the decision-making process of 3D convolutional neural networks (CNN) with a temporal extension of occlusion sensitivity analysis. The key idea here is to occlude a specific volume of data by a 3D mask in an input 3D temporal-spatial data space and then measure the change degree in the output score. The occluded volume data that produces a larger change degree is regarded as a more critical element for classification. However, while the occlusion sensitivity analysis is commonly used to analyze single image classification, it is not so straightforward to apply this idea to video classification as a simple fixed cuboid cannot deal with the motions. To this end, we adapt the shape of a 3D occlusion mask to complicated motions of target objects. Our flexible mask adaptation is performed by considering the temporal continuity and spatial co-occurrence of the optical flows extracted from the input video data. We further propose to approximate our method by using the first-order partial derivative of the score with respect to an input image to reduce its computational cost. We demonstrate the effectiveness of our method through various and extensive comparisons with the conventional methods in terms of the deletion/insertion metric and the pointing metric on the UCF-101. The code is available at: https://github.com/uchiyama33/AOSA.

中文总结: 本文提出了一种方法,用于通过时间扩展的遮挡敏感性分析来直观解释3D卷积神经网络(CNN)的决策过程。关键思想是在输入的3D时空数据空间中用3D遮挡来遮挡特定数据体积,然后测量输出分数的变化程度。产生较大变化程度的被遮挡数据体积被视为分类的更关键元素。然而,尽管遮挡敏感性分析通常用于分析单幅图像分类,但将这个想法应用于视频分类并不那么简单,因为简单的固定长方体无法处理运动。为此,我们通过考虑从输入视频数据中提取的光流的时间连续性和空间共现性,调整3D遮挡掩模的形状以适应目标对象的复杂运动。我们进一步提出,通过使用关于输入图像的得分的一阶偏导数来近似我们的方法,以减少计算成本。我们通过在UCF-101上进行的多种广泛比较,通过删除/插入度量和指向度量展示了我们方法的有效性。代码可在以下链接获取:https://github.com/uchiyama33/AOSA。

Paper26 Learning Graph Variational Autoencoders With Constraints and Structured Priors for Conditional Indoor 3D Scene Generation

摘要原文: We present a graph variational autoencoder with a structured prior for generating the layout of indoor 3D scenes. Given the room type (e.g., living room or library) and the room layout (e.g., room elements such as floor and walls), our architecture generates a collection of objects (e.g., furniture items such as sofa, table and chairs) that is consistent with the room type and layout. This is a challenging problem because the generated scene needs to satisfy multiple constrains, e.g., each object should lie inside the room and two objects should not occupy the same volume. To address these challenges, we propose a deep generative model that encodes these relationships as soft constraints on an attributed graph (e.g., the nodes capture attributes of room and furniture elements, such as shape, class, pose and size, and the edges capture geometric relationships such as relative orientation). The architecture consists of a graph encoder that maps the input graph to a structured latent space, and a graph decoder that generates a furniture graph, given a latent code and the room graph. The latent space is modeled with autoregressive priors, which facilitates the generation of highly structured scenes. We also propose an efficient training procedure that combines matching and constrained learning. Experiments on the 3D-FRONT dataset show that our method produces scenes that are diverse and are adapted to the room layout.

中文总结: 本文介绍了一种具有结构化先验的图变分自动编码器,用于生成室内3D场景的布局。给定房间类型(例如,客厅或图书馆)和房间布局(例如,房间元素如地板和墙壁),我们的架构生成与房间类型和布局一致的对象集合(例如,家具项目如沙发、桌子和椅子)。这是一个具有挑战性的问题,因为生成的场景需要满足多个约束,例如,每个对象应位于房间内,两个对象不应占据相同的空间。为了解决这些挑战,我们提出了一个深度生成模型,将这些关系编码为对属性图的软约束(例如,节点捕获房间和家具元素的属性,如形状、类别、姿势和大小,边捕获几何关系,如相对方向)。该架构包括一个图编码器,将输入图映射到结构化潜在空间,以及一个图解码器,给定潜在代码和房间图生成家具图。潜在空间采用自回归先验建模,有助于生成高度结构化的场景。我们还提出了一种有效的训练过程,结合了匹配和约束学习。在3D-FRONT数据集上的实验证明,我们的方法产生了多样化且适应房间布局的场景。

Paper27 CG-NeRF: Conditional Generative Neural Radiance Fields for 3D-Aware Image Synthesis

摘要原文: Recent generative models based on neural radiance fields (NeRF) achieve the generation of diverse 3D-aware images. Despite the success, their applicability can be further expanded by incorporating with various types of user-specified conditions such as text and images. In this paper, we propose a novel approach called the conditional generative neural radiance fields (CG-NeRF), which generates multi-view images that reflect multimodal input conditions such as images or text. However, generating 3D-aware images from multimodal conditions bears several challenges. First, each condition type has different amount of information - e.g., the amount of information in text and color images are significantly different. Furthermore, the pose-consistency is often violated when diversifying the generated images from input conditions. Addressing such challenges, we propose 1) a unified architecture that effectively handles multiple types of conditions, and 2) the pose-consistent diversity loss for generating various images while maintaining the view consistency. Experimental results show that the proposed method maintains consistent image quality on various multimodal condition types and achieves superior fidelity and diversity compared to the existing NeRF-based generative models.

中文总结: 最近基于神经辐射场(NeRF)的生成模型实现了多样化的3D感知图像生成。尽管取得成功,但它们的适用性可以通过与各种类型的用户指定条件(如文本和图像)相结合来进一步扩展。在本文中,我们提出了一种名为条件生成神经辐射场(CG-NeRF)的新方法,该方法生成反映多模态输入条件(如图像或文本)的多视角图像。然而,从多模态条件生成3D感知图像存在一些挑战。首先,每种条件类型包含不同数量的信息 - 例如,文本和彩色图像中的信息量明显不同。此外,从输入条件多样化生成图像时,姿势一致性通常会被破坏。为了解决这些挑战,我们提出了1)一个有效处理多种条件类型的统一架构,以及2)用于生成各种图像同时保持视图一致性的姿势一致多样性损失。实验结果表明,所提出的方法在各种多模态条件类型上保持一致的图像质量,并在保真度和多样性方面优于现有基于NeRF的生成模型。

Paper28 3DMM-RF: Convolutional Radiance Fields for 3D Face Modeling

摘要原文: Facial 3D Morphable Models are a main computer vision subject with countless applications and have been highly optimized in the last two decades. The tremendous improvements of deep generative networks have created various possibilities for improving such models and have attracted wide interest. Moreover, the recent advances in neural radiance fields, are revolutionising novel-view synthesis of known scenes. In this work, we present a facial 3D Morphable Model, which exploits both of the above, and can accurately model a subject’s identity, pose and expression and render it in arbitrary illumination. This is achieved by utilizing a powerful deep style-based generator to overcome two main weaknesses of neural radiance fields, their rigidity and rendering speed. We introduce a style-based generative network that synthesizes in one pass all and only the required rendering samples of a neural radiance field. We create a vast labelled synthetic dataset of facial renders, and train the network, so that it can accurately model and generalize on facial identity, pose and appearance. Finally, we show that this model can accurately be fit to “in-the-wild” facial images of arbitrary pose and illumination, extract the facial characteristics, and be used to re-render the face in controllable conditions.

中文总结: 这段话主要讨论了面部3D可塑模型是计算机视觉的一个主要研究课题,具有无数的应用,并在过去的两十年中得到了高度优化。深度生成网络的巨大改进为改进这些模型创造了各种可能性,并吸引了广泛的兴趣。此外,神经辐射场的最新进展正在革新已知场景的新视角合成。在这项工作中,我们提出了一个面部3D可塑模型,利用了上述两者,并可以准确地建模主体的身份、姿势和表情,并在任意照明下呈现。这是通过利用强大的基于风格的深度生成器来克服神经辐射场的两个主要弱点,即它们的刚性和渲染速度。我们引入了一个基于风格的生成网络,一次性合成神经辐射场的所有必需渲染样本。我们创建了一个庞大的带标签的合成面部渲染数据集,并训练网络,使其能够准确地建模和概括面部身份、姿势和外观。最后,我们展示了这个模型可以准确地适应任意姿势和照明的“野外”面部图像,提取面部特征,并用于在可控条件下重新呈现面部。

Paper29 Seg&Struct: The Interplay Between Part Segmentation and Structure Inference for 3D Shape Parsing

摘要原文: We propose Seg&Struct, a supervised learning framework leveraging the interplay between part segmentation and structure inference and demonstrating their synergy in an integrated framework. Both part segmentation and structure inference have been extensively studied in the recent deep learning literature, while the supervisions used for each task have not been fully exploited to assist the other task. Namely, structure inference has been typically conducted with an autoencoder that does not leverage the point-to-part associations. Also, segmentation has been mostly performed without structural priors that tell the plausibility of the output segments. We present how these two tasks can be best combined while fully utilizing supervision to improve performance. Our framework first decomposes a raw input shape into part segments using an off-the-shelf algorithm, whose outputs are then mapped to nodes in a part hierarchy, establishing point-to-part associations. Following this, ours predicts the structural information, e.g., part bounding boxes and part relationships. Lastly, the segmentation is rectified by examining the confusion of part boundaries using the structure-based part features. Our experimental results based on the StructureNet and PartNet demonstrate that the interplay between the two tasks results in remarkable improvements in both tasks: 27.91% in structure inference and 0.5% in segmentation.

中文总结: 我们提出了Seg&Struct,这是一个监督学习框架,利用部分分割和结构推断之间的相互作用,并展示它们在一个集成框架中的协同作用。最近的深度学习文献中广泛研究了部分分割和结构推断,但各自任务所使用的监督并未完全被充分利用来协助其他任务。即结构推断通常是通过自编码器进行的,而没有利用点到部分的关联。另外,分割大多数是在没有结构先验的情况下进行的,这些先验可以告诉输出分割的合理性。我们展示了如何最好地结合这两个任务,同时充分利用监督来提高性能。我们的框架首先将原始输入形状分解为部分段,使用现成的算法,然后将其输出映射到部分层次结构中的节点,建立点到部分的关联。接着,我们预测结构信息,例如部分边界框和部分关系。最后,通过检查基于结构的部分特征的边界混淆来矫正分割。我们基于StructureNet和PartNet的实验结果表明,这两个任务之间的相互作用在两个任务中都取得了显著的改进:结构推断提高了27.91%,分割提高了0.5%。

Paper30 MonoEdge: Monocular 3D Object Detection Using Local Perspectives

摘要原文: We propose a novel approach for monocular 3D object detection by leveraging local perspective effects of each object. While the global perspective effect shown as size and position variations has been exploited for monocular 3D detection extensively, the local perspectives has long been overlooked. We propose a new regression target named keyedge-ratios as the parameterization of the local shape distortion to account for the local perspective, and derive the object depth and yaw angle from it. Theoretically, this approach does not rely on the absolute size or position of the objects in the image, therefore independent of the camera intrinsic parameters. This approach provides a new perspective for monocular 3D reasoning and can be plugged in flexibly to existing monocular 3D object detection frameworks. We demonstrate effectiveness and superior performance over strong baseline methods in multiple datasets.

中文总结: 本文提出了一种新颖的单目3D物体检测方法,利用每个物体的局部透视效应。尽管全局透视效应(如大小和位置变化)已被广泛用于单目3D检测,但局部透视一直被忽视。我们提出了一种新的回归目标称为keyedge-ratios,作为局部形状失真的参数化,以考虑局部透视,并从中推导出物体的深度和偏航角。从理论上讲,这种方法不依赖于图像中物体的绝对大小或位置,因此与相机内部参数无关。这种方法为单目3D推理提供了新的视角,并可以灵活地插入到现有的单目3D物体检测框架中。我们在多个数据集上展示了该方法的有效性和优越性能,超过了强基线方法。

;