Abstract: Reconstructing and interpreting real world scenes in 3D is a major challenge, especially when this needs to be done form scarce data, such as a single or few input images or scans. For this to be possible at all, we require rich 3D priors that can constrain reconstruction. In this talk, I will discuss progress in Facebook AI Research in developing such priors. I will introduce Common Objects in 3D, a new dataset of videos of real objects, from which 3D priors can be acquired. I will then discuss algorithms to reconstruct new objects from a small number of images, including warp-conditioned ray embedding and NeRFormer. Finally, I will discuss the general problem of establishing 3D correspondences in complex object categories, which is a key step towards building better 3D priors, and introduce NeuroMorph, a method to establish correspondences in an unsupervised manner.
Abstract: Over the past decade, we have witnessed significant progress in scene understanding methods. However, the current scene understanding pipelines used in computer vision are not well suited for modern applications in Interactive AI as these applications frequently lack clear supervision, and they require constant adaption to new data and changing environments. This talk focuses on embodied scene understanding as a form of Interactive AI, and it discusses two challenging embodied tasks requiring visual reasoning, scene state understanding, object manipulation, and navigation, all while maintaining a persistent memory. This talk shows that state-of-the-art techniques struggle in performing these tasks, and they leave ample room for future progress in embodied scene understanding.
Abstract: This talk will view scene understanding from the perspective of agents actively interacting with the world around them. We will focus on navigation agents, and discuss: a) how can we efficiently learn policies for navigation, and b) how can navigating agents improve their understanding of the world through self-supervised interaction..
Abstract: Estimating the relative rigid pose between two RGB-D scans of the same underlying environment is a fundamental problem in computer vision, robotics, and computer graphics. Most existing approaches allow only limited relative pose changes since they require considerable overlap between the input scans. This talk discusses techniques extending the scope to extreme relative poses, with little or no overlap between the input scans. The key idea is to infer complete scene information about the underlying environment and match the completed scans. We discuss suitable representations for scene completion and how to integrate relative pose estimation and scene completion. Experimental results on benchmark datasets show that our approach leads to considerable improvements over state-of-the-art methods for relative pose estimation. In particular, our approach provides encouraging relative pose estimates even between non-overlapping scans.
Abstract: We live in a 3D world that is dynamic and full of life, with inhabitants like people and animals who interact with their environment by moving their bodies. While there has been rapid progress in perceiving 3D humans from images and videos, much of the work focuses on 3D human perception alone, independent from its environment including other objects that people interact with. This is particularly so on images captured in uncontrolled, "in-the-wild" settings where there are no ground truth 3D labels. In this talk I will discuss our recent direction in recovering humans that interact in two settings: 1) perceiving hands and objects 2) perceiving humans in movies, a rich data source for human interaction depicted over a large temporal context.
Abstract: The creation of technologies for telepresence applications as a new means of communication that brings people closer together has become a necessity, especially when stay-at-home orders and remote work have become part of daily life. The development of virtual humans with AR/VR technologies to enable authentic and believable interactions between distant people is therefore of particular importance. In this talk I will give an overview of our mission, and review some of our recent work related to the creation the next-gen virtual humans that are based on a full understanding of the human body and its environment.
Abstract: Recovering 3D geometries of scenes from 2D images is one of the most fundamental and challenging problems in computer vision. On one hand, traditional geometry-based algorithms such as SfM and SLAM are fragile in certain environments, and the resulting noisy point-clouds are hard to process and interpret. On the other hand, recent learning-based 3D understanding neural networks parse scenes by extrapolating patterns seen in the training data, which often have limited generalizability and accuracy. In this talk, I will try to address these shortcomings and combine the advantages of geometric-based and data-driven approaches into an integrated framework. More specifically, we apply learning-based methods to extract high-level geometric structures from images and use them for 3D parsing. To this end, we design specialized neural networks that understand geometric structures such as lines, junctions, planes, vanishing points, and symmetry, and detect them from images accurately; we also create large-scale 3D datasets with structural annotations to support data-driven approaches; and demonstrate how to use these high-level abstractions to parse and reconstruct scenes. By combining the power of data-driven approaches and geometric principles, future 3D systems are becoming more accurate, reliable, and easier to implement, resulting in clean, compact, and interpretable scene representations.
Abstract: In this talk, we discuss recent progress on implicitly defined neural-network-parameterized signal representation, processing, and rendering techniques. Such representation networks model 3D scenes in a continuous and differentiable manner, for example using signed distance functions or neural volumes. Differentiable renderers, including sphere tracing and neural volume rendering (i.e., NeRF), allow for these representations to be learned from partial 2D observations. In this context we will discuss recent advances in efficient representation network architectures, rendering algorithms, generalization strategies, and generative adversarial approaches for applications ranging from novel view synthesis to unconditional 3D scene generation.
Abstract: Imagine walking up to a home robot and asking “Hey robot – can you go check if my laptop is on my desk? And if so, bring it to me”. Developing such intelligent systems is a goal of deep scientific and societal value. Training and testing such agents directly in physical environments is slow, expensive, and difficult to reproduce. I will present the next generation of our simulation platform (Habitat) for training virtual robots in interactive environments and complex physics-enabled scenarios.