“Self-supervised Scene Representation Learning”

Vincent Sitzmann


Please LOG IN to view the video.

Date: January 20, 2021


Unsupervised learning with generative models has the potential of discovering rich representations of 3D scenes. Such Neural Scene Representations may subsequently support a wide variety of downstream tasks, ranging from robotics to computer graphics to medical imaging. However, existing methods ignore one of the most fundamental properties of scenes: their three-dimensional structure. In this talk, I will make the case for equipping Neural Scene Representations with an inductive bias for 3D structure, enabling self-supervised discovery of shape and appearance from few observations. By embedding an implicit scene representation in a neural rendering framework and learning a prior over these representations, I will show how we can enable 3D reconstruction from only a single posed 2D image. I will show how the features we learn in this process are already useful to the downstream task of semantic segmentation. I will then show how gradient-based meta-learning can enable fast inference of implicit representations.

Further Information:

Vincent Sitzmann is a postdoc in Joshua Tenenbaum’s group at MIT CSAIL. He previously finished his PhD at Stanford University with a thesis on “Self-Supervised Scene Representation Learning”. His research interest lies in neural scene representations – the way neural networks learn to represent information on our world. His goal is to allow independent agents to reason about our world given visual observations, such as inferring a complete model of a scene with information on geometry, material, lighting etc. from only few observations, a task that is simple for humans, but currently impossible for AI.

Created: Friday, January 22nd, 2021