SAOR: Single-View Articulated Object Reconstruction


Mehmet Aygun
Oisin Mac Aodha

University of Edinburgh
CVPR, 2024
 [Paper]

Abstract

We introduce SAOR, a novel approach for estimating the 3D shape, texture, and viewpoint of an articulated object from a single image captured in the wild. Unlike prior approaches that rely on pre-defined category-specific 3D templates or tailored 3D skeletons, SAOR learns to articulate shapes from single-view image collections with a skeleton-free part-based model without requiring any 3D object shape priors. Our method only requires estimated object silhouettes and relative depth maps from off-the-shelf pre-trained networks during training. At inference time, given a single-view image, it efficiently outputs an explicit mesh representation.



Single Frame Reconstuction Results

SAOR learns to estimate 3D shape of highly articulated objects from single images without using any 3D object prior (e.g. a 3D template or skeleton) during training. It also estimates texture, viewpoint, and a 3D articulated part segmentation, all learned in a self-supervised manner. As It does not use any priors, it can learn to predict even 100s of classes of object classes in a single model. All of the examples below were generated using a single model which was trained on 101 animal classes including zebras, horses, cows, elephants, giraffes, birds, penguins, etc.

Per-Frame Predictions

SAOR robustly estimate 3D shape, viewpoint, and texture of objects using per-frame input without using any temporal information.







Interactive Single View Reconstructions





Articulation Transfer

SAOR learns to disentangle shape deformation and articulation during training. Given a source image (left) we can transfer articulation from another image (right) via interpolating articulation features, and we can obtain reasonable articulation transfer (middle).