Distilling Neural Fields for Real-Time Articulated Shape Reconstruction

Abstract

We present a method for reconstructing articulated 3D models from videos in real-time, without test-time optimization or manual 3D supervision at training time. Prior work often relies on pre-built deformable models (e.g. SMAL/SMPL), or slow per-scene optimization through differentiable rendering (e.g. dynamic NeRFs). Such methods fail to support arbitrary object categories, or are unsuitable for real-time applications. To address the challenge of collecting large-scale 3D training data for arbitrary deformable object categories, our key insight is to use off-the-shelf video-based dynamic NeRFs as 3D supervision to train a fast feed-forward network, turning 3D shape and motion prediction into a supervised distillation task. Our temporal-aware network uses articulated bones and blend skinning to represent arbitrary deformations, and is self-supervised on video datasets without requiring 3D shapes or viewpoints as input. Through distillation, our network learns to 3D-reconstruct unseen articulated objects at interactive frame rates. Our method yields higher-fidelity 3D reconstructions than prior real-time methods for animals, with the ability to render realistic images at novel viewpoints and poses.

Video Results on Dogs

From left to right: (1) The input image, (2) Comparison to BANMo, (3) Comparison to BARC, (4) Our articulated shape and texture predictions, and (5-7) Our predicted geometry from three views

Video Results on Cats

From left to right: (1) The input image, (2) Comparison to BANMo, (3) Our articulated shape and texture predictions, and (4-6) Our predicted geometry from three views

Video Results on Humans

From left to right: (1) The input image, (2) Comparison to BANMo, (3) Our articulated shape and texture predictions, and (4-6) Our predicted geometry from three views

Bibtex

@inproceedings{tan2023distilling,
    title={Distilling Neural Fields for Real-Time Articulated Shape Reconstruction},
    author={Tan, Jeff and Yang, Gengshan and Ramanan, Deva},
    booktitle={CVPR},
    year={2023}
}

Acknowledgments

Gengshan Yang is supported by the Qualcomm Innovation Fellowship. Thanks to Chonghyuk Song for providing data; Chittesh Thavamani for help with appearance prediction; and Kangle Deng, Zhiqiu Lin, and Erica Weng for reviewing early drafts. The website template was borrowed from Jon Barron.