DressRecon: Freeform 4D Human Reconstruction from Monocular Video

Jeff Tan, Donglai Xiang, Shubham Tulsiani, Deva Ramanan, Gengshan Yang
Carnegie Mellon University

Abstract

We present a method to reconstruct time-consistent human body models from monocular videos, focusing on extremely loose clothing or handheld object interactions. Prior work in human reconstruction is either limited to tight clothing with no object interactions, or requires calibrated multi-view captures or personalized template scans which are costly to collect at scale. Our key insight for high-quality yet flexible reconstruction is the careful combination of generic human priors about articulated body shape (learned from large-scale training data) with video-specific articulated ``bag-of-bones" deformation (fit to a single video via test-time optimization). We accomplish this by learning a neural implicit model that disentangles body versus clothing deformations as separate motion model layers. To capture subtle geometry of clothing, we leverage image-based priors such as human body pose, surface normals, and optical flow during optimization. The resulting neural fields can be extracted into time-consistent meshes, or further optimized as explicit 3D Gaussians for high-fidelity interactive rendering. On datasets with highly challenging clothing deformations and object interactions, DressRecon yields higher-fidelity 3D reconstructions than prior art.

Method

The body shape is a neural signed distance field in the canonical space. During volume rendering, rays at time t are traced back to the canonical space via a deformation field.

Hierarchical Deformation

Hierarchical motion fields, represented by body and clothing Gaussians, warp between the canonical shape and time t. The motion fields capture limb motions as well as fine-grained clothing deformations. Using a two-layer model allows us to initialize body pose from off-the-shelf estimates. Below, the woman's arms stop moving when body deformation is removed (middle).

Image-Based Priors

To capture subtle geometry and make optimization tractable, we use foundational image-based priors as supervision, including surface normals, optical flow, universal features, segmentation masks, and 3D human body pose. Each observation below contributes an additional loss term.

Refinement with 3D Gaussians

The resulting neural fields can be extracted into time-consistent meshes, or further optimized as explicit 3D Gaussians to improve the rendering quality and enable interactive visualization. Below, refinement from an implicit SDF to 3D Gaussians improves texture quality.

Results

DressRecon reconstructs high-fidelity shapes and motions even in challenging scenarios. Below, we show the reconstructed shape, rendered 3D point tracks, 3D Gaussian locations after refinement, input-view RGB renderings, and input monocular videos on DNA-Rendering sequences.


Extreme View Synthesis

The reconstructed avatars can be rendered from any view. Given the input monocular video on the left, we show four novel-view renderings at extreme views.

Motion Decomposition

The body and clothing deformation layers are evenly distributed in space, and are often responsible for separate types of motion. Below, we remove each motion type from the reconstructed avatar. Clothing Gaussians are yellow and body Gaussians are blue.

Baseline Comparisons

We compare DressRecon's shape with several baselines, on DNA-Rendering sequences that contain challenging clothing deformation and handheld objects. DressRecon is able to reconstruct challenging deformable structures with higher fidelity than prior art.

Acknowledgments

The website template was borrowed from Jon Barron.