Drivable Avatar Clothing: Faithful Full-Body Telepresence with Dynamic Clothing Driven by Sparse RGB-D Input

SIGGRAPH Asia 2023

Donglai Xiang¹, Fabian Prada², Zhe Cao², Kaiwen Guo², Chenglei Wu², Jessica Hodgins¹, Timur Bagautdinov²

¹Carngie Mellon University ²Meta Reality Labs Research

We build photorealistic full-body avatars that can be driven by sparse RGB-D views and body pose, and are able to faithfully reproduce the appearance and dynamics of challenging loose clothing.

Abstract

Clothing is an important part of human appearance but challenging to model in photorealistic avatars. In this work we present avatars with dynamically moving loose clothing that can be faithfully driven by sparse RGB-D inputs as well as body and face motion. We propose a Neural Iterative Closest Point (N-ICP) algorithm that can efficiently track the coarse garment shape given sparse depth input. Given the coarse tracking results, the input RGB-D images are then remapped to texel-aligned features, which are fed into the drivable avatar models to faithfully reconstruct appearance details. We evaluate our method against recent image-driven synthesis baselines, and conduct a comprehensive analysis of the N-ICP algorithm. We demonstrate that our method can generalize to a novel testing environment, while preserving the ability to produce high-fidelity and faithful clothing dynamics and appearance.

Video (1080p)

Pose-Driven Animation vs. Faithful Driving

Previous pose-driven avatars like Clothing Codec Avatars can be animated by body pose and produce photorealistic clothing appearance. Dressing Avatars further improves the clothing dynamics with the help of physically-based cloth simulation. These pose-driven avatars can produce output that looks realistic, but cannot guarantee faithfulness due to the inherent ambiguity of clothing states given only body motion as input.

To address this problem, we introduce more driving signal (still sparse) as input to the system, and aim to drive the avatars faithfully from up to three RGB-D views (in addition to body pose).

Extending Texel-Aligned Avatars

We build on top of the formulation of texel-aligned avatars, whose central idea is to aggregate driving images into a UV-aligned feature by performing texture unwrapping using a posed body template, followed by a convolutional network to predict appearance and geometry for the avatars. However, it is hard to directly apply this method to loose clothing, because the body template has a different topology from clothing and and highly misaligned with the true surface, which cannot extract meaningful texture-aligned features.

Our central idea is to introduce an initial tracking stage, where we coarsely track the deformation of clothing given the point cloud fused from sparse depth input in an online manner.

Neural (Non-Rigid) Iterative Closest Points

We take inspiration from classical (non-rigid) Iterative Closest Point (ICP) and introduce a data-driven method that is trained to take most efficient tracking steps iteratively with help of a low-dimensional deformation graph model.

We feed the residual and gradient from the closest point query into a self-supervised PointNet to update the deformation parameters iteratively. We show that our trained PointNet is more efficient than classical first-order nonlinear optimization.

Results

We demonstrate that the introduction of additional sensing signal allows our method to drive the avatar with clothing more faithfully than previous pose-driven avatars. In the paper, we also show comparison with generalizable NeRFs and sensing-based baselines, as well as ablation studies.

To demonstrate the applicabiliy of our method for telepresence, we test the avatars using three Kinects as RGB-D input in a novel capture environment. Without finetuning, our avatars capture the body and clothing dynamics in the original appearance from the training data. After finetuning, our output is adapted to the illumination in the new environment.

Please see our supplementary video for the full results and our paper for more detail.