AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis

ICCV 2021

Yudong Guo^1,2      Keyu Chen^1,2      Sen Liang³      Yong-Jin Liu⁴      Hujun Bao³      Juyong Zhang¹
¹University of Science and Technology of China     ²Beijing Dilusense Technology Corporation
³Zhejiang University     ⁴Tsinghua University

Paper



Video



Code

Introduction

Generating high-fidelity talking head video by fitting with the input audio sequence is a challenging problem that receives considerable attentions recently. In this paper, we address this problem with the aid of neural scene representation networks. Our method is completely different from existing methods that rely on intermediate representations like 2D landmarks or 3D face models to bridge the gap between audio input and video output. Specifically, the feature of input audio signal is directly fed into a conditional implicit function to generate a dynamic neural radiance field, from which a high-fidelity talking-head video corresponding to the audio signal is synthesized using volume rendering. Another advantage of our framework is that not only the head (with hair) region is synthesized as previous methods did, but also the upper body is generated via two individual neural radiance fields. Experimental results demonstrate that our novel framework can (1) produce high-fidelity and natural results, and (2) support free adjustment of audio signals, viewing directions, and background images.

Pipeline

Our talking-head synthesis framework is trained on a short video sequence along with the audio track of a target person. Based on the neural rendering idea, we implicitly model the deformed human heads and upper bodies by neural scene representation, i.e., neural radiance fields. In order to bridge the domain gap between audio signals and visual faces, we extract the semantic audio features and learn a conditional implicit function to map the audio features to neural radiance fields. Finally, visual faces are rendered from the neural radiance fields using volumetric rendering.

Individual NeRFs Representation

We decompose the neural radiance fields of human portrait scenes into two branches to model the head and torso deformation respectively, which helps to generate more natural talking head results.

Comparisons

The image-based talking head methods are restricted by the input image size and thus could not producing high-resolution imagery as we do. The model-based methods generally require large quantities of training data. Moreover, our method owns the advantage of freely manipulating the viewing directions on the target person, which means that we can naturally rotate the “virtual camera” to observe the speaking actors from arbitrary novel angles.