AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis

ICCV 2021

Yudong Guo1,2      Keyu Chen1,2      Sen Liang3      Yong-Jin Liu4      Hujun Bao3      Juyong Zhang1

1University of Science and Technology of China     2Beijing Dilusense Technology Corporation

3Zhejiang University     4Tsinghua University

Introduction

Generating high-fidelity talking head video by fitting with the input audio sequence is a challenging problem that receives considerable attentions recently. In this paper, we address this problem with the aid of neural scene representation networks. Our method is completely different from existing methods that rely on intermediate representations like 2D landmarks or 3D face models to bridge the gap between audio input and video output. Specifically, the feature of input audio signal is directly fed into a conditional implicit function to generate a dynamic neural radiance field, from which a high-fidelity talking-head video corresponding to the audio signal is synthesized using volume rendering. Another advantage of our framework is that not only the head (with hair) region is synthesized as previous methods did, but also the upper body is generated via two individual neural radiance fields. Experimental results demonstrate that our novel framework can (1) produce high-fidelity and natural results, and (2) support free adjustment of audio signals, viewing directions, and background images.

Pipeline

Our talking-head synthesis framework is trained on a short video sequence along with the audio track of a target person. Based on the neural rendering idea, we implicitly model the deformed human heads and upper bodies by neural scene representation, i.e., neural radiance fields. In order to bridge the domain gap between audio signals and visual faces, we extract the semantic audio features and learn a conditional implicit function to map the audio features to neural radiance fields. Finally, visual faces are rendered from the neural radiance fields using volumetric rendering.

Audio Driven Results

Our method allows arbitrary audio input from different identity, gender and language to drive target persons.

Background & Pose Editing

Our method can generate talking head frames with freely adjusted viewing directions and various background images.

Individual NeRFs Representation

We decompose the neural radiance fields of human portrait scenes into two branches to model the head and torso deformation respectively, which helps to generate more natural talking head results.

Comparisons

The image-based talking head methods are restricted by the input image size and thus could not producing high-resolution imagery as we do. The model-based methods generally require large quantities of training data. Moreover, our method owns the advantage of freely manipulating the viewing directions on the target person, which means that we can naturally rotate the “virtual camera” to observe the speaking actors from arbitrary novel angles.

Citation

@inproceedings{guo2021adnerf,
    title={AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis},
    author={Yudong Guo and Keyu Chen and Sen Liang and Yongjin Liu and Hujun Bao and Juyong Zhang},
    booktitle = {{IEEE/CVF} International Conference on Computer Vision (ICCV)},
    year={2021}
}