3Zhejiang University 4Tsinghua University
Generating high-fidelity talking head video by fitting with the input audio sequence is a challenging problem that receives considerable attentions recently. In this paper, we address this problem with the aid of neural scene representation networks. Our method is completely different from existing methods that rely on intermediate representations like 2D landmarks or 3D face models to bridge the gap between audio input and video output. Specifically, the feature of input audio signal is directly fed into a conditional implicit function to generate a dynamic neural radiance field, from which a high-fidelity talking-head video corresponding to the audio signal is synthesized using volume rendering. Another advantage of our framework is that not only the head (with hair) region is synthesized as previous methods did, but also the upper body is generated via two individual neural radiance fields. Experimental results demonstrate that our novel framework can (1) produce high-fidelity and natural results, and (2) support free adjustment of audio signals, viewing directions, and background images.
Our talking-head synthesis framework is trained on a short video sequence along with the audio track of a target person. Based on the neural rendering idea, we implicitly model the deformed human heads and upper bodies by neural scene representation, i.e., neural radiance fields. In order to bridge the domain gap between audio signals and visual faces, we extract the semantic audio features and learn a conditional implicit function to map the audio features to neural radiance fields. Finally, visual faces are rendered from the neural radiance fields using volumetric rendering.
Our method allows arbitrary audio input from different identity, gender and language to drive target persons.
Our method can generate talking head frames with freely adjusted viewing directions and various background images.
We decompose the neural radiance fields of human portrait scenes into two branches to model the head and torso deformation respectively, which helps to generate more natural talking head results.
The image-based talking head methods are restricted by the input image size and thus could not producing high-resolution imagery as we do. The model-based methods generally require large quantities of training data. Moreover, our method owns the advantage of freely manipulating the viewing directions on the target person, which means that we can naturally rotate the “virtual camera” to observe the speaking actors from arbitrary novel angles.
@inproceedings{guo2021adnerf,
    title={AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis},
    author={Yudong Guo and Keyu Chen and Sen Liang and Yongjin Liu and Hujun Bao and Juyong Zhang},
    booktitle = {{IEEE/CVF} International Conference on Computer Vision (ICCV)},
    year={2021}
}