ER-NeRF++: Efficient region-aware Neural Radiance Fields for high-fidelity talking portrait synthesis

No Thumbnail Available
File version
Author(s)
Li, J
Zhang, J
Bai, X
Zheng, J
Zhou, J
Gu, L
Griffith University Author(s)
Primary Supervisor
Other Supervisors
Editor(s)
Date
2024
Size
File type(s)
Location
License
Abstract

Despite conditional Neural Radiance Fields (NeRF) achieving great success in modeling audio-driven talking portraits, the generation quality is increasingly hampered by the lack of efficient use of space information. This paper presents ER-NeRF, a novel conditional NeRF-based architecture for talking portrait synthesis, and its variant version ER-NeRF++ to concurrently achieve fast convergence, real-time rendering, and state-of-the-art performance with small model size. Inspired by the unequal contribution of spatial regions, we propose two modules in ER-NeRF to guide the talking portrait modeling: (1) A compact and expressive Tri-Plane Hash Representation to improve the accuracy of dynamic head reconstruction by pruning empty spatial regions with three planar hash encoders. (2) A Region Attention Module for the audio–visual feature fusion, including a novel cross-modal attention mechanism to connect audio features with different spatial regions explicitly for local motion priors. Additionally, to tackle the difficulty in learning large facial motions, we propose a deformable variant ER-NeRF++ by including a Deformation Grid Transformer to enable the reuse of cross-regional spatial features for large motion representation. Compared to ER-NeRF, our ER-NeRF++ framework achieves a significant improvement in facial motion quality while maintaining the ability of fast training and real-time rendering. For the torso part, a directAdaptive Pose Encoding is introduced to simplify the pose information for a better head-torso connection. Extensive experiments demonstrate that both of our proposed frameworks can efficiently render lifelike talking portrait videos with rich realistic details, performing better in image quality and audio-lip synchronization compared to previous methods.

Journal Title

Information Fusion

Conference Title
Book Title
Edition
Volume

110

Issue
Thesis Type
Degree Program
School
Publisher link
Patent number
Funder(s)
Grant identifier(s)
Rights Statement
Rights Statement
Item Access Status
Note
Access the data
Related item(s)
Subject

Artificial intelligence

Computer vision and multimedia computation

Data management and data science

Persistent link to this record
Citation

Li, J; Zhang, J; Bai, X; Zheng, J; Zhou, J; Gu, L, ER-NeRF++: Efficient region-aware Neural Radiance Fields for high-fidelity talking portrait synthesis, Information Fusion, 2024, 110, pp. 102456

Collections