ER-NeRF++: Efficient region-aware Neural Radiance Fields for high-fidelity talking portrait synthesis
File version
Author(s)
Zhang, J
Bai, X
Zheng, J
Zhou, J
Gu, L
Griffith University Author(s)
Primary Supervisor
Other Supervisors
Editor(s)
Date
Size
File type(s)
Location
License
Abstract
Despite conditional Neural Radiance Fields (NeRF) achieving great success in modeling audio-driven talking portraits, the generation quality is increasingly hampered by the lack of efficient use of space information. This paper presents ER-NeRF, a novel conditional NeRF-based architecture for talking portrait synthesis, and its variant version ER-NeRF++ to concurrently achieve fast convergence, real-time rendering, and state-of-the-art performance with small model size. Inspired by the unequal contribution of spatial regions, we propose two modules in ER-NeRF to guide the talking portrait modeling: (1) A compact and expressive Tri-Plane Hash Representation to improve the accuracy of dynamic head reconstruction by pruning empty spatial regions with three planar hash encoders. (2) A Region Attention Module for the audio–visual feature fusion, including a novel cross-modal attention mechanism to connect audio features with different spatial regions explicitly for local motion priors. Additionally, to tackle the difficulty in learning large facial motions, we propose a deformable variant ER-NeRF++ by including a Deformation Grid Transformer to enable the reuse of cross-regional spatial features for large motion representation. Compared to ER-NeRF, our ER-NeRF++ framework achieves a significant improvement in facial motion quality while maintaining the ability of fast training and real-time rendering. For the torso part, a directAdaptive Pose Encoding is introduced to simplify the pose information for a better head-torso connection. Extensive experiments demonstrate that both of our proposed frameworks can efficiently render lifelike talking portrait videos with rich realistic details, performing better in image quality and audio-lip synchronization compared to previous methods.
Journal Title
Information Fusion
Conference Title
Book Title
Edition
Volume
110
Issue
Thesis Type
Degree Program
School
Publisher link
Patent number
Funder(s)
Grant identifier(s)
Rights Statement
Rights Statement
Item Access Status
Note
Access the data
Related item(s)
Subject
Artificial intelligence
Computer vision and multimedia computation
Data management and data science
Persistent link to this record
Citation
Li, J; Zhang, J; Bai, X; Zheng, J; Zhou, J; Gu, L, ER-NeRF++: Efficient region-aware Neural Radiance Fields for high-fidelity talking portrait synthesis, Information Fusion, 2024, 110, pp. 102456