FADA: Fast Diffusion Avatar Synthesis with Mixed-Supervised Multi-CFG Distillation

Tianyun Zhong1*‡, Chao Liang2*, Jianwen Jiang2*†, Gaojie Lin2, Jiaqi Yang2, Zhou Zhao1
1Zhejiang University,2Bytedance *Equal contribution,Project lead,Internship at Bytedance

FADA-Balanced (4.17X speedup)

FADA-Fast (12.5X speedup)

Abstract

Diffusion-based audio-driven talking avatar methods have recently gained attention for their high-fidelity, vivid, and expressive results. However, their slow inference speed limits practical applications. Despite the development of various distillation techniques for diffusion models, we found that naive diffusion distillation methods do not yield satisfactory results. Distilled models exhibit reduced robustness with open-set input images and a decreased correlation between audio and video compared to teacher models, undermining the advantages of diffusion models. To address this, we propose FADA (FAst Diffusion Avatar Synthesis with Mixed-Supervised Multi-CFG Distillation). We first designed a mixed-supervised loss to leverage data of varying quality and enhance the overall model capability as well as robustness. Additionally, we propose a multi-CFG distillation with learnable tokens to utilize the correlation between audio and reference image conditions, reducing the threefold inference runs caused by multi-CFG with acceptable quality degradation. Extensive experiments across multiple datasets show that FADA generates vivid videos comparable to recent diffusion model-based methods while achieving an NFE speedup of 4.17-12.5 times.

Comparison with Recent Methods

Visualizations of CFG Control Abilities with Multi-CFG Distillation

Ethics Concerns

The purpose of this work is only for research. The images and audios used in these demos are from public sources. If there are any concerns, please contact us (zhongtianyun@zju.edu.cn) and we will delete it in time. The template of this webpage is from VASA-1.