Researchers from Alibaba and Chinese universities unveiled Live Avatar, a 14-billion-parameter diffusion system built to generate audio-driven avatar video in real time at 20 FPS—and sustain it for over 10,000 seconds without identity drift or visual collapse.

  • Real-time streaming performance - The model hits 20 FPS on five NVIDIA H800 GPUs using 4-step sampling and a novel Timestep-forcing Pipeline Parallelism technique that distributes denoising stages across multiple devices for linear speedup. The team reports an 84× FPS improvement over baseline without quantization.

  • Infinite-length stability - Live Avatar uses three mechanisms to prevent the usual long-form breakdown: Rolling RoPE to preserve identity cues, Adaptive Attention Sink to eliminate distribution drift, and History Corrupt to inject noise into the KV-cache so the model extracts motion from history while keeping details stable. The result is video generation that can run 10,000+ seconds without quality degradation.

  • Agentic integration - The project page shows demos of Live Avatar combined with Qwen3-Omni, creating fully interactive dialogue agents that can hold real-time conversations. One clip features two autonomous agents talking to each other for over 18 minutes.

The hardware requirement—five H800 GPUs—places this in studio or cloud territory for now, but the system-level innovations preview how future avatar tools could support live hosts, virtual presenters, or AI-driven NPCs in broadcast and virtual production workflows.

Reply

or to participate

Keep Reading

No posts found