Alibaba has introduced Wan2.2-S2V, an open-source speech-to-video model that creates digital human videos from photos and audio.
The model converts portraits into film-quality avatars that can speak, sing, and perform. It supports multiple framing options, such as portrait, bust, and full body, and can animate characters dynamically based on prompts.
Wan2.2-S2V offers resolutions of 480P and 720P, making it useful for both social media content and professional presentations. It handles single and multiple characters and supports avatars ranging from cartoon to stylized characters.
The technology combines text-guided global motion control with audio-driven fine-grained movements. This ensures lifelike and expressive performances across complex scenarios. A new frame compression technique also enables stable long-video generation with less computation.
Trained on a large audio-visual dataset for film and TV, the model supports both vertical and horizontal formats. It is available for download on Hugging Face, GitHub, and ModelScope. The Wan series has already seen over 6.9 million downloads worldwide.
