Turn any portrait into a talking video with audio-synchronized lip-sync at 1080p 48fps
Kling Avatar V2 turns a static portrait photo into a talking video driven by an audio file. Upload a face image and an audio clip — the model generates natural lip movements, facial expressions, and subtle head motion synchronized to the speech. Output is 1080p at 48fps.
Lip movements match the audio precisely, including pauses, emphasis, and natural speech rhythm.
High-resolution output with smooth 48 frames per second for natural-looking motion.
Works with realistic photos, cartoon characters, anime faces, and even animal portraits.
Video length automatically matches the audio file duration — no manual trimming needed.
Provide a clear, front-facing portrait image. JPG or PNG, max 10MB, minimum 300px. Well-lit photos with visible face work best.
Add your audio file. MP3, WAV, M4A, or AAC format, maximum 5MB. Clear speech with minimal background noise gives the best lip-sync.
Describe desired head movements, emotions, or camera motion to guide the animation beyond lip-sync.
Choose Standard or Pro mode. The video duration matches your audio length automatically.
Frame-accurate lip movements that follow speech patterns, including consonants, vowels, and pauses.
Subtle head tilts, nods, and movements that match conversational patterns for realistic output.
The model generates appropriate facial expressions based on speech tone and optional prompt guidance.
Supports lip synchronization across multiple languages. Best results with English and Chinese audio.
Animate realistic portraits, illustrated characters, anime faces, 3D renders, and stylized artwork.
Use text prompts to add specific gestures, emotions, or camera movements beyond the audio-driven animation.
Per-second pricing based on audio duration.
Lower cost option for quick previews and drafts.
Higher quality output with better facial detail and smoother motion.
Avatar V2 is designed for turning static portraits into talking videos driven by audio.
Create talking instructor videos from a single photo and voiceover recording for online courses and tutorials.
Produce spokesperson videos for product demos, FAQ responses, and brand messaging without filming.
Turn podcast audio into talking-head video clips for social media promotion and YouTube uploads.
Generate the same spokesperson speaking different languages from translated audio tracks.
Turn any portrait into a talking video with audio-synchronized lip-sync.
Next-generation AI video model — Coming Soon
Native 4K, multi-shot sequencing & integrated audio
Multimodal generation, video editing & style consistency
Transfer motion from reference video to image
Natural language video editing with motion preservation
Cinematic motion with native audio & lip-sync
Motion path control with up to 6 independent elements
Ultra-fast photorealistic image generation