Kling O1 drops 'Nano Banana of Video'

In partnership with

Welcome to VP Land! The week started off with a deluge of new AI model drops...which got dwarfed by the Netflix/WB bombshell yesterday.

Better commentary on that elsewhere - here's what you need to know on AI and creative tech this week:

Kling O1 brings levels up AI video modification
Alibaba does real-time AI - plus drops Z-Image model
LTX Retake modifies video without regeneration
Runway Gen-4.5 brings better physics

Kling Debuts O1, the "Nano Banana of Video"

Kling officially launched Kling O1, calling it "the world's first unified multimodal video model" — a single AI engine that combines reference-based generation, keyframe-to-video, text-to-video, video editing, transformations, restyling, and camera extension in one interface.

Instead of bouncing between separate tools for generation and post, Kling O1 lets you throw text, images, reference elements, and video clips into a multimodal prompt area and get edited, stylistically consistent output with 3-10 second duration control. Some users are already dubbing it the "Nano Banana of video."

Unified multimodal prompting — Upload images, video, reference subjects, or text into one input box; the model treats everything as a prompt. This covers reference-to-video, image-to-video, start/end frames, transformations, video reference for previous/next shots, and text-to-video.
Text-driven editing without masks — Type prompts like "remove bystanders," "change daylight to dusk," or "swap the main character's outfit" and O1 performs "pixel-level semantic reconstruction" without manual masking or keyframing. Kling positions this as turning post-production editing into a conversation.
Character and product consistency — Kling O1 emphasizes "all-in-one reference" to maintain character, prop, and scene identity across shots.
Combined skills in one generation — The model supports stacking tasks in a single prompt, like adding a subject while modifying the background or changing style while using reference elements, instead of serial tool-switching.

Also this week: Kling also rolled out Kling 2.6 with native audio generation — the model now produces synchronized voice, action sound effects, and environmental ambience matched to visual motion in a single pass, covering dialogue, narration, singing, multi-character conversations, and scene audio.

SPONSOR MESSAGE

Shoppers are adding to cart for the holidays

Over the next year, Roku predicts that 100% of the streaming audience will see ads. For growth marketers in 2026, CTV will remain an important “safe space” as AI creates widespread disruption in the search and social channels. Plus, easier access to self-serve CTV ad buying tools and targeting options will lead to a surge in locally-targeted streaming campaigns.

Read our guide to find out why growth marketers should make sure CTV is part of their 2026 media mix.

Learn more.

Alibaba Demos 20 FPS Real-Time AI Avatars

Researchers from Alibaba and Chinese universities unveiled Live Avatar, a 14-billion-parameter diffusion system built to generate audio-driven avatar video in real time at 20 FPS—and sustain it for over 10,000 seconds without identity drift or visual collapse.

Real-time streaming performance - The model hits 20 FPS on five NVIDIA H800 GPUs using 4-step sampling and a novel Timestep-forcing Pipeline Parallelism technique that distributes denoising stages across multiple devices for linear speedup. The team reports an 84× FPS improvement over baseline without quantization.
Infinite-length stability - Live Avatar uses three mechanisms to prevent the usual long-form breakdown: Rolling RoPE to preserve identity cues, Adaptive Attention Sink to eliminate distribution drift, and History Corrupt to inject noise into the KV-cache so the model extracts motion from history while keeping details stable. The result is video generation that can run 10,000+ seconds without quality degradation.
Agentic integration - The project page shows demos of Live Avatar combined with Qwen3-Omni, creating fully interactive dialogue agents that can hold real-time conversations. One clip features two autonomous agents talking to each other for over 18 minutes.

The hardware requirement—five H800 GPUs—places this in studio or cloud territory for now, but the system-level innovations preview how future avatar tools could support live hosts, virtual presenters, or AI-driven NPCs in broadcast and virtual production workflows.

Runway Gen-4.5, ByteDance Seedream 4.5, LTX Retake, Alibaba Z-Image

Runway released Gen-4.5, which claims the #1 spot on the Artificial Analysis Text to Video benchmark with 1,247 Elo points. The model improves physical accuracy with realistic object weight and momentum, maintains temporal consistency for details like hair strands across frames, and produces expressive characters with nuanced facial expressions. It matches Gen-4's speed and pricing.

ByteDance launched Seedream 4.5 with up to 4K output and multi-reference fusion supporting 2-10 images for consistent multi-panel generation. The update improves text rendering for posters and logos, and enhances editing consistency so changes to lighting, objects, or characters retain the original's fine detail and color tone.

LTX Studio debuted Retake, a tool that regenerates specific 2-16 second segments within a video without recreating the entire shot. Users can rephrase dialogue, redirect emotional beats, or explore alternate camera movements while the AI maintains seamless blending with surrounding frames. It outputs at 1080p and costs $0.10 per second.

Alibaba open-sourced Z-Image, a 6-billion parameter image generator released under Apache 2.0 with 16GB consumer GPU compatibility. The Z-Image-Turbo variant achieves sub-second inference with only 8 function evaluations per image. It includes bilingual text rendering (English and Chinese) and a Z-Image-Edit variant for instruction-following edits.

Addy walks through his step-by-step workflow for crafting a cinematic Western sequence with AI tools—capturing human performance, animating with Nano Banana Pro and Wan 2.2 Animate, and polishing the final output in DaVinci Resolve.

Stories, projects, and links that caught our attention from around the web:

🎨 fal drops LoRA Gallery with 9 specialized Qwen models for precise image editing like outfit changes and portrait retouching

📉 UK's high-end TV tax credit stays flat at 25% after producers pushed for 40% upgrade

✨ OTOY launched OctaneRender 2026, the first commercial path tracer to natively relight Gaussian splats with full spectral lighting

🎯 TwelveLabs releases Marengo 3.0 with 2x faster indexing and 50% smaller embeddings for production-scale video search

⚪ Pantone picks white "Cloud Dancer" as 2026 Color of the Year

Addy and Joey break down Kling O1’s edge in video modification, Z-Image’s rise as an open-source challenger, Seedream 4.5’s batch magic, and how the latest drops from Runway Gen-4.5, LTX Retake, FLUX.2, and TwelveLabs’ Morengo 3.0 stack up.

Read the show notes or watch the full episode.

Watch/Listen & Subscribe

Spotify | Apple Podcast | YouTube