AI video company Pika Labs has launched the beta of PikaStream1.0, a real-time generative video model that gives AI agents a face and a voice during live video calls. The company is calling it "the first video chat skill for any agent." It is a streaming video model that renders an agent's visual presence, expressions, and speech in real time, maintaining memory and personality across the duration of a call.
The immediate implication is straightforward: any AI agent can now conduct a live video conversation with a human user. The agent sees, listens, responds, and adapts visually at conversational speed. Combined with Pika AI Self, agents can also execute tasks during the call — pulling data, triggering workflows, running tools — all while maintaining the video interaction. Users can try it now by inviting their Pika AI Self to a Google Meet.
What PikaStream1.0 Actually Does
PikaStream1.0 is optimized for low-latency, continuous output. Unlike Pika's existing tools, which generate short clips from text prompts, this model runs as a persistent stream. According to Pika's technical breakdown, the 9-billion parameter Diffusion Transformer (DiT) generates personalized video at 30 frames per second and 480p resolution on a single NVIDIA H100 GPU.
The end-to-end speech-to-video latency is approximately 1.5 seconds. Pika achieves this by running speech recognition, LLM reasoning, and text-to-speech concurrently, with video generation beginning as soon as the first audio chunk is ready. A custom Transformer-based component called FlashVAE reconstructs the video in real time.
Pika is positioning this as a modular skill on GitHub, meaning developers can attach it to existing agent architectures. The agent handles the reasoning, tool use, and memory. PikaStream1.0 handles the face and the gestures.
The Interaction Paradigm Shift
For the past several years, AI tools have lived behind text boxes, command lines, and chat interfaces. Voice assistants added a layer but remained disembodied. PikaStream1.0 introduces a visual presence layer that operates at conversational speed.
This matters because human communication is heavily visual. Facial expressions, eye contact, and gestural cues carry information that text and voice alone cannot. An AI agent that can leverage these channels, even synthetically, changes the bandwidth of human-AI interaction.
There are obvious questions about uncanny valley effects, latency under load, and how well personality coherence holds over extended sessions. The 1.5-second latency is still noticeable compared to a human conversation, but it is a dramatic improvement over Pika's previous generation model, which required eight GPUs and took 4.5 seconds to respond.
What to Watch
The competitive landscape will shift quickly if PikaStream1.0 delivers on its beta promises. Google, OpenAI, and a dozen startups are all working on real-time multimodal interaction, but Pika's framing as a pluggable skill for any agent is a distinctive architectural choice. It decouples the face from the brain, letting developers choose their reasoning engine and bolt on a visual presence layer.
The near-term question is integration. The technology to put an AI collaborator on a video call now exists as an open-source component. The question is who builds the most effective workflows around it.


