← Research Blog
Home
# VideoAgent: Self-Improving Video World Models for Embodied Planning **Published:** Nov 2025 | **Author:** Abhranil Chandra Generative video models are getting incredibly good at simulating physics and object interactions. But can we use them as **World Models** for robots to plan their actions? In our recent paper *"Video Agent: Self-Improving Video World Models for Embodied Planning"* (TMLR 2025 / RLBrew@RLC), we explore this question. We found that while models like SORA or smaller diffusion transformers are impressive, they suffer from **compounding hallucinations** when used for long-horizon planning. ## The Problem: Drift in Imagination When a robot tries to "imagine" a sequence of 20 actions to pick up a cup, small errors in frame 1 accumulate. By frame 10, the cup might have teleported or the gripper might have melted. This makes open-loop planning using video generation unreliable. ## Our Solution: Self-Conditioning & VLM Feedback We introduce **VideoAgent**, a framework that improves the consistency of these video world models without requiring millions of new expert demonstrations. 1. **Self-Conditioning:** We iteratively refine the generated video plan. We treat the generated video as a "draft" and condition the model on its own consistent frames to denoise the inconsistent ones. 2. **VLM Feedback:** We use a Visual Language Model (like GPT-4V or Gemini) as a "verifier". The VLM watches the generated video plan and scores it based on physics violations and task completion. This score guides the search for the best plan. > "By closing the loop with VLM feedback, we turn a generative video model into a robust planner, achieving a 4x improvement in success rates on simulated manipulation tasks." ## Results We tested this on both simulated environments (Meta-World) and real-world robotic manipulation videos. The agent was able to plan complex multi-stage tasks (like "open drawer" -> "place apple") by visually imagining the outcome and refining it until it looked physically plausible. This work suggests that the path to general-purpose robots might not just be better motor control policies, but better *visual imagination* capabilities that allow them to "think" before they act. [Read the full paper here](https://openreview.net/pdf?id=GDd5H92egZ)
← Back to All Blogs
Go to Home