How we are building Humanoid AI skills with DreamControl –

You've probably been amazed by recent humanoid robot demos online – robots dancing, doing kung-fu, or playing soccer. It's easy to think these robots can now do almost anything. After all, humans master basic tasks before moving on to complex skills like kung-fu. But robots don't learn like us. In some ways, performing a pre-programmed dance is actually much simpler for a robot than something seemingly basic, like opening a door or picking up a heavy box! If we want to move beyond impressive but narrow demonstrations and truly build universal assistants for demanding and dangerous tasks, we need to first master basic tasks.

‍Today, we're excited to introduce DreamControl, a workflow for building core skills for humanoid robots.

Building a Library of Humanoid AI Skills at General Robotics

At General Robotics, our mission is to create general-purpose intelligence for every robot. For humanoids, this means developing a comprehensive library of fundamental AI skills that enable them to be useful, safe, and interact naturally with their surroundings.

The emphasis here is on interaction with the environment. It's one thing for a robot to flawlessly execute a dance routine in open space, but quite another to bend down and open a lower drawer in your kitchen.

This seemingly simple act requires complex coordination: using perception to understand the pose of the drawer handle, bending or squatting down, precisely moving the arm towards the handle, and closing the grippers at just the right moment, maintaining balance throughout.

Such tasks, that involve coordinating both the lower and upper body simultaneously – what we sometimes call "whole-body skills" – are mastered by human toddlers… and particularly challenging for robots. However, mastery of these skills is crucial if we want to fully leverage the humanoid form's mobility and extensive range of motion.

Why are Whole-Body Skills so Difficult for Robots?

The core difficulty lies in solving two distinct problems at once: short-term control (like shifting your weight to stay balanced while moving your arm) and long-term planning for movement (coordinating your arms to reach a target and closing your hands at the perfect time). While Reinforcement Learning (RL) is often used for the first problem, it's notoriously hard to apply effectively to the second. That's typically where large datasets and methods like diffusion or flow matching policies, learned through behavior cloning, come in.

While we could theoretically collect data for these whole-body tasks by manually controlling humanoids, this approach is incredibly expensive. Unlike the vast amounts of text available for training large language models (LLMs), there's nothing comparable in robotics – a challenge Ken Goldberg calls the "100,000-year data gap."

Introducing DreamControl: A Novel Workflow for Training Humanoid skills

Today, we're excited to introduce DreamControl, our novel workflow for building whole-body skills for humanoids. DreamControl combines the strengths of diffusion models and reinforcement learning through a three-stage process:

‍

DreamControl: Three-stage process of combining the strengths of diffusion models and reinforcement learning

‍

Guided Motion Generation: We begin by sampling realistic movements from a diffusion model that's been trained on human actions. As inputs to this diffusion model, we provide a task description in text, the robot's initial position and the object it needs to interact with. The output, which is a human-like movement that is consistent with the task at hand as well as the provided spatial guidance, is then adapted for the robot's physical form (called retargeting).
Simulation-Based Learning: Next, we train the humanoid to imitate these sampled movements while obeying physics. We use GRID’s integration of NVIDIA Isaac Sim, an open source, reference robotics simulation framework. The control policy is a lightweight neural network with fewer than one million parameters. Simulation enables us to model environments at scale, capturing realistic physics and interactions across thousands of scenarios. This compresses decades of practice into accelerated GPU training, where the robot is rewarded both for following the Stage 1 reference motions and for completing task goals—for example, lifting an object to a specified height.‍
Real-World Transfer (Sim2Real): Finally, we bridge the gap from simulation to the real world. We train a version of our policy where the movements from Stage 1 are only experienced implicitly, through the reward function (this is where the "Dream" in DreamControl comes from!). We also integrate readily available vision AI models hosted by our robot intelligence platform GRID for flexible perception capabilities. Before deploying, we make sure to test extensively in simulation (both in Isaac Sim and in Mujoco).

‍

These promising results are only the first step — we plan to expand this work significantly, to scale to multiple humanoid form factors, to thousands of skills, and ultimately to combine them to create even more complex, high-level AI capabilities. We invite you to join us on this exciting journey as General Robotics continues to push the boundaries of humanoid robotics.

For those interested in the deeper technical details, please refer to our full paper, available via the Arxiv and Website links.

‍

We extend our gratitude to our dedicated core contributors (including collaborators from Berkeley and Brown): Dvij Kalaria, Sudarshan Harithas, Pushkal Katara, Sangkyung Kwak, Sarthak Bhagat, Shankar Sastry, Srinath Sridhar, Sai Vemprala, Ashish Kapoor, Jonathan Huang. We are also grateful to Yasvi Patel, Srujan Deolasee, Brandon Rishi, Dinesh Narayanan, Viswesh Nagaswamy Rajesh, and Geordie Moffatt for their support.