[Preview] Beyond the Black Box: How Agentic Architectures Unlock Truly General Robotics

September 30, 2025

Sai Vemprala

Sai Vemprala

For decades, the pursuit of general-purpose robotics has swung between brittle pipelines and large monolithic foundation models. Hand-engineered pipelines deliver reliable performance in narrow domains but are difficult to adapt or reuse. End-to-end models promise breadth, but require vast data, offer little transparency, and struggle to transfer. Both approaches treat intelligence as either fixed logic or an opaque block, which makes extension challenging.

Rather than relying solely on larger models or broader demonstrations, progress in robotics may benefit from applying agentic systems: architectures that frame intelligence as the ability to reason about goals, compose modular tools, and adapt through experience. Intelligence should not be a static script or a black box with limited transparency - rather, it should exhibit agency: the capacity to reason, plan, compose tools, and adapt through experience.

This shift has already transformed software, where agents generate code, manage toolchains, and coordinate workflows by invoking the right tools at the right time. Robotics raises the stakes: agents must integrate perception, action, and memory in the open world—demanding vast skill libraries, scalable infrastructure, and standardized deployment. Without these foundations, agentic workflows stop at software and cannot cross into embodied intelligence.

Today, we are unveiling our blueprint for agentic architectures for robotics embodied in GRID.

Our system generalizes across a broad range of  robot form factors, creates skills for various use cases, leverages memory to provide contextual grounding, and scalability through a cloud-native design. Collectively, our results demonstrate that agentic workflows offer a practical path to truly general robotics.

<LINK TO HERO VIDEO>

Skills as the building blocks

GRID builds on a large repertoire of robot AI skills—modular units of perception, planning, or control. These skills span a wide spectrum: allowing robots to detect objects, figure out grasp points, identify obstacles, and understand scenes, and much more. 

In addition to AI skills, we also treat low-level robot capabilities as part of the skill library.

A central design principle of GRID is the unification of such robot skills into consistent primitives per form factor, e.g. for accessing images, sending velocity commands, or executing trajectories. This eliminates SDK fragmentation across robots and provides a clean abstraction layer for reasoning especially from an agent’s point of view. With a unified interface represented in pure Python, agents can rapidly move from simulation to reality, or from one OEM’s robot to another, without significant rewrites. This allows GRID to work with any robot form factor with ease.

Agents for Orchestration

GRID’s agentic layer turns these skills into scalable intelligence. GRID’s agents operate in two complementary modes:. 

  • In tool-calling mode, GRID directly invokes skills step by step, grounding future decisions on previous outputs and its own commonsense reasoning. This lets the agent dynamically adapt mid-task—monitoring outcomes and deciding on-the-fly which tool to call next. 
  • When well encapsulated skills are not readily available, GRID operates in code generation mode, where it creates larger programs that create new skills or orchestrate skills with glue logic into complex routines.  

We standardize the interfaces of such skills and make them composition-ready through the Model Context Protocol (MCP).

Our MCP servers expose each skill in a structured, LLM-friendly format. They publish the API signature, usage examples, and operational constraints, so skills can be discovered, validated, and composed with others. This design ensures that agents see skills and robot interfaces as structured, documented tools. Every skill is explicit code, its inputs and outputs observable, and reusable across contexts.  

The outputs in either case are auditable, debuggable Python programs — far more interpretable than a black-box policy. GRID’s agents also provide step-by-step plans before moving on to code generation, allowing human feedback to guide the process.

For example, when a UR5 arm is asked to pick and place an object, GRID composes a script invoking segmentation, grasping, and motion planning tools in sequence. Once validated, that composition can be stored as a new skill, which can be reused zero-shot in progressively harder contexts, like cleaning up a table, or sorting objects by category.

The skill library is therefore self-expanding: every new composition can be another building block. 

We apply these principles of modularity and extensibility to have GRID play robot chess: combining board state recognition, grasping, motion planning, and reasoning into a closed loop. Unlike foundation models such as vision-language-action models which take significant effort to extend and finetune for a new task, an agentic system can simply encode new tools and compose them into behavior.

One architecture spans every robot

Our agentic framework applies seamlessly across manipulators, quadrupeds, humanoids, wheeled robots, and drones. 

  • On a UR5 arm, a pick-and-place skill was extended into routines like cleaning up an entire table. 
  • A Unitree humanoid followed multi-step instructions such as “Hand a bag of chips to the nearest person”. 
  • A quadruped was directed to search a space until it found a yellow caution sign.
  • A drone was tasked with tracking and following a moving target. 

Across these embodiments, most behaviors emerged zero-shot: the system recombined existing skills in new contexts without any retraining. This illustrates how agentic architectures can generalize across platforms and tasks, providing a pathway toward more unified and extensible robotic solutions. 

 Modularity is what makes this generality real. 

For cases where a skill might not exist, GRID allows agents to invoke simulations also as callable tools. GRID agents demonstrated the creation of a safe navigation routine for a wheeled robot by launching simulation, proposing code, understanding failures, and refining until collisions stopped. Such a behavior can then be saved and reused across tasks or improved further for more operational domains.

Robots do not have to be static executors; they should be systems that compose and extend their own intelligence.

In robotics, agents gain a unique advantage through simulation: a sandbox where they can invent, test, and scale new capabilities. Just as swarms of agents are being used to tackle open-ended problems like mathematics by exploring thousands of solution paths in parallel, agents in robotics can leverage simulation to explore and validate new capabilities, making adaptation both scalable and safe.

Elasticity and unification for scale via accessible AI skills

For agents to be effective, they require access to a broad library of perception, planning, and control skills that can be combined, swapped, or updated as tasks evolve. Sequential orchestration and parallel composition are particularly demanding, often requiring multiple models to execute concurrently. To support this, skills are best hosted on remote accelerators—whether in the cloud or in dedicated on-premise clusters—where diversity and scale can be sustained independently of robot hardware. This architecture enables agents to compose models in parallel, swap capabilities seamlessly, and propagate updates instantly across a fleet.

With sufficiently optimized communication protocols, almost all skills can be served remotely at practical latency, leaving only reflexive low-level and safety behaviors on-device. This balance allows intelligence to scale across robots, embodiments, and tasks without being bottlenecked by local hardware.

Memory and Human-Like Cognition

Human cognition relies on memory not just to store information, but to encode experiences, compress them into useful abstractions, recall them in context, and reflect on them to improve future behavior. This is what gives our actions continuity and adaptability. For robots to operate in the open world, they need the same: a way to ground reasoning in what they have seen and done, and to refine their competence over time.

GRID provides this continuity through layered memory. 

  • Observational memory records the world in an agent-friendly way as semantic snapshots through textual captions, object-centric image embeddings, and more. Our robots can answer questions with grounded evidence, receive images as input and identify similar locations or objects, or just converse casually about a space. 
  • Operational memory tracks what the agent attempted, which skills were invoked, previous examples, and how execution unfolded, with the potential to convert each run into reusable competence. 
  • Domain memory encodes external knowledge such as environmental or task specific guidelines. We show an example where our agent was able to parse the FAA Part 107 guidelines provided to it, and then classify drone missions as valid or invalid with cited justification.
The experience this enables is strikingly human-like.

You can ask a robot what it is doing, why it made a decision, or request alternatives, and it can explain transparently. That seamless back-and-forth transforms the robot into a collaborator rather than an inscrutable machine.

Toward General Robotics

Taken together, these principles— accessible AI skills, agentic composition, and memory-driven cognition—change how we think about building general-purpose robots. We treat intelligence as something that emerges from reasoning, composition, and experience. When robots are able to call from a variety of skills, refine them or adapt them, and use memory to adjust their behaviors, they become far more adaptable and transparent.

The age of the general-purpose robot will be defined by agents that can reason, converse, compose, remember, and learn continuously—and by architectures designed to make those abilities concrete and measurable. 

Read our detailed study in our paper: <LINK>