Robots have come a long way over the past few years—they can perform impressive acrobatic feats, dance on stage, follow language commands and, in some of our own results, perform complex tasks like folding laundry or cleaning off a table. But the biggest challenge in robotics is not in performing feats of agility or dexterity, but generalization: the ability to figure out how to correctly perform even a simple task in a new setting or with new objects. Imagine a robot that needs to clean your home: every home is different, with different objects in different places. Generalization must occur at many levels. At the low level, the robot must understand how to pick up a spoon (by the handle) or plate (by the edge), even if it has not seen these specific spoons or plates before, and even if they are placed in a pile of dirty dishes. At a higher level, the robot must understand the semantics of each task—where to put clothes and shoes (ideally in the laundry hamper or closet, not on the bed), and what kind of tool is appropriate for wiping down a spill. This generalization requires both robust physical skills and a common-sense understanding of the environment, so that the robot can generalize at many levels at the same time, from physical, to visual, to semantic. This is made even harder by the limited availability of diverse data for such robotic systems.
This is why most commercial robots operate in tightly controlled environments like factories or warehouses: in a world where the robot never needs to venture outside of a single building and where the objects and their locations are predetermined, current robotic methods that provide for only weak generalization can be very successful. Even the impressive demonstrations of robotic agility and dexterity that have been shown in recent years are typically designed to work in a specific environment, often with data collected in the test scene or very similar settings. But if we want robots to be part of our everyday lives, working in our homes, grocery stores, offices, hospitals, and other "messy" environments, we need strong generalization.
We have been developing robotic foundation models that can generalize to such messy environments, building on our vision-language-action (VLA) model π0. While both π0 and other recent VLAs are evaluated in environments that closely match training, we've developed a new model that we call π0.5 that exhibits meaningful generalization to entirely new environments. We believe that this represents a significant step forward toward truly generalizable physical intelligence. Our current model is far from perfect: its goal is not to accomplish new skills or exhibit high dexterity, but to generalize to new settings, such as cleaning up a kitchen or bedroom in a new home that was not seen in the training data. In our experiments, π0.5 can perform a variety of tasks in entirely new homes. It does not always succeed on the first try, but it often exhibits a hint of the flexibility and resourcefulness with which a person might approach a new challenge.
The individual tasks that π0.5 performs vary in difficulty, from rearranging objects (e.g., to put dishes in the sink) to much more intricate behaviors, such as using a sponge to wipe down a spill. We show some of the more complex stages in these tasks below, and the videos of the long-horizon behaviors later in the post.
The main principle behind π0.5 is co-training on heterogeneous data: by training our VLA model on a variety of different data sources, we can teach it not only how to physically perform diverse skills, but also how to understand the semantic context of each skill (e.g., if the task is to clean the kitchen, what are appropriate objects to pick up and put away, and where to put them), infer the high-level structure of a task (e.g., the steps required to make a bed), and even transfer physical behaviors from other robots (e.g., simpler robots that have one arm or no mobile base, or data from robots in less diverse environments).
Co-training is conceptually straightforward: because VLAs are derived from general vision-language models (VLMs), they can be trained on examples that consist of any combination of actions, images, text, and other multimodal annotations such as bounding boxes. This includes general multimodal tasks, such as image captioning, visual question answering, or object detection, and robotic oriented tasks, such as robotic demonstrations with actions, and "high-level" robot examples, consisting of observations labeled with the appropriate semantic behavior (e.g., an observation of an unmade bed with the label "pick up the pillow"). We also include "verbal instruction" demonstrations, where a person coaches the robot through a complex task by telling it what to do step by step (with natural language). The model makes both high-level inferences about the next semantic step to perform, analogously to chain-of-thought inference, and low-level predictions to output motor commands to the robot's joints:
Multimodal data
Verbal Instructions
Subtask Commands
Object Detection
Multimodal Web Data
Robot action data
In-the-wild Mobile Robot
In-the-wild Static Robot
In-office Static Robot
General Robot Data
Cross-modal VLA policy
Deploy out-of-the-box in new homes
While the basic principles of co-training are not new, training a VLA that can generalize broadly requires the right mixture of co-training tasks. Just like a person needs an appropriate curriculum to teach them the conceptual and practical aspects of a new job, VLAs need a "curriculum" provided by the mixture of co-training tasks to enable generalization at all of the necessary levels of abstraction. In our experiments, we trained versions of the π0.5 model that exclude different parts of the full training mixture: the "no WD" version excludes multimodal Web Data (question-answering, captioning, and object detection), the "no ME" version excludes Multiple Environment data collected with non-mobile robots (e.g., static robots placed into many other homes), the "no CE" version excludes Cross Embodiment data collect as part of the original π0 training set, and the "no ME or CE" version excludes both sources of robot data, leaving only the mobile manipulation data collected with the same robots that we use in our experiments (about 400 hours).
We evaluated two experimental conditions: full cleaning tasks, such as putting away dishes in the sink or cleaning up items off the floor of a bedroom, and an out-of-distribution (OOD) evaluation that tasks the robot to move specific objects indicated in the prompt into a drawer. For both evaluations, we measure the success rate, averaged over individual subtasks (e.g., the percentage of objects that were moved into their proper place), as well as the language following rate, which indicates the fraction of cases where the robot's behavior correctly accords with the user's prompt. We can see that in all cases, data from other robots (ME and CE) makes a big difference in terms of policy performance. In the OOD case, we also see a significant difference from including web data (WD), which greatly improves the robot's ability to correctly identify new object categories that were not in the data. More details on these experiments are included in the accompanying paper.
To better quantify just how much generalization π0.5 can achieve, we conducted a scaling study where we vary the number of different environments seen in the training data. We also include a baseline model in these comparisons that was trained directly on data from the test environment in addition to all of the other data sources. This model (shown with a horizontal green line) provides a sense for how well a VLA could do in this scene if the challenge of generalizing to new environments is removed.
Evaluating how performance scales with the number of training environments, when co-training with the other datasets in our training mixture. When using all of the available training environments (rightmost point on the graph), our model (yellow) attains similar performance as a baseline that is trained directly on test environments (green).
These results not only show that the generalization performance of π0.5 steadily increases with the number of distinct environments in the training set, but that after only about 100 training environments, it actually approaches the performance of the baseline model that was trained on test environment directly. This suggests that our recipe can attain effective generalization using relatively accessible amounts of mobile manipulation training data.
π0.5 is based on the π0 VLA, but because it is co-trained on tasks that require outputting a variety of label types, including actions and text, we can use the same model to control the robot at both the high and low level. When we run π0.5, we first ask it to output a "high level" action expressed in text, and then ask it to follow this high level action by choosing an appropriate robot motor command, in the form of a 50-step (1-second) "action chunk" of continuous low-level joint actions. This approach follows our recently developed Hi Robot system, except that the same model is used for both the high-level decisions and low-level motor control in a kind of "chain of thought" process.
The model itself includes both discrete auto-regressive token decoding and continuous decoding via flow matching, as in π0. The discrete decoding pathway is used for inferring high-level actions, while the continuous flow-matching pathway is used for low-level motor commands, as illustrated in the diagram below.
high-level prompt
subtask prediction
low-level command
continuous actions
ACTION EXPERT (300M)
We evaluated π0.5 by asking it to control mobile manipulators to clean new homes that were never seen in the training data. This is an exceptionally difficult test for a VLA: while there have been impressive demonstrations of VLA generalization, such as following new semantic commands, interactively following human instructions, and chaining together distinct primitive skills, such demonstrations typically take place in the same or very similar environment as the training data. Our recent π0-FAST model was able to generalize to new environments with the DROID setup, but for relatively simple skills like moving individual objects. Our experiments involved placing a robot equipped with π0.5 into an entirely new home and asking it to put away dishes, make the bed, or clean up a bedroom floor. These are long tasks that require not only using complex behaviors (such as using a sponge to clean a spill), but also understanding the semantics of the task and breaking it down into individual parts, with each stage interacting with the correct object. We show example evaluations of π0.5 in the videos below.
Examples of our model completing long-horizon tasks in new kitchens and bedrooms.
All experiments were done in homes that were not in the training data.The policies are reactive, and can handle both variability in the environment and perturbations. In the videos below, we test what happens when people interfere with the robot.
Lastly, the π0.5 model can accept language commands at various levels of granularity, from high-level prompts like "put the dishes in the sink" to detailed individual commands instructing the model to pick up specific objects or move in specific directions. We show some examples of language following in the videos below.
We include detailed videos from our rigorous empirical evaluation below, with examples of successful and failed episodes of our model. Importantly, as with all the videos on this page, none of the scenes in the videos below are from the training data. Complete results from all experiments can be found in the full article.
We showed that VLAs can enable broad generalization even for complex and extended robotic skills, like cleaning a kitchen or bedroom. Our π0.5 model can enable a robot to clean a new home that was never seen in the training data. π0.5 is far from perfect, and it often makes mistakes both in terms of its high-level semantic deductions and motor commands. However, by allowing robots to learn from a variety of knowledge sources, we hope that the π0.5 recipe will bring us closer to broadly generalizable and flexible physical intelligence. There is a lot left to do: while our robots can improve from verbal feedback, they could also in the future utilize their autonomous experience to get better with even less supervision, or they could explicitly request help or advice in unfamiliar situations. There is also a lot left to do to improve transfer of knowledge, both in the technical aspects of how the models are structured, and in the diversity of data sources that our models can employ.
If you are interested in collaborating, please reach out. We are particularly excited to work with companies scaling up data collection with robots deployed for real-world applications, who are looking to collaborate on autonomy.
We are also hiring! If you'd be interested in joining us please get in touch.
For researchers interested in our work, collaborations, or other queries, please write to research@physicalintelligence.company.