



One of the most exciting (and perhaps controversial) phenomena in large language models is emergence. As models and datasets become bigger, some capabilities, such as in-context learning and effective chain-of-thought reasoning, begin to appear only above a particular scale. One of the things that can emerge at scale with LLMs is the ability to more effectively leverage data, both through compositionality and generalization, and by utilizing other data sources, such as synthetic data produced via RL. As we scale up foundation models, they become generalists that can soak up diverse data sources in ways that smaller models cannot. In this post, we’ll discuss some of our recent results showing that transfer from human videos to robotic tasks emerges in robotic foundation models as we scale up the amount of robot training data. Based on this finding, we developed a method for using ego-centric data from humans to improve our models, providing a roughly 2x improvement on tasks where robot data is limited.
Modern vision-language-action (VLA) models like π0.5 can achieve broad open world generalization by training on large and diverse datasets, which include many different robots and data from the web. Do these models acquire emergent capabilities to leverage new data sources as they are scaled up?
We focus specifically on egocentric human videos as one such data source, which can be recorded by wearable cameras. This kind of data is easy to record, but using it to train robots presents a challenge, which is called a domain gap: humans and robots look different and move in different ways, and robots can't imitate humans directly. In recent research works, using such data typically required some sort of manual alignment, such as masking out parts of the image or even converting human hands into robot hands using generative models. Some methods even try to change the robot hardware to better align with human motions, such as by using humanoid robots. While these methods can be effective, what they gain in transferability, they tend to lose in generality. We wanted to see if simply scaling up robotic foundation models could allow for emergent human to robot transfer without any explicit transfer learning mechanism.
In our first experiment, we fine-tuned the full π0.5 with human data. We use a simple human-robot co-finetuning recipe, where we treat human video data like our existing robot embodiments, with actions given by 3D hand positions, without any special transfer learning method. We combine the human data with the most relevant robot data and fine-tune on this mixture, then evaluate the policies on the specific settings illustrated only in the human demonstrations. For example, in the sorting eggs task, robot data covers placing eggs in cartons, while human data shows how differently colored eggs should be sorted across multiple cartons — the scenario used for evaluation. Similarly, for the dresser-tidying task, robot data spans diverse bedroom scenes, whereas human data shows how to arrange the target dresser in its specific scene, with items placed into appropriate containers (e.g., jewelry in the jewelry box, hair ties in the organizer).
To our surprise, we found that this simple recipe actually works quite well: simply by including the human video data in fine-tuning, the performance of our policy went up by about 2x across a suite of 4 generalization scenarios present only in the human data. This is surprising because we did not include any special mechanism to facilitate transfer. Simply using the pre-trained π0.5 model and co-training on the data we enabled emergent human to robot transfer. We found this same recipe could be extended to even more tasks, from organizing a toolbox to sorting fruit:
But what made this work? Was it really the case that the pre-trained π0.5 was key to enabling this emergent transfer? We wanted to dig into this result more closely and understand how exactly the transfer of knowledge from human data relates to the diversity and scale of robot data used during pre-training of the robotic foundation model. That is, does the ability to effectively learn from human data emerge by scaling up robotic foundation model pre-training?
To answer this question, we measured the robot’s performance on the scenarios demonstrated in the human data, comparing it to a policy without the human data. We see that the improvement in performance from adding human data increases as we scale up the pre-trained model.
One particular instance where this is clearly evident is in the sorting eggs task, where we observe that scaling up pre-training stops improving the model fine-tuned without human videos after about 60% of the dataset, but still continues to improve performance when fine-tuning with human videos. That means that adding more robot data in pre-training actually improves the model’s ability to absorb human data in fine-tuning!
To better understand why this happens, we can examine the representations that the models use for the images in the human and robot examples. Visualizations of the model features projected onto a 2D plot are shown below. For models with small-scale pre-training, or no pre-training at all (left, middle), we see that robot data and human data is represented with very different features, suggesting that the model has not successfully aligned human and robot examples. But when we scale up the pre-training dataset, we see that the features line up much better, indicating emergent human-robot alignment. Note that pre-training uses robot data, the human data is seen only in fine-tuning and does not change between the different models. The emergent human-robot alignment stems only from the increase in the quantity and diversity of robot data from other tasks.
Our finding on the emergence of human to robot transfer paints a promising picture for scaling up vision-language-action models. These results suggest that, as with large language models, scaling up VLAs might lead not only to better performance, but also to new capabilities. These capabilities could enable leveraging new, previously hard-to-use data sources and provide for more effective transfer across domains, which in turn would allow scaling up robotic foundation models even more. Effectively using human video might represent just one of many such capabilities, and it’s exciting to imagine what new capabilities might be unlocked as we continue to scale up our robotic foundation models.
If you are interested in collaborating, please reach out. We are particularly excited to work with companies scaling up data collection with robots deployed for real-world applications, who are looking to collaborate on autonomy.
We are also hiring! If you'd be interested in joining us please get in touch.
For researchers interested in our work, collaborations, or other queries, please write to research@physicalintelligence.company.