Robots have developed the skill to learn by watching videos, study reveals

A new model will allow robots to mimic human actions fast. Scientists are using this model to train robots so that they could do daily chores just like the way we do.
Rupendra Brahambhatt
An illustration of a robot doing chores
An illustration of a robot doing chores


Are you among those who often dream of a day when a robot will do all the everyday household chores for you? A team of researchers from Carnegie Mellon University (CMU) has figured out how to turn your dream into reality.

In their latest study, they proposed a model that allowed them to train robots to do household tasks by showing them videos of people doing ordinary activities in their homes, like picking up the phone, opening a drawer, etc. 

So far, scientists have been training robots by physically showing them how a task is done or training them for weeks in a simulated environment. Both these methods take a lot of time and resources and often fail. 

The CMU team claims that their proposed model, Visual-Robotics Bridge (VRB), how can make a robot learn a task in just 25 minutes, and that too without involving any humans or simulated environment. 

This work could drastically improve the way robots are trained and “could enable robots to learn from the vast amount of internet and YouTube videos available," said Shikhar Bahl, one of the study authors and a Ph.D. student at CMU’s School of Computer Science. 

Robots have learned to watch and learn

VRB is an advanced version of WHIRL (In-the-Wild Human Imitating Robot Learning), a model that researchers used previously to train robots. 

The difference between WHIRL and VRB is that the former requires a human to perform a task in front of a robot in a particular environment. After watching the human, the robot could perform the task in the same environment.

However, in VRB, no human is required, and with some practice, a trainee robot can mimic human operations even in a setting different from that shown in the video. 

The model works on affordance, a concept that explains the possibility of an action on an object. Designers employ affordance to make a product user-friendly and intuitive. 

“For VRB, affordances define where and how a robot might interact with an object based on human behavior. For example, as a robot watches a human open a drawer, it identifies the contact points — the handle — and the direction of the drawer's movement — straight out from the starting location. After watching several videos of humans opening drawers, the robot can determine how to open any drawer,” the researchers note.

During their study, the researchers first made the robots watch some videos from large video data sets such as Ego4d and Epic Kitchen. These extensive data have been developed to train AI programs to learn human actions. 

Then they used affordance to make the robots understand the contact points and steps that make an action complete, and finally, they tested two robot platforms in multiple real-world settings for 200 hours. 

Both robots successfully performed 12 tasks that humans perform almost daily in their homes, such as opening a can of soup, picking up a phone, lifting a lid, opening a door, pulling out a drawer, etc.

The CMU team wrote in their paper, “Vision-Robotics Bridge (VRB) is a scalable approach for learning useful affordances from passive human video data and deploying them on many different robot learning paradigms.”

In the future, they hope to use VRB to train robots for more complex multi-step tasks.

You can read the study here.

Study Abstract: Building a robot that can understand and learn to interact by watching humans has inspired several vision problems. However, despite some successful results on static datasets, it remains unclear how current models can be used on a robot directly. In this paper, we aim to bridge this gap by leveraging videos of human interactions in an environment-centric manner. Utilizing internet videos of human behavior, we train a visual affordance model that estimates where and how in the scene a human is likely to interact. The structure of these behavioral affordances directly enables the robot to perform many complex tasks. We show how to seamlessly integrate our affordance model with four robot learning paradigms including offline imitation learning, exploration, goal-conditioned learning, and action parameterization for reinforcement learning. We show the efficacy of our approach, which we call Vision-Robotics Bridge (VRB) as we aim to seamlessly integrate computer vision techniques with robotic manipulation, across 4 real-world environments, over 10 different tasks, and 2 robotic platforms operating in the wild.

Add Interesting Engineering to your Google News feed.
Add Interesting Engineering to your Google News feed.
message circleSHOW COMMENT (1)chevron
Job Board