Robot, Teach Thyself!

Designing robots that function in the real world is a monumental task. It can take dozens or even hundreds of hours to train a robot to accomplish a single action. Stringing tasks together to accomplish anything meaningful—say, doing laundry, or cooking and serving a meal—remains elusive.

A core challenge is the amount of programming and training robots require. “There is an enormous amount of effort involved, and systems tend to be fragile and unadaptable,” said Deepak Pathak, Raj Reddy Assistant Professor in the School of Computer Science at Carnegie Mellon University, and co-founder of robotic intelligence firm Skild AI.

An emerging tool for dealing with this issue? Reinforcement learning (RL), which allows robots to teach themselves through experimentation and trial and error. “It shifts robot locomotion and behavior away from traditional methods that involve detailed strategies in code and empirical tuning,” said Marco da Silva, vice president and general manager for the Spot robot at Boston Dynamics.

As a result, reinforcement learning is taking center stage in robotics. It allows engineers to build better robots faster by interacting with an environment and learning through trial and error. This approach also comes with concerns, however, including whether these systems can circumvent costly training errors that undermine performance and lead to safety concerns.

A New Movement

Researchers have traditionally relied on supervised learning to teach robots new tricks. It allows a robot to accomplish a task through imitation or behavioral cloning. This typically requires labels and cross-validation. “You give the robot both the situation and the correct action,” explained Changliu Liu, an assistant professor at Carnegie Mellon’s Robotics Institute. Yet there’s an enormous downside. “You need lots of human demonstrations, which often come from teleoperation or curated video datasets.”

That’s where reinforcement learning enters the picture. It greatly reduces the need for a human in the loop. Part of the appeal is that RL doesn’t require labeled data and direct human oversight. Instead, it uses a reward signal—positive or negative—to guide behavior over time. “The robot tries something, gets feedback, and adjusts,” Liu said. “It may even discover novel strategies no human has considered.”

Pioneers in reinforcement learning include Andrew Barto and Richard Sutton, whose contributions were recognized with the 2024 ACM A.M. Turing Award.

To conduct RL sessions, researchers typically use simulators or digital twins. For example, physics simulation environments like MuJoCo, PyBullet, and Isaac Sim are used to train robots virtually at a massive scale. Using GPUs, they analyze physical factors, including motion across timesteps, friction and contact dynamics, and articulated joint movements. Researchers measure results by understanding cumulative rewards over time. If the number grows, the robot is performing better.

Because these tools accurately model complex physics and produce algorithms that closely mirror realistic movements, researchers can significantly compress training cycles. However, skills learned in simulation typically require additional fine-tuning—a process known as sim-to-real transfer—to address gaps and ensure the robot performs reliably in diverse, real-world situations. This blended approach maximizes the likelihood that robots will operate correctly without damaging objects, falling, or endangering humans.

The real-world results are impressive. At Oregon State University, Alan Fern, professor and co-director of the Dynamic Robotics and Artificial intelligence Laboratory, has tapped reward-based learning to teach robots to walk on different terrain, negotiate steps, and handle other motor functions that humans take for granted. With RL, “You give the robot a bigger and more refined reward as it accomplishes tasks,” he said. “Over time and iterations, the neural network gets pushed to improve.”

Boston Dynamics is also shifting away from code and supervised learning in favor of RL. Its quadruped robot Spot can “acquire complex behaviors without specific human-provided solutions,” da Silva said. “Through millions of simulations, we fine-tune policies that help the robot navigate diverse terrains and perform tasks with precision.” This includes navigating complex obstacles, wet floors, and extreme environmental conditions.

Learned Behaviors

Self-learning algorithms are already paying dividends. For instance, Boston Dynamic’s Spot robot now works at a beer factory in Leuven, Belgium, where it autonomously handles more than 1,800 inspections each week. The robot uses visual and thermal monitoring to identify excessive wear in machines. This allows technicians to conduct repairs proactively, rather than after a critical failure shuts down production.

Boston Dynamics conducted extensive simulations and hardware testing to ensure the robot would function correctly under a wide range of situations and environments, da Silva said. The simulations included disturbances, modeling errors, obstacles, and other variations. “These scenarios become part of our training and evaluation datasets,” he noted. Over time, this has helped Boston Dynamics refine behavior and improve performance.

Reinforcement learning doesn’t replace supervised learning; it complements it, Fern said. It also fits into a broader picture that includes using LLMs as robot brains—a feature that simplifies robot-human communication. “The most effective training approaches now use a hybrid model that starts with imitation learning to build a basic policy. It’s then possible to enhance and fine-tune performance through reinforcement learning and make changes and adaptions on the device,” Fern explained.

Robotic self-learning is also effective for accumulating general-purpose and transferrable knowledge, Pathak said. Constant reprogramming and code changes aren’t only untenable, they make it more difficult to get to a more general state of robot intelligence. “Just like language models predict the next word, you can train a robot to predict the next visual frame, the next state. Over time, this helps build an internal model of how the world works.”

Despite enormous advances in self-learning algorithms, there’s still a lot of work to do to improve robotic motor control, dexterity, and physical interactions, Liu said. The challenge is particularly formidable for humanoid robots. How soon the field can reach a critical mass of data—similar to what enabled ChatGPT a few years ago—remains to be seen. A core goal for researchers is to develop more robust peer-to-peer knowledge sharing frameworks and deeper integration with LLMs and vision-language models (VLMs).

Even then, humans will likely have to remain in the loop—not just for safety, but also to align objectives with performance. This includes issuing verbal commands and physical corrections, such as praise or even “petting” as reward signals, Liu noted. Critical questions also remain regarding transparency, potential training biases, explainability, and trust. Without adequate human oversight, RL-trained robots could cause unforeseen harm or damage.

Nevertheless, robotics is poised to leap forward as autonomous learning methods advance. “Instead of being able to work in just few scenarios, robots will be able to adapt to unseen surroundings,” Pathak said. “That being said, general-purpose robots that can do every kind of task that humans can are still far away.”

Samuel Greengard is an author and journalist based in West Linn, OR, USA.

Robot, Teach Thyself!

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.