News Release

How to create intelligent robots as those in science fiction?

The researchers provide an Embodied AI survey on achieving robot's behavioral intelligence.

Peer-Reviewed Publication

Journal Center of Harbin Institute of Technology

A overview of Embodied AI

image: 

According to the process of robot behavior, we categorize Embodied AI into three modules: embodied perception, embodied decision-making, and embodied execution.

view more 

Credit: Weinan Zhang from “Harbin Institute of Technology”

While recent advancements in artificial intelligence (AI) have shown remarkable capabilities in language, vision, and speech processing, these technologies are largely "disembodied." The authors argue this disembodied nature is insufficient for creating the general-purpose intelligent robots often envisioned in science fiction.

 

They illustrate this using the complex instruction: "clean the room." A classic, disembodied AI can process parts of this task—it can interpret the audio (speech), understand the command's meaning (NLP), and detect objects in a static image (CV). However, this passive analysis is where its capabilities end.

 

An embodied agent, by contrast, must solve the entire problem. This begins with Embodied Perception; as the robot moves, it perceives far more information than a static view allows (for instance, finding a toy hidden behind a box). It then uses Embodied Decision-Making, knowing the correct sequence (e.g., throw away trash before arranging toys) and how to handle problems (like searching for that missing item). Finally, it performs Embodied Execution—the physical acts of walking, grasping a bottle, or opening a door.

 

To bridge the gap from passive analysis to behavioral intelligence, a comprehensive new survey from a team of researchers provides a structural framework for the field of Embodied AI. The survey, titled "Embodied AI: A Survey on the Evolution from Perceptive to Behavioral Intelligence," systematically maps the field to guide future research.

 

The authors propose that achieving intelligent behavior is a process that can be categorized into three modules.

The framework begins with Embodied Perception, which the authors categorize based on its relationship with robot behavior. The first is “perception for behavior,” which focuses on the perception tasks primarily utilized for robot actions. This includes object perception—sensing an object's geometric shape, articulated structure, and physical properties to enable manipulation—and scene perception, which involves building models of the environment, such as metric or topological maps, to guide mobility. The second and more distinct area is “behavior for perception,” which involves incorporating the robot's own behavior into the perception process. The survey details how an agent can use mobility to actively move and obtain more information about objects and scenes, or use manipulation to interact with an object to discover its properties, such as its articulated structure.

 

The second module, Embodied Decision-Making, addresses how the agent generates a sequence of behaviors to complete a human instruction based on its observations. The survey categorizes this crucial step into two primary domains: Navigation and Task Planning. Navigation involves reasoning a sequence of mobility commands (e.g., 'turn left,' 'move straight') to move through an environment, while Task Planning generates a sequence of manipulation skills (e.g., 'open the microwave,' 'grasp the bottle'), including integrated navigation steps. The authors emphasize that the fundamental challenge in this module is real-world grounding. Unlike purely digital decision-making, an embodied agent must account for numerous real-world challenges, such as physical feasibility, object affordance, and preconditions.

 

The final module, Embodied Execution, translates the generated decision into physical action. The survey focuses this discussion on manipulation skill learning, defining it as learning a behavior policy that maps skill descriptions and environmental observations to a concrete action, typically an embodiment-independent 7-DoF trajectory for a robot arm. The authors review the two primary algorithmic approaches used to train this policy: Imitation Learning (IL), which learns from human demonstrations, and Reinforcement Learning (RL), which learns through trial-and-error interaction. The survey states that the key research problem in this area is achieving generalization—across varied objects, scenes, skills, and instructions. It also highlights a critical trend: a shift away from training isolated, single-skill models and toward developing General-Purpose Execution Models, which, as a direct application of multimodal large language models, can handle multiple skills within a single model.

 

By providing this comprehensive three-module framework, the survey aims to structure the research landscape, systematically identify key challenges, and offer a clear roadmap for the field. The authors hope this structural approach will guide the community's efforts in developing the next generation of general-purpose intelligent agents.


Disclaimer: AAAS and EurekAlert! are not responsible for the accuracy of news releases posted to EurekAlert! by contributing institutions or for the use of any information through the EurekAlert system.