Article Highlight | 14-Nov-2025

Deep MARL-based resilient motion planning for decentralized space manipulator

Beijing Institute of Technology Press Co., Ltd

**image:**
**Fig. 1. Overall description of the paper. By discretizing the single agent, multiagent reinforcement learning is used to improve the single manipulator’s resilient recovery ability under unknown**
view more

Credit: Space: Science & Technology

Space manipulators play an important role in the on-orbit services and planetary surface operation whose reliability is a key issue. In the extreme environment of space, space manipulators are susceptible to a variety of unknown disturbances. Since it is difficult for the manipulator to be repaired immediately, once it fails, it will mean that it cannot complete the mission as expected, which may cause serious losses and dangers. How to have a resilient guarantee, i.e., the manipulator’s ability to resiliently recover and continue to complete tasks, in failure or disturbance is the core capability of its future development. The motion planning unit is used as the computing terminal of the manipulator’s joint motion trajectory. Compared with traditional motion planning, learning-based motion planning has gradually become a hot spot in current research. However, no matter what kind of research ideas, the single robotic manipulator is studied as an independent agent, making it unable to provide sufficient flexibility under conditions such as external force disturbance, observation noise, and mechanical failure. In a research article recently published in Space: Science & Technology, a team lead by professor Chengchao Bai from Harbin Institute of Technology puts forward the idea of “discretization of the traditional single manipulator” where different discretization forms are given through the analysis of the multi-degree-of-freedom single-manipulator joint relationship, and a single-manipulator representation composed of multiple new subagents is obtained, hoping that this work can provide a new reference for improving the manipulator’s resilient recovery ability.

First, the core idea: “discretization of the traditional single manipulator” is demonstrated in detail. The overall idea is shown in Fig. 1.

Before applying the multiagent learning algorithm to a single manipulator’s motion planning, it is first necessary to discretize the single manipulator into a multiagent form, which can be achieved by constructing a joint graph. The joints of the manipulator with n degrees of freedom (DOFs) are regarded as the node V = {1, 2, …, n} in the graph, and the link between the joints is regarded as the edge ε. The joint graph of the manipulator can be expressed as an undirected graph G = (V, ε). Each agent can select joint nodes from the joint graph to construct its own subjoint graph, and each agent can only control the joint nodes in its own subjoint graph. The observation of each agent contains 3 parts of information, namely the state information of each joint node in its subjoint graph (such as joint angle, angular velocity, torque information, etc.), the information obtained by communicating with other agents within k joint nodes (i.e., the state of all nodes within k joint nodes of its subjoint graph), and the global task information (such as the distance from the end of the manipulator to the target point). On the basis of the joint graph, this paper divides the manipulator into 3 situations: single agent, double agents, and three agents (as shown in Fig. 2).

When it comes to multi-agent reinforcement learning (MARL) for a single manipulator, vector q = {q₁, q₂, …, q_n} is used to represent a point in the configuration space C of an n-joint manipulator. If the manipulator is divided into a agents, then the configuration of the manipulator can be expressed as q = {q}^a_i₌₁. The set of joint angles that all agents can obtain through communication is denoted by q_k = {q_k_=i}^a_i₌₀. The motion planning is to calculate the collision-free trajectory of the manipulator under the given initial configuration q_start and target configuration q_goal. The multiagent motion planning process of a single manipulator is modeled as a multiagent Markov decision process, which can be represented by a tuple (G, S, U, P, R, γ), wherein S, U, P, R, and γ are the set of state vectors of each agent, the set of action vectors of each agent, the transfer function of each agent, the reward function, and the attenuation coefficient, respectively. The goal of multiagent reinforcement learning for a single manipulator is to train a neural network motion planner π_Φ_i parameterized by Φ_i for each agent. The three parts of the observation s_i^t of each agent contains are the angle q_i^t of the joint node, the amount of change in angle between the current time and the previous time Δq_i^t = q_i^t – q_i^t^–1, and 3-dimensional vector from the end position of the manipulator to the target position P^t = P_goal – P_end^t.

The algorithm adopted in this paper is similar to the multiagent deep deterministic policy gradient (MADDPG) algorithm, both of them use the framework of centralized training with decentralized execution. A centralized action-state value function neural network is set up for all agents. The network takes the united action U and united observation {S^t, S_k^t} as input and outputs Q values for all agents. To alleviate the overestimation of value by the Q function, 2 Q functions are trained in the same way by referring to double deep Q-network (DQN), which are respectively expressed by Q_θ₁ and Q_θ₂. In addition, a random exploration noise is applied on the basis of the mean action output by the actor neural network. In the end, a centralized multiagent soft actor-critic (CenMASAC) algorithm is formed.

Then, the simulation results are presented. The training environment adopted in this paper is an environment with planar obstacle based on MuJoCo physical simulation engine. The manipulator is a 7-DOF Kinova Jaco2 manipulator, as shown in Fig. 3. The goal of training is to move the end of the manipulator from the starting position P_start to the target position P_goal without colliding with the obstacle. To avoid unreachable starting and target configurations, we randomly generated 10000 pairs of collision-free starting and target configurations in advance. Each agent in CenMASAC algorithm corresponds to an actor neural network containing a fully connected neural net- work with 3 hidden layers. The number of nodes in the 3 hidden layers is 256. The activation function of the 3 hidden layers is the Elu function, and the activation function of the last layer is the Tanh function. The critic neural network of CenMASAC algorithm is a fully connected neural network with 3 hidden layers. The number of nodes is 256, 256, and 64, respectively. The activation function is the Elu function. When the distance between the end position of the manipulator and the target position is less than 5 cm, the planning is considered successful. The simulation considers the influence of different agent numbers and communication distance k on the training results. At the end of the training, 100 pairs of initial configuration and target configuration are randomly generated again. By counting the success rate of these 100 plans, the true success rate without adding noise can be evaluated. The results are shown in Table 3. It can be concluded that as the number of agents increases, the difficulty of training also increases, resulting in a decrease in the success rate. As for the resilient ability under different disturbances, the experimental results are shown in Table 4. Experimental results show that the resilient recovery ability of multiagent is better than that of single agent under all k values and the number of agents. It also shows good resilient ability to resist observation interference when k = 0, 2, and 3. Furthermore, the joint near the root of the manipulator has a great influence on the motion planning. Communication distance k has little influence on the success rate of motion planning but has a great influence on the training difficulty of multiagent reinforcement learning.

Finally, the discussion is displayed and the conclusion is drawn. This paper proposes a discrete definition for a single manipulator from a brand-new perspective and successfully applies multiagent reinforcement learning to single-agent motion planning. By comparing the manipulator under unknown disturbances, such as joint locking, observation noise, motion disturbance, etc., the correctness of the proposed method is verified, and a new research idea is provided for the resilient recovery of a single manipulator. However, there are still some problems:

1. When the number of agents is 3 or more, it becomes more difficult to train multiagent coordination. The multiagent reinforcement learning algorithm for a single manipulator needs further studied.

2. The current motion planning problem allows the manipulator to reach the target position without constraining the target pose. For this task, the 7-DOF manipulator has a lot of redundancy of DOFs. When the redundancy of the manipulator is reduced, the effect of the multi- agent motion planning needs to be further studied.

Disclaimer: AAAS and EurekAlert! are not responsible for the accuracy of news releases posted to EurekAlert! by contributing institutions or for the use of any information through the EurekAlert system.