Open Access
ARTICLE
Safe Robot Control through Multi-Task Offline Reinforcement Learning with Multi-Scale Distribution Debiasing
1 College of Computer Science and Technology (College of Data Science), Taiyuan University of Technology, Jinzhong, China
2 College of Artificial Intelligence, Taiyuan University of Technology, Jinzhong, China
* Corresponding Author: Li Wang. Email:
Computers, Materials & Continua 2026, 88(1), 67 https://doi.org/10.32604/cmc.2026.079959
Received 31 January 2026; Accepted 24 March 2026; Issue published 08 May 2026
Abstract
Robots perform diverse tasks in real-world scenarios. In safety-critical applications, robot control must prioritize satisfying safety constraints in addition to achieving high performance. Offline safe reinforcement learning avoids risky online exploration by training from a given dataset. However, most existing methods overlook two issues in offline data. First, non-zero cost signals are typically sparse, which leads to inaccurate cost value estimates and makes it difficult to impose effective safety constraints on the policy. Second, an imbalanced dataset biases policy learning toward unsafe behaviors. To address these challenges, we propose an actor-critic method ARMOR (multi-scAle Reweighting with Multi-task Offline cRitic). The multi-task critic treats reward, long-term cost, and short-term cost as multiple tasks, learns shared representations to capture common state information, and leverages dense reward signals to stabilize learning under sparse cost signals. To mitigate dataset imbalance, ARMOR performs counterfactual reasoning with the short-term cost to upweight critical safe transitions near the risk boundary and assigns higher weights to low-cost trajectories. It then performs multi-scale reweighting by combining transition-level and trajectory-level weights to debias data distribution and emphasize safe demonstrations. The actor is parameterized by a conditional diffusion policy and trained via weighted behavior cloning. ARMOR additionally incorporates a reward-guided objective and a long-term cost constraint to improve the reward-cost trade-off. Extensive experiments on continuous-control robot tasks show that ARMOR achieves competitive performance under safety constraints, with clear advantages in several challenging environments. Furthermore, ARMOR exhibits zero-shot adaptation capability, making it suitable for practical deployment.Keywords
Robot control is a key technology for autonomous systems. It is increasingly deployed in safety-critical domains, including robotic surgery [1], industrial automation [2], and Automated Guided Vehicle (AGV) transportation [3], where failures may cause serious harm. Therefore, ensuring safety satisfaction during decision-making is a priority requirement for real-world deployment. Reinforcement learning (RL) has received a great deal of attention in the field of robotic control [4], but online exploration can be unsafe and expensive. Offline safe reinforcement learning (OSRL) addresses this concern by learning from fixed datasets without risky interaction.
In OSRL, the offline dataset is collected by one or more behavior policies and consists of a set of trajectories. Each trajectory is a sequence of transitions, each of which provides a reward signal and a cost signal. The reward quantifies task performance, while the cost represents safety violations, such as collisions, damage, or entering hazardous regions. The objective of OSRL is to learn a policy that maximizes the expected reward while satisfying a predefined cost limit.
Offline safe reinforcement learning faces two major challenges. Firstly, at the transition level, non-zero cost signals are typically sparse because safety violations do not occur at every time step, as shown in Fig. 1a. Consequently, cost value function learning is dominated by zero-cost samples, which leads to the underestimation of cost estimates and further makes it difficult to enforce constraints during policy optimization. Secondly, at the trajectory level, offline datasets are often imbalanced. Trajectories that satisfy the cost limit are rare, which hinders the learning of safe policies, as shown in Fig. 1b. To quantify these two challenges, we report dataset statistics for the public offline safe RL datasets released in [5]. The datasets are collected by a suite of policies, which are trained with different cost constraints using various safe RL algorithms. Table 1 reports the number of transitions, the non-zero cost transition rate

Figure 1: Visualization of reward and cost for the AntRun task, based on the datasets in [5]. (a) Each point represents (step, cost) or (step, reward) of a transition in an episode. (b) Each point represents (cost return, reward return) of a trajectory in the dataset. Only data on the left side of the dashed line is feasible.

Accurate value estimation is a central challenge in offline reinforcement learning. Prior offline RL methods mitigate value estimation errors through behavior-regularized policy learning that restricts actions to the given data and conservative value estimation [6,7]. Building on these ideas, offline safe RL further introduces a cost critic and optimizes a Lagrangian objective. Primal-Dual-Critic Algorithm (PDCA) runs a primal-dual procedure over a critics-estimated Lagrangian [8]. Constrained Offline Policy Optimization (COPO) applies an offline cost-projection with confidence bounds to better account for distributional shift [9]. Other approaches explicitly handle out-of-distribution (OOD) behaviors for safety. Constraints Penalized Q-Learning (CPQ) treats OOD actions as unsafe and updates the policy using only safe state-action pairs [10]. Constraint-Conditioned Actor-Critic (CCAC) employs a constraint-conditioned variational autoencoder with a classifier to generate and identify unsafe OOD data, and uses such samples to regularize critics and policy learning [11]. Complementary to these, Lee et al. [12] proposed a method that optimizes the policy in the stationary distribution space under conservative cost constraints. Variational Optimization with Conservative Estimation (VOCE) utilizes variational formulations with pessimistic reward and cost value estimation to reduce OOD extrapolation errors [13]. Despite these advances, most methods assume that the dataset provides sufficiently informative cost supervision. In practice, sparse non-zero costs violate this assumption, which biases cost value estimates and weakens constraint enforcement.
Imperfect datasets complicate safe offline RL because the data distribution can be dominated by unsafe trajectories. Recent work attempts to reshape the offline distribution by generating or augmenting data. Generative Trajectory Augmentation (GTA) augments trajectories by diffusion-based denoising with guidance toward amplified returns, producing high-reward data [14]. AdaptDiffuser generates expert data with reward-gradient guidance and selects high-quality data via a discriminator, iteratively finetunes the diffusion planner [15]. For offline safe RL, trajectory classification partitions trajectories into desirable and undesirable subsets, training a policy to generate desirable trajectories via classifier-provided desirability scores [16]. OASIS, short for cOnditionAl diStributIon Shaping, employs a conditional diffusion model, conditioning on reward and cost thresholds to reshape the offline distribution toward safer and more rewarding regions [17]. SafeDiffuser embeds control barrier function constraints into the denoising process to enforce safety specifications during diffusion-based data generation [18]. However, distribution debiasing via data generation can be limited under severe dataset imbalance, since the generator is difficult to train and may fail to reliably produce safe and informative samples.
Due to their ability to represent complex distributions, diffusion models have been explored for offline decision-making by modeling policies as action generators [19]. In robotics, Chi et al. [20] generate robot behavior by modeling visuomotor control as a conditional denoising diffusion process. Several methods further incorporate safety into diffusion-based offline policies. Trajectory-based REal-time Budget Inference (TREBI) transforms policy optimization into a trajectory distribution optimization problem, using diffusion-based planning with dynamic cost budgets to guide action generation [21]. FeasIbility-guided Safe Offline RL (FISOR) leverages reachability analysis to translate hard safety requirements into feasible-region identification and derives an energy-guided diffusion formulation for weighted behavior cloning [22]. Constrained Diffusion Policy (CDP) maps diffusion samples onto a constrained manifold via a mirror diffusion model, thereby generating actions that satisfy safety constraints [23]. In safety-critical autonomous driving, Uncertainty-based Alternative Diffusion Policy (UADP) trains two alternative diffusion policies with an ensemble Q critic and selects actions with lower uncertainty to reduce risk [24]. However, many diffusion-based methods are trained by matching the offline data distribution and are thus sensitive to dataset quality. Reliable improvement beyond the behavior policy is challenging, especially under sparse cost supervision and severe dataset imbalance.
To tackle these challenges, we propose ARMOR (multi-scAle Reweighting with Multi-task Offline cRitic), an actor-critic method that integrates multi-scale distribution debiasing and a multi-task critic into a conditional diffusion policy. ARMOR provides a generative offline control approach that jointly optimizes task performance and safety. The main contributions are summarized as follows:
• We propose a multi-task critic with a shared trunk that treats reward, long-term cost, and short-term cost as multiple tasks. The shared trunk learns shared representations to capture common state features, leveraging dense reward feedback to enable more reliable value estimation under sparse non-zero cost supervision.
• We present a multi-scale debiasing strategy that combines trajectory-level weighting with counterfactual transition-level weighting to mitigate dataset imbalance. At the trajectory level, we upweight low-cost trajectories. At the transition level, we assign transition weights by comparing short-term costs under counterfactual action perturbations, emphasizing safety-critical transitions near the risk boundary.
• We incorporate the above designs into a conditional diffusion actor. The actor is trained with weighted behavior cloning, augmented with a return-guided objective from the reward critic and a cost constraint from the long-term cost critic to achieve a better reward-cost trade-off.
• We evaluate ARMOR on eight tasks across two standard robot control benchmarks. Results show that ARMOR improves returns while satisfying cost limits and exhibits zero-shot adaptation.
We model safe reinforcement learning for continuous-control robotics using a Constrained Markov Decision Process (CMDP)
In safe robotic control, the policy is required to maximize the reward return while satisfying a cost constraint. Therefore, the CMDP objective can be written as:
where
Unlike the online setting, offline reinforcement learning assumes that the agent only has access to a fixed dataset
In this section, we present ARMOR (multi-scAle Reweighting with Multi-task Offline cRitic) as illustrated in Fig. 2.

Figure 2: ARMOR overview.
In order to represent the remaining cost budget at each steps, we first introduce the cost threshold construction, where each transition is augmented with a cumulative cost. Then, we describe the proposed multi-task critic with a shared trunk, which jointly learns reward, long-term cost and short-term cost. The reward head guides performance optimization, the long-term cost head estimates cumulative cost to ensure policy safety, and the short-term cost head captures imminent violations to highlight the risk boundary. The shared trunk learns shared representations to ease risk representation learning, and PCGrad mitigates gradient conflicts across multi-task. Next, we formulate multi-scale reweighting, which combines trajectory-level debiasing with counterfactual transition-level weighting. Trajectory weights reshape the effective training distribution toward low-cost behaviors. Counterfactual comparisons use the short-term cost critic to identify transitions that are sensitive to small action perturbations. This focuses learning on safety-critical transitions near the constraint boundary. Finally, we integrate these components into a conditional diffusion actor, optimised through weighted behavior cloning, reward-guided improvement and a Lagrangian penalty induced by the long-term cost critic. This setup allows for stable offline training and a stronger reward-safety trade-off.
3.1 Cost Threshold Construction
Many previous methods typically distribute the cost limit across each time step either by discounting or uniformly splitting the total, and use this distribution to determine constraint violations at each time step. While this approach simplifies constraint evaluation, it introduces two main issues. First, safety cost signals are often sparse, with
ARMOR introduces
We transfer dense reward supervision to cost-related tasks via a shared representation, thereby reducing the negative impact of cost sparsity. This design eases risk representation learning and improves feature generalization for value estimation. Moreover, by sharing features across objectives, ARMOR reuses supervision more effectively, leading to more data-efficient value estimation in the offline setting.
Specifically, we formulate critic learning as multi-task representation learning. The critic takes
where
The critic is implemented as a shared trunk
We jointly optimize all critic parameters

Figure 3: Multi-task gradients and PCGrad. (a) Gradient conflicts arise between tasks
Intuitively, when the gradients of two heads conflict, we subtract the conflicting component from the original gradient. After completing the gradient projection for all heads, we average the corrected gradients to obtain the joint gradient used for updating the shared trunk.
Offline safe reinforcement learning is often limited by data distribution. Under strict constraints, few trajectories satisfy the safety requirements, resulting in an imbalanced training distribution. ARMOR addresses this issue by reshaping the training distribution with multi-scale weights, including trajectory-level weighting to correct behavioral bias and transition-level weighting to emphasize safe transitions near the constraint boundary.
Trajectory-level debiasing. For each trajectory
where
The trajectory weight can be regarded as an importance shift toward feasible trajectories. Let
Since
Counterfactual transition reweighting. It is not possible to judge the quality of all transitions based only on trajectory cost, since high-cost trajectories may also contain critical decision transitions. We therefore introduce a transition-level counterfactual weight that highlights critical safe transitions near the risk boundary, where the observed action is safe, but small local perturbations could lead to constraint violations.
For each transition
where
where
The final weight is obtained by combining the two scales after normalization and clipping:
These weights are used to reweight the actor’s behavior cloning loss, explicitly injecting safety-aware data preference from the distribution side rather than only relying on penalties in the objective.
Since
3.4 Conditional Diffusion Actor
Diffusion models exhibit strong expressive capacity for modeling complex data distributions. ARMOR parameterizes the actor as a conditional diffusion policy. This formulation provides a unified framework for weighted behavior cloning, policy improvement based on the critic, and explicit safety constraint. Specifically, the policy generates actions
We follow the DDPM [27] on the action space. Let
where
where
where
Behavior cloning typically fails to produce a policy that outperforms the dataset. To address this, we introduce a policy improvement objective based on the reward value function. Specifically, we sample actions
At the same time, policy improvement must satisfy cost constraints. We use the long-term cost critic
Considering all of the above, the optimization of the actor can be formulated as a constrained problem:
where
We transform this constrained problem into an unconstrained optimization by applying Lagrangian relaxation. By introducing the dual variable
where
Eqs. (15) and (16) define a standard primal–dual optimization of a Lagrangian relaxation. The actor minimizes a behavior-regularized objective with reward guidance and a Lagrangian penalty, while the dual variable is updated by approximate dual ascent. The multi-scale weights are normalized and clipped to prevent extreme gradient amplification. These design choices empirically stabilize training and are consistent with common convergence conditions for stochastic primal-dual methods such as bounded stochastic gradients and suitably chosen step sizes. We report the evolution of
3.5 Deployment of ARMOR on Robotic Systems
ARMOR trains a conditional diffusion model that is deployed on a robotic system to enable autonomous control. The robot acquires its current state
We evaluate ARMOR on the public benchmarks. In addition to the main comparison against representative baselines, we conduct an ablation study to quantify the contributions of the proposed multi-task critics and multi-scale reweighting, and perform sensitivity analyses on hyperparameters of the diffusion policy and the reweighting scheme. Furthermore, we demonstrate the zero-shot adaptation capability of ARMOR to different cost limits without retraining.
We conduct experiments on continuous-control robotic tasks using the public benchmarks Bullet-Safety-Gym [28] and Safety-Gymnasium [29], which are commonly used in previous works.
In Bullet-Safety-Gym, we focus on the Run task with three robot types: Ant, Ball, and Drone. In this task, the agent is rewarded for traversing a corridor between two boundaries at high speed, while crossing the boundaries or exceeding the velocity limit incurs penalties (Fig. 4a).

Figure 4: Tasks in Bullet-Safety-Gym and Safety-Gymnasium. (a) Run; (b) AntVelocity; (c) Walker2dVelocity; (d) HalfCheetahVelocity; (e) Circle; (f) Goal; (g) Push; (h) Button.
In Safety-Gymnasium, we consider two groups of tasks: the Velocity group and the Navigation group. In the Velocity group, the Ant, Walker2d, and HalfCheetah robots are used for tasks, where the objective is to maximize forward displacement and a cost is incurred when overspeed (Fig. 4b–d).
The Navigation group includes tasks Circle, Goal, Push, and Button with Point and Car robots (Fig. 4e–h). In Circle, clockwise motion along a circular track yields reward, whereas leaving the boundary-defined safe region produces a cost. In Goal, the agent navigates toward a target location while avoiding traps and preventing collisions with hazardous objects. In Push, the agent must push a box to the goal while steering around obstacles and avoiding traps. In Button, the agent must tap the correct target button among multiple buttons, and entering traps or collisions with moving obstacles trigger a cost.
To provide a comprehensive evaluation, we compare ARMOR with the following offline baselines. This allows us to assess ARMOR from multiple perspectives, including whether it improves over behavior cloning, how it compares with representative Q-learning methods under sparse cost supervision, what additional benefits it brings within generative policy learning, and whether it can surpass data-generation approaches without relying on additional synthesized data.
• Imitation Learning: BC, behavior cloning that imitates trajectories in the datasets.
• Q-Learning-Based Algorithms: BCQL, a Lagrangian-based extension of BCQ [6]; CPQ [10], a Q-learning method that identifies Out Of Distribution (OOD) actions as unsafe and learns policies with safe transitions.
• Generative Modeling Algorithm: FISOR [22], a feasibility-guided method with a diffusion model.
• Data Generation Algorithms: OASIS [17], which employs a conditional diffusion model to generate datasets and guides the data distribution towards a target domain; CCAC [11], which generates and identifies unsafe OOD data to train adaptive safe policies.
Our evaluation metrics include the normalized cost return and the normalized reward return.
where
According to difficulty, the cost limit is set to

Table 3 reports the normalized return and normalized cost of all methods across three task groups.

Overall, ARMOR satisfies the cost constraint in most environments while attaining competitive returns. Furthermore, it demonstrates a distinct advantage in more challenging scenarios, where the environment is more complex. This outcome is primarily attributed to the reweighting mechanism, which reshapes the training distribution to emphasize safer and more informative transitions, aligning critic evaluation and diffusion policy learning. In addition, the shared representation provides a common feature basis for multiple critic heads, reducing reliance on sparse cost signals and improving the stability and accuracy of value estimation. For tasks with explicit goals, we further report success rate alongside normalized cost in Appendix C (Table A2).
BCQL employs a Lagrangian approach to balance performance and safety, whereas CPQ adopts a conservative update rule and updates the value function only on state action pairs classified as safe. However, neither method explicitly addresses distribution bias in offline data. When the proportion of safe samples in the dataset is low, the available supervisory signal becomes insufficient, which hinders learning policies that satisfy safety constraints while maintaining performance. OASIS synthesizes training data using reward and cost models, an inverse dynamics model, and a conditional diffusion generator. Since this pipeline involves multiple learned components, modeling errors introduced at any stage can propagate through subsequent data generation and policy optimization, which can degrade the resulting policy and increase the likelihood of constraint violations. CCAC updates the cost critic conservatively using augmented data to improve the reliability of constraint estimation, but it does not account for the potential sparsity of cost signals. When cost is sparse, the cost critic may fail to converge to an accurate estimate, which weakens constraint guidance during policy optimization and limits the ability to provide consistent safety guarantees. FISOR tends to enforce feasibility more strictly, yet in some tasks it yields very low even negative normalized return, suggesting that it may converge to overly conservative policies that are undesirable in offline safe RL.
We observe that ARMOR does not satisfy the cost constraint on CarButton1. Several baselines also violate the constraint on this environment, indicating that the dataset coverage and the sharp constraint boundary make it challenging for offline methods. While FISOR satisfies the constraint on CarButton1, its normalized return is negative, implying an extremely conservative policy. This behavior is not aligned with the goal of offline safe RL, which seeks both feasibility and high utility.
We conduct ablation experiments to evaluate the effectiveness of each component in ARMOR. We consider the following variants: (i) w/o Reweighting: removing both trajectory-level and counterfactual reweighting by setting
Fig. 5 summarizes the ablation results on Navigation tasks. Since offline safe RL prioritizes constraint satisfaction before return maximization, we primarily analyze the normalized cost. armor achieves the lowest cost on three of four tasks and maintains competitive returns. Removing multi-scale reweighting leads to higher normalized cost, indicating that reshaping the training distribution is crucial for safety under offline data limitations. Ablating the multi-task critic also harms feasibility, suggesting that jointly learning reward and multi-horizon costs with conflict-aware optimization produces more reliable value estimates for policy learning. Removing both the multi-task critic and reweighting results in poorer performance in constraint satisfaction, highlighting that conditional generation alone is insufficient. Replacing the diffusion actor with a simple Gaussian policy causes constraint violation in most tasks, showing that the representation capacity of the diffusion model is indispensable. Disabling the reweighting warm-up and EMA tends to cost increases due to inaccurate value estimation during the initial training phase and unstable weights. Removing PCGrad and multi-term cost critic degrades constraint satisfaction, demonstrating the need to mitigate gradient interference in the shared critic and to capture both long-term and short-term safety signals.

Figure 5: Ablations on Navigation tasks. The dashed line represents the normalized cost limit.
We study the sensitivity of ARMOR to three important hyperparameters: the diffusion denoising steps N, the clipping range
The denoising steps N control the granularity of the reverse diffusion process and thus affect both the expressiveness of the action generator and the inference cost. We sweep

Figure 6: Effect of diffusion denoising step N during training. We report mean
We evaluate three clipping ranges for weights:

Figure 7: Effect of the weight clipping range

The short-term cost discount

Another advantage of our approach is that it can adapt to different cost limits without retraining. In ARMOR, the cost limit is explicitly used when constructing the trajectory-level weights, so training with different limits induces policies with distinct safety-performance preferences. To examine generalization under changing constraints, we train two policies with cost limits of

Figure 8: Zero-shot adaptation across cost limits. The dashed line represents the cost limit.
We propose ARMOR, a conditioned diffusion policy augmented with multi-scale reweighting and a multi-task critic. ARMOR learns a shared representation via the multi-task critic to enable reliable value estimation. In addition, multi-scale reweighting is introduced to the conditional diffusion policy objective, which injects safety constraints from the data distribution. Experiments demonstrate that ARMOR achieves strong performance under cost constraints across multiple continuous-control robotics tasks. A practical limitation of ARMOR is its real-time inference overhead (Appendix C, Table A3), since diffusion-based action generation requires multi-step denoising for each decision. A further limitation is that the current study assumes accurate reward and cost signals, and does not consider noisy supervision that may arise in practical deployment. Future work could focus on accelerating inference and improving robustness to noisy feedback, which is essential for deploying ARMOR in real-world robotic systems.
Acknowledgement: The authors sincerely thank all those who supported to this research.
Funding Statement: This work was supported by the Joint Fund for Regional Innovation and Development of the National Natural Science Foundation of China (No. U22A20167) and the Special Project for Guiding the Transformation of Scientific and Technological Achievements in Shanxi Province (No. 202404021301033).
Author Contributions: The authors confirm contribution to the paper as follows: methodology, Chengjing Li; software, Chengjing Li; validation, Xiaoyan Zhao; investigation, Chengjing Li; data curation, Xiaoyan Zhao; writing—original draft preparation, Chengjing Li; writing—review and editing, Li Wang and Xiaoyan Zhao; visualization, Chengjing Li; supervision, Li Wang. All authors reviewed and approved the final version of the manuscript.
Availability of Data and Materials: The data that support the findings of this study are available from the corresponding author, upon reasonable request.
Ethics Approval: Not applicable.
Conflicts of Interest: The authors declare no conflicts of interest.
Appendix A Gradient Conflict Statistics
To examine whether gradient conflict is common, we measure the cosine similarity between the gradients induced by the three critic heads. At each training step, we compute cosine similarities for each of the three head pairs and count the number of pairs whose cosine similarity is negative. Table A1 reports the fraction of conflicting pairs for each task. The results indicate that gradient conflict is non-negligible. Gradient conflict occurs in roughly half of the training steps, and the case of two conflicting pairs arises frequently. This pattern is expected because the critic optimizes one reward objective and two cost objectives, while reward-driven gradients often compete with cost-driven gradients. Overall, these statistics support the motivation for applying PCGrad in the shared critic trunk to mitigate destructive interference among reward and multi-term cost learning.


Figure A1: Evolution of the dual variable
Appendix B Implementation Details
The diffusion actor uses an MLP denoiser conditioned on state
We employ a multi-task critic with a shared MLP trunk and three task-specific heads. The critic input concatenates the state, the cost threshold, and the action
The dual variable is implemented as a MLP that takes the cost threshold
We report the evolution of the learned dual variable
Appendix C More Experiment Results
For tasks with explicit goals, such as Goal, Push, and Button, we report success rate together with normalized cost to assess performance under safety constraints. Success is defined by the environment-provided success signal for reaching the goal, and the success rate is computed over

Table A3 reports the inference time of ARMOR and Q-learning style offline safe RL baselines, measured on an NVIDIA RTX 3090 GPU. ARMOR exhibits the largest latency among the compared methods. This overhead is expected because the actor is a conditional diffusion policy and requires multiple denoising steps at test time. In contrast, Q-learning style baselines select actions with a single forward pass through the policy. Despite this overhead, the per-action latency is still on the order of

References
1. Saeidi H, Opfermann JD, Kam M, Wei S, Léonard S, Hsieh MH, et al. Autonomous robotic laparoscopic surgery for intestinal anastomosis. Sci Robot. 2022;7(62):eabj2908. doi:10.1126/scirobotics.abj2908. [Google Scholar] [PubMed] [CrossRef]
2. Wu J, Huang Y, Lai Y, Yang S, Zhang C. Obstacle avoidance inspection method of cable tunnel for quadruped robot based on particle swarm algorithm and neural network. Sci Rep. 2025;15(1):36065. doi:10.1038/s41598-025-19903-w. [Google Scholar] [PubMed] [CrossRef]
3. Nie J, Zhang G, Lu X, Wang H, Sheng C, Sun L. Obstacle avoidance method based on reinforcement learning dual-layer decision model for AGV with visual perception. Control Eng Pract. 2024;153(8):106121. doi:10.1016/j.conengprac.2024.106121. [Google Scholar] [CrossRef]
4. Radosavovic I, Xiao T, Zhang B, Darrell T, Malik J, Sreenath K. Real-world humanoid locomotion with reinforcement learning. Sci Robot. 2024;9(89):eadi9579. doi:10.1126/scirobotics.adi9579. [Google Scholar] [PubMed] [CrossRef]
5. Liu Z, Guo Z, Lin H, Yao Y, Zhu J, Cen Z, et al. Datasets and benchmarks for offline safe reinforcement learning. J Data-Centric Mach Learn Res. 2024;1(12):1–29. doi:10.52202/079017-2494. [Google Scholar] [CrossRef]
6. Fujimoto S, Meger D, Precup D. Off-policy deep reinforcement learning without exploration. In: Proceedings of the 36th International Conference on Machine Learning; 2019 Jun 9–15; Long Beach, California, USA. p. 2052–62. [Google Scholar]
7. Kumar A, Zhou A, Tucker G, Levine S. Conservative q-learning for offline reinforcement learning. In: Proceedings of the 34th International Conference on Neural Information Processing Systems; 2020 Dec 6–12; virtual. p. 1179–91. [Google Scholar]
8. Hong K, Li Y, Tewari A. A primal-dual-critic algorithm for offline constrained reinforcement learning. In: Proceedings of the 27th International Conference on Artificial Intelligence and Statistics; 2024 May 2–4; Palau de Congressos, Valencia, Spain. p. 280–8. [Google Scholar]
9. Polosky N, Da Silva BC, Fiterau M, Jagannath J. Constrained offline policy optimization. In: Proceedings of the 39th International Conference on Machine Learning; 2022 Jul 17-23; Baltimore, MD, USA. p. 17801–10. [Google Scholar]
10. Xu H, Zhan X, Zhu X. Constraints penalized q-learning for safe offline reinforcement learning. In: Proceedings of the 36th AAAI Conference on Artificial Intelligence; 2022 Feb 22–Mar 1; virtual. p. 8753–60. doi:10.1609/aaai.v36i8.20855. [Google Scholar] [CrossRef]
11. Guo Z, Zhou W, Wang S, Li W. Constraint-conditioned actor-critic for offline safe reinforcement learning. In: Proceedings of the 13th International Conference on Learning Representations; 2025 Apr 24–28; Singapore. [Google Scholar]
12. Lee J, Paduraru C, Mankowitz DJ, Heess N, Precup D, Kim KE, et al. COptiDICE: offline constrained reinforcement learning via stationary distribution correction estimation. In: Proceedings of the 10th International Conference on Learning Representations; 2022 Apr 25–29; virtual. [Google Scholar]
13. Guan J, Chen G, Ji J, Yang L, Zhou A, Li Z, et al. Voce: variational optimization with conservative estimation for offline safe reinforcement learning. In: Proceedings of the 37th International Conference on Neural Information Processing Systems; 2023 Dec 10–16; New Orleans, LA, USA. p. 33758–80. [Google Scholar]
14. Lee J, Yun S, Yun T, Park J. GTA: generative trajectory augmentation with guidance for offline reinforcement learning. In: Proceedings of the 38th International Conference on Neural Information Processing Systems; 2024 Dec 10–15; Vancouver, BC, Canada. p. 56766–801. [Google Scholar]
15. Liang Z, Mu Y, Ding M, Ni F, Tomizuka M, Luo P. AdaptDiffuser: Diffusion models as adaptive self-evolving planners. In: Proceedings of the 40th International Conference on Machine Learning; 2023 Jul 23-29; Honolulu, HI, USA. p. 20725–45. [Google Scholar]
16. Gong Z, Kumar A, Varakantham P. Offline safe reinforcement learning using trajectory classification. In: Proceedings of the 39th AAAI Conference on Artificial Intelligence; 2025 Feb 25–Mar 4; Philadelphia, PA, USA. p. 16880–7. doi:10.1609/aaai.v39i16.33855. [Google Scholar] [CrossRef]
17. Yao Y, Cen Z, Ding W, Lin H, Liu S, Zhang T, et al. OASIS: conditional distribution shaping for offline safe reinforcement learning. In: Proceedings of the 38th International Conference on Neural Information Processing Systems; 2024 Dec 10–15; Vancouver, BC, Canada. p. 78451–78. doi:10.52202/079017-2494. [Google Scholar] [CrossRef]
18. Xiao W, Wang TH, Gan C, Hasani R, Lechner M, Rus D. Safediffuser: safe planning with diffusion probabilistic models. In: Proceedings of the 11th International Conference on Learning Representations; 2023 May 1–5; Kigali, Rwanda. [Google Scholar]
19. Ajay A, Du Y, Gupta A, Tenenbaum JB, Jaakkola TS, Agrawal P. Is conditional generative modeling all you need for decision-making? In: Proceedings of the 11th International Conference on Learning Representations; 2023 May 1–5; Kigali, Rwanda. [Google Scholar]
20. Chi C, Xu Z, Feng S, Cousineau E, Du Y, Burchfiel B, et al. Diffusion policy: visuomotor policy learning via action diffusion. Int J Robot Res. 2025;44(10–11):1684–704. doi:10.1177/02783649241273668. [Google Scholar] [CrossRef]
21. Lin Q, Tang B, Wu Z, Yu C, Mao S, Xie Q, et al. Safe offline reinforcement learning with real-time budget constraints. In: Proceedings of the 40th International Conference on Machine Learning; 2023 Jul 23–29; Honolulu, HI, USA. p. 21127–52. [Google Scholar]
22. Zheng Y, Li J, Yu D, Yang Y, Li SE, Zhan X, et al. Safe offline reinforcement learning with feasibility-guided diffusion model. In: Proceedings of the 12th International Conference on Learning Representations; 2024 May 7–11; Vienna, Austria. [Google Scholar]
23. Ha T, Cha H, Ji D. CDP: constrained diffusion policies with mirror diffusion model for safety-assured imitation learning. In: Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems; 2025 Oct 19–25; Hangzhou, China. p. 9838–45. doi:10.1109/IROS60139.2025.11246518. [Google Scholar] [CrossRef]
24. Huang X, Wang X, Cheng Y. Uncertainty-based alternative diffusion policy for safe autonomous driving. IEEE Trans Intell Transp Syst. 2025;26(11):18854–63. doi:10.1109/TITS.2025.3587341. [Google Scholar] [CrossRef]
25. Hasselt H. Double Q-learning. In: Proceedings of the 24th International Conference on Neural Information Processing Systems; 2010 Dec 6–9; Vancouver, BC, Canada. [Google Scholar]
26. Yu T, Kumar S, Gupta A, Levine S, Hausman K, Finn C. Gradient surgery for multi-task learning. In: Proceedings of the 34th International Conference on Neural Information Processing Systems; 2020 Dec 6–12; virtual. p. 5824–36. [Google Scholar]
27. Ho J, Jain A, Abbeel P. Denoising diffusion probabilistic models. In: Proceedings of the 34th International Conference on Neural Information Processing Systems; 2020 Dec 6–12; virtual. p. 6840–51. [Google Scholar]
28. Gronauer S. Bullet-Safety-Gym: a framework for constrained reinforcement learning. In: AAMAS ’24: Proceedings of the 23rd International Conference on Autonomous Agents and Multiagent System. New York, NY, USA: The Association for Computing Machinery (ACM); 2022. doi:10.14459/2022md1639974. [Google Scholar] [CrossRef]
29. Ji J, Zhou J, Zhang B, Dai J, Pan X, Sun R, et al. Omnisafe: an infrastructure for accelerating safe reinforcement learning research. J Mach Learn Res. 2024;25(285):1–6. [Google Scholar]
Cite This Article
Copyright © 2026 The Author(s). Published by Tech Science Press.This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.


Submit a Paper
Propose a Special lssue
View Full Text
Download PDF
Downloads
Citation Tools