Safe Robot Control through Multi-Task Offline Reinforcement Learning with Multi-Scale Distribution Debiasing

Chengjing Li; Li Wang; Xiaoyan Zhao

doi:10.32604/cmc.2026.079959

icon Open Access

ARTICLE

Safe Robot Control through Multi-Task Offline Reinforcement Learning with Multi-Scale Distribution Debiasing

Chengjing Li¹, Li Wang^2,*, Xiaoyan Zhao²

1 College of Computer Science and Technology (College of Data Science), Taiyuan University of Technology, Jinzhong, China
2 College of Artificial Intelligence, Taiyuan University of Technology, Jinzhong, China

* Corresponding Author: Li Wang. Email: email

Computers, Materials & Continua 2026, 88(1), 67 https://doi.org/10.32604/cmc.2026.079959

Received 31 January 2026; Accepted 24 March 2026; Issue published 08 May 2026

Abstract

Robots perform diverse tasks in real-world scenarios. In safety-critical applications, robot control must prioritize satisfying safety constraints in addition to achieving high performance. Offline safe reinforcement learning avoids risky online exploration by training from a given dataset. However, most existing methods overlook two issues in offline data. First, non-zero cost signals are typically sparse, which leads to inaccurate cost value estimates and makes it difficult to impose effective safety constraints on the policy. Second, an imbalanced dataset biases policy learning toward unsafe behaviors. To address these challenges, we propose an actor-critic method ARMOR (multi-scAle Reweighting with Multi-task Offline cRitic). The multi-task critic treats reward, long-term cost, and short-term cost as multiple tasks, learns shared representations to capture common state information, and leverages dense reward signals to stabilize learning under sparse cost signals. To mitigate dataset imbalance, ARMOR performs counterfactual reasoning with the short-term cost to upweight critical safe transitions near the risk boundary and assigns higher weights to low-cost trajectories. It then performs multi-scale reweighting by combining transition-level and trajectory-level weights to debias data distribution and emphasize safe demonstrations. The actor is parameterized by a conditional diffusion policy and trained via weighted behavior cloning. ARMOR additionally incorporates a reward-guided objective and a long-term cost constraint to improve the reward-cost trade-off. Extensive experiments on continuous-control robot tasks show that ARMOR achieves competitive performance under safety constraints, with clear advantages in several challenging environments. Furthermore, ARMOR exhibits zero-shot adaptation capability, making it suitable for practical deployment.

Keywords

Robotics; offline safe reinforcement learning; diffusion model; imbalanced data

1 Introduction

Robot control is a key technology for autonomous systems. It is increasingly deployed in safety-critical domains, including robotic surgery [1], industrial automation [2], and Automated Guided Vehicle (AGV) transportation [3], where failures may cause serious harm. Therefore, ensuring safety satisfaction during decision-making is a priority requirement for real-world deployment. Reinforcement learning (RL) has received a great deal of attention in the field of robotic control [4], but online exploration can be unsafe and expensive. Offline safe reinforcement learning (OSRL) addresses this concern by learning from fixed datasets without risky interaction.

In OSRL, the offline dataset is collected by one or more behavior policies and consists of a set of trajectories. Each trajectory is a sequence of transitions, each of which provides a reward signal and a cost signal. The reward quantifies task performance, while the cost represents safety violations, such as collisions, damage, or entering hazardous regions. The objective of OSRL is to learn a policy that maximizes the expected reward while satisfying a predefined cost limit.

Offline safe reinforcement learning faces two major challenges. Firstly, at the transition level, non-zero cost signals are typically sparse because safety violations do not occur at every time step, as shown in Fig. 1a. Consequently, cost value function learning is dominated by zero-cost samples, which leads to the underestimation of cost estimates and further makes it difficult to enforce constraints during policy optimization. Secondly, at the trajectory level, offline datasets are often imbalanced. Trajectories that satisfy the cost limit are rare, which hinders the learning of safe policies, as shown in Fig. 1b. To quantify these two challenges, we report dataset statistics for the public offline safe RL datasets released in [5]. The datasets are collected by a suite of policies, which are trained with different cost constraints using various safe RL algorithms. Table 1 reports the number of transitions, the non-zero cost transition rate Pnz, and the feasible trajectory ratio Pfeas for each task, where feasible trajectories refer to trajectories whose cumulative cost satisfies the safety constraint. These statistics directly confirm that sparse cost supervision and dataset imbalance are pervasive in offline safe RL.

images

Figure 1: Visualization of reward and cost for the AntRun task, based on the datasets in [5]. (a) Each point represents (step, cost) or (step, reward) of a transition in an episode. (b) Each point represents (cost return, reward return) of a trajectory in the dataset. Only data on the left side of the dashed line is feasible.

images

Accurate value estimation is a central challenge in offline reinforcement learning. Prior offline RL methods mitigate value estimation errors through behavior-regularized policy learning that restricts actions to the given data and conservative value estimation [6,7]. Building on these ideas, offline safe RL further introduces a cost critic and optimizes a Lagrangian objective. Primal-Dual-Critic Algorithm (PDCA) runs a primal-dual procedure over a critics-estimated Lagrangian [8]. Constrained Offline Policy Optimization (COPO) applies an offline cost-projection with confidence bounds to better account for distributional shift [9]. Other approaches explicitly handle out-of-distribution (OOD) behaviors for safety. Constraints Penalized Q-Learning (CPQ) treats OOD actions as unsafe and updates the policy using only safe state-action pairs [10]. Constraint-Conditioned Actor-Critic (CCAC) employs a constraint-conditioned variational autoencoder with a classifier to generate and identify unsafe OOD data, and uses such samples to regularize critics and policy learning [11]. Complementary to these, Lee et al. [12] proposed a method that optimizes the policy in the stationary distribution space under conservative cost constraints. Variational Optimization with Conservative Estimation (VOCE) utilizes variational formulations with pessimistic reward and cost value estimation to reduce OOD extrapolation errors [13]. Despite these advances, most methods assume that the dataset provides sufficiently informative cost supervision. In practice, sparse non-zero costs violate this assumption, which biases cost value estimates and weakens constraint enforcement.

Imperfect datasets complicate safe offline RL because the data distribution can be dominated by unsafe trajectories. Recent work attempts to reshape the offline distribution by generating or augmenting data. Generative Trajectory Augmentation (GTA) augments trajectories by diffusion-based denoising with guidance toward amplified returns, producing high-reward data [14]. AdaptDiffuser generates expert data with reward-gradient guidance and selects high-quality data via a discriminator, iteratively finetunes the diffusion planner [15]. For offline safe RL, trajectory classification partitions trajectories into desirable and undesirable subsets, training a policy to generate desirable trajectories via classifier-provided desirability scores [16]. OASIS, short for cOnditionAl diStributIon Shaping, employs a conditional diffusion model, conditioning on reward and cost thresholds to reshape the offline distribution toward safer and more rewarding regions [17]. SafeDiffuser embeds control barrier function constraints into the denoising process to enforce safety specifications during diffusion-based data generation [18]. However, distribution debiasing via data generation can be limited under severe dataset imbalance, since the generator is difficult to train and may fail to reliably produce safe and informative samples.

Due to their ability to represent complex distributions, diffusion models have been explored for offline decision-making by modeling policies as action generators [19]. In robotics, Chi et al. [20] generate robot behavior by modeling visuomotor control as a conditional denoising diffusion process. Several methods further incorporate safety into diffusion-based offline policies. Trajectory-based REal-time Budget Inference (TREBI) transforms policy optimization into a trajectory distribution optimization problem, using diffusion-based planning with dynamic cost budgets to guide action generation [21]. FeasIbility-guided Safe Offline RL (FISOR) leverages reachability analysis to translate hard safety requirements into feasible-region identification and derives an energy-guided diffusion formulation for weighted behavior cloning [22]. Constrained Diffusion Policy (CDP) maps diffusion samples onto a constrained manifold via a mirror diffusion model, thereby generating actions that satisfy safety constraints [23]. In safety-critical autonomous driving, Uncertainty-based Alternative Diffusion Policy (UADP) trains two alternative diffusion policies with an ensemble Q critic and selects actions with lower uncertainty to reduce risk [24]. However, many diffusion-based methods are trained by matching the offline data distribution and are thus sensitive to dataset quality. Reliable improvement beyond the behavior policy is challenging, especially under sparse cost supervision and severe dataset imbalance.

To tackle these challenges, we propose ARMOR (multi-scAle Reweighting with Multi-task Offline cRitic), an actor-critic method that integrates multi-scale distribution debiasing and a multi-task critic into a conditional diffusion policy. ARMOR provides a generative offline control approach that jointly optimizes task performance and safety. The main contributions are summarized as follows:

• We propose a multi-task critic with a shared trunk that treats reward, long-term cost, and short-term cost as multiple tasks. The shared trunk learns shared representations to capture common state features, leveraging dense reward feedback to enable more reliable value estimation under sparse non-zero cost supervision.

• We present a multi-scale debiasing strategy that combines trajectory-level weighting with counterfactual transition-level weighting to mitigate dataset imbalance. At the trajectory level, we upweight low-cost trajectories. At the transition level, we assign transition weights by comparing short-term costs under counterfactual action perturbations, emphasizing safety-critical transitions near the risk boundary.

• We incorporate the above designs into a conditional diffusion actor. The actor is trained with weighted behavior cloning, augmented with a return-guided objective from the reward critic and a cost constraint from the long-term cost critic to achieve a better reward-cost trade-off.

• We evaluate ARMOR on eight tasks across two standard robot control benchmarks. Results show that ARMOR improves returns while satisfying cost limits and exhibits zero-shot adaptation.

2 Problem Formulation

We model safe reinforcement learning for continuous-control robotics using a Constrained Markov Decision Process (CMDP) ℳ=(𝒮,𝒜,P,r,c,γ), where 𝒮 is the state space, including variables like velocities, body poses, and task-relevant quantities such as target and obstacle positions. The action space 𝒜 is typically continuous and corresponds to control commands, including torques, angular velocities, or acceleration. The transition function P(st+1∣st,at) represents the environment dynamics, determined by physics and contact interactions. The reward function r:𝒮×𝒜→R quantifies task accomplishment, such as forward progress, staying on a desired track, or reaching a goal. The cost function c:𝒮×𝒜→R≥0 quantifies safety violations that are critical in robot deployment, such as exceeding speed limits, leaving the safe region, or collisions. we denote the realized reward and cost at time t as rt≜r(st,at) and ct≜c(st,at). γ∈(0,1) is a discount factor. The policy π(a∣s) maps the current state to a control action. Given a trajectory τ={st,at,rt,ct}t=0T, the reward return and the cost return are defined as R(τ)=∑t=0T(γr)trt and C(τ)=∑t=0T(γc)tct, where we allow different discount factors for reward and cost.

In safe robotic control, the policy is required to maximize the reward return while satisfying a cost constraint. Therefore, the CMDP objective can be written as:

maxπ Eτ∼π[R(τ)]s.t.Eτ∼π[C(τ)]≤d,(1)

where d is a prescribed cost limit.

Unlike the online setting, offline reinforcement learning assumes that the agent only has access to a fixed dataset 𝒟={τi}i=1N collected by one or several behavior policies πB(a|s). Thus, the OSRL problem can be formulated as using an offline dataset 𝒟 to learn a policy π that satisfies Eq. (1).

3 Method

In this section, we present ARMOR (multi-scAle Reweighting with Multi-task Offline cRitic) as illustrated in Fig. 2.

images

Figure 2: ARMOR overview.

In order to represent the remaining cost budget at each steps, we first introduce the cost threshold construction, where each transition is augmented with a cumulative cost. Then, we describe the proposed multi-task critic with a shared trunk, which jointly learns reward, long-term cost and short-term cost. The reward head guides performance optimization, the long-term cost head estimates cumulative cost to ensure policy safety, and the short-term cost head captures imminent violations to highlight the risk boundary. The shared trunk learns shared representations to ease risk representation learning, and PCGrad mitigates gradient conflicts across multi-task. Next, we formulate multi-scale reweighting, which combines trajectory-level debiasing with counterfactual transition-level weighting. Trajectory weights reshape the effective training distribution toward low-cost behaviors. Counterfactual comparisons use the short-term cost critic to identify transitions that are sensitive to small action perturbations. This focuses learning on safety-critical transitions near the constraint boundary. Finally, we integrate these components into a conditional diffusion actor, optimised through weighted behavior cloning, reward-guided improvement and a Lagrangian penalty induced by the long-term cost critic. This setup allows for stable offline training and a stronger reward-safety trade-off.

3.1 Cost Threshold Construction

Many previous methods typically distribute the cost limit across each time step either by discounting or uniformly splitting the total, and use this distribution to determine constraint violations at each time step. While this approach simplifies constraint evaluation, it introduces two main issues. First, safety cost signals are often sparse, with ct being zero for most time steps. Distributing the cost limit uniformly across steps fails to align with the true cumulative risk process. Second, the CMDP constraint is inherently based on cumulative cost over a trajectory, and comparing step costs to step limits does not accurately reflect the cumulative constraint, leading to potential bias in constraint evaluation.

ARMOR introduces κt to represent the cost threshold from the current time step to the trajectory termination. Specifically, for any trajectory in the offline dataset, at time step t, we define the cumulative cost from time step t to the trajectory’s end as the threshold for that step, i.e., κt=∑k=tTck. κt dynamically changes over time, enabling a more accurate representation of the remaining cost budget at different stages within a trajectory. This approach aligns more closely with the CMDP’s cumulative cost constraint and provides consistent contextual information for critic and conditioned actor learning.

3.2 Multi-Task Critic

We transfer dense reward supervision to cost-related tasks via a shared representation, thereby reducing the negative impact of cost sparsity. This design eases risk representation learning and improves feature generalization for value estimation. Moreover, by sharing features across objectives, ARMOR reuses supervision more effectively, leading to more data-efficient value estimation in the offline setting.

Specifically, we formulate critic learning as multi-task representation learning. The critic takes (st,κt,at) as input and outputs three action-values:

Qr(st,κt,at;θr)≈E[∑k=0T(γr)krt+k∣st,κt,at,π],Qcu(st,κt,at;θcu)≈E[∑k=0T(γcu)kct+k∣st,κt,at,π], u∈{L,S}.(2)

where Qr is the reward critic that estimates the expected return, QcL is the long-term cost critic that estimates the expected cost, and QcS is the short-term cost critic that predicts imminent risk to identify safety-critical transitions. The discount factors γr, γcL, and γcS correspond to the reward, long-term cost, and short-term cost, respectively. γcS<γcL encourages the short-term head to emphasize imminent risk.

The critic is implemented as a shared trunk h=fψ(s,κ,a) followed by three task heads, each head maintains two Q networks [25]. Following conservative design for offline learning, we use the minimum reward estimate and the maximum cost estimate when constructing the Bellman targets. For a transition (st,κt,at,rt,ct,st+1,κt+1), let at+1∼πϕ(⋅∣st+1,κt+1) be an action sampled from the current policy. The targets are

yr=rt+γrmini∈{1,2}Qr,i(st+1,κt+1,at+1;θr−),ycu=ct+γcumaxi∈{1,2}Qcu,i(st+1,κt+1,at+1;θcu−), u∈{L,S}.(3)

θ− denotes the target network. Each head is trained by a temporal difference (TD) loss:

ℒf = E(st,κt,at)∼𝒟[(Qf(st,κt,at;θf)−yf)2], f∈{r,cL,cS}.(4)

We jointly optimize all critic parameters θ=(ψ,θr,θcL,θcS). Learning multiple tasks simultaneously may lead to gradient conflicts, and we empirically observe such conflicts during training (Appendix A). To address the gradient conflicts across objectives, we apply PCGrad [26] to the gradients on the shared trunk parameters. Specifically, we perform backpropagation for each task loss ℒi separately to compute the corresponding gradients gi. Then, for each head i, we calculate the gradient inner product gi⋅gj with other heads j. If the inner product is positive, the gradient remains unchanged. If the inner product is negative, we project gi onto the orthogonal space of gj (Fig. 3):

gi←gi−gi⋅gj‖gj‖2gj.(5)

images

Figure 3: Multi-task gradients and PCGrad. (a) Gradient conflicts arise between tasks i and j as well as i and k, while the gradients of tasks j and k do not conflict; (b–d) Illustrate the PCGrad for each task.

Intuitively, when the gradients of two heads conflict, we subtract the conflicting component from the original gradient. After completing the gradient projection for all heads, we average the corrected gradients to obtain the joint gradient used for updating the shared trunk.

3.3 Multi-Scale Reweighting

Offline safe reinforcement learning is often limited by data distribution. Under strict constraints, few trajectories satisfy the safety requirements, resulting in an imbalanced training distribution. ARMOR addresses this issue by reshaping the training distribution with multi-scale weights, including trajectory-level weighting to correct behavioral bias and transition-level weighting to emphasize safe transitions near the constraint boundary.

Trajectory-level debiasing. For each trajectory τi∈𝒟, we compute its cumulative cost Ci=∑t∈τict and assign

witraj = 1+β⋅I[Ci≤d],(6)

where β>0 controls how strongly the training emphasizes low-cost trajectories, and I[⋅] denotes the indicator function that equals 1 if the condition in the brackets is satisfied. All transitions within τi inherit the same witraj, which biases learning toward feasible behaviors without discarding unsafe trajectories, thus preserving diversity and coverage.

The trajectory weight can be regarded as an importance shift toward feasible trajectories. Let ν(τ) denote the dataset distribution over trajectories in 𝒟. Define the feasible set F={τ:C(τ)≤d}, and let p=ν(F), where p≪1 typically holds under data imbalance. With trajectory-level debiasing, define the weight function wtraj(τ)=1+βI[τ∈F] and the induced reweighted distribution νβ(τ)=(wtraj(τ)ν(τ))/Eτ∼ν[wtraj(τ)]. Then the probability mass of feasible trajectories under νβ is νβ(F) =((1+β)p)/((1+β)p+(1−p)), which is strictly increasing in β for 0<p<1. Moreover, letting C¯F=Eτ∼ν[C(τ)∣τ∈F] and C¯U=Eτ∼ν[C(τ)∣τ∉F], we obtain

Eτ∼νβ[C(τ)]=(1+β)pC¯F+(1−p)C¯U(1+β)p+(1−p),ddβEτ∼νβ[C(τ)]=p(1−p)(C¯F−C¯U)((1+β)p+(1−p))2.(7)

Since C¯F<C¯U, increasing β strictly decreases Eτ∼νβ[C(τ)], i.e., Eq. (6) performs an importance shift toward safe behaviors.

Counterfactual transition reweighting. It is not possible to judge the quality of all transitions based only on trajectory cost, since high-cost trajectories may also contain critical decision transitions. We therefore introduce a transition-level counterfactual weight that highlights critical safe transitions near the risk boundary, where the observed action is safe, but small local perturbations could lead to constraint violations.

For each transition (st,at,κt), we sample K perturbed actions at(k)=clip(at+ϵ(k),amin,amax) with ϵ(k)∼𝒩(0,σ2I), and evaluate them using the short-term cost critic. We define a counterfactual safety margin

mt=[maxk=1,…,KQcS(st,κt,at(k))−κt]+⋅I[QcS(st,κt,at)≤κt],(8)

where [⋅]+ denotes the positive-part operator. This margin becomes large when the current action is predicted safe but nearby actions can be unsafe. We then map mt to a bounded weight via normalization and clipping:

wtcf = min(1+mtE[m]+ε,wmax),(9)

where ε is a constant used to avoid division by zero. E[m] is estimated as the mean safety margin over a mini-batch, i.e., E[m]≈1|B|∑t∈Bmt, where B is a mini-batch of transitions sampled from the dataset 𝒟. Intuitively, for an action that is considered safe, if a small perturbation leads to a constraint violation, it indicates that the transition is near the risk boundary and the current action has successfully avoided the risk. In this case, we assign a larger value to wtcf, guiding the algorithm to focus on learning state-action pairs that are near the boundary but still safe.

The final weight is obtained by combining the two scales after normalization and clipping:

wt = clip(witrajwtcfE[witrajwtcf]+ε,wmin,wmax),t∈τi.(10)

These weights are used to reweight the actor’s behavior cloning loss, explicitly injecting safety-aware data preference from the distribution side rather than only relying on penalties in the objective.

Since wtcf relies on an evolving cost critic, we enable multi-scale reweighting after a warmup phase and periodically refresh the weights, while applying exponential moving average (EMA) smoothing to reduce weight oscillations.

3.4 Conditional Diffusion Actor

Diffusion models exhibit strong expressive capacity for modeling complex data distributions. ARMOR parameterizes the actor as a conditional diffusion policy. This formulation provides a unified framework for weighted behavior cloning, policy improvement based on the critic, and explicit safety constraint. Specifically, the policy generates actions at conditioned on (st,κt).

We follow the DDPM [27] on the action space. Let at0 denote the dataset action and define a forward noising process

q(atn∣at0)=𝒩(atn;α¯nat0,(1−α¯n)I),(11)

where n=1,…,N, {αn}n=1N is the noise schedule, and α¯n=∏i=1nαi. The denoising process is parameterized by a conditional distribution:

pϕ(atn−1∣atn,st,κt)=𝒩(atn−1;μϕ(atn,st,κt,n),Σϕ(atn,st,κt,n)),(12)

where ϕ represents the parameters of the policy network. On the offline dataset 𝒟, the diffusion policy learns the behavior distribution through noise regression, attempting to recover the actions as accurately as possible given (st,κt). Let ϵ^ϕ(atn,st,κt,n) be the network’s predicted noise, and ϵ∼𝒩(0,I) be the true noise. The weighted diffusion behavior cloning loss is given by

ℒBC(ϕ)=E(st,at0,κt)∼𝒟[wt‖ϵ^ϕ(atn,st,κt,n)−ϵ‖22],(13)

where wt is the weights used to emphasize safe samples. This term ensures that the policy update does not deviate from the distribution supported by the offline data, providing a optimization foundation for subsequent value guidance and safety constraints.

Behavior cloning typically fails to produce a policy that outperforms the dataset. To address this, we introduce a policy improvement objective based on the reward value function. Specifically, we sample actions atπ∼πϕ(⋅∣st,κt) from the current diffusion policy and use the reward critic Qr(st,κt,atπ) to drive the policy towards higher return actions.

At the same time, policy improvement must satisfy cost constraints. We use the long-term cost critic QcL(st,κt,atπ) to evaluate the expected cost of the action, with the cost threshold κt serving as the upper bound for the feasible domain.

Considering all of the above, the optimization of the actor can be formulated as a constrained problem:

minϕℒBC(ϕ)−ηE(st,κt)∼𝒟,atπ∼πϕ[Qr(st,κt,atπ)]s.t.E(st,κt)∼𝒟,atπ∼πϕ[QcL(st,κt,atπ)]≤κt,(14)

where η>0 is the reward guidance coefficient. This formulation clearly captures the optimization logic of the proposed method by maximizing reward while maintaining diffusion-based behavior cloning constraints and ensuring cost feasibility. The constraint QcL(st,κt,atπ)≤κt can be interpreted as a practical surrogate for the original CMDP cost bound. The intuition is that QcL estimates the future cumulative cost, while κt represents the remaining cost budget. Therefore, enforcing QcL(st,κt,atπ)≤κt is directly analogous to enforcing that the cumulative cost should remain below the cost limit. Under accurate cost estimation and sufficient data coverage, this surrogate is aligned with controlling the cumulative cost along the trajectory.

We transform this constrained problem into an unconstrained optimization by applying Lagrangian relaxation. By introducing the dual variable λ(κt), we obtain the actor objective:

ℒ(ϕ)=ℒBC(ϕ)−ηE[Qr(st,κt,atπ)]+E[λ(κt)(QcL(st,κt,atπ)−κt)],(15)

where λ(⋅) is parameterized by a small network. In this minimization objective, when the predicted cost exceeds the threshold, the penalty term pushes the policy away from high-risk actions. When the constraint is satisfied, the penalty weakens, allowing reward-driven policy improvement to take effect. The dual network is updated through approximate dual ascent, and its loss is given by:

ℒλ=−E(st,κt)∼𝒟,atπ∼πϕ[λ(κt)(QcL(st,κt,atπ)−κt)].(16)

Eqs. (15) and (16) define a standard primal–dual optimization of a Lagrangian relaxation. The actor minimizes a behavior-regularized objective with reward guidance and a Lagrangian penalty, while the dual variable is updated by approximate dual ascent. The multi-scale weights are normalized and clipped to prevent extreme gradient amplification. These design choices empirically stabilize training and are consistent with common convergence conditions for stochastic primal-dual methods such as bounded stochastic gradients and suitably chosen step sizes. We report the evolution of λ(κt) across tasks in Appendix B Fig. A1, which empirically supports the stability of the primal-dual optimization.

3.5 Deployment of ARMOR on Robotic Systems

ARMOR trains a conditional diffusion model that is deployed on a robotic system to enable autonomous control. The robot acquires its current state st through sensors at each time step. The cost threshold κ is set to the predefined cost limit d initially, and is updated after each action based on the incurred cost. The robot uses the current state and cost threshold to determine the action. After executing the action, the robot updates its state and computes the new cost threshold, completing a feedback loop that allows for closed-loop control. Due to the design of ARMOR, this process ensures that the robot balances task performance with safety constraints.

4 Experiments

We evaluate ARMOR on the public benchmarks. In addition to the main comparison against representative baselines, we conduct an ablation study to quantify the contributions of the proposed multi-task critics and multi-scale reweighting, and perform sensitivity analyses on hyperparameters of the diffusion policy and the reweighting scheme. Furthermore, we demonstrate the zero-shot adaptation capability of ARMOR to different cost limits without retraining.

4.1 Environments

We conduct experiments on continuous-control robotic tasks using the public benchmarks Bullet-Safety-Gym [28] and Safety-Gymnasium [29], which are commonly used in previous works.

In Bullet-Safety-Gym, we focus on the Run task with three robot types: Ant, Ball, and Drone. In this task, the agent is rewarded for traversing a corridor between two boundaries at high speed, while crossing the boundaries or exceeding the velocity limit incurs penalties (Fig. 4a).

images

Figure 4: Tasks in Bullet-Safety-Gym and Safety-Gymnasium. (a) Run; (b) AntVelocity; (c) Walker2dVelocity; (d) HalfCheetahVelocity; (e) Circle; (f) Goal; (g) Push; (h) Button.

In Safety-Gymnasium, we consider two groups of tasks: the Velocity group and the Navigation group. In the Velocity group, the Ant, Walker2d, and HalfCheetah robots are used for tasks, where the objective is to maximize forward displacement and a cost is incurred when overspeed (Fig. 4b–d).

The Navigation group includes tasks Circle, Goal, Push, and Button with Point and Car robots (Fig. 4e–h). In Circle, clockwise motion along a circular track yields reward, whereas leaving the boundary-defined safe region produces a cost. In Goal, the agent navigates toward a target location while avoiding traps and preventing collisions with hazardous objects. In Push, the agent must push a box to the goal while steering around obstacles and avoiding traps. In Button, the agent must tap the correct target button among multiple buttons, and entering traps or collisions with moving obstacles trigger a cost.

4.2 Baselines

To provide a comprehensive evaluation, we compare ARMOR with the following offline baselines. This allows us to assess ARMOR from multiple perspectives, including whether it improves over behavior cloning, how it compares with representative Q-learning methods under sparse cost supervision, what additional benefits it brings within generative policy learning, and whether it can surpass data-generation approaches without relying on additional synthesized data.

• Imitation Learning: BC, behavior cloning that imitates trajectories in the datasets.

• Q-Learning-Based Algorithms: BCQL, a Lagrangian-based extension of BCQ [6]; CPQ [10], a Q-learning method that identifies Out Of Distribution (OOD) actions as unsafe and learns policies with safe transitions.

• Generative Modeling Algorithm: FISOR [22], a feasibility-guided method with a diffusion model.

• Data Generation Algorithms: OASIS [17], which employs a conditional diffusion model to generate datasets and guides the data distribution towards a target domain; CCAC [11], which generates and identifies unsafe OOD data to train adaptive safe policies.

4.3 Metrics

Our evaluation metrics include the normalized cost return and the normalized reward return.

Rnormalized=Rπ−RminRmax−Rmin,Cnormalized=Cπd,(17)

where Rπ is the reward return of the policy π, Rmax and Rmin are the maximum and minimum reward returns within the given dataset, respectively. Cπ is the cost return of the policy π, and d is the cost limit. A policy is safe if Cnormalized≤1, and we pursue a higher reward return under this constraint.

According to difficulty, the cost limit is set to 5 for all Bullet-Safety-Gym tasks. For Safety-Gymnasium, Velocity tasks have their cost limits set to 10, and other tasks are set to 20. Table 2 lists the key hyperparameters. For a fair comparison, the common hyperparameters are kept consistent across methods, while method-specific hyperparameters are set according to the official implementations. More implementation details can be found in Appendix B.

images

4.4 Overall Performance

Table 3 reports the normalized return and normalized cost of all methods across three task groups.

images

Overall, ARMOR satisfies the cost constraint in most environments while attaining competitive returns. Furthermore, it demonstrates a distinct advantage in more challenging scenarios, where the environment is more complex. This outcome is primarily attributed to the reweighting mechanism, which reshapes the training distribution to emphasize safer and more informative transitions, aligning critic evaluation and diffusion policy learning. In addition, the shared representation provides a common feature basis for multiple critic heads, reducing reliance on sparse cost signals and improving the stability and accuracy of value estimation. For tasks with explicit goals, we further report success rate alongside normalized cost in Appendix C (Table A2).

BCQL employs a Lagrangian approach to balance performance and safety, whereas CPQ adopts a conservative update rule and updates the value function only on state action pairs classified as safe. However, neither method explicitly addresses distribution bias in offline data. When the proportion of safe samples in the dataset is low, the available supervisory signal becomes insufficient, which hinders learning policies that satisfy safety constraints while maintaining performance. OASIS synthesizes training data using reward and cost models, an inverse dynamics model, and a conditional diffusion generator. Since this pipeline involves multiple learned components, modeling errors introduced at any stage can propagate through subsequent data generation and policy optimization, which can degrade the resulting policy and increase the likelihood of constraint violations. CCAC updates the cost critic conservatively using augmented data to improve the reliability of constraint estimation, but it does not account for the potential sparsity of cost signals. When cost is sparse, the cost critic may fail to converge to an accurate estimate, which weakens constraint guidance during policy optimization and limits the ability to provide consistent safety guarantees. FISOR tends to enforce feasibility more strictly, yet in some tasks it yields very low even negative normalized return, suggesting that it may converge to overly conservative policies that are undesirable in offline safe RL.

We observe that ARMOR does not satisfy the cost constraint on CarButton1. Several baselines also violate the constraint on this environment, indicating that the dataset coverage and the sharp constraint boundary make it challenging for offline methods. While FISOR satisfies the constraint on CarButton1, its normalized return is negative, implying an extremely conservative policy. This behavior is not aligned with the goal of offline safe RL, which seeks both feasibility and high utility.

4.5 Ablation Study

We conduct ablation experiments to evaluate the effectiveness of each component in ARMOR. We consider the following variants: (i) w/o Reweighting: removing both trajectory-level and counterfactual reweighting by setting wt≡1; (ii) w/o multi-critic: replacing the multi-task critic with three independent critics and disabling PCGrad; (iii) w/o multi-critic+reweighting: keeping only the conditional diffusion actor and removing both the multi-task critic and reweighting; (iv) w/o diffusion: replacing the diffusion actor with a simple Gaussian policy; (v) w/o warmup+EMA: disabling the reweighting warmup and the exponential moving average used to smooth counterfactual weights; (vi) w/o PCGrad: removing PCGrad in the shared critic trunk; (vii) w/o multi-cost-critic: replacing the multi-term cost critic with a single-head cost critic that has multi-horizon targets; (viii) armor: the full method.

Fig. 5 summarizes the ablation results on Navigation tasks. Since offline safe RL prioritizes constraint satisfaction before return maximization, we primarily analyze the normalized cost. armor achieves the lowest cost on three of four tasks and maintains competitive returns. Removing multi-scale reweighting leads to higher normalized cost, indicating that reshaping the training distribution is crucial for safety under offline data limitations. Ablating the multi-task critic also harms feasibility, suggesting that jointly learning reward and multi-horizon costs with conflict-aware optimization produces more reliable value estimates for policy learning. Removing both the multi-task critic and reweighting results in poorer performance in constraint satisfaction, highlighting that conditional generation alone is insufficient. Replacing the diffusion actor with a simple Gaussian policy causes constraint violation in most tasks, showing that the representation capacity of the diffusion model is indispensable. Disabling the reweighting warm-up and EMA tends to cost increases due to inaccurate value estimation during the initial training phase and unstable weights. Removing PCGrad and multi-term cost critic degrades constraint satisfaction, demonstrating the need to mitigate gradient interference in the shared critic and to capture both long-term and short-term safety signals.

images

Figure 5: Ablations on Navigation tasks. The dashed line represents the normalized cost limit.

4.6 Hyperparameter Choices

We study the sensitivity of ARMOR to three important hyperparameters: the diffusion denoising steps N, the clipping range [wmin,wmax] used in multi-scale reweighting and the discount factor γcS for the short-term cost critic.

The denoising steps N control the granularity of the reverse diffusion process and thus affect both the expressiveness of the action generator and the inference cost. We sweep N∈{10,20,30} and plot the training curves of normalized return and normalized cost in Fig. 6. We observe that larger N typically yields smoother curves with reduced variance, consistent with the intuition that finer denoising improves sampling stability. However, larger N increases inference time. Considering the trade-off between performance and runtime, we use N=20 as the default setting in all main experiments.

images

Figure 6: Effect of diffusion denoising step N during training. We report mean ± std across 3 seeds for (a) normalized return and (b) normalized cost.

We evaluate three clipping ranges for weights: [0.9,1.5], [0.8,2.0], and [0.6,2.5]. A smaller lower bound suppresses gradients from low-confidence transitions more aggressively, but may reduce effective coverage and harm generalization. A larger upper bound emphasizes high-weight transitions, but may increase training instability and overfitting risk. Fig. 7 shows that the three ranges lead to broadly similar learning curves, suggesting that ARMOR is not overly sensitive to moderate clipping changes. We further demonstrated the performance of different clipping ranges during evaluation (Table 4), and selected [0.8,2.0] as the robust default range.

images

Figure 7: Effect of the weight clipping range [wmin,wmax] during training. We report mean ± std across 3 seeds for (a) normalized return and (b) normalized cost.

images

The short-term cost discount γcS sets the temporal scope of the short-term cost critic QcS, and consequently controls how counterfactual reasoning identifies near-boundary transitions for reweighting. When γcS is too small, the short-term cost target approaches to the immediate cost. Since costs are typically sparse and the cost threshold κt aggregates future costs, the counterfactual margin maxk=1,…,KQcS(st,κt,at(k))−κt becomes positive less frequently, making the counterfactual weights degenerate toward 1. In contrast, when γcS is too large, QcS becomes less specialized for imminent risk, which blurs the risk boundary and weakens the emphasis on critical safe transitions. We sweep γcS∈{0.1,0.3,0.5,0.7} and report the average normalized reward and normalized cost in Table 5. γcS=0.3 yields a strong reward-cost balance across tasks, achieving best reward while maintaining competitive normalized cost. Because of this, we use γcS=0.3 as the default setting.

images

4.7 Zero-Shot Adaptation

Another advantage of our approach is that it can adapt to different cost limits without retraining. In ARMOR, the cost limit is explicitly used when constructing the trajectory-level weights, so training with different limits induces policies with distinct safety-performance preferences. To examine generalization under changing constraints, we train two policies with cost limits of 15 and 30, and evaluate each policy under multiple limits {15,20,25,30}, which cover both tightened and relaxed constraints. As shown in Fig. 8, ARMOR exhibits zero-shot adaptability when the constraint is relaxed. Even when the constraint is tightened, it maintains safety on most tasks, indicating robust safety control without fine-tuning. This performance arises from treating the dynamically updated constraint as an explicit input during training, which enables zero-shot adaptation to new cost limits. We further compare against CCAC trained with a cost limit of 15. While CCAC can generalize in some simpler tasks, it struggles on more challenging tasks and fails to achieve safe decision-making even under relaxed limits.

images

Figure 8: Zero-shot adaptation across cost limits. The dashed line represents the cost limit.

5 Conclusions

We propose ARMOR, a conditioned diffusion policy augmented with multi-scale reweighting and a multi-task critic. ARMOR learns a shared representation via the multi-task critic to enable reliable value estimation. In addition, multi-scale reweighting is introduced to the conditional diffusion policy objective, which injects safety constraints from the data distribution. Experiments demonstrate that ARMOR achieves strong performance under cost constraints across multiple continuous-control robotics tasks. A practical limitation of ARMOR is its real-time inference overhead (Appendix C, Table A3), since diffusion-based action generation requires multi-step denoising for each decision. A further limitation is that the current study assumes accurate reward and cost signals, and does not consider noisy supervision that may arise in practical deployment. Future work could focus on accelerating inference and improving robustness to noisy feedback, which is essential for deploying ARMOR in real-world robotic systems.

Acknowledgement: The authors sincerely thank all those who supported to this research.

Funding Statement: This work was supported by the Joint Fund for Regional Innovation and Development of the National Natural Science Foundation of China (No. U22A20167) and the Special Project for Guiding the Transformation of Scientific and Technological Achievements in Shanxi Province (No. 202404021301033).

Author Contributions: The authors confirm contribution to the paper as follows: methodology, Chengjing Li; software, Chengjing Li; validation, Xiaoyan Zhao; investigation, Chengjing Li; data curation, Xiaoyan Zhao; writing—original draft preparation, Chengjing Li; writing—review and editing, Li Wang and Xiaoyan Zhao; visualization, Chengjing Li; supervision, Li Wang. All authors reviewed and approved the final version of the manuscript.

Availability of Data and Materials: The data that support the findings of this study are available from the corresponding author, upon reasonable request.

Ethics Approval: Not applicable.

Conflicts of Interest: The authors declare no conflicts of interest.

Appendix A Gradient Conflict Statistics

To examine whether gradient conflict is common, we measure the cosine similarity between the gradients induced by the three critic heads. At each training step, we compute cosine similarities for each of the three head pairs and count the number of pairs whose cosine similarity is negative. Table A1 reports the fraction of conflicting pairs for each task. The results indicate that gradient conflict is non-negligible. Gradient conflict occurs in roughly half of the training steps, and the case of two conflicting pairs arises frequently. This pattern is expected because the critic optimizes one reward objective and two cost objectives, while reward-driven gradients often compete with cost-driven gradients. Overall, these statistics support the motivation for applying PCGrad in the shared critic trunk to mitigate destructive interference among reward and multi-term cost learning.

images

Figure A1: Evolution of the dual variable λ(κt) during training.

Appendix B Implementation Details

The diffusion actor uses an MLP denoiser conditioned on state st, cost threshold κt, and diffusion timestep n. The cost condition embedding uses a 3-layer MLP, and the timestep uses sinusoidal positional embedding. The denoiser takes the concatenation of the noisy action, the state, and the two embeddings as input, then applies a 3-layer MLP trunk of width 256 with Mish activations, followed by a linear projection to the action dimension. The actor is optimized using Adam with learning rate 1×10−4.

We employ a multi-task critic with a shared MLP trunk and three task-specific heads. The critic input concatenates the state, the cost threshold, and the action [st;κt;at]. The shared trunk is a 2-layer MLP with hidden sizes 256 and Mish activations. Each task head is a 2-layer MLP with hidden size 128 and Mish activations, outputting a scalar Q-value. We use a target critic to soft update at every training step: θtarg←τθ+(1−τ)θtarg, where τ=0.005. The critic is optimized using Adam with learning rate 1×10−3.

The dual variable is implemented as a MLP that takes the cost threshold κt as input. To ensure λ(κ)≥0, we treat the network output as log⁡λ and compute λ=exp⁡(clip(log⁡λ,−20,5)). The MLP has two hidden layers of width 256 with ReLU activations. We use Adam with learning rate 1×10−4. The dual network is optimized jointly with the actor and is updated once per training step.

We report the evolution of the learned dual variable λ across tasks in Fig. A1. As described in Eqs. (15) and (16), λ increases when the predicted long-term cost exceeds the threshold and decreases otherwise, thereby adjusting the penalty strength. The curves quickly increase in early training and then stabilize, which empirically supports the numerical stability of the primal-dual optimization.

Appendix C More Experiment Results

For tasks with explicit goals, such as Goal, Push, and Button, we report success rate together with normalized cost to assess performance under safety constraints. Success is defined by the environment-provided success signal for reaching the goal, and the success rate is computed over 20 evaluation episodes. As shown in Table A2, all results are reported as the mean ± std across 3 seeds. Overall, the conclusions drawn from success rate are broadly consistent with those based on normalized return in Table 3. ARMOR achieves high success rates while satisfying the safety constraint on most tasks, demonstrating a favorable safety-performance trade-off.

images

Table A3 reports the inference time of ARMOR and Q-learning style offline safe RL baselines, measured on an NVIDIA RTX 3090 GPU. ARMOR exhibits the largest latency among the compared methods. This overhead is expected because the actor is a conditional diffusion policy and requires multiple denoising steps at test time. In contrast, Q-learning style baselines select actions with a single forward pass through the policy. Despite this overhead, the per-action latency is still on the order of 10−2 seconds, suggesting that ARMOR is still feasible for moderate control frequencies in real-world robotics.

images

References

1. Saeidi H, Opfermann JD, Kam M, Wei S, Léonard S, Hsieh MH, et al. Autonomous robotic laparoscopic surgery for intestinal anastomosis. Sci Robot. 2022;7(62):eabj2908. doi:10.1126/scirobotics.abj2908. [Google Scholar] [PubMed] [CrossRef]

2. Wu J, Huang Y, Lai Y, Yang S, Zhang C. Obstacle avoidance inspection method of cable tunnel for quadruped robot based on particle swarm algorithm and neural network. Sci Rep. 2025;15(1):36065. doi:10.1038/s41598-025-19903-w. [Google Scholar] [PubMed] [CrossRef]

3. Nie J, Zhang G, Lu X, Wang H, Sheng C, Sun L. Obstacle avoidance method based on reinforcement learning dual-layer decision model for AGV with visual perception. Control Eng Pract. 2024;153(8):106121. doi:10.1016/j.conengprac.2024.106121. [Google Scholar] [CrossRef]

4. Radosavovic I, Xiao T, Zhang B, Darrell T, Malik J, Sreenath K. Real-world humanoid locomotion with reinforcement learning. Sci Robot. 2024;9(89):eadi9579. doi:10.1126/scirobotics.adi9579. [Google Scholar] [PubMed] [CrossRef]

5. Liu Z, Guo Z, Lin H, Yao Y, Zhu J, Cen Z, et al. Datasets and benchmarks for offline safe reinforcement learning. J Data-Centric Mach Learn Res. 2024;1(12):1–29. doi:10.52202/079017-2494. [Google Scholar] [CrossRef]

6. Fujimoto S, Meger D, Precup D. Off-policy deep reinforcement learning without exploration. In: Proceedings of the 36th International Conference on Machine Learning; 2019 Jun 9–15; Long Beach, California, USA. p. 2052–62. [Google Scholar]

7. Kumar A, Zhou A, Tucker G, Levine S. Conservative q-learning for offline reinforcement learning. In: Proceedings of the 34th International Conference on Neural Information Processing Systems; 2020 Dec 6–12; virtual. p. 1179–91. [Google Scholar]

8. Hong K, Li Y, Tewari A. A primal-dual-critic algorithm for offline constrained reinforcement learning. In: Proceedings of the 27th International Conference on Artificial Intelligence and Statistics; 2024 May 2–4; Palau de Congressos, Valencia, Spain. p. 280–8. [Google Scholar]

9. Polosky N, Da Silva BC, Fiterau M, Jagannath J. Constrained offline policy optimization. In: Proceedings of the 39th International Conference on Machine Learning; 2022 Jul 17-23; Baltimore, MD, USA. p. 17801–10. [Google Scholar]

10. Xu H, Zhan X, Zhu X. Constraints penalized q-learning for safe offline reinforcement learning. In: Proceedings of the 36th AAAI Conference on Artificial Intelligence; 2022 Feb 22–Mar 1; virtual. p. 8753–60. doi:10.1609/aaai.v36i8.20855. [Google Scholar] [CrossRef]

11. Guo Z, Zhou W, Wang S, Li W. Constraint-conditioned actor-critic for offline safe reinforcement learning. In: Proceedings of the 13th International Conference on Learning Representations; 2025 Apr 24–28; Singapore. [Google Scholar]

12. Lee J, Paduraru C, Mankowitz DJ, Heess N, Precup D, Kim KE, et al. COptiDICE: offline constrained reinforcement learning via stationary distribution correction estimation. In: Proceedings of the 10th International Conference on Learning Representations; 2022 Apr 25–29; virtual. [Google Scholar]

13. Guan J, Chen G, Ji J, Yang L, Zhou A, Li Z, et al. Voce: variational optimization with conservative estimation for offline safe reinforcement learning. In: Proceedings of the 37th International Conference on Neural Information Processing Systems; 2023 Dec 10–16; New Orleans, LA, USA. p. 33758–80. [Google Scholar]

14. Lee J, Yun S, Yun T, Park J. GTA: generative trajectory augmentation with guidance for offline reinforcement learning. In: Proceedings of the 38th International Conference on Neural Information Processing Systems; 2024 Dec 10–15; Vancouver, BC, Canada. p. 56766–801. [Google Scholar]

15. Liang Z, Mu Y, Ding M, Ni F, Tomizuka M, Luo P. AdaptDiffuser: Diffusion models as adaptive self-evolving planners. In: Proceedings of the 40th International Conference on Machine Learning; 2023 Jul 23-29; Honolulu, HI, USA. p. 20725–45. [Google Scholar]

16. Gong Z, Kumar A, Varakantham P. Offline safe reinforcement learning using trajectory classification. In: Proceedings of the 39th AAAI Conference on Artificial Intelligence; 2025 Feb 25–Mar 4; Philadelphia, PA, USA. p. 16880–7. doi:10.1609/aaai.v39i16.33855. [Google Scholar] [CrossRef]

17. Yao Y, Cen Z, Ding W, Lin H, Liu S, Zhang T, et al. OASIS: conditional distribution shaping for offline safe reinforcement learning. In: Proceedings of the 38th International Conference on Neural Information Processing Systems; 2024 Dec 10–15; Vancouver, BC, Canada. p. 78451–78. doi:10.52202/079017-2494. [Google Scholar] [CrossRef]

18. Xiao W, Wang TH, Gan C, Hasani R, Lechner M, Rus D. Safediffuser: safe planning with diffusion probabilistic models. In: Proceedings of the 11th International Conference on Learning Representations; 2023 May 1–5; Kigali, Rwanda. [Google Scholar]

19. Ajay A, Du Y, Gupta A, Tenenbaum JB, Jaakkola TS, Agrawal P. Is conditional generative modeling all you need for decision-making? In: Proceedings of the 11th International Conference on Learning Representations; 2023 May 1–5; Kigali, Rwanda. [Google Scholar]

20. Chi C, Xu Z, Feng S, Cousineau E, Du Y, Burchfiel B, et al. Diffusion policy: visuomotor policy learning via action diffusion. Int J Robot Res. 2025;44(10–11):1684–704. doi:10.1177/02783649241273668. [Google Scholar] [CrossRef]

21. Lin Q, Tang B, Wu Z, Yu C, Mao S, Xie Q, et al. Safe offline reinforcement learning with real-time budget constraints. In: Proceedings of the 40th International Conference on Machine Learning; 2023 Jul 23–29; Honolulu, HI, USA. p. 21127–52. [Google Scholar]

22. Zheng Y, Li J, Yu D, Yang Y, Li SE, Zhan X, et al. Safe offline reinforcement learning with feasibility-guided diffusion model. In: Proceedings of the 12th International Conference on Learning Representations; 2024 May 7–11; Vienna, Austria. [Google Scholar]

23. Ha T, Cha H, Ji D. CDP: constrained diffusion policies with mirror diffusion model for safety-assured imitation learning. In: Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems; 2025 Oct 19–25; Hangzhou, China. p. 9838–45. doi:10.1109/IROS60139.2025.11246518. [Google Scholar] [CrossRef]

24. Huang X, Wang X, Cheng Y. Uncertainty-based alternative diffusion policy for safe autonomous driving. IEEE Trans Intell Transp Syst. 2025;26(11):18854–63. doi:10.1109/TITS.2025.3587341. [Google Scholar] [CrossRef]

25. Hasselt H. Double Q-learning. In: Proceedings of the 24th International Conference on Neural Information Processing Systems; 2010 Dec 6–9; Vancouver, BC, Canada. [Google Scholar]

26. Yu T, Kumar S, Gupta A, Levine S, Hausman K, Finn C. Gradient surgery for multi-task learning. In: Proceedings of the 34th International Conference on Neural Information Processing Systems; 2020 Dec 6–12; virtual. p. 5824–36. [Google Scholar]

27. Ho J, Jain A, Abbeel P. Denoising diffusion probabilistic models. In: Proceedings of the 34th International Conference on Neural Information Processing Systems; 2020 Dec 6–12; virtual. p. 6840–51. [Google Scholar]

28. Gronauer S. Bullet-Safety-Gym: a framework for constrained reinforcement learning. In: AAMAS ’24: Proceedings of the 23rd International Conference on Autonomous Agents and Multiagent System. New York, NY, USA: The Association for Computing Machinery (ACM); 2022. doi:10.14459/2022md1639974. [Google Scholar] [CrossRef]

29. Ji J, Zhou J, Zhang B, Dai J, Pan X, Sun R, et al. Omnisafe: an infrastructure for accelerating safe reinforcement learning research. J Mach Learn Res. 2024;25(285):1–6. [Google Scholar]

Cite This Article

APA Style

Li, C., Wang, L., Zhao, X. (2026). Safe Robot Control through Multi-Task Offline Reinforcement Learning with Multi-Scale Distribution Debiasing. Computers, Materials & Continua, 88(1), 67. https://doi.org/10.32604/cmc.2026.079959

Vancouver Style

Li C, Wang L, Zhao X. Safe Robot Control through Multi-Task Offline Reinforcement Learning with Multi-Scale Distribution Debiasing. Comput Mater Contin. 2026;88(1):67. https://doi.org/10.32604/cmc.2026.079959

IEEE Style

C. Li, L. Wang, and X. Zhao, “Safe Robot Control through Multi-Task Offline Reinforcement Learning with Multi-Scale Distribution Debiasing,” Comput. Mater. Contin., vol. 88, no. 1, pp. 67, 2026. https://doi.org/10.32604/cmc.2026.079959

BibTex EndNote RIS

Copyright © 2026 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Table of Content

Safe Robot Control through Multi-Task Offline Reinforcement Learning with Multi-Scale Distribution Debiasing

Abstract

Keywords

References

Cite This Article

782

192

0

Related articles

Further Information

Guidelines

Follow Us

Join Us

Contact Us

WhatsApp:

Share Link