iconOpen Access

ARTICLE

Actor–Critic Trajectory Controller with Optimal Design for Nonlinear Robotic Systems

Nien-Tsu Hu1,*, Hsiang-Tung Kao1, Chin-Sheng Chen1, Shih-Hao Chang2

1 Graduate Institute of Automation Technology, National Taipei University of Technology, Taipei, 10608, Taiwan
2 Computer Science and Information Engineering, National Taipei University of Technology, Taipei, 10608, Taiwan

* Corresponding Author: Nien-Tsu Hu. Email: email

Computers, Materials & Continua 2026, 87(1), 83 https://doi.org/10.32604/cmc.2025.074993

Abstract

Trajectory tracking for nonlinear robotic systems remains a fundamental yet challenging problem in control engineering, particularly when both precision and efficiency must be ensured. Conventional control methods are often effective for stabilization but may not directly optimize long-term performance. To address this limitation, this study develops an integrated framework that combines optimal control principles with reinforcement learning for a single-link robotic manipulator. The proposed scheme adopts an actor–critic structure, where the critic network approximates the value function associated with the Hamilton–Jacobi–Bellman equation, and the actor network generates near-optimal control signals in real time. This dual adaptation enables the controller to refine its policy online without explicit system knowledge. Stability of the closed-loop system is analyzed through Lyapunov theory, ensuring boundedness of the tracking error. Numerical simulations on the single-link manipulator demonstrate that the method achieves accurate trajectory following while maintaining low control effort. The results further show that the actor–critic learning mechanism accelerates convergence of the control policy compared with conventional optimization-based strategies. This work highlights the potential of reinforcement learning integrated with optimal control for robotic manipulators and provides a foundation for future extensions to more complex multi-degree-of-freedom systems. The proposed controller is further validated in a physics-based virtual Gazebo environment, demonstrating stable adaptation and real-time feasibility.

Keywords

Reinforcement learning; optimal control; actor–critic algorithm; trajectory tracking; nonlinear systems; robotic manipulator

1  Introduction

In recent years, intelligent and optimized methodologies for nonlinear system control have drawn significant attention. Owing to the complexity of nonlinear dynamics, traditional linear control approaches are often inadequate in practical applications. To address unknown dynamics and uncertainties, artificial neural networks (ANNs), especially radial basis function neural networks (RBFNNs), have been widely used for nonlinear system modelling and controller design. With Lyapunov-based adaptation schemes, such RBFNN-based controllers can guarantee that all closed-loop signals are semi-globally uniformly ultimately bounded (SGUUB) [1].

On the other hand, the mathematical foundation of optimal control is based on the Hamilton–Jacobi–Bellman (HJB) equation, which is generally intractable for nonlinear systems. Approximation methods such as Galerkin schemes have been developed to provide feasible solutions, laying the groundwork for the integration of adaptive dynamic programming (ADP) and reinforcement learning (RL) into nonlinear control [2]. From a theoretical standpoint, the foundations of nonlinear analysis and stability are systematically presented in Khalil’s Nonlinear Systems, which covers Lyapunov stability, backstepping, and input–output stability [3]. Similarly, the comprehensive framework of optimal control, extending from linear quadratic regulation (LQR) to nonlinear HJB approaches, is well established in Lewis’s Optimal Control [4].

More recently, advanced RL-based methods have been introduced for robotic systems subject to complex constraints. For instance, Peng et al. developed an event-sampled critic learning strategy to solve the optimal tracking control problem of motion-constrained robot manipulators, reducing computational burden while ensuring stability [5]. Cheng and Dong proposed a single-critic neural network approach for robotic manipulators with input saturation and disturbances, with stability guaranteed through Lyapunov theory [6]. Hu et al. further extended actor–critic network designs to handle manipulators with dynamic disturbances, ensuring SGUUB stability and accurate trajectory tracking [7].

In addition, Wen et al. proposed optimized adaptive tracking and backstepping frameworks that integrate actor–critic RL into nonlinear control. In these works, both virtual and actual control inputs are designed as optimal solutions at each recursive step, and the resulting controllers simultaneously achieve trajectory tracking and performance optimization [8,9]. In the broader context of multi-agent systems (MASs), Wen et al. further introduced an identifier–actor–critic algorithm based on fuzzy logic systems for MASs with unknown nonlinear dynamics, addressing state coupling and ensuring stability under Lyapunov analysis [10]. These representative studies highlight how filtered or composite errors can be exploited to structure the optimal tracking problem in both single-system and multi-agent settings.

Beyond these developments, many tracking designs employ the reference error to streamline stability analysis and shape tracking dynamics [11]. For transient and steady-state guarantees, prescribed-performance control (PPC) provides Lyapunov-based performance envelopes that accommodate uncertainties and actuator limitations in robotic manipulators; recent works further strengthen PPC with Barrier Lyapunov Function (BLF)-based designs, assigned/appointed settling time, and neural adaptations [1214]. On the optimal-learning side, actor–critic reinforcement learning has been applied to robotic manipulators to enhance practical tracking performance with stability-oriented designs, and RL-based controllers have also been developed for nonlinear syst ems such as single-link manipulators [15,16]. Moreover, model-based actor–critic learning has been exploited for robotic impedance control in complex interactive environments, where the actor–critic framework is used to learn impedance behavior for safe and efficient robot–environment interaction [17].

In parallel, integral and off-policy reinforcement learning techniques have been investigated to address optimal tracking under disturbances, input constraints, and model uncertainty. For perturbed discrete-time systems, H∞ tracking control frameworks based on On/Off policy Q-learning have been proposed, where model-free Q-learning algorithms are shown to guarantee convergence of the learned policy in the presence of external disturbances [18]. For perturbed bilateral teleoperators with variable time delay, nonlinear RISE-based integral RL algorithms have been developed by combining robust integral action with actor–critic structures to achieve both optimal performance and coordination tracking [19]. Off-policy integral RL-based optimal tracking schemes have also been proposed for nonzero-sum game systems with unknown dynamics, further relaxing model requirements while maintaining stability and disturbance attenuation [20].

Together, these contributions indicate that the integration of RBFNNs, Lyapunov stability analysis, HJB approximations, and actor–critic reinforcement learning provides an effective pathway for solving nonlinear optimal tracking problems. At the same time, most existing RL-based optimal tracking schemes still exhibit several limitations: many require partial model knowledge or rely on Persistent of Excitation (PE)-type excitation conditions to guarantee convergence of the actor–critic networks, and the transient dynamics are usually determined implicitly by Lyapunov designs rather than by an explicitly tunable reference-error structure. In addition, the interaction between optimal cost design, actuator constraints, and reference-error shaping has not been fully explored in the context of nonlinear robotic systems.

To clarify how the proposed scheme is tailored to address the above gaps, we explicitly match each limitation with a corresponding design element of our method.

(i)   Reliance on partial model knowledge: many RL/HJB/ADP controllers still require certain model components for policy/value formulation and stability analysis; our approach integrates an online actor–critic structure with RBFNN approximation to reduce dependence on precise model parameters while retaining Lyapunov-based stability analysis (see Sections 3 and 4).

(ii)   Persistence of excitation (PE) requirements: theoretical convergence of adaptive/learning laws typically relies on PE; we adopt a standard PE condition for weight convergence analysis and provide implementation-oriented discussions/experiments to illustrate stable learning under practical excitation levels (Assumption 2 in Section 3.2; results in Sections 5 and 6).

(iii)   Implicit transient tuning: transient behavior is often tuned implicitly through multiple gains without a clear performance knob; we introduce a structured reference-error design with an explicit gain to shape tracking dynamics in a transparent manner, while preserving the Lyapunov proof (see Sections 2 and 4; parameter study in Section 5).

(iv)   Lack of a structured reference-error mechanism in optimal learning designs: by embedding the reference-error into the learning/control formulation, the proposed method establishes a one-to-one link between tracking-dynamics shaping and optimal-learning updates, which is validated in both numerical simulations and physics-based environments (see Sections 2, 3, 5 and 6).

Motivated by these developments and limitations, this work proposes a reinforcement learning–based optimal tracking control framework for robotic manipulators, with a particular focus on trajectory tracking under nonlinear dynamics. The key novelty of this work lies in augmenting the actor–critic optimal control framework with a tunable reference error structure, enabling direct adjustment of convergence speed while preserving Lyapunov stability guarantees and real-time implementability in both numerical and physics-based environments.

Contributions of this work are summarized as follows:

1.   A structured reference-error formulation that explicitly shapes tracking transients with a tunable gain.

2.   A Lyapunov-stability-oriented actor–critic learning design with online approximation.

3.   Comparative studies and validations in both numerical simulations and physics-based environments.

2  Problem Formulation

2.1 System Model

The single-link robotic manipulator is widely used as a benchmark system to evaluate nonlinear control methods. Its dynamic equation can be expressed as

Jq¨(t)+Bq˙(t)+Mglsin(q(t))=τ(t)(1)

where q(t) denotes the angular position of the manipulator, q˙(t) and q¨(t) are the angular velocity and acceleration, respectively, and τ(t) is the input torque. The parameters are defined as follows: J represents the total rotational inertia, B is the damping coefficient, M is the mass of the link, l is the distance from the pivot to the center of mass, and g is the gravitational acceleration.

Following the definition of the manipulator dynamics in Eq. (1), the tracking error with respect to a desired trajectory qd(t) is introduced as

e(t)=qd(t)q(t)(2)

To enforce convergence of the tracking error, reference error is defined as

r(t)=e˙(t)+Λe(t)(3)

where Λ is a constant gain specifying the convergence rate.

Substituting Eq. (1) into the above, the closed-loop error channel can be rearranged as

Jr˙(t)=Br(t)τ(t)+h(t)(4)

where

h(t)=J(q¨+Λe˙)+B(q˙d+Λe)+Mglsin(q)(5)

and the augmented state is given by

χ(t)=[e(t)r(t)](6)

Introducing the auxiliary input

u(t)=h(t)τ(t)(7)

the error dynamics reduce to

r˙(t)=BJr(t)+1Ju(t)(8)

Combining with Eq. (3), the augmented system can be expressed in compact form as

χ˙(t)=[ΛI0BJ]χ(t)+[01J]u(t)=f(χ)+g(χ)u(t)(9)

Assumption 1: The system dynamics f(χ) and the control gain g(χ) in Eq. (9) are locally Lipschitz and bounded on a compact set ΩRn that contains all closed-loop trajectories χ(t). Moreover, g(χ) is nonsingular for all χΩ.

Remark 1: Although Eq. (9) is written in a compact state-space form that is linear in the error state vector χ and the control input u, this does not mean that the overall robotic system is linear. As shown in the original manipulator model in Eq. (1), the dynamics contain inherently nonlinear terms, including the inertia matrix, Coriolis/centrifugal forces, and gravity, which depend nonlinearly on the joint position and velocity (e.g., trigonometric functions). Eq. (9) is obtained by defining the tracking and reference errors and regrouping the dynamics into a convenient control-affine representation for controller design, without invoking any small-angle approximation or local linearization of the plant. In addition, the control policy generated by the actor network is a nonlinear function of the state, so the closed-loop dynamics obtained by substituting this policy into Eq. (9) remain nonlinear. Therefore, the proposed actor–critic reinforcement learning controller is still designed and analyzed in the context of nonlinear robotic systems, even though the error dynamics are expressed in a linear-like form in Eq. (9).

2.2 Optimal Control Design

To measure tracking performance and input usage, we introduce the performance index in terms of a running-cost surrogate:

J=tL(χ(τ),u(τ))dτ(10)

with the running cost L() is defined as

L(χ,u)χTQχ+uTu(11)

where Q=g(χ)gT(χ)Rn×n is a positive definite matrix, ensuring that both state deviation and input magnitude are penalized.

Definition 1: A control strategy u is considered to belong to the admissible set Ω, written as uΨ(Ω), provided that u is continuous, meets the condition, and ensures stability of the augmented error system in Eq. (9), and renders the performance index in Eq. (10) finite for all χΩ.

Here, J in Eq. (10) represents the general performance index for an admissible policy u(t). The optimal performance index is denoted as J(χ), which corresponds to the minimum cost achievable from the error state χ under the optimal policy u.

According to optimal control theory, the Hamilton–Jacobi–Bellman (HJB) equation for this problem can be written as

H(χ,u,J)=L(χ,u)+JT(χ˙)=χTQχ+uTu+J(f(χ)+g(χ)u)=0(12)

where J(χ) is its gradient with respect to the error state.

By applying the optimality condition H/u=0, the corresponding control law takes the form

u(t)=12gT(χ)J(χ)(13)

Thus, the solution to the control problem requires finding the optimal policy u(t) through the gradient J(χ). Since J(χ) is generally difficult to obtain in closed form for nonlinear systems, reinforcement learning with an actor–critic structure is adopted in the next section to approximate this term online.

3  Optimal Control with Reinforcement Learning

3.1 Radial Basis Function Neural Network (RBFNN) Approximation

To facilitate the construction of the optimal tracking control law, the optimal performance function J(χ) can be decomposed as

J(χ)=βχ(t)2βχ(t)2+J(χ)=βχ(t)2+J0(χ)(14)

where β>0 is a design constant and J0(χ)=J(χ)βχ(t)2.

Since neural networks have the capability to approximate continuous functions with arbitrary accuracy on a compact domain, the residual term J0(χ) can be approximated by a neural network as

J0(χ)=WTB(χ)+ε(χ)(15)

where WRm is the ideal weight vector, B(χ)Rm acts as the basis vector, m indicates the number of nodes. The function ε(χ) represents the approximation error.

By substituting the approximation Eq. (15) into the decomposition Eq. (14), the gradient of J(χ) can be expressed as

J(χ)=2βχ(t)+TB(χ)χW+ε(χ)χ(16)

According to the optimal control policy derived in Eq. (13), the optimal input can be rewritten as

u(t)=βgT(χ)χ(t)12gT(χ)TB(χ)χW12gT(χ)ε(χ)χ(17)

By substituting Eqs. (16) and (17) into Eq. (12), the following result can be obtained:

H(χ,u,J)=(β21)χT(t)Q(χ)χ(t)+2βχT(t)f(χ)+WTB(χ)χT(f(χ)βQ(χ)χ(t))14WTΘ(χ)W+ω(t)=0(18)

where

Θ(χ)=B(χ)χTQ(χ)TB(χ)χRm×m

ω(t)=ε(χ)χT(f(χ)+g(χ)u+14Qε(χ)χ)

Remark 2: The number of radial basis function (RBF) nodes has a direct impact on both the approximation capability and the closed-loop performance. In this work, Gaussian RBFNNs are adopted for both the critic and actor, and the centers are uniformly distributed over the estimated reachable region of the state vector. Preliminary simulations with different node numbers (e.g., 25, 36, and 49) indicate that 36 nodes provide a good trade-off between approximation accuracy and online computational complexity; therefore, this configuration is used in the final simulations.

From the viewpoint of function approximation, too few basis functions lead to insufficient representation power and large residual errors, which may slow down learning and degrade tracking performance. On the other hand, using too many RBF nodes increases the network dimension and may cause overfitting and high-variance responses. In our simulations, when the number of nodes is excessively large, the learned control policy tends to exhibit sharper variations and high-frequency oscillations, resulting in “spiky” tracking trajectories and more aggressive control actions. A similar trade-off between approximation accuracy and model complexity has been reported in adaptive neural network control of robot manipulators, where an optimal number of hidden nodes is sought to avoid both under-fitting and over-fitting of the unknown dynamics [21].

3.2 Actor–Critic Implementation

In practice, the ideal weight vector W and the approximation error ε(χ) are unknown. To address this, an actor–critic reinforcement learning structure is employed.

The critic network approximates the performance index as

J^(χ)=2βχ(t)+TB(χ)χW^c(19)

where J^(χ) denotes the estimated gradient of the optimal performance index J(χ), and W^c(t) is the critic weight vector adapted online.

Accordingly, the actor generates the control input using both the critic information and its own adaptive weights:

u(t)=βgT(χ)χ(t)12gT(χ)TB(χ)χW^a(t)(20)

where W^a(t) is the actor weight vector updated during the learning process.

Therefore, the actor–critic pair collaborates such that the critic approximates the gradient J(χ), while the actor produces the near-optimal control input u(t) in real time.

The Bellman residual is defined in terms of the Hamiltonian as

ϕ(t)=H(χ,u,J^)H(χ,u,J)(21)

since the Hamiltonian at the optimal policy u satisfies the HJB condition, the above reduces to

ϕ(t)=H(χ,u,J^)(22)

expanding Eq. (22), the Bellman residual can be written explicitly as

ϕ(t)=χTQχ+βgT(χ)χ(t)+12gT(χ)B(χ)χW^a(t)2+(2βχ(t)+B(χ)χW^cT(t))T(f(χ)βQχ(t)12QB(χ)χW^a(t))(23)

to minimize the Bellman residual error ϕ(t), the positive function

Φ(t)=ϕ2(t)(24)

is introduced. For stability, it is required that Φ˙0.

The critic weight updating rule is obtained by applying a normalized negative gradient descent:

W^˙c(t)=αcς(t)1+ς(t)2Φ(t)W^c(t)=αcς(t)1+ς(t)2(ςT(t)W^c(t)(β21)χTQχ+2βχTf(χ)+14W^aT(t)Θ(χ)W^a(t))(25)

where αc>0 is the critic learning rate, and

ς(t)=B(z)χT(f(χ)βQ(χ)χ(t)12Q(χ)TB(χ)χW^a(t))Rm×1

The actor weight adaptation rule is derived from stability analysis to ensure that the optimized control input remains stable. The actor weights are updated according to

W^˙a(t)=12B(χ)χTQχ(t)αΘ(χ)W^a(t)+αc4(1+ς(t)2)Θ(χ)W^a(t)ςT(t)W^c(t),(26)

where α>0 acts as the gain parameter used in the actor update process.

Assumption 2: In this work, we adopt the standard concept of persistence of excitation (PE) as a requirement for ensuring sufficient richness in the regressor signal. Let ς(t) denote a regressor signal used in the critic and actor update laws. We assume that there exist positive constants ξ1>0, ξ2>0 and T>0 such that, for all t, the quantity ς(t)ςT(t) computed over [t,t+T] fulfills

ξ1Imς(t)ςT(t)ξ2Im

where ImRm×m is the identity matrix. In other words, the PE condition requires that the regressor ς(t) be sufficiently rich in every time window of length T, so that all directions in the parameter space are repeatedly excited and the critic–actor weights can, in principle, be identified as in the classical actor–critic analysis of Vamvoudakis and Lewis [22].

4  Stability Analysis

In this section, the stability of the closed-loop system is analyzed using a Lyapunov-based method. Consider the following Lyapunov candidate function:

V(t)=12χT(t)χ(t)+12W~aT(t)W~a+12W~cT(t)W~c(27)

where W~a(t)=W^a(t)Wa and W~c(t)=W^c(t)Wc denote the estimation errors of the actor and critic weight vectors, respectively.

The time derivative of V is

V˙(t)=χT(t)χ˙(t)+W~aT(t)W^˙a+W~cT(t)W^˙c=χT(t)(f(χ)+g(χ)u(t))+W~aT(t)W^˙a+W~cT(t)W^˙c(28)

Applying Eqs. (20), (25) and (26) to Eq. (28), the following equation can be formulated as:

V˙(t)=βχT(t)Q(χ)χ(t)12χT(t)Q(χ)TB(χ)χW^a(t)+χT(t)f(χ)+12W~aT(t)B(χ)χTQ(χ)χ(t)αW~aT(t)Θ(χ)W^a(t)+αc4(1+ς(t)2)W~aT(t)Θ(χ)W^a(t)ςT(t)W^c(t)W~cT(t)(αcς(t)1+ς(t)2(ςT(t)W^c(t)(β21)χT(t)Q(χ)χ(t)+2βχT(t)f(χ)+14W^aT(t)Θ(χ)W^a(t)))(29)

Based on W~c(t)=W^c(t)W, we obtain the expressions below:

12W~aT(t)B(χ)χTQ(χ)χ(t)12χT(t)Q(χ)TB(χ)χW^a(t)=12χT(t)Q(χ)TB(χ)χW(30)

αW~aT(t)Θ(χ)W^a(t)=α2W~aT(t)Θ(χ)W~a(t)α2W^aT(t)Θ(χ)W^a(t)+α2WTΘ(χ)W(31)

According to Young’s and Cauchy inequalities [10], we obtain the results below:

χT(t)f(χ)12χ(t)2+12f(χ)2(32)

12χT(t)Q(χ)TB(χ)χW14χT(t)Q(χ)χ(t)+14WTΘ(χ)W(33)

Substituting into (29), the following equation follows:

V˙(t) χT(t)((β14)Q(χ)12In)χ(t)+12f(χ)2α2W~aT(t)Θ(χ)W~a(t)α2W^aT(t)Θ(χ)W^a(t)+2α+14WTΘ(χ)W+αc4(1+ς(t)2)W~aT(t)Θ(χ)W^a(t)ςT(t)W^c(t)W~cT(t)(αcς(t)1+ς(t)2(ςT(t)W^c(t)(β21)χT(t)Q(χ)χ(t)+2βχT(t)f(χ)+14W^aT(t)Θ(χ)W^a(t))).(34)

Based on (18), we obtain the equation shown below:

(β21)χT(t)Q(χ)χ(t)+2βχT(t)f(χ)=WTς(t)12WTΘ(χ)W^a(t)+14WTΘ(χ)Wω(t)(35)

By inserting (35) into (34), the following result is obtained:

V˙(t) χT(t)((β14)Q(χ)12In]χ(t)+12f(χ)2α2W~aT(t)Θ(χ)W~a(t)α2W^aT(t)Θ(χ)W^a(t)+2α+14WTΘ(χ)W+αc4(1+ς(t)2)W~aT(t)Θ(χ)W^a(t)ςT(t)W^c(t)W~cT(t)[αcς(t)1+ς(t)2(ςT(t)W^c(t)12WTΘ(χ)W^a(t)+14W^aT(t)Θ(χ)W^a(t)+14WTΘ(χ)Wω(t))).(36)

Using the fact that:

14W^aT(t)Θ(χ)W^a(t)12WΘ(χ)W^a(t)+14WΘ(χ)W=14W~aT(t)Θ(χ)W^a(t)14WTΘ(χ)W~a(t)(37)

Eq. (36) is obtained:

V˙(t)χT(t)((β14)Q(χ)12In)χ(t)+12f(χ)2α2W~aT(t)Θ(χ)W~a(t)α2W^aT(t)Θ(χ)W^a(t)+2α+14WT(t)Θ(χ)Wαc1+ς(t)2W~cT(t)ς(t)ςT(t)W~c(t)+αc4(1+ς(t)2)W~aT(t)Θ(χ)W^a(t)ςT(t)W^c(t)αc4(1+ς(t)2)W~cT(t)ς(t)W~aT(t)Θ(χ)W^a(t)+αc4(1+ς(t)2)W~cT(t)ς(t)WTΘ(χ)W~a(t)+αc1+ς(t)2W~cT(t)ς(t)ω(t)(38)

Substituting the facts that

αc4(1+ς(t)2)W~aT(t)Θ(χ)W^a(t)ςT(t)W^c(t)αc4(1+ς(t)2)W~cT(t)ς(t)W~aT(t)Θ(χ)W^a(t)=αc4(1+ς(t)2)W~aT(t)B(χ)χTg(χ)WTς(t)gT(χ)TB(χ)χW^a(t)(39)

αc1+ς(t)2W~cT(t)ς(t)ω(t)αc2(1+ς(t)2)W~cT(t)ς(t)ςT(t)W~c(t)+αc2(1+ς(t)2)ω2(t)(40)

Substituting Eqs. (39) and (40) into Eq. (38) yields:

V˙(t)χT(t)((β14)Q(χ)12In)χ(t)α2W~aT(t)Θ(χ)W~a(t)αc2(1+ς(t)2)W~cT(t)ς(t)ςT(t)W~c(t)+αc4(1+ς(t)2)W~aT(t)B(χ)χTg(χ)WTς(t)gT(χ)TB(χ)χW^a(t)+αc4(1+ς(t)2)W~cT(t)ς(t)WTΘ(χ)W~a(t)α2W^aT(t)Θ(χ)W^a(t)+C(t)(41)

where

C(t)=(2α+14WTΘ(χ)W)+12f(χ)2+ac2(1+ς(t)2)ω2(t)

Using Young’s and Cauchy inequalities, we obtain the bounds shown below:

αc4(1+ς(t)2)W~aT(t)B(χ)χTg(χ)WTς(t)gT(χ)TB(χ)χW^a(t)132W~aT(t)B(χ)χTg(χ)WTς(t)ςT(t)WgT(χ)TB(χ)χW~a(t)+αc22W^aT(t)Θ(χ)W^a(t)(42)

αc4(1+ς(t)2)W~cT(t)ς(t)WTΘ(χ)W~a(t)132(1+ς(t)2)W~cT(t)ς(t)WTΘ(χ)WςT(t)W~c(t)+αc22W~aT(t)Θ(χ)W~a(t)(43)

By substituting the inequalities in Eqs. (42) and (43) into Eq. (41), we obtain:

V˙(t)χT(t)(β12)QIn)χ(t)(α2αc22132WTς(t)ςT(t)W)W~aT(t)Θ(χ)W~a(t)11+ς(t)2(αc2132WTΘ(χ)W)W~cT(t)ς(t)ςT(t)W~c(α2αc22)W~aT(t)Θ(χ)W~a(t)+C(t)(44)

Let

c1=(β12)λQmin1

c2=(α2αc22ξ132W2)λΘmin

c3=(αc2λΘmax32W2)ξ2

γ1=sup{C(t)}

where λQmin denotes the smallest eigenvalue of Q(χ) and λΘmax represents the largest eigenvalue of Θ(χ). Then Eq. (28) be expressed as

V˙(t)c1χT(t)χ(t)c2W~aT(t)W~a(t)c3W~cT(t)W~c(t)+γ1(α2αc22)W^aT(t)Θ(χ)W^a(t)(45)

equivalently,

V˙(t)cLV(t)+γ1(46)

where cL=min{c1,c2,c3}>0.

Lemma 1 [1]: Consider a scalar function VR that remains positive and continuous, and whose initial value V(0) is bounded. Suppose V(t) evolves according to V˙aV(t)+b, where a>0 and b>0 are given constants.

V(t)eatV(0)+ba(1eat).(47)

By Lemma 1, Eq. (47) can be obtained:

V(t)ecLtV(0)+γ1cL(1ecLt).(48)

Result. Inequality Eq. (32) implies that the errors e(t), W~c(t) and W~a(t) satisfy the SGUUB property.

Finally, our proposed actor–critic trajectory controller with optimal design for nonlinear robotic systems is shown in Fig. 1.

images

Figure 1: Our proposed actor–critic trajectory controller with optimal design for nonlinear robotic systems

5  Numerical Simulation Results

5.1 Link Manipulator

The effectiveness of the control scheme is assessed on a single-link robotic manipulator governed by Eq. (1). The system parameters follow Section 2.1 and are fixed as J=1, B=2, and Mgl=10. The sampling period is T=0.02 s, and t0=0, tf=20 s. The initial condition is q(0)=5, q˙(0)=5. The desired trajectory is specified by

qd(t)=20049sin(0.7t)+207t

For the neural network approximation, RBFNN with 36 units is adopted. The basis vector takes the form

B(e)=[B1(e),B2(e),,B36(e)]T

with each Gaussian basis function given by

Bi(e)=exp(eςi2/μi2),i=1,,36

where μi=1, and the centers ςi are uniformly distributed over the range [6,6].

The control design parameters are selected as β=40.0, critic learning rate αc=10.0, and actor learning rate α=5. The initial conditions of the weights are set as

W^c(0)=[0.3,,0.3]T

W^a(0)=[0.5,,0.5]T

Two controllers are compared under identical conditions. The first follows the method in [8], which uses the tracking error e(t)=qd(t)q(t) directly within the Optimal+RL framework. The second uses the method proposed in this paper, the key difference is the additional reference error defined in Eq. (3), with Λ=5 and Λ=10, after which the manipulator dynamics are reformulated in terms of the composite state in Eq. (6) for the actor–critic optimal-tracking design.

For the baseline controller based on [8], we reimplemented the algorithm under exactly the same settings as those described above for the proposed method. In particular, both controllers use the same single-link manipulator dynamics in Section 2, the same desired trajectory, the same parameters, and the same RBFNN structure (number of nodes, centers and widths) for the critic and actor. Moreover, the critic and actor learning rates and, the initial weight vectors, and the initial state of the system are chosen to be identical for the baseline and the proposed controller. Therefore, the performance differences observed in this subsection mainly originate from the structural modification introduced in this work, rather than from arbitrary retuning of the underlying parameters.

Fig. 2 presents position and velocity tracking results. The method proposed in this paper produces markedly faster transients; increasing Λ from 5 to 10 further accelerates the initial convergence without degrading steady-state tracking. Fig. 3 shows the position and velocity errors. Consistent with Fig. 1, the error trajectories of the proposed method reach zero much earlier and settle within a tighter band. Fig. 4 depicts the control torque, which remains bounded in all cases; the proposed method yields smoother steady-state actuation. Figs. 5 and 6 report the Euclidean norms of the actor and critic weights, confirming bounded adaptation. Fig. 7 plots the instantaneous cost and indicates faster cost decay for the proposed method.

images

Figure 2: Position and velocity tracking of the single-link manipulator: (a) position comparison between the method in [8] and the proposed method; (b) velocity comparison between the method in [8] and the proposed method. All curves are generated from our own simulations based on the controller structure reported in [8]

images

Figure 3: Tracking errors of the single-link manipulator: (a) position error; (b) velocity error

images

Figure 4: Control input

images

Figure 5: Norm of actor Neural Network (NN) weights

images

Figure 6: Norm of critic NN weights

images

Figure 7: Cost function

These results, summarized in Tables 1 and 2, substantiate that incorporating the reference error into the Optimal+RL design substantially improves transient speed and reduces post-tracking error; moreover, Λ serves as a convergence gain that allows explicit shaping of the trajectory-tracking response.

images

images

Influence of the Reference-Error Gain

To further investigate the influence of the reference-error gain Λ on the closed-loop behavior, an additional simulation study is carried out on the single-link manipulator with five different values, Λ=1,2,5,10,20, while all other parameters are kept identical. The resulting position and velocity trajectories are shown in Fig. 8, where Fig. 8a depicts the joint position and Fig. 8b depicts the joint velocity for the five choices of Λ. The corresponding position and velocity tracking errors are plotted in Fig. 9a and b, respectively, and the associated control inputs are reported in Fig. 10.

images

Figure 8: Position and velocity responses of the single-link manipulator under different values of the reference-error gain: (a) joint position; (b) joint velocity

images

Figure 9: Position and velocity tracking errors of the single-link manipulator under different values of the reference-error gain: (a) position error; (b) velocity error

images

Figure 10: Control input of the single-link manipulator under different values of the reference-error gain

The cases Λ=1 and Λ=2 exhibit relatively slow convergence and longer settling times, although their control inputs in Fig. 10 are comparatively smooth and far from the saturation bounds. When Λ is increased to 5 and 10, the position and velocity errors decay much more rapidly, and the trajectories in Fig. 8 closely follow the desired motion with only moderate growth in the control effort. For Λ=20, the tracking errors converge the fastest among all tested values, but the corresponding control signal in Fig. 10 becomes more aggressive and approaches the saturation limits more frequently, which may cause larger overshoot or high-frequency components in practical implementations.

These observations confirm that Λ provides a simple and effective tuning knob to balance transient speed and control effort in the proposed actor–critic controller. In the subsequent simulations, Λ=5 and Λ=10 are therefore adopted as representative “moderate” and “aggressive” settings, respectively: they clearly illustrate the trade-off between convergence rate and control activity, while avoiding the excessively sluggish response observed for very small Λ and the overly aggressive control behavior associated with excessively large Λ.

5.2 Nonlinear Mass–Spring–Damper

The second experiment considers a single-Degrees of Freedom (DOF) nonlinear oscillator described by

q¨(t)=cq˙(t)kq(t)k3q3(t)+u(t)

with damping c=0.5, linear stiffness k=2.0, and cubic stiffness k3=0.5. The sampling period is T=0.02 s and the simulation horizon is t0=0 and tf=50 s. The reference trajectory is chosen as

qd(t)=cos(t),q˙d(t)=sin(t)

the initial condition is q(0)=q˙(0)=0.

For function approximation, an RBFNN with 36 Gaussian nodes is used centers on [6,6], the control design parameters are selected as β=40.0, critic learning rate αc=5.0, and actor learning rate α=40. The initial conditions of the weights are set as

W^c(0)=[0.3,,0.3]T

W^a(0)=[0.5,,0.5]T

Three tests are performed under identical settings: the method in [8], and the method proposed in this paper using Eq. (3) with Λ=5 and Λ=10. The tracking-time and post-tracking error metrics are defined as in Section 5.1.

As in Section 5.1, the baseline controller from [8] in this example is also reimplemented using exactly the same controller parameters as the proposed method, so that the comparison is carried out under identical settings.

The position and velocity tracking responses of the three controllers are shown in Fig. 11, and the corresponding tracking errors are plotted in Fig. 12. The control inputs generated by the proposed method and the baseline are depicted in Fig. 13. The evolution of the actor and critic weight norms is illustrated in Figs. 14 and 15, and the associated cost function trajectories are reported in Fig. 16. The settling time and steady-state error of the position and velocity tracking are summarized in Tables 3 and 4, respectively.

images

Figure 11: Position and velocity tracking of the nonlinear mass–spring–damper system: (a) position comparison between the method in [8] and the proposed method; (b) velocity comparison between the method in [8] and the proposed method. All curves are obtained from our own simulations based on the reference setup in [8]

images

Figure 12: Tracking errors of the nonlinear mass–spring–damper system: (a) position error; (b) velocity error [8]

images

Figure 13: Control input [8]

images

Figure 14: Norm of actor NN weights [8]

images

Figure 15: Norm of critic NN weights [8]

images

Figure 16: Cost function [8]

images

images

From Figs. 1116 and Tables 3 and 4, the method in [8] shows poor performance under the chosen nonlinear reference, whereas the proposed reference-error based design achieves faster convergence, smaller steady-state errors, and smoother control effort, as visible in the insets of the velocity and torque plots.

6  Simulation in Virtual Environment—Single-Link Manipulator

To further evaluate the proposed optimal reinforcement learning control strategy, a virtual simulation platform was constructed in the Gazebo environment, which provides physics-based visualization and real-time dynamic interaction. The manipulator model and control framework were configured following the experimental setup described in [15], where the same single-link robotic system was adopted as a benchmark for validation.

The developed Gazebo environment and manipulator model are illustrated in Fig. 17, which replicates realistic joint friction, gravitational effects, and torque constraints.

images

Figure 17: Gazebo simulation environment of the single-link robotic manipulator

The plant dynamics follow the model of Eq. (1):

Jq¨+Bq˙+Mglsin(q)=τ

the parameters are set as J=1.0, B=2.0, and Mgl=10.0.

The reference trajectory and its derivatives are defined as follows:

qd(t)=20049sin(0.7t)+207t

Simulation settings are summarized below: β=20.0, μ=1.0, α=5.0, αc=5.0, and 36 RBF nodes.

The time horizon is t[0,20] s, with 1000 uniformly spaced samples. The neural network centers are uniformly distributed within [6,6].

In this example, the baseline method of [8] is implemented with the same controller-related parameters as the proposed scheme, including the discount factor, learning rates, RBFNN structure, initial weight vectors and initial state, ensuring that both controllers are tested under the same conditions.

The results obtained in the Gazebo environment are presented in Figs. 1823. The tracking response in Fig. 18a,b demonstrates that the proposed controller achieves accurate and stable motion tracking, closely following the desired trajectory under virtual dynamics. The position and velocity errors shown in Fig. 19a,b converge rapidly to zero, verifying that the controller maintains robustness and precision even with virtual actuator dynamics and friction. The control input shown in Fig. 20 remains smooth and bounded throughout the operation, indicating that the control effort required in the simulated environment is practical and stable. The evolution of actor and critic weights in Figs. 2122 remains bounded, confirming the stability and learning consistency of the proposed actor–critic mechanism during the entire simulation.

images

Figure 18: Position and velocity tracking of the single-link manipulator: (a) position comparison between the method in [8] and the proposed method; (b) velocity comparison between the method in [8] and the proposed method. All curves are generated from our own simulations based on the controller structure reported in [8]

images

Figure 19: Tracking errors of the single-link manipulator: (a) position error; (b) velocity error [8]

images

Figure 20: Control input [8]

images

Figure 21: Norm of actor NN weights [8]

images

Figure 22: Norm of critic NN weights [8]

images

Figure 23: Cost function [8]

The quantitative performance evaluation is summarized in Tables 5 and 6, showing that the proposed controller significantly improves both transient and steady-state accuracy compared with the benchmark approach [8]. The results verify that the actor–critic optimal control strategy can be effectively implemented in a physics-based virtual manipulator and maintains stable adaptation during real-time operation.

images

images

In summary, the Gazebo-based virtual validation confirms the effectiveness and practicality of the proposed controller. The system maintains stable adaptation, bounded control input, and high tracking precision under realistic simulation conditions. The cost function convergence further verifies that the control policy achieves near-optimal performance while ensuring smooth and efficient actuation. These findings demonstrate that the control algorithm can be reliably extended from numerical experiments to physical robotic platforms with consistent performance.

Quantitatively, the proposed method reduces settling time by approximately 90% and steady-state error by 85% compared to [8], confirming the improved real-time performance.

7  Conclusion

This paper presented an actor–critic optimal tracking framework for nonlinear systems that augments the classical tracking error with a reference error. By reformulating the plant dynamics in the composite error state, the proposed method enables explicit shaping of the convergence rate through the gain Λ while preserving a control-affine structure amenable to reinforcement learning. A Lyapunov-based design was adopted to derive the update laws and to ensure closed-loop boundedness of the tracking signals and adaptive weights.

Simulation studies on two benchmarks—a single-link robotic manipulator and a nonlinear mass–spring–damper—demonstrated clear performance gains over the reference approach in [8]. With identical neural approximators and learning gains, the proposed controller achieved substantially faster transients and smaller steady-state errors. On the manipulator, sub-second zero-crossing of the error was obtained for Λ=5 and Λ=10, whereas ref. [8] exhibited a much slower approach. On the oscillator, the proposed method tracked the periodic reference within fractions of a second and maintained a tighter residual band. These improvements were accompanied by bounded control inputs and weight norms, confirming stable adaptation. The trade-off is that larger Λ values yield sharper initial responses and higher early control demand; nevertheless, they provide a direct and tunable mechanism to accelerate convergence.

Overall, the results verify that incorporating the reference error into the Optimal+RL design offers a simple yet effective route to enhance tracking accuracy and convergence speed, without sacrificing stability guarantees. Future work will extend the method to multi-DOF manipulators and implement it on a physical UR5 robot using Robot Operating System (ROS) 2 to evaluate robustness under sensor noise and actuator saturation.

Acknowledgement: Not applicable.

Funding Statement: This work was supported in part by the National Science and Technology Council under Grant NSTC 114-2221-E-027-104.

Author Contributions: Nien-Tsu Hu: Conceptualization, Formal analysis, Funding acquisition, Methodology, Project administration, Resources, Supervision, Validation, Writing—review and editing. Hsiang-Tung Kao: Data curation, Formal analysis, Investigation, Software, Validation, Writing—original draft, Writing—review and editing. Chin-Sheng Chen and Shih-Hao Chang: Supervision, Visualization. All authors reviewed the results and approved the final version of the manuscript.

Availability of Data and Materials: The datasets and source codes generated and analyzed during the current study are not publicly available at this time but are available from the corresponding author upon reasonable request.

Ethics Approval: Not applicable.

Conflicts of Interest: The authors declare no conflicts of interest to report regarding the present study.

References

1. Ge SS, Hang CC, Zhang T. Adaptive neural network control of nonlinear systems by state and output feedback. IEEE Trans Syst, Man, Cybern B. 1999;29(6):818–28. doi:10.1109/3477.809035. [Google Scholar] [PubMed] [CrossRef]

2. Beard RW, Saridis GN, Wen JT. Galerkin approximations of the generalized Hamilton-Jacobi-Bellman equation. Automatica. 1997;33(12):2159–77. doi:10.1016/S0005-1098(97)00128-3. [Google Scholar] [CrossRef]

3. Khalil HK. Nonlinear systems. 3rd ed. Upper Saddle River, NJ, USA: Prentice Hall; 2002. 750 p. [Google Scholar]

4. Lewis FL, Vrabie DL, Syrmos VL. Optimal control. 3rd ed. Hoboken, NJ, USA: John Wiley & Sons, Inc.; 2012. 540 p. [Google Scholar]

5. Peng Z, Cheng H, Shi K, Zou C, Huang R, Li X, et al. Optimal tracking control for motion constrained robot systems via event-sampled critic learning. Expert Syst Appl. 2023;234:121085. doi:10.1016/j.eswa.2023.121085. [Google Scholar] [CrossRef]

6. Cheng G, Dong L. Optimal control for robotic manipulators with input saturation using single critic network. In: Proceedings of the 2019 Chinese Automation Congress (CAC); 2019 Nov 22–24; Hangzhou, China. p. 2344–9. doi:10.1109/cac48633.2019.8996999. [Google Scholar] [CrossRef]

7. Hu Y, Cui L, Chai S. Optimal tracking control for robotic manipulator using actor-critic network. In: Proceedings of the 2021 40th Chinese Control Conference (CCC); 2021 Jul 26–28; Shanghai, China. p. 1556–61. [Google Scholar]

8. Wen G, Philip Chen CL, Ge SS, Yang H, Liu X. Optimized adaptive nonlinear tracking control using actor–critic reinforcement learning strategy. IEEE Trans Ind Inf. 2019;15(9):4969–77. doi:10.1109/tii.2019.2894282. [Google Scholar] [CrossRef]

9. Wen G, Ge SS, Tu F. Optimized backstepping for tracking control of strict-feedback systems. IEEE Trans Neural Netw Learn Syst. 2018;29(8):3850–62. doi:10.1109/TNNLS.2018.2803726. [Google Scholar] [PubMed] [CrossRef]

10. Wen G, Philip Chen CL, Feng J, Zhou N. Optimized multi-agent formation control based on an identifier–actor–critic reinforcement learning algorithm. IEEE Trans Fuzzy Syst. 2018;26(5):2719–31. doi:10.1109/tfuzz.2017.2787561. [Google Scholar] [CrossRef]

11. Kim YH, Lewis FL, Dawson DM. Intelligent optimal control of robotic manipulators using neural networks. Automatica. 2000;36(9):1355–64. doi:10.1016/S0005-1098(00)00045-5. [Google Scholar] [CrossRef]

12. Bu X. Prescribed performance control approaches, applications and challenges: a comprehensive survey. Asian J Control. 2023;25(1):241–61. doi:10.1002/asjc.2765. [Google Scholar] [CrossRef]

13. Ghanooni P, Habibi H, Yazdani A, Wang H, MahmoudZadeh S, Ferrara A. Prescribed performance control of a robotic manipulator with unknown control gain and assigned settling time. ISA Trans. 2024;145(2):330–54. doi:10.1016/j.isatra.2023.12.011. [Google Scholar] [PubMed] [CrossRef]

14. Zhao K, Xie Y, Xu S, Zhang L. Adaptive neural appointed-time prescribed performance control for the manipulator system via barrier Lyapunov function. J Frankl Inst. 2025;362(2):107468. doi:10.1016/j.jfranklin.2024.107468. [Google Scholar] [CrossRef]

15. Rahimi F, Ziaei S, Esfanjani RM. A reinforcement learning-based control approach for tracking problem of a class of nonlinear systems: applied to a single-link manipulator. In: Proceedings of the 2023 31st International Conference on Electrical Engineering (ICEE); 2023 May 9–11; Tehran, Iran. p. 58–63. doi:10.1109/icee59167.2023.10334874. [Google Scholar] [CrossRef]

16. Rahimi Nohooji H, Zaraki A, Voos H. Actor–critic learning based PID control for robotic manipulators. Appl Soft Comput. 2024;151:111153. doi:10.1016/j.asoc.2023.111153. [Google Scholar] [CrossRef]

17. Zhao X, Han S, Tao B, Yin Z, Ding H. Model-based actor–critic learning of robotic impedance control in complex interactive environment. IEEE Trans Ind Electron. 2022;69(12):13225–35. doi:10.1109/tie.2021.3134082. [Google Scholar] [CrossRef]

18. Nam D, Huy D. H∞ tracking control for perturbed discrete-time systems using On/Off policy Q-learning algorithms. Chaos Solitons Fractals. 2025;197:116459. doi:10.1016/j.chaos.2025.116459. [Google Scholar] [CrossRef]

19. Dao PN, Nguyen VQ, Duc HAN. Nonlinear RISE based integral reinforcement learning algorithms for perturbed Bilateral Teleoperators with variable time delay. Neurocomputing. 2024;605:128355. doi:10.1016/j.neucom.2024.128355. [Google Scholar] [CrossRef]

20. Zhao JG, Chen FF. Off-policy integral reinforcement learning-based optimal tracking control for a class of nonzero-sum game systems with unknown dynamics. Optim Control Appl Meth. 2022;43(6):1623–44. doi:10.1002/oca.2916. [Google Scholar] [CrossRef]

21. Liu C, Zhao Z, Wen G. Adaptive neural network control with optimal number of hidden nodes for trajectory tracking of robot manipulators. Neurocomputing. 2019;350:136–45. doi:10.1016/j.neucom.2019.03.043. [Google Scholar] [CrossRef]

22. Vamvoudakis KG, Lewis FL. Online actor–critic algorithm to solve the continuous-time infinite horizon optimal control problem. Automatica. 2010;46(5):878–88. doi:10.1016/j.automatica.2010.02.018. [Google Scholar] [CrossRef]


Cite This Article

APA Style
Hu, N., Kao, H., Chen, C., Chang, S. (2026). Actor–Critic Trajectory Controller with Optimal Design for Nonlinear Robotic Systems. Computers, Materials & Continua, 87(1), 83. https://doi.org/10.32604/cmc.2025.074993
Vancouver Style
Hu N, Kao H, Chen C, Chang S. Actor–Critic Trajectory Controller with Optimal Design for Nonlinear Robotic Systems. Comput Mater Contin. 2026;87(1):83. https://doi.org/10.32604/cmc.2025.074993
IEEE Style
N. Hu, H. Kao, C. Chen, and S. Chang, “Actor–Critic Trajectory Controller with Optimal Design for Nonlinear Robotic Systems,” Comput. Mater. Contin., vol. 87, no. 1, pp. 83, 2026. https://doi.org/10.32604/cmc.2025.074993


cc Copyright © 2026 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
  • 312

    View

  • 63

    Download

  • 0

    Like

Share Link