Open Access
ARTICLE
Graph-Based Constrained PPO for Low-Latency and Energy-Aware AI Agent Migration in Internet of Vehicular Agents
1 School of Automation, Guangdong University of Technology and Key Laboratory of Intelligent Detection and the Internet of Things in Manufacturing, Ministry of Education, Guangzhou, China
2 School of Automation, Guangdong University of Technology, Guangzhou, China
* Corresponding Author: Ming Li. Email:
(This article belongs to the Special Issue: AI-Driven Optimization for Secure and Sustainable Edge IoT Services)
Computers, Materials & Continua 2026, 88(2), 61 https://doi.org/10.32604/cmc.2026.083294
Received 01 April 2026; Accepted 03 May 2026; Issue published 15 June 2026
Abstract
The Internet of Vehicular Agents (IoVA) interconnects distributed AI agents across vehicular networks to deliver real-time intelligent services for vehicular users. Due to the limited computing capacity of vehicles, AI agents are deployed on nearby RoadSide Units (RSUs) to perform computation-intensive inference. As vehicles traverse RSU coverage boundaries, AI agents must migrate to target RSUs to maintain service continuity. However, the communication and computing resources at each RSU are shared among multiple co-served vehicles, creating coupled allocation decisions that jointly determine system latency and energy consumption. To address this challenge, we propose a low-latency and energy-aware AI agent migration framework that models the end-to-end system latency and vehicle energy consumption in the IoVA. Since the cumulative nature of energy consumption introduces long-term constraints that cannot be handled by instantaneous optimization, we formulate the resource allocation problem as a constrained Markov decision process and develop a Graph-based Constrained Proximal Policy Optimization (GCPPO) algorithm to solve it. GCPPO employs a bidirectional graph attention network to extract the relational features between heterogeneous vehicles and RSUs, thereby enabling topology-aware resource allocation, and adopts a Lagrangian dual mechanism to adaptively enforce the long-term energy constraints. Simulation results demonstrate the effectiveness and scalability of the proposed algorithm, which achieves aKeywords
Large Language Models (LLMs) have demonstrated strong capabilities in natural language understanding, complex reasoning, and content generation [1,2]. AI agents leverage LLMs as their cognitive core, evolving from task-specific tools into autonomous entities capable of perceiving, reasoning, and acting across diverse domains [3]. In vehicular networks, AI agents are increasingly deployed to deliver real-time intelligent services [4,5]. However, the rapidly changing network topology and fluctuating wireless channel conditions impose stringent requirements on service continuity [6]. The Internet of Vehicular Agents (IoVA) has emerged as a promising paradigm in which distributed AI agents are seamlessly interconnected and dynamically coordinated across the vehicular environment [7]. In the IoVA, AI agents continuously perceive the surrounding environment and user requests to construct real-time situational awareness. Leveraging this awareness, the AI agents formulate context-aware decisions and convert them into executable actions, thereby delivering intelligent services to vehicular users [7].
However, AI agent decision-making relies on computation-intensive inference that far exceeds the limited computing capacity of vehicles [8]. To sustain the inference process, AI agents are deployed on RoadSide Units (RSUs) with sufficient computing resources [9]. As vehicles continuously traverse RSU coverage boundaries, AI agents must migrate to target RSUs to maintain service continuity [10]. The resulting service latency and energy consumption are jointly determined by the allocation of communication and computing resources at the RSUs. Although increasing resource allocation can effectively reduce service latency, it also raises energy consumption, which is bounded by strict vehicle energy budgets [11]. Therefore, it remains a significant challenge to optimize resource allocation for AI agent migration in the IoVA while jointly reducing service latency and satisfying vehicle energy constraints.
Traditional optimization methods for resource allocation in vehicular networks typically rely on accurate instantaneous channel state information and quasi-static network assumptions [12,13]. Nevertheless, the high mobility of vehicles introduces significant channel estimation errors [14], and the dynamically changing network topology further limits the applicability of these methods [6,15]. Deep Reinforcement Learning (DRL)-based methods offer a promising alternative by learning effective policies without prior knowledge of system dynamics [15,16]. Despite this advantage, most existing DRL approaches encode the environment state as a concatenated observation vector [13,17]. This flat representation fails to capture the topological relationships among vehicles and RSUs, and generalizes poorly as the network scales. Furthermore, conventional DRL methods lack systematic constraint-handling mechanisms [18], making the learned policies prone to violating vehicle energy constraints in the highly dynamic IoVA.
To address the above challenges, at the system level, we develop a low-latency and energy-aware AI agent migration framework in the IoVA. The framework jointly models end-to-end service latency and vehicle energy consumption under dynamic channel conditions. We further formulate the multi-vehicle resource allocation problem as a Constrained Markov Decision Process (CMDP) with long-term cumulative energy budgets. At the algorithmic level, we design the Graph-based Constrained Proximal Policy Optimization (GCPPO) algorithm to learn an effective policy for the formulated CMDP. The main contributions of this paper are summarized as follows:
• We develop a low-latency and energy-aware AI agent migration framework in the IoVA, which jointly characterizes the end-to-end service latency and vehicle energy consumption in dynamic vehicular environments. Specifically, the latency model captures the latency across the communication, inference, and migration phases, while the energy model captures both the transmission and circuit power consumption of each vehicle over uplink and downlink channels.
• We formulate the multi-vehicle resource allocation problem for AI agent migration as a Constrained Markov Decision Process (CMDP), which aims to minimize the long-term average system latency subject to cumulative energy constraints.
• We design a novel GCPPO algorithm to solve the formulated CMDP. GCPPO leverages a bidirectional Graph Attention Network (GAT) to capture the relational features between heterogeneous vehicles and RSUs, and incorporates a Lagrangian dual method to adaptively enforce the long-term energy constraints. Simulation results demonstrate that GCPPO reduces average system latency by
The rest of the paper is organized as follows: Section 2 reviews the related work. Section 3 introduces the proposed low-latency and energy-aware AI agent migration framework in the IoVA. In Section 4, we present the architecture of the GCPPO algorithm. Section 5 provides simulation results to demonstrate the performance of the GCPPO algorithm. In Section 6, we conclude the paper.
2.1 AI Agents in Vehicular Networks
Driven by the rapid advancement of LLMs, researchers have increasingly explored integrating AI agents into vehicular networks [3,4,19,20]. To address the substantial computing demand of AI agents, the authors in [21] proposed a cloud-edge collaborative architecture that distributed multimodal LLM inference between edge servers and the cloud for intelligent driver assistance. Furthermore, the authors in [22] developed an Agent-as-a-Service paradigm, where AI agents autonomously performed computing and communication tasks, effectively reducing service latency for edge-assisted autonomous driving.
Given the high mobility of vehicles, AI agent migration across edge servers has emerged as a promising approach to maintain service continuity [5,23]. Specifically, the authors in [23] proposed a generative diffusion-based contract design to incentivize RSU participation in AI agent migration. To address security threats during AI agent migration, the authors in [5] developed a secure online migration framework with trust assessment, effectively mitigating network attacks while maintaining low migration latency. However, the above works focus on incentive mechanisms and migration security, while overlooking vehicle energy consumption, which critically constrains vehicles with limited resources across successive migrations in the IoVA.
2.2 Constrained Deep Reinforcement Learning for Resource Optimization
Resource optimization in wireless networks typically involves long-term constraints (e.g., energy budgets and service quality guarantees) that conventional DRL methods fail to satisfy [18,24]. Constrained DRL (CDRL) methods address this limitation by formulating the problem as a CMDP, which extends the MDP with cumulative cost constraints [25]. Among CDRL methods, Lagrangian dual optimization is the most widely adopted approach, which solves the CMDP by converting constraints into penalty terms in the objective function [18,25]. For instance, the authors in [24] employed Lagrangian dual optimization to achieve near-optimal network capacity through joint UAV altitude control and channel access under energy harvesting constraints. Since Lagrangian dual methods may still produce infeasible actions during execution [18], recent studies have incorporated safety mechanisms to provide stronger constraint guarantees [26,27]. For instance, the authors in [28] embedded a safety layer that projects each action onto the feasible set to satisfy latency constraints for edge offloading. However, the above CDRL methods rely on flat state representations, which cannot capture the spatial relationships among interacting entities. When applied to IoVA scenarios with heterogeneous vehicles and RSUs under dynamic topologies, this limitation degrades the expressiveness and scalability of the learned policies.
3 Low-Latency and Energy-Aware AI Agent Migration Framework
In this section, we present the proposed low-latency and energy-aware AI agent migration framework in the IoVA, as illustrated in Fig. 1. The framework considers a system where AI agents are deployed on RSUs to perform computation-intensive inference, and migrate to target RSUs as vehicles traverse coverage boundaries. We first model the service latency and vehicle energy consumption, and then formulate the resource allocation problem.

Figure 1: Illustration of the proposed low-latency and energy-aware AI agent migration framework in the IoVA.
We consider an IoVA system that consists of a set
In the communication phases, each vehicle exchanges data with its serving RSU through shared wireless bandwidth. Let
where
In the computation phase, the RSU allocates its computing resources to the AI agents for inference [8]. Specifically, the AI agent first encodes the multimodal perception data into a joint representation of the driving environment and task intent. This representation is then processed in parallel during prefill to construct the Key-Value (KV) cache for the session. Based on this cache, the AI agent performs autoregressive decoding to generate the response. The encoding, prefill, and decode stages process data volumes of
Since the AI agent is deployed on the RSU, it must be migrated upon RSU handover to preserve service continuity [30]. Let
Thus, the end-to-end service latency for vehicle
In the IoVA, vehicles operate on limited onboard batteries while RSUs are grid-powered, we model only the vehicle-side energy consumption during uplink and downlink communication [11].
In the uplink phase, the vehicle transmits data to the serving RSU through its radio frequency chain [29]. Due to the limited amplifier efficiency, the power amplifier draws input power that exceeds the intended transmit power. The active transmit chain further consumes static circuit power from the supporting circuitry and a bandwidth-dependent dynamic component that arises from baseband signal processing across the allocated bandwidth. The resulting uplink energy consumption can be expressed as
where
In the downlink phase, the vehicle receives the service response without power amplification. The receive chain consumes a demodulation power
Thus, the total energy consumption of vehicle
We aim to optimize the resource allocation in the IoVA to minimize the cumulative service latency across all vehicles while satisfying per-vehicle energy budget constraints. However, increasing the transmit power of a vehicle reduces its latency but raises its energy consumption, while allocating more bandwidth to one vehicle limits the resources available to others. To balance these competing objectives, we jointly determine the computing resource allocation ratio
Constraints (9b,c) ensure that the computing and bandwidth allocation ratios assigned by each RSU do not exceed unity. Constraint (9d) bounds the uplink transmit power of each vehicle in its feasible range, where
4 Graph-Based Constrained Proximal Policy Optimization Algorithm
The resource allocation problem in the IoVA involves a non-convex objective, temporally coupled energy constraints, and non-stationary system dynamics, which render conventional approaches intractable. We therefore reformulate it as a CMDP [25], characterized by the tuple
(1) State space: At each time slot
(2) Action space: At each time slot
(3) Reward function: Since the goal is to minimize the cumulative service latency, the immediate reward is defined as
(4) Cost function: Among the constraints in (9), constraints (9b–d) are enforced via per-slot action projection, and constraint (9e) is determined by the physical distance between each vehicle and its nearest RSU. However, constraint (9f) couples decisions across all time slots. To encode this long-term constraint, we define the immediate cost as
4.2 Architecture of the GCPPO Algorithm
4.2.1 Bidirectional Graph Attention Network for State Representation
To capture the relational structure among vehicles and RSUs in the IoVA, we represent the system state as a bipartite graph
The bipartite graph
where
where
Since the aggregations in (11) only access immediate neighbors, we stack L layers to expand the receptive field to L-hop neighbors, which enables each vehicle node to incorporate information from co-served vehicles that compete for the same RSU resources. After the final layer, we concatenate all vehicle node embeddings to obtain the state vector
4.2.2 Graph-Based PPO for Energy-Constrained AI Agent Migration
The actor network
where
Since both the policy objective and the constraint enforcement rely on accurate value estimates, the reward critic and cost critic are trained to minimize the mean squared error between their predictions and the empirical returns, with the critic loss functions given by
where
where

We consider an IoVA system in which vehicles travel along a road segment at speeds of 30 to 120 km/h and are served by uniformly deployed RSUs. The initial positions of vehicles are randomly distributed along the road, and each vehicle maintains a constant speed throughout each episode. Channel gains between each vehicle and RSU combine a distance-dependent path loss with exponent

For the GCPPO algorithm, the actor and critic networks share a two-layer bidirectional GAT encoder. All algorithms are implemented in PyTorch and trained for
To validate the effectiveness of the proposed GCPPO algorithm in the IoVA, we compare it with the following baseline algorithms:
• CPO [34]: Constrained Policy Optimization (CPO) solves the CMDP via trust region updates with linearized cost constraints, providing first-order feasibility guarantees.
• PPO-Lag [35]: PPO-Lagrangian (PPO-Lag) augments the standard PPO objective with a Lagrangian penalty for energy constraint enforcement. It serves as an ablation variant of GCPPO without the bidirectional GAT encoder, isolating the contribution of the graph-based state representation.
• SAC-Lag [35,36]: Soft Actor-Critic with Lagrangian (SAC-Lag) extends the entropy-regularized off-policy framework with dual variables for cost constraint satisfaction.
• DAPA: Demand-Aware Proportional Allocation (DAPA) is a heuristic baseline that allocates computing and bandwidth resources at each RSU in proportion to the service demand of served vehicles, and adjusts each vehicle’s transmit power according to its demand priority and residual energy.
• Random: Random uniformly samples resource allocation decisions from the feasible action space.
In Fig. 2, we present the training performance of GCPPO and the baseline algorithms. As shown in Fig. 2a, GCPPO achieves the highest converged reward and converges in

Figure 2: Comparison of training performance among GCPPO and baseline algorithms. (a) Reward curves for different algorithms. (b) Cost curves for different algorithms under the energy constraint threshold. (c) Convergence trajectories of different algorithms in the reward-cost plane.
In Fig. 3, we present the evaluation results on independent test episodes. As shown in Fig. 3a, GCPPO achieves the lowest average system latency, reducing it by

Figure 3: Performance evaluation of GCPPO and baseline algorithms for low-latency and energy-aware AI agent migration optimization. (a) Average system latency. (b) Average energy consumption. (c) Constraint satisfaction rate.
Fig. 4 illustrates the impact of the energy budget on GCPPO in the IoVA. As the budget increases from

Figure 4: Reward curves and constraint satisfaction rates of the GCPPO algorithm under different energy budgets.
Figs. 5 and 6 illustrate the impact of wireless bandwidth and vehicle transmit power in the IoVA, respectively. As bandwidth increases from

Figure 5: Impact of wireless bandwidth on the performance of GCPPO and baseline algorithms. (a) System latency curves and normalized average latency. (b) Energy consumption curves and normalized average energy consumption.

Figure 6: Impact of vehicle transmit power on the performance of GCPPO and baseline algorithms. (a) System latency curves and normalized average latency. (b) Energy consumption curves and normalized average energy consumption.
In Fig. 7, we evaluate the scalability of all algorithms as the number of vehicles increases from

Figure 7: Impact of the number of vehicles on the performance of GCPPO and baseline algorithms, where the scalability exponent is estimated via least-squares linear regression in log-log space. (a) Per-vehicle latency curves and scalability exponent. (b) Per-vehicle energy consumption curves and scalability exponent.
As shown in Fig. 7a, GCPPO achieves the lowest latency scalability exponent of
In this paper, we have proposed a low-latency and energy-aware AI agent migration framework in the IoVA, jointly characterizing the end-to-end service latency and vehicle energy consumption across communication, inference, and migration phases. To solve the formulated CMDP, we have designed the GCPPO algorithm, which leverages a bidirectional GAT encoder to capture the relational structure among vehicles and RSUs, thereby enabling topology-aware resource allocation. It further incorporates a Lagrangian dual mechanism to adaptively enforce the long-term energy constraints without requiring predefined penalty weights. Simulation results demonstrate the effectiveness and scalability of GCPPO, which achieves a
Acknowledgement: None.
Funding Statement: This work was supported by the 2024 Guangdong Province Education Science Planning Project (Higher Education Special Project) under Grant 2024GXJK621.
Author Contributions: The authors confirm contribution to the paper as follows: Conceptualization, Kanyang Jiang, Yingkai Kang and Ming Li; methodology, Kanyang Jiang and Yingkai Kang; software, Kanyang Jiang and Yingkai Kang; validation, Kanyang Jiang, Yingkai Kang and Ming Li; formal analysis, Kanyang Jiang and Yingkai Kang; investigation, Kanyang Jiang and Yingkai Kang; resources, Ming Li; data curation, Kanyang Jiang and Yingkai Kang; writing—original draft preparation, Kanyang Jiang and Yingkai Kang; writing—review and editing, Ming Li; visualization, Kanyang Jiang and Yingkai Kang; supervision, Ming Li; project administration, Ming Li. All authors reviewed and approved the final version of the manuscript.
Availability of Data and Materials: Not applicable.
Ethics Approval: Not applicable.
Conflicts of Interest: The authors declare no conflicts of interest.
References
1. Chang Y, Wang X, Wang J, Wu Y, Yang L, Zhu K, et al. A survey on evaluation of large language models. ACM Trans Intell Syst Technol. 2024;15(3):39. doi:10.1145/3641289. [Google Scholar] [CrossRef]
2. Plaat A, Wong A, Verberne S, Broekens J, Van Stein N, Bäck T. Multi-step reasoning with large language models, a survey. ACM Comput Surv. 2025;58(6):160. doi:10.1145/3774896. [Google Scholar] [CrossRef]
3. Guo T, Chen X, Wang Y, Chang R, Pei S, Chawla NV, et al. Large language model based multi-agents: a survey of progress and challenges. In: Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI-24. CA, USA: International Joint Conferences on Artificial Intelligence Organization; 2024. p. 8048–57. [Google Scholar]
4. Mahmud D, Hajmohamed H, Almentheri S, Alqaydi S, Aldhaheri L, Khalil RA, et al. Integrating LLMs with ITS: recent advances, potentials, challenges, and future directions. IEEE Trans Intell Transp Syst. 2025;26(5):5674–709. doi:10.1109/TITS.2025.3528116. [Google Scholar] [PubMed] [CrossRef]
5. Wen X, Wen J, Xiao M, Kang J, Zhang T, Li X, et al. Defending against network attacks for secure AI agent migration in vehicular metaverses. IEEE Internet Things J. 2026;13(3):4153–66. doi:10.1109/JIOT.2025.3633501. [Google Scholar] [PubMed] [CrossRef]
6. Clancy J, Mullins D, Deegan B, Horgan J, Ward E, Eising C, et al. Wireless access for V2X communications: research, challenges and opportunities. IEEE Commun Surv Tutor. 2024;26(3):2082–119. doi:10.1109/COMST.2024.3384132. [Google Scholar] [PubMed] [CrossRef]
7. Wang Y, Guo S, Pan Y, Su Z, Chen F, Luan TH, et al. Internet of agents: fundamentals, applications, and challenges. IEEE Trans Cognit Commun Netw. 2026;12:4476–501. doi:10.1109/TCCN.2025.3623369. [Google Scholar] [PubMed] [CrossRef]
8. Zheng Y, Chen Y, Qian B, Shi X, Shu Y, Chen J. A review on edge large language models: design, execution, and applications. ACM Comput Surv. 2025;57(8):1–35. doi:10.1145/3719664. [Google Scholar] [CrossRef]
9. Qu G, Chen Q, Wei W, Lin Z, Chen X, Huang K. Mobile edge intelligence for large language models: a contemporary survey. IEEE Commun Surv Tutor. 2025;27(6):3820–60. doi:10.36227/techrxiv.172115025.57884352/v1. [Google Scholar] [CrossRef]
10. Chen Z, Huang S, Min G, Ning Z, Li J, Zhang Y. Mobility-aware seamless service migration and resource allocation in multi-edge IoV systems. IEEE Trans Mob Comput. 2025;24(7):6315–32. doi:10.1109/TMC.2025.3540407. [Google Scholar] [PubMed] [CrossRef]
11. Qiu B, Wang Y, Xiao H, Zhang Z. Deep reinforcement learning-based adaptive computation offloading and power allocation in vehicular edge computing networks. IEEE Trans Intell Transp Syst. 2024;25(10):13339–49. doi:10.1109/TITS.2024.3391831. [Google Scholar] [PubMed] [CrossRef]
12. Shui T, Saad W, Hu Y, Chen M. Resilient vehicular communications under imperfect channel state information. IEEE Trans Wirel Commun. 2026;25:6442–59. doi:10.1109/twc.2025.3625199. [Google Scholar] [PubMed] [CrossRef]
13. Xu Y, Zhu K, Xu H, Ji J. Deep reinforcement learning for multi-objective resource allocation in multi-platoon cooperative vehicular networks. IEEE Trans Wirel Commun. 2023;22(9):6185–98. doi:10.1109/twc.2023.3240425. [Google Scholar] [PubMed] [CrossRef]
14. Wang P, Wu W, Liu J, Chai G, Feng L. Joint spectrum and power allocation for V2X communications with imperfect CSI. IEEE Trans Vehic Technol. 2023;72(12):16338–53. doi:10.1109/tvt.2023.3299691. [Google Scholar] [PubMed] [CrossRef]
15. Ju Y, Chen Y, Cao Z, Liu L, Pei Q, Xiao M, et al. Joint secure offloading and resource allocation for vehicular edge computing network: a multi-agent deep reinforcement learning approach. IEEE Trans Intell Transp Syst. 2023;24(5):5555–69. doi:10.1109/TITS.2023.3242997. [Google Scholar] [PubMed] [CrossRef]
16. Li P, Wang X, Li C, Iqbal M, Al-Dulaimi A, Chih-Lin I, et al. Deep reinforcement learning-based task scheduling and resource allocation for vehicular edge computing: a survey. IEEE Tran Intell Transp Syst. 2025;26(12):21472–501. doi:10.1109/tits.2025.3607910. [Google Scholar] [PubMed] [CrossRef]
17. Ji M, Wu Q, Fan P, Cheng N, Chen W, Wang J, et al. Graph neural networks and deep reinforcement learning-based resource allocation for V2X communications. IEEE Internet Things J. 2025;12(4):3613–28. doi:10.1109/JIOT.2024.3469547. [Google Scholar] [PubMed] [CrossRef]
18. Wachi A, Shen X, Sui Y. A survey of constraint formulations in safe reinforcement learning. In: Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI-24. CA, USA: International Joint Conferences on Artificial Intelligence Organization; 2024. p. 8262–71. [Google Scholar]
19. Zhang R, Xiong K, Du H, Niyato D, Kang J, Shen X, et al. Generative AI-enabled vehicular networks: fundamentals, framework, and case study. IEEE Netw. 2024;38(4):259–67. doi:10.1109/MNET.2024.3391767. [Google Scholar] [PubMed] [CrossRef]
20. Xie G, Xiong Z, Zhang X, Xie R, Guo S, Guizani M, et al. GAI-IoV: bridging generative AI and vehicular networks for ubiquitous edge intelligence. IEEE Trans Wirel Commun. 2024;23(10):12799–814. doi:10.1109/TWC.2024.3396276. [Google Scholar] [PubMed] [CrossRef]
21. Hu Y, Ye D, Kang J, Wu M, Yu R. A cloud-edge collaborative architecture for multimodal LLM-based advanced driver assistance systems in IoT networks. IEEE Internet Things J. 2025;12(10):13208–21. doi:10.1109/jiot.2024.3509628. [Google Scholar] [PubMed] [CrossRef]
22. Li B, Liu T, Wang W, Zhao C, Wang S. Agent-as-a-service: an AI-native edge computing framework for 6G networks. IEEE Netw. 2025;39(2):44–51. doi:10.1109/mnet.2024.3520987. [Google Scholar] [PubMed] [CrossRef]
23. Zhong Y, Kang J, Wen J, Ye D, Nie J, Niyato D, et al. Generative diffusion-based contract design for efficient AI twin migration in vehicular embodied AI networks. IEEE Trans Mobile Comput. 2025;24(5):4573–88. doi:10.1109/tmc.2025.3526230/mm1. [Google Scholar] [CrossRef]
24. Khairy S, Balaprakash P, Cai LX, Cheng Y. Constrained deep reinforcement learning for energy sustainable multi-UAV based random access IoT networks with NOMA. IEEE J Selected Areas Commun. 2021;39(4):1101–15. doi:10.1109/JSAC.2020.3018804. [Google Scholar] [PubMed] [CrossRef]
25. Altman E. Constrained markov decision processes. Abingdon, UK: Routledge; 1999. doi:10.1201/9781315140223. [Google Scholar] [CrossRef]
26. Koursioumpas N, Magoula L, Petropouleas N, Thanopoulos AI, Panagea T, Alonistioti N, et al. A safe deep reinforcement learning approach for energy efficient federated learning in wireless communication networks. IEEE Trans Green Commun Netw. 2024;8(4):1862–74. doi:10.1109/TGCN.2024.3372695. [Google Scholar] [PubMed] [CrossRef]
27. Gao Z, Hao H, Gao F, Zhao R. Constrained reinforcement-learning-enabled policies with augmented lagrangian for cooperative intersection management. IEEE Internet Things J. 2025;12(5):5396–411. doi:10.1109/jiot.2024.3487854. [Google Scholar] [PubMed] [CrossRef]
28. Huang H, Ye Q, Zhou Y. Safety-critical offloading with constrained reinforcement learning for multi-access edge computing. ACM Trans Sens Netw. 2025;21(2):1–37. doi:10.1145/3715695. [Google Scholar] [CrossRef]
29. Jang Y, Jeong S, Kang J. Energy-efficient vehicular edge computing with one-by-one access scheme. IEEE Wirel Commun Lett. 2024;13(1):39–43. doi:10.1109/LWC.2023.3318632. [Google Scholar] [PubMed] [CrossRef]
30. Kang Y, Wen J, Kang J, Zhang T, Du H, Niyato D, et al. Hybrid-generative diffusion models for attack-oriented twin migration in vehicular metaverses. IEEE Trans Vehic Technol. 2025;74(9):14720–34. doi:10.1109/tvt.2025.3566034. [Google Scholar] [PubMed] [CrossRef]
31. Velickovic P, Cucurull G, Casanova A, Romero A, Liò P, Bengio Y. Graph attention networks. arXiv:1710.10903. 2018. [Google Scholar]
32. Schulman J, Moritz P, Levine S, Jordan MI, Abbeel P. High-dimensional continuous control using generalized advantage estimation. arXiv:1506.02438. 2016. [Google Scholar]
33. Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O. Proximal policy optimization algorithms. arXiv:1707.06347. 2017. [Google Scholar]
34. Achiam J, Held D, Tamar A, Abbeel P. Constrained policy optimization. In: Proceedings of the 34th International Conference on Machine Learning. London, UK: PMLR; 2017. p. 22–31. [Google Scholar]
35. Ray A, Achiam J, Amodei D. Benchmarking safe exploration in deep reinforcement learning. arXiv:1910.01708. 2019. [Google Scholar]
36. Haarnoja T, Zhou A, Abbeel P, Levine S. Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: Proceedings of the 35th International Conference on Machine Learning. London, UK: PMLR; 2018. p. 1861–70. [Google Scholar]
Cite This Article
Copyright © 2026 The Author(s). Published by Tech Science Press.This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.


Submit a Paper
Propose a Special lssue
View Full Text
Download PDF
Downloads
Citation Tools