Open Access
ARTICLE
A3TD: A Deep Reinforcement Learning Algorithm for Joint Resource Allocation in RIS-Aided CNOMA-D2D Networks
Software School, Nanchang Hangkong University, Nanchang, China
* Corresponding Author: Chen Sun. Email:
Computers, Materials & Continua 2026, 88(1), 80 https://doi.org/10.32604/cmc.2026.079214
Received 16 January 2026; Accepted 13 March 2026; Issue published 08 May 2026
Abstract
This paper investigates the joint resource allocation problem in Reconfigurable Intelligent Surface (RIS)-assisted cooperative non-orthogonal multiple access device-to-device (CNOMA-D2D) cellular networks. To tackle the high-dimensional non-convex joint optimization of power control, RIS phase configuration and channel assignment, we propose an integrated user pairing strategy, PIP-UP, quantifying utility through factors, phase alignment, interference suppression and power difference, neglected in existing methods. Furthermore, we develop a hybrid deep reinforcement learning algorithm, A3TD, combining the parallel exploration capability of Asynchronous Advantage Actor-Critic (A3C) with the stable continuous optimization of Twin Delayed Deep Deterministic Policy Gradient (TD3). This integration enables efficient and robust joint optimization of D2D channel allocation, transmit power, and RIS phase shifts. Simulation results demonstrate that the proposed A3TD algorithm significantly outperforms baseline algorithms, Actor-Critic (AC), Deep Deterministic Policy Gradient (DDPG) and TD3, in terms of sum rate and convergence speed, validating its effectiveness for resource management in complex RIS-assisted CNOMA-D2D networks.Keywords
The relentless densification of 5G and beyond networks necessitates efficient resource allocation techniques to meet soaring capacity demands. Device-to-device (D2D) communication, leveraging underutilized macro-cell resources, presents a promising solution to enhance spectral efficiency and energy sustainability. However, this paradigm introduces severe intra-cell interference due to co-channel D2D transmissions, fundamentally undermining resource allocation efficacy in dense deployments [1]. While cooperative non-orthogonal multiple access (CNOMA) mitigates this challenge via superposition coding and successive interference cancellation (SIC), its performance hinges critically on user pairing strategies that simultaneously manage channel heterogeneity and interference dynamics [2]. Reconfigurable intelligent surfaces (RIS) introduces a transformative degree of freedom for wireless environment control [3]. By intelligently adjusting signal reflection phases, RIS can enhance desired signals, suppress interference, and reshape channel correlations. Its integration with CNOMA-D2D networks unlocks new potentials but also compounds the resource allocation challenge.
Predominant research on user pairing in NOMA/CNOMA systems relies heavily on heuristic methods centered on channel gains. For instance, pairing schemes often maximize inter-user channel gain differences [4] or apply combinatorial optimization like the Hungarian algorithm for spectral efficiency [5,6]. Strategies assuming fixed user roles [7] or those focusing solely on channel strength ratios [8,9] further lack adaptability in dynamic environments. Even recent advancements employing game theory [10] or multi-agent reinforcement learning (RL) [11,12] for pairing and resource allocation often incur high computational complexity and fail to jointly optimize with RIS. These approaches largely neglect the phase characteristics of user channels with RIS, which are crucial for constructive/destructive signal superposition in SIC, and the dynamic power constraints necessary to maintain the decoding order in the power domain of D2D pairs.
Existing studies in RIS-aided CNOMA systems often treat user pairing and RIS beamforming separately. Some optimize pairing based on conventional channel state information [13,14], while others focus on joint power and phase optimization without deeply coupling with user pairing dynamics [15,16]. Consequently, the synergistic impact of RIS-induced phase alignment on SIC decoding reliability within paired users remains inadequately explored. Moreover, existing optimization methods, ranging from block coordinate descent [17,18], meta-RL [19] to deep RL [20,21] and hybrid deep RL [22], struggle with the coupled, high-dimensional action space, often suffering from slow convergence, training instability, or limited generalization.
Existing user pairing strategies in NOMA/CNOMA systems predominantly rely on channel gain disparities [4,5,11], often neglecting the phase characteristics introduced by RIS-a critical factor for SIC performance in reflected environments. Similarly, prior DRL-based resource allocation methods either suffer from high variance (e.g., A3C in continuous domains) or slow convergence (e.g., TD3 in high-dimensional spaces) when applied in isolation [19–22]. This gap motivates an integrated design philosophy: pairing must account for phase coherence and power dynamics, while optimization must balance exploration efficiency with update stability.
To include critical resource management factors-phase alignment and power control in RIS-aided CNOMA-D2D networks, We propose a Phase matching, Interference suppression, and Power control-based User Pairing (PIP-UP) mechanism. Inspired by the complementary strengths of A3C in parallelized exploration [23] and TD3 in deterministic policy refinement [24], we propose A3TD—a hybrid framework that concurrently addresses the combinatorial and continuous aspects of RIS-aided resource allocation. This approach is not only tailored to CNOMA-D2D networks but also embodies a generalizable methodology for joint optimization in other RIS-enhanced or multi-agent wireless systems where phase alignment and interference coordination are pivotal. Comparison of existing works and the proposed method on resource allocation in RIS-aided NOMA-D2D networks is shown in Table 1.

To bridge these gaps, this paper proposes a holistic framework for joint resource optimization in RIS-aided CNOMA-D2D networks. Our work makes the following key contributions:
• PIP-UP Strategy quantifies pairing suitability through three metrics: (1) Phase Alignment Degree (PAD), leveraging RIS to enhance channel coherence for robust SIC; (2) Interference Degree (ID), measuring and suppressing mutual interference; (3) Power Difference Factor (PDF), ensuring adherence to NOMA’s power-domain decoding constraints. PIP-UP moves beyond gain-only metrics, explicitly incorporating phase and power dynamics to improve pairing reliability.
• A3TD Algorithm harnesses A3C’s multi-threaded parallel exploration for broad and efficient sampling of the state-action space, while employing TD3’s dual-critic architecture and delayed policy updates to ensure stable and precise optimization in continuous action domains. This enables the simultaneous optimization of D2D channel allocation, transmit power, and RIS phase shifts.
• Comprehensive Performance Validation: Through extensive simulations, we demonstrate that the proposed A3TD algorithm, coupled with the PIP-UP strategy, significantly outperforms state-of-the-art baseline algorithms (including AC, DDPG, and TD3) in terms of system sum rate and convergence speed. The results validate the effectiveness of our framework in RIS-assisted CNOMA-D2D networks.
The remainder of this paper is organized as follows: Section 2 presents the system model and problem formulation. Section 3 details the proposed PIP-UP strategy and the A3TD algorithm. Section 4 discusses simulation results and performance analysis. Finally, Section 5 concludes the paper and outlines future research directions.
As illustrated in Fig. 1, we consider a downlink RIS-aided CNOMA-D2D communication system comprising a single-antenna base station (BS), an RIS unit equipped with an

Figure 1: Illustration of an RIS-aided CNOMA-D2D cell.
Each CNOMA-D2D pair comprises a D2D transmitter (Tx) and two receivers (a strong user

The channel gain of the direct link from the BS to cellular user
where
The signal transmitted by the BS to cellular user
where
The signal-to-interference-plus-noise ratio (SINR) for user
and the achievable rate of
CNOMA transmission Phase I (Direct Transmission):
where
Thus, the received signal at the strong user
CNOMA transmission Phase II (Cooperative Relaying): After successfully decoding the signal of the weak user
The transmitted signal from the strong user to the weak user
where
Hence, the received signal at the weak user during the second phase can be expressed as:
where
For the weak user
The expressions for inter-pair D2D interference
The achievable rate for the D2D user pair
The optimization object of the RIS-aided CNOMA-D2D network is to maximize the sum rate of all D2D links, which is formulated as P1:
where:
• C1 enforces a minimum SINR requirement for each cellular user to guarantee its quality of service (QoS);
• C2 ensures minimum SINR requirements for both the strong and weak users in each D2D pair to maintain D2D link reliability;
• C3 requires that each D2D pair reuses the channel of exactly one cellular user;
• C4 imposes transmit power constraints on each D2D transmitter during both the direct transmission and cooperative relaying phases;
• C5 limits the transmit power from BS to each cellular user;
• C6 mandates that the power allocation coefficient for the strong user exceeds that for the weak user (
• C7 specifies a continuous phase-shift model for the RIS reflection coefficients.
Underlay spectrum sharing between CUs and D2D pairs boosts spectral efficiency via spatial reuse but introduces mutual co-channel interference. This trade-off is managed by QoS constraints C1 (CUs) and C2 (D2D users), which maximize D2D sum rate while ensuring all links meet SINR thresholds for balanced coexistence.
Constraint C3 (each D2D pair reuses exactly one CU channel) is motivated by three factors: (i) Complexity control—multi-channel reuse adds binary variables, making the non-convex problem intractable for conventional solvers and DRL algorithms; (ii) Interference localization—single-channel reuse confines interference to one CU and co-channel D2D pairs, simplifying coordination and enhancing RIS phase alignment; (iii) CNOMA compatibility—power-domain NOMA and SIC are designed for a single shared channel; multi-channel operation contradicts NOMA principles and increases transceiver complexity. This widely adopted assumption [1,4,7] ensures a tractable yet realistic evaluation framework.
Compared to discrete phase shift models, the constraints C7 enables finer and more precise phase configuration, maximizing the optimization space for phase alignment. Consequently, the system can accurately match the channel phase characteristics among users, thereby enhancing the reliability of SIC decoding and thus supporting the construction of a continuous action space. The continuous phase-shift model is a well-established assumption in the vast majority of RIS-aided communication literature [12,14,16,22].
3.1.1 Unified Pairing Weight (UPW)
In an RIS-aided CNOMA-D2D network, user pairing is critical for achieving efficient resource allocation. The proposed PIP-UP strategy integrates three key factors including phase alignment, interference, and power difference, to ultimately derive a joint pairing weight based on these metrics.
(1) Phase Alignment Degree (PAD) quantifies the similarity in phase between the equivalent channels of two users:
where
(2) Interference Degree (ID) quantifies the channel correlation and mutual interference level between two users:
where
(3) Power Difference Factor (PDF) quantifies the disparity in power requirements between two users:
where
UPW is proposed to combine the three key metrics—PAD, ID, and PDF—into a single measure of pairing suitability between two users in an RIS-aided CNOMA-D2D network. Based on the normalized versions of these metrics, the UPW is defined as:
where
3.1.2 PIP-UP Algorithm Description and Complexity Analysis
Based on UPW, a greedy-based user pairing algorithm is proposed to iteratively select the best user pair under the current conditions to obtain a locally optimal solution. As is illustrated in Algorithm 1, after the composite channel is computed for each receiving user
Users

The time complexity of the PIP-UP user pairing strategy is determined by the cumulative computational load of three sequential steps: candidate pairing matrix construction, three-dimensional core metric calculation with joint pairing weight derivation, and greedy iterative optimization. For M D2D pairs, the candidate pairing matrix ought to be traversed
3.2 Reinforcement Learning Based Resource Allocation Algorithm
In an RIS-aided CNOMA-D2D network, resource allocation requires joint optimization across multiple dimensions. Conventional optimization methods face significant limitations in such high-dimensional and complex environments, suffering from high computational complexity and slow convergence. To address this challenge efficiently, the resource allocation problem is formulated as a decision-making process of deep reinforcement learning (DRL). Specifically, it is modeled as a multi-agent Markov Decision Process (MDP), leveraging DRL’s capabilities in parallel exploration and policy optimization to achieve efficient resource allocation.
The reward function is defined as the instantaneous total throughput at time
where
The state
where
The action space
where
State space-channel gains
The channel gain matrix
3.3 A3TD Deep Reinforcement Learning
To jointly optimize power allocation, phase configuration, and channel assignment in complex channel environments and thereby maximize spectral efficiency, this paper proposes an A3TD deep reinforcement learning algorithm, which integrates the strengths of both A3C and TD3. It leverages A3C’s capability of multi-threaded parallel exploration to efficiently explore complex environments, while exploiting TD3’s optimization in continuous action space to enable rapid iteration and stable convergence in resource allocation. The architecture of the A3TD algorithm is illustrated in Fig. 2.

Figure 2: Architecture of the proposed A3TD hybrid deep reinforcement learning framework. Left: A3C module employs multiple parallel actors to explore the environment asynchronously, generating diverse experience trajectories. Right: TD3 module refines the policy using dual critics and delayed updates for stable and precise optimization in continuous action spaces. Center: a shared replay buffer stores transitions collected by A3C and feeds them to TD3, decoupling exploration from learning.
The A3C module accelerates agents’ learning through multi-threaded parallel training. Multiple threads interact independently with the environment to generate experience samples; after updating their local AC network parameters, they asynchronously synchronize gradients with the A3C global network.
The TD3 module utilizes the samples generated by A3C: the Critic networks evaluate action values to generate Q-values, while the Actor network determines actions to guide resource allocation and maximize the system sum rate.
During training, the two modules operate in a synergistic and complementary manner: A3C provides diverse samples and exploration direction, while TD3 refines the policy updates through accurate action evaluation, leading to more efficient network performance and faster convergence.
3.3.1 Multi-Threaded A3C-Based Exploration Module
This module is responsible for generating diverse experience data. Its policy network optimizes the probability distribution over actions by maximizing the following objective function:
where
where
Meanwhile, the Critic network estimates the state-value function
where
The parameter update formula for the A3C Critic network is:
where
The experience collected by A3C,
where a batch of data
3.3.2 TD3-Based Policy Refinement Module
This module employs dual Critic networks and delayed policy update mechanism to improve optimization stability and efficiency, which is suitable for resource allocation problems involving high-dimensional continuous action space.
The Critic network estimates action values and updates the target values:
where
The loss function for the Critic network is calculated as:
The training objective of the Critic network is to minimize the error between estimated and target values. The parameters
The Actor network directly generates action
where
The TD3 enhances training stability through delayed updates of the Actor network and target networks. The Critic network is updated at every step, while the Actor network is updated every two steps. The target Actor network is updated as:
The target Critic networks are updated as:
where
3.3.3 A3TD Algorithm Description and Complexity Analysis
As illustrated in Algorithm 2, the A3TD algorithm combines A3C’s asynchronous parallelism with TD3’s stability through a hybrid framework. It initializes separate networks: A3C’s global Actor (

The state space is determined by the channel, power, and phase information of M D2D pairs, K cellular users, and
4 Simulation Results and Analysis
This section presents a comprehensive performance evaluation of the proposed A3TD algorithm within an RIS-assisted CNOMA-D2D network through systematic simulation experiments. The simulation platform is built using Python and the PyTorch deep learning framework for model construction and DRL algorithm training. To benchmark our approach, the proposed A3TD algorithm is compared against three state-of-the-art deep reinforcement learning baseline algorithms:
• AC algorithm [25] implements an Actor-Critic architecture with independent agent learning via an online experience buffer.
• TD3 algorithm [26] extends DDPG with dual Critic networks and delayed policy updates to mitigate value overestimation and enhance training stability.
• DDPG algorithm [27] incorporates experience replay and target network mechanisms with soft parameter updates, optimized for continuous action space resource allocation.
The evaluation employs two key metrics: (i) the system sum rate, representing the total achievable data rate calculated according to Formula (16), and (ii) the convergence time, defined as the number of training episodes required for an algorithm to achieve stable performance. An algorithm is considered converged when the moving average of its sum rate remains within 97% of its peak observed performance for 200 consecutive episodes.
D2D user positions are randomly generated within a single cell with the RIS deployed in the center area, which is a circular area centered at the base station with a 100-m radius—a setup commonly adopted in related works such as [13]. The SINR calculation accounts for both strong and weak user components in the NOMA scheme.
As illustrated in Table 3, the simulation parameters align with both the practical application scenarios of RIS-aided CNOMA-D2D communication systems and the standard simulation specification in the field. The hyperparameters of A3TD is given in Table 4.

Fig. 3 illustrates the relationship between convergence time and the number of D2D pairs. As network density increases, the convergence episodes for all algorithms rise due to heightened environmental complexity. The proposed A3TD algorithm consistently achieves the shortest convergence time, with its advantage becoming more pronounced in denser scenarios. For instance, in a network with 20 D2D pairs, A3TD reduces convergence episodes by approximately 30% to 45% compared to the baselines. In dense scenarios characterized by an increasing number of D2D pairs, the significantly faster convergence exhibited by A3TD is fundamentally attributed to the A3C component’s multi-threaded parallel sampling capability. This architecture enables concurrent exploration from diverse environmental states, drastically accelerating the collection of experiences within the high-dimensional state-action space. Consequently, it breaks the sample temporal correlation bottleneck inherent in traditional serial sampling, allowing the algorithm to learn effective resource allocation patterns much more rapidly.

Figure 3: Relationship between D2D pair number and convergence episode.
Fig. 4a depicts the sum rate performance as the number of RIS reflecting elements increases from 16 to 49. All algorithms benefit significantly from a larger RIS, as it provides greater flexibility for phase adjustments, enabling precise signal path optimization and the creation of virtual line-of-sight links.

Figure 4: Sum rate, energy efficiency, and convergence episode with different numbers of RIS units.
Fig. 4b shows the corresponding convergence behavior. As the RIS dimension grows, posing a higher-dimensional non-convex optimization challenge, the convergence episodes for all baselines increase markedly. A3TD maintains the fastest convergence, achieving a 23% to 42% reduction in episodes with 49 RIS elements.
As the number of RIS reflecting elements increases—causing the dimensionality of the optimization problem to escalate sharply—the superior convergence and sum-rate performance of A3TD can be attributed to the synergistic effect within its architecture. The extensive exploration driven by A3C ensures that the algorithm avoids getting trapped in the proliferating local optima caused by dimensional expansion. Meanwhile, the TD3 component performs robust policy refinement along the explored directions in this high-dimensional continuous space, leveraging conservative value estimation (by taking the minimum of dual critic networks) and delayed policy updates, thereby achieving steady performance improvement.
The effect of maximum transmit power is analyzed in Fig. 5. As shown in Fig. 5a, increasing power from 0.5 to 2.0 W improves the sum rate for all algorithms, following the trend predicted by the Shannon capacity formula. However, the marginal gain diminishes at higher power levels due to the concurrent increase in co-channel interference within the multi-user D2D network.

Figure 5: Sum rate and convergence episode with different maximum transmission power.
Fig. 5b reveals that transmit power variations have a relatively minor impact on convergence time compared to network density or RIS size. This is because power scaling primarily affects the reward magnitude without altering the dimensionality or fundamental complexity of the state/action space. A3TD consistently delivers the highest sum rate across all power levels, and its performance advantage remains stable, verifying the robustness of its hybrid architecture under different power constraints.
Fig. 6 investigates performance under varying numbers of available channels. Fig. 6a confirms that ample channel resources (increasing from 5 to 20) significantly alleviate co-channel interference, leading to substantial sum rate improvements for all algorithms. In resource-constrained scenarios (5 channels), algorithmic performance is similar. However, as resources become abundant, the superior optimization capability of A3TD becomes more evident, widening the performance gap.

Figure 6: Sum rate and convergence episode with different channel numbers.
Correspondingly, Fig. 6b shows that convergence is faster in resource-rich environments. A constrained action space forces agents into repeated iterations to resolve allocation conflicts, while a broader action space allows algorithms like A3TD to more readily discover high-reward strategies, accelerating convergence.
The evolutions of spectral efficiency during training using different algorithms are compared in Fig. 7. The learning curves distinctly highlight the differences in convergence speed, stability, and final performance. The A3TD algorithm demonstrates the most rapid convergence and the highest stability. The AC algorithm exhibits high variance and policy drift. While DDPG and TD3 improve stability through experience replay and target networks, and TD3 achieves higher final performance than DDPG via its twin critics, A3TD surpasses them all.

Figure 7: Sum spectral efficiency in training episodes (Moving average).
The rapid convergence and high stability of A3TD demonstrated in the learning curves are a direct manifestation of its hybrid design advantages. The A3C component provides diverse, decorrelated training samples, accelerating the initial learning phase. Concurrently, the TD3 component leverages these samples for stable policy optimization, thereby mitigating the inherent instability often associated with AC algorithms, as well as avoiding the insufficient exploration or slow convergence issues that may plague DDPG/TD3 in complex environments. In contrast, the AC framework suffers from inefficient exploration and policy instability; DDPG exhibits shortcomings in both exploration and high-dimensional optimization; while although TD3 improves stability, its exploration efficiency remains limited.
4.3 Design Guideline and Key Trade-Off
When the action space contains both discrete and continuous dimensions, a single DRL algorithm is often suboptimal. Our A3TD framework use a discrete-parallel-exploration-oriented A3C algorithm to efficiently sample diverse state-action combinations and break temporal correlations, and employ a continuous-control-oriented TD3 algorithm with double critics and delayed updates to achieve stable and precise policy learning. This “explorer + refiner” synergy is transferable to any problem where exploration breadth and optimization precision are both critical.
A3C’s multi-threaded asynchronous exploration significantly improves sample diversity but can introduce delayed and stale gradients, potentially destabilizing the shared global model. Our implementation mitigates this via experience replay buffering between A3C and TD3, but the fundamental tension remains: more exploration often comes at the cost of less stable learning. Researchers adopting this hybrid approach should carefully balance the number of parallel threads, the frequency of global updates, and the replay buffer size.
This paper has presented a deep reinforcement learning (DRL) framework for the joint optimization of multi-dimensional resources in RIS-aided cooperative NOMA device-to-device (CNOMA-D2D) networks, with the objective of maximizing the system sum rate. The core of this framework is a novel hybrid algorithm, A3TD, which seamlessly integrates A3C and TD3 paradigms. The A3TD algorithm effectively addresses the high-dimensional, non-convex challenge of simultaneous D2D channel assignment, transmit power control, and RIS phase-shift configuration by reformulating it as a MDP. This approach leverages the strong function approximation capability of deep neural networks alongside the adaptive, experience-driven optimization of reinforcement learning, thereby circumventing the prohibitive computational complexity of traditional methods. Extensive simulation results conclusively validate the superior efficacy of the proposed framework, demonstrating significant outperformance over state-of-the-art baselines in terms of spectral efficiency, energy efficiency, and convergence speed.
Looking forward, while the proposed scheme shows great promise, its practical deployment becomes more realistic. For examples, in smart factories, metallic infrastructure causes severe signal blockage, making direct links unreliable. Deploying low-cost RIS panels on ceilings or production lines creates virtual LoS paths to shadowed devices. D2D communication among AGVs and robots enables direct cooperative relaying. Our PIP-UP strategy pairs devices with aligned RIS phases and sufficient power disparity, ensuring robust SIC in fast-changing industrial channels. The A3TD algorithm then jointly optimizes transmit power, RIS phase shifts, and channel assignment without manual reconfiguration.
Future research will focus on enhancing the algorithm’s robustness and applicability by incorporating critical real-world constraints. Key directions include: (i) investigating the impact of user mobility on time-varying channel states and dynamic RIS reconfiguration; (ii) developing distributed optimization strategies for scenarios involving multiple cooperating RISs [30]; (iii) accounting for practical impairments such as imperfect channel state information (CSI) and hardware distortions in transceivers and RIS elements; and (iv) exploring joint network planning problems to determine the optimal number and strategic placement of RIS panels within the cellular infrastructure.
Acknowledgement: Not applicable.
Funding Statement: This research was funded by the National Natural Science Foundation of China, grant number 62362052.
Author Contributions: The authors confirm contribution to the paper as follows: Conceptualization, Zongchuan Li and Chen Sun; methodology, Zongchuan Li and Chen Sun; software, Zongchuan Li; validation, Zongchuan Li; formal analysis, Zongchuan Li and Chen Sun; investigation, Zongchuan Li and Chen Sun; resources, Chen Sun and Jian Shu; data curation, Zongchuan Li; writing—original draft preparation, Zongchuan Li and Chen Sun; writing—review and editing, Chen Sun; visualization, Zongchuan Li; supervision, Chen Sun and Jian Shu; project administration, Jian Shu; funding acquisition, Jian Shu. All authors reviewed and approved the final version of the manuscript.
Availability of Data and Materials: The data that support the findings of this study are available from the Corresponding Author, Chen Sun, upon reasonable request.
Ethics Approval: Not applicable.
Conflicts of Interest: The authors declare no conflicts of interest.
References
1. Zhao X, Liu F, Zhang YJ, Chen SC, Gan J. Energy-efficient power allocation for full-duplex device-to-device underlaying cellular networks with NOMA. Electronics. 2023;12(16):3433–47. doi:10.3390/electronics12163433. [Google Scholar] [CrossRef]
2. Nair RB, Kirthiga S. Ant colony optimization based user pairing, power allocation and relaying in a cooperative NOMA system. Wirel Pers Commun. 2025;140(3–4):905–43. doi:10.1007/s11277-025-11752-0. [Google Scholar] [CrossRef]
3. Chen YL, Ai B, Zhang HL, Niu Y, Song LY, Han Z, et al. Reconfigurable intelligent surface assisted device-to-device communications. IEEE Trans Wirel Commun. 2021;20(5):2792–804. [Google Scholar]
4. Jia J, Tian QZ, Du A, Chen J, Wang XW. DE-based resource allocation for D2D-assisted NOMA systems. Soft Comput. 2024;28(4):3071–82. doi:10.1007/s00500-023-09266-7. [Google Scholar] [CrossRef]
5. Khan MAA, Kaidi HM, Ahmad N, Ur Rehman M. Sum throughput maximization scheme for NOMA-enabled D2D groups using deep reinforcement learning in 5G and beyond networks. IEEE Sens J. 2023;23(13):15046–57. doi:10.1109/jsen.2023.3276799. [Google Scholar] [CrossRef]
6. Dinh P, Arfaoui MA, Sharafeddine S, Assi C, Ghrayeb A. Joint user pairing and power control for C-NOMA with full-duplex device-to-device relaying. IEEE Trans Wirel Commun. 2023;22(5):3103–15. doi:10.1109/globecom38437.2019.9013180. [Google Scholar] [CrossRef]
7. Gu XH, Zhang GA, Zhuo BT, Duan W, Wang J, Wen MW, et al. On the performance of cooperative NOMA downlink: a RIS-aided D2D perspective. IEEE Trans Cogn Commun Netw. 2023;9(6):1612–24. [Google Scholar]
8. Yang G, Liao YT, Liang YC, Tirkkonen O. Reconfigurable intelligent surface empowered underlaying device-to-device communication. In: Proceedings of the 2021 IEEE Wireless Communications and Networking Conference; 2021 Mar 29–Apr 1; Nanjing, China. Piscataway, NJ, USA: IEEE Press; 2021. p. 1–6. [Google Scholar]
9. Yang G, Liao YT, Liang YC, Tirkkonen O, Wang GP, Zhu X. Reconfigurable intelligent surface empowered device-to-device communication underlaying cellular networks. IEEE Trans Commun. 2021;69(11):7797–805. [Google Scholar]
10. Amer A, Hoteit S, Ben Othman J. Throughput maximization in multi-slice cooperative NOMA-based system with underlay D2D communications. Comput Commun. 2024;217(4):134–51. doi:10.1016/j.comcom.2024.01.030. [Google Scholar] [CrossRef]
11. Vishnoi V, Budhiraja I, Gupta S, Kumar N. A deep reinforcement learning scheme for sum rate and fairness maximization among D2D pairs underlaying cellular network with NOMA. IEEE Trans Veh Technol. 2023;72(10):13506–22. doi:10.1109/tvt.2023.3276647. [Google Scholar] [CrossRef]
12. Chandra KR, Borugadda S. Multi agent deep reinforcement learning with deep Q-network based energy efficiency and resource allocation in NOMA wireless systems. In: Proceedings of the 2023 Second International Conference on Electrical, Electronics, Information and Communication Technologies (ICEEICT); 2023 Apr 5–7; Trichirappalli, India. p. 1–8. [Google Scholar]
13. Ao SY, Niu Y, Han Z, Zhong ZD, Ai B, Wang N, et al. Resource allocation for RIS-assisted device-to-device communications in heterogeneous cellular networks. IEEE Trans Veh Technol. 2023;72(9):11748–55. doi:10.1109/tvt.2023.3267032. [Google Scholar] [CrossRef]
14. Sultana A, Moniruzzaman M. Spectrum efficiency maximization of reconfigurable intelligent surface assisted device-to-device networks: an actor-critic approach. In: Proceedings of the 2023 IEEE 34th Annual International Symposium on Personal, Indoor and Mobile Radio Communications (PIMRC); 2023 Sep 5–8; Toronto, ON, Canada. p. 1–7. [Google Scholar]
15. Liu Y, Xu K, Xia XC, Xie W, Ma N, Xu JH. Joint power control and passive beamforming optimization in RIS-assisted anti-jamming communication. Front Inf Technol Electron Eng. 2024;25(4):537–50. doi:10.1631/fitee.2200646. [Google Scholar] [CrossRef]
16. Wang YZ, Sun MY, Cui QM, Chen KC, Liao YX. RIS-aided proactive mobile network downlink interference suppression: a deep reinforcement learning approach. Sensors. 2023;23(14):6550. doi:10.3390/s23146550. [Google Scholar] [PubMed] [CrossRef]
17. Dong GQ, Yang Z, Feng YH, Lyu B. Exploiting RIS-aided cooperative non-orthogonal multiple access with full-duplex relaying. IEICE Trans Fundam Electron Commun Comput Sci. 2023;106(7):1014–8. doi:10.1587/transfun.2022eal2067. [Google Scholar] [CrossRef]
18. Liu YK, Chen W, Tang HY, Wang KL. Resource allocation in the RIS assisted SCMA cellular network coexisting with D2D communications. IEEE Access. 2023;11:39978–89. doi:10.1109/access.2023.3269284. [Google Scholar] [CrossRef]
19. Zhai Q, Dong LM, Liu CX, Li Y, Cheng W. Resource management for active RIS aided multi-cluster SWIPT cooperative NOMA networks. IEEE Trans Netw Serv Manag. 2024;21(4):4421–33. doi:10.1109/tnsm.2024.3395298. [Google Scholar] [CrossRef]
20. Saikia P, Singh K, Taghizadeh O, Huang WJ, Biswas S. Meta reinforcement learning-based spectrum sharing between RIS-assisted cellular communications and MIMO radar. IEEE Trans Cogn Commun Netw. 2024;10(1):168–79. doi:10.1109/tccn.2023.3319543. [Google Scholar] [CrossRef]
21. Liu Y, Li Y, Li L, He M. NOMA resource allocation method based on prioritized dueling DQN-DDPG network. Symmetry. 2023;15(6):1170. doi:10.3390/sym15061170. [Google Scholar] [CrossRef]
22. Ji Z, Qin Z, Parini CG. Reconfigurable intelligent surface aided cellular networks with device-to-device users. IEEE Trans Commun. 2022;70(3):1808–19. doi:10.1109/tcomm.2022.3145570. [Google Scholar] [CrossRef]
23. Mnih V, Puigdomenech Badia A, Mirza M, Graves A, Lillicrap T, Harley T, et al. Asynchronous methods for deep reinforcement learning. In: Proceedings of the 33rd International Conference on International Conference on Machine Learning; 2016 Jun 19–24; New York, NY, USA. p. 1928–37. [Google Scholar]
24. 3GPP. TR 36.901: study on channel model for frequencies from 0.5 to 100 GHz (Release 17). [cited 2025 Dec 1]. Available from: https://www.3gpp.org/ftp/Specs/archive/38_series/38.901/38901-h10.zip. [Google Scholar]
25. Wang X, Zhang H, Long K. Power control based on DRL algorithm for D2D-enabled networks. In: Proceedings of the 2021 IEEE Global Communications Conference (GLOBECOM); 2021 Dec 7–11; Madrid, Spain. p. 1–5. [Google Scholar]
26. Liu XY, Xu JX, Zheng KC, Zhang GL, Liu J, Shiratori N. Throughput maximization with an AoI constraint in energy harvesting D2D-enabled cellular networks: an MSRA-TD3 approach. IEEE Trans Wirel Commun. 2025;24(2):1448–66. doi:10.1109/twc.2024.3509475. [Google Scholar] [CrossRef]
27. Guo L, Jia J, Chen J, Du A, Wang X. Deep reinforcement learning empowered joint mode selection and resource allocation for RIS-aided D2D communications. Neural Comput Appl. 2023;35(25 Suppl):18231–49. doi:10.1007/s00521-023-08745-0. [Google Scholar] [CrossRef]
28. 3GPP. TS 38.104: NR; base station radio transmission and reception. [cited 2026 Jan 1]. Available from: https://www.3gpp.org/ftp/Specs/archive/38_series/38.104/38104-j30.zip. [Google Scholar]
29. 3GPP. TS 38.101: NR; user equipment (UE) radio transmission and reception; part 1: range 1 standalone. [cited 2025 Dec 1]. Available from: https://www.etsi.org/deliver/etsi_ts/138100_138199/13810101/19.03.01_60/ts_13810101v190301p.pdf. [Google Scholar]
30. Zhang S, Tong X, Chi K, Gao W, Chen X, Shi Z. Stackelberg game-based multi-agent algorithm for resource allocation and task offloading in MEC-enabled C-ITS. IEEE Trans Intell Transp Syst. 2025;26(10):17940–51. doi:10.1109/tits.2025.3553487. [Google Scholar] [CrossRef]
Cite This Article
Copyright © 2026 The Author(s). Published by Tech Science Press.This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.


Submit a Paper
Propose a Special lssue
View Full Text
Download PDF
Downloads
Citation Tools