A Multi-Objective Adaptive Car-Following Framework for Autonomous Connected Vehicles with Deep Reinforcement Learning

Abu Tayab; Yanwen Li; Ahmad Syed; Ghanshyam Tejani; Doaa Khafaga; El-Sayed El-kenawy; Amel Alhussan; Marwa Eid

doi:10.32604/cmc.2025.070583

icon Open Access

ARTICLE

A Multi-Objective Adaptive Car-Following Framework for Autonomous Connected Vehicles with Deep Reinforcement Learning

Abu Tayab^1,*, Yanwen Li¹, Ahmad Syed², Ghanshyam G. Tejani^3,4,*, Doaa Sami Khafaga⁵, El-Sayed M. El-kenawy⁶, Amel Ali Alhussan⁷, Marwa M. Eid^8,9

1 Department of Mechanical Engineering, Yanshan University, Qinhuangdao, 066004, China
2 Department of Electrical Engineering, Yanshan University, Qinhuangdao, 066004, China
3 Department of Research Analytics, Saveetha Dental College and Hospitals, Saveetha Institute of Medical and Technical Sciences, Saveetha University, Chennai, 600077, India
4 Applied Science Research Center, Applied Science Private University, Amman, 11937, Jordan
5 Department of Computer Sciences, College of Computer and Information Sciences, Princess Nourah bint Abdulrahman University, P.O. Box 84428, Riyadh, 11671, Saudi Arabia
6 Department of Programming, School of Information and Communications Technology (ICT), Bahrain Polytechnic, Isa Town, P.O. Box 33349, Bahrain
7 Department of Computer Sciences, College of Computer and Information Sciences, Princess Nourah bint Abdulrahman University, P.O. Box 84428, Riyadh, 11671, Saudi Arabia
8 Faculty of Artificial Intelligence, Delta University for Science and Technology, Mansoura, 11152, Egypt
9 Department Jadara Research Center, Jadara University, Irbid, 21110, Jordan

* Corresponding Authors: Abu Tayab. Email: email ; Ghanshyam G. Tejani. Email: email

(This article belongs to the Special Issue: Advances in Vehicular Ad-Hoc Networks (VANETs) for Intelligent Transportation Systems)

Computers, Materials & Continua 2026, 86(2), 1-27. https://doi.org/10.32604/cmc.2025.070583

Received 19 July 2025; Accepted 29 September 2025; Issue published 09 December 2025

Abstract

Autonomous connected vehicles (ACV) involve advanced control strategies to effectively balance safety, efficiency, energy consumption, and passenger comfort. This research introduces a deep reinforcement learning (DRL)-based car-following (CF) framework employing the Deep Deterministic Policy Gradient (DDPG) algorithm, which integrates a multi-objective reward function that balances the four goals while maintaining safe policy learning. Utilizing real-world driving data from the highD dataset, the proposed model learns adaptive speed control policies suitable for dynamic traffic scenarios. The performance of the DRL-based model is evaluated against a traditional model predictive control-adaptive cruise control (MPC-ACC) controller. Results show that the DRL model significantly enhances safety, achieving zero collisions and a higher average time-to-collision (TTC) of 8.45 s, compared to 5.67 s for MPC and 6.12 s for human drivers. For efficiency, the model demonstrates 89.2% headway compliance and maintains speed tracking errors below 1.2 m/s in 90% of cases. In terms of energy optimization, the proposed approach reduces fuel consumption by 5.4% relative to MPC. Additionally, it enhances passenger comfort by lowering jerk values by 65%, achieving 0.12 m/s³ vs. 0.34 m/s³ for human drivers. A multi-objective reward function is integrated to ensure stable policy convergence while simultaneously balancing the four key performance metrics. Moreover, the findings underscore the potential of DRL in advancing autonomous vehicle control, offering a robust and sustainable solution for safer, more efficient, and more comfortable transportation systems.

Keywords

Car-following model; DDPG; multi-objective framework; autonomous connected vehicles

1 Introduction

The Car-following (CF) model is one of the most corporate driving scenarios, where the primary goal is to regulate vehicle speed to continue a safe and smooth next distance. Implementing velocity control in autonomous CF can help reduce driver workload, improve traffic safety, and enhance road capacity [1]. The Driver models play a vital role in velocity control systems [2]. Typically, models that focus on CF behavior are developed using two primary methods: the rule-based process and supervised learning method [3]. The rule-based method mainly encompasses old CF methods, including Gazis-Herman-Rothery (GHR) and intelligent driver models (IDM) [4]. Conversely, the supervised learning method uses data from human protests to estimate the connection between CF shapes and vehicle acceleration responses.

Both of these methods aim to replicate the CF behavior of human drivers. However, simply mimicking human driving may not be an effective method for autonomous vehicles. Initially, operators might prefer that autonomous cars do not replicate their driving styles [5]. Moreover, driving should be improved for safety, efficiency, energy efficiency, and comfort; rather than just replicating human drivers, it may not always make the best driving decisions [6]. Furthermore, optimizing energy efficiency is a critical consideration for autonomous vehicles, as efficient energy use reduces operational costs and supports sustainability goals. This paper will explore how energy-efficient driving strategies can be integrated into autonomous vehicle control systems, aligning with the broader objectives of improved safety, comfort, and environmental sustainability. Growing global emphasis on energy efficiency and the decrease of greenhouse gas emissions has propelled the investigation of strategies to enhance the energy use of Autonomous connected vehicles (ACV) to the forefront of transportation research. As ACVs continue to be deployed more widely, their exceptional ability to interconnect with other vehicles, infrastructure, and integrated servers creates numerous opportunities to improve traffic flow and significantly reduce energy consumption [7]. Optimization of traffic flow has been a focus of extensive ACV research for decades [8]. This endeavor has delivered considerable rewards, promising major improvements in road safety and energy efficiency, and addressing the increasing mobbing issues on roadways. The emergence of ACV alongside the swift growth of wireless communication technologies has opened up new possibilities for traffic flow optimization.

To address these issues, this research introduces a CF model for autonomous speed control that employs Reinforcement learning (RL). This system enhances driving safety, efficiency, comfort, and energy efficiency by learning through connections within a simulated setting. By optimizing energy use alongside other critical parameters, the model not only improves the overall driving experience but also aligns with sustainability objectives, making it suitable for modern autonomous vehicle systems. Specifically, it utilizes the deep deterministic policy gradient (DDPG) algorithm [9], which excels in continuous control tasks, to develop actor and critic networks. The actor network generates policies by determining the required accelerations of the following vehicle based on its speed, relative speed, and distance from preceding vehicles. Meanwhile, the critic network focuses on refining the policy by adjusting the parameters of the actor’s policy to enhance enactment. A reward function is created by drawing from human driving data and integrating key driving factors associated with safety, efficiency, energy efficiency, and comfort. The suggested RL model includes collision avoidance approach for safety assessments to mitigate the risk of unsafe maneuvers. This safety check is implemented throughout the training and testing stages, leading to quicker convergence and eliminating impacts.

Despite the promising performance of RL in CF tasks, existing RL-based CF models exhibit several key limitations. Many focus on limited performance metrics such as maintaining time headway or minimizing velocity error, while overlooking critical factors like energy efficiency and driving comfort. Moreover, most studies are trained on synthetic or simulated datasets, limiting their ability to generalize to real-world traffic dynamics. Additionally, safety constraints, such as collision avoidance, are often not integrated directly into the learning process, leading to unstable or unsafe behavior during training and deployment. To assess the suggested model, real-world driving data acquired from highD are utilized for both training and testing the model. The model is evaluated alongside empirical highD data and an adaptive cruise control (ACC) algorithm that employs model predictive control (MPC), highlighting the model’s capability to safely, efficiently, and comfortably follow the lead vehicle.

The key objectives of this study are as follows:

• Implementing RL on real-world highD driving data to regulate autonomous vehicle speed and designing multi-objective planning framework for autonomous driving using RL.

• Developing reward function that balances safety, efficiency, energy efficiency, and driving comfort, ensuring stable junction.

• Combining RL with kinematic-based collision escaping strategy for safety evaluation results in faster junction and collision-free performance.

2 Literature Review

To enhance passenger comfort in an ACV, it is essential to replicate human driving performance and integrate it into the design of an ACV controller, particularly in a mixed-traffic environment through CFmodel. These systems have previously been utilized in traffic micro-simulators to imitate human driving. They aim to emulate the longitudinal driving actions of the driver of a subject vehicle while it follows a car directly ahead on the road [10–12]. Research on driver behavior models for vehicle following has been ongoing since 1950s, resulting in numerous systems [13]. These include GHR model, linear models, fuzzy-logic models, collision avoidance models, Meta models, optimum velocity models, IDM, Gipps’ model, and psychophysical systems. A driver model need to prioritize passenger comfort for effective longitudinal control in ACV while maintaining operational efficiency across various traffic conditions [14].

2.1 Adaptive Cruise Control

This system [15] builds on the traditional cruise control (CC) system, designed to sustain a steady vehicle speed, and is closely related to speed management in this research. Maintaining a constant speed in congested traffic is less effective; thus, ACC improves CC by repeatedly modifying the vehicle’s speed to keep a safe distance from the front vehicle [16]. Various sensors, including radar, facilitate this adjustment, which monitors comparative distance and speed between vehicles. MPC is one of the most commonly used methodologies for developing ACC algorithms. At every time step, MPC causes an arrangement of control inputs by resolving an optimization problem over a finite time horizon; however, the first control input in the arrangement is implemented. This cycle continues at each subsequent time step with updated measurements. A significant benefit of MPC is its capacity to manage constraints related to control inputs [17] and system states.

Researchers [18] introduced an MPC system designed for real-world CF scenarios, utilizing a linear and incessant model of car-following. This MPC aims to regulate the FV’s acceleration to maintain a safe relative distance between two cars. Simulation outcomes indicated that the MPC controller exhibited safer behavior compared to actual drivers. This paper [18] created an efficient MPC-based ACC algorithm characterized by low computational demands, making it suitable for embedded microprocessors. They employed a low-order prediction model to reduce the computational burden, and results revealed that their algorithm provided high responsiveness with minimal discomfort. Additionally, Ref. [19] developed an MPC-based ACC system focusing on improving tracking capability, fuel efficiency, and aligning with driver preferences. They utilized a quadratic cost function to address tracking errors, fuel consumption, and adherence to driver appearances. Their simulations demonstrated that the ACC system offered notable advantages in fuel economy, tracking performance, and meeting the desired characteristics for car following. Lastly, Ref. [19] proposed learning-based approach to autonomous CF speed control, beginning with the creation of a driver model to simulate drivers’ CF behaviors. The outputs of the driver model, specifically vehicle acceleration values, served as a reference for the MPC controller. By addressing a constrained optimization problem, the MPC controller ensures that the vehicle adheres to the model’s behavior while also meeting certain safety standards.

2.2 Reinforcement Learning

RL is a method for optimizing difficulties related to successive executives by allowing an RL agent to interact with its environment. At each time step t, the agent detects current state st and selects an action from a defined action space A according to policy π(at|st), which correlates the state st with action t. Simultaneously, the system provides agent with a reward rt and transitions to the subsequent state st+1. This cycle continues until a terminal state is achieved, at which point agent begins anew. The agent aims to exploit promotional cumulative reward: Rt=∑k=0∞γkrt+k, utilizing a discount factor. Generally, RL methods fall into value-based and policy-based categories [20].

A value function evaluates the quality of a state or state-action pair. The action value, denoted as Qπ(s,a)=E[Rt∣st=s,at=a], reflects estimated return achieved by choosing action a in-state s and then adhering to policy π. It indicates how beneficial it is to act in state s. Value-based RL techniques aim to derive the action value function from past experiences. Q-learning exemplifies a typical value-based RL approach. Starting with an initial random Q-function, agent continuously updates Q-values according to the Bellman equation [21].

Q(s,a)=E[r+γmaxa′Q(s′,a′)](1)

The fundamental idea is that the maximum expected reward for a specific state s and action is derived from the immediate reward r combined with the highest anticipated future reward from subsequent states. By utilizing estimated Q-values, the optimal action is determined by selecting one with the highest value, Q(s, a), to maximize expected future rewards. Instead of calculating Q(s, a) for every possible state-action pair, deep Q-learning utilizes neural networks as function approximators to assess the action-value function [22]. While deep Q networks (DQN) are effective in discrete action spaces, they encounter difficulties in continuous action environments, such as the one we are addressing. To tackle this challenge, an algorithm known as DDPG was developed. This algorithm [9] integrates an actor-critic methodology into DQN, allowing it to manage continuous control problems. It employs two separate networks to independently approximate actor and critic. The critic network, represented by weights θQ, is responsible for estimating the action-value function Q(s, a∣θQ) [9]. In contrast, the actor network, defined by weights θμ, is responsible for explicitly formulating agent’s policy μ(s∣θμ). Consistent with the DQN approach, DDPG incorporates experience replay and a target network to ensure stable and effective learning. Algorithm 1 presents complete DDPG framework.

images

2.3 DRL-Based Car-Following Models

In recent years, DRL has gained increasing attention for CF control in autonomous vehicles, due to its capability to learn adaptive driving policies from high-dimensional state-action spaces. Several studies have explored the use of DRL for autonomous CF, focusing on different aspects such as safety, efficiency, and ride comfort. The researchers [23] proposed a DRL-based velocity control model using proximal policy optimization (PPO), aiming to achieve safe and smooth driving by optimizing time headway and jerk. Their model demonstrated promising results in simulation; however, it lacked real-world data validation. This paper [24] introduced FollowNet, a comprehensive DRL benchmark for CF behavior modeling using deep DQN and policy gradient methods. Although their work presents a strong foundation, it does not integrate energy efficiency or collision-avoidance constraints explicitly into learning process.

Other studies have utilized actor-critic algorithms such as Asynchronous Advantage Actor-Critic (A3C) or PPO [25] with reward functions focusing on limited objectives like maintaining spacing or minimizing velocity deviation. DRL models often overlook holistic metrics such as fuel consumption or safety overrides, which are crucial for real-world deployment [26,27]. Moreover, most prior research in DRL-based CF relies on simulated dataset environments, which may not capture complexity of real-world traffic dynamics, Table 1 presents a comparative summary of existing DRL. In contrast, our suggested framework is trained and validated on the highD dataset, a naturalistic driving dataset, to ensure practical applicability and robust performance. To address these limitations, our study contributes multi-objective reward function that explicitly balances four core driving criteria: safety, efficiency, energy consumption, and ride comfort. Additionally, we incorporate a kinematics-based collision-avoidance mechanism during training to penalize unsafe states, enhancing safety and convergence stability. While we adopt the standard DDPG algorithm, the system integration, safety filters, and real-world benchmarking against human and MPC performance significantly advance current DRL-based CF frameworks.

images

3 Data Preparation

To evaluate the effectiveness of suggested Ensemble Follower framework, we drew on vehicle trajectory data from the HighD dataset [28]. This comprehensive driving dataset was provided by Institute of Automotive Engineering at RWTH Aachen University in Germany. It contains accurate information regarding vehicle locations and speeds, collected from aerial bird’s-eye view footage of six distinct roadways in the Cologne area, captured using a high-resolution 4K camera mounted on a drone in Fig. 1. Utilizing advanced computer vision techniques, the dataset typically maintains a positional accuracy of less than 10 cm, with Bayesian smoothing applied to reduce noise and refine motion data. The collection includes over 110,500 vehicles documented from six unique locations, allowing for automatic extraction of vehicle trajectories, dimensions, types, and maneuvers. Though it was primarily designed for the validation of highly automated car safety, the dataset also proves valuable for other applications, such as analyzing traffic patterns and parameterizing driver models. Researchers and automobile manufacturers extensively utilize it in the development of self-driving technologies and in enhancing the safety of autonomous vehicles.

images

Figure 1: The highD dataset serves as a resource for both training and evaluation purposes

To gather CF events for the driving database, a specific filter was used. This filter pinpointed instances where lead vehicles (LV) and follow vehicles (FV) stayed in the same lane for minimum of 15 s, guaranteeing that CF behavior was sustained adequately for analysis. Additionally, any instances of low speed or stoppage exceeding 5 s were excluded to confirm that the detailed actions were expressive and equivalent. This study makes use of a total of 12,541 CF events recorded at a frequency of 25 Hz as provided by [24].

Using the driving database, we generated a simulator to engage with CF models in line with the jerk-constrained kinematic model. Once CFevent concludes in the simulation, we refresh the environmental state by introducing another event from the database. The highD dataset is divided into three subsets: training, validation, and testing, which account for 70%, 15%, and 15% of the total data, respectively. To develop low-level models, we calibrated or trained them with collision checks using the training set. Subsequently, we trained our RL-based hierarchical models utilizing the training set and developed low-level models. Lastly, we assessed the model candidates on the confirmation set to identify the highest-performing model.

4 Characteristics of the Reward Function

4.1 Safety

In rapidly changing traffic conditions, ensuring safety is of utmost importance. The time to collision (TTC) is frequently employed to evaluate the risk of a rear-end accident in real time [29]. The TTC for a trailing AV is defined as:

TTC(t)={Sn−1,n(t)Vn−1,n(t), if Vn(t)<Vn−1(t)∞, if Vn(t)≥Vn−1(t)(2)

where t represents time; n − 1 and n signify the lead and trailing vehicles, correspondingly; and n, n − 1 collectively indicate variables associated with both the lead and trailing vehicles: sn−1, n refers to the clearance distance, while Vn−1(t) represents the velocity of LV.

In particular, a low TTC value indicates a heightened risk of a traffic accident. It’s important to establish a TTC threshold to identify hazardous behaviors. Various studies recommend a threshold range between 1.5 and 5 s. Drawing from the findings of [29], TTC threshold of 4 s is suggested to achieve optimal performance. An agent should face penalties if TTC falls between greater than 0 s and less than 4 s. The TTC characteristic Rst is defined as:

Rst={−10, if 0≤TTC(t)≤40, otherwise (3)

While Rst can penalize actions that may be unsafe, the TTC values are also connected to the distance for clearance and the relative speed. Not having enough room for emergency braking poses a risk as well. At the same time, the trailing autonomous vehicle needs some time to assess risks, make decisions, and execute braking. In addition to TTC, predictive measures such as Potential Accident Risk estimation based on clearance distance, relative velocity, and deceleration capability can be incorporated to proactively identify collision-prone scenarios. Such measures complement TTC by providing earlier safety feedback to the RL agent.

4.2 Efficiency

Efficient driving involves sustaining minimal headway duration. Time headway is defined as the duration that passes between the arrival of LV and FV at a specific location. Maintaining time headway within acceptable parameters enhances road capacity. Then, the suggested time headway varies concerning states, this study utilizes vehicle trajectory data from the highD dataset. The highD dataset affords comprehensive trajectory data composed from freeway traffic in Germany, making it highly suitable for analyzing CF behavior.

A lognormal distribution was applied to model the mined CF procedures from the highD dataset [28]. Driving efficiency rewards are assigned based on the probability density function of this distribution. The agent is granted positive reward when the time headway is within the preferred range, indicating an ideal following distance. Conversely, if the time headway is excessively large or small, the reward diminishes toward zero. The time headway feature Fth is represented as follows:

Fth=1hσ2π(1h−μ)212σ2|(4)

where h represents the time interval between vehicles.

4.3 Energy Efficiency

The optimization of energy efficiency in autonomous driving is critical for minimizing fuel or energy consumption, especially in electric vehicles, where reducing energy usage directly impacts on vehicle’s operational range and environmental sustainability. Energy efficiency can be defined as the ability of the vehicle to achieve safe, efficient, and comfortable driving with the least possible energy expenditure. Fuel consumption is typically influenced by the vehicle’s speed, acceleration, and driving behavior. To model fuel consumption, a relationship between acceleration, speed, and fuel consumption is proposed as follows:

Ffuel=α.v(t).a(t)+β.[v(t)]2+γ(5)

where α, β, and γ are parameters calibrated based on the vehicle’s specific features, such as engine efficiency, weight, and aerodynamics. The first term, α.v(t).a(t), captures energy consumed during acceleration or deceleration, while the second term, [βv(t)]2, accounts for energy lost due to aerodynamic drag. The constant term γ represents additional energy expenditures for onboard systems. The fuel consumption feature is then scaled to incentivize energy-efficient driving behaviors while ensuring safety and comfort.

Energy efficiency in autonomous driving focuses on minimizing energy consumption while maintaining safe and comfortable driving. To measure energy efficiency, a feature was constructed as the inverse of total energy consumed by FV during CF event. The feature is formulated as:

Fefficiency=11+Econsumed(6)

where Econsumed is the total energy consumption of the vehicle in kilowatt-hours (kWh) or equivalent units. The feature is scaled to the range [0, 1], where higher values indicate better energy efficiency. This formulation incentivizes the RL agent to adopt driving strategies that minimize energy consumption while maintaining the necessary safety and comfort. By reducing energy consumption, the proposed model not only helps optimize operational costs but also contributes to the sustainability of autonomous driving methods, making it a crucial element in the development of energy-efficient, eco-friendly autonomous vehicles. The proposed energy efficiency model, Eqs. (5) and (6) are calibrated using parameters corresponding to a gasoline-powered vehicle, based on the dataset and driving conditions considered in this study. While the formulation itself can be adapted to other vehicle types by adjusting the coefficients, the present results and comparisons specifically reflect the performance of a conventional internal combustion engine vehicle.

4.4 Comfort

Jerk, defined as the rate of change of acceleration, is important for driving comfort. Large jerk values can cause discomfort for passengers [29]. In this study, jerk is quantified by defining a jerk feature Fjerk, which is based on the square of the jerk value, normalized by a scale factor to keep it within a range of [0, 1]. The jerk feature is expressed as:

Fjerk=jerk2scalefactor(7)

The scale factor of 3600 is derived from several key considerations: (1) The example interval used in the dataset is 0.1 s. (2) The acceleration values observed through CF actions are bounded within the range of −3 to 3 m/s2. (3) The maximum jerk occurs when the acceleration changes by 3 m/s2 over seconds, resulting in a jerk of 60 m/s3. Squaring this value (602) gives 3600, which is used as the normalization constant.

Jerk is a widely used and effective metric for passenger comfort, as it directly reflects abrupt acceleration changes impacting ride smoothness. Although factors like cabin vibration also affect comfort, jerk provides practical and quantifiable measure within the scope of this study, given available data and focus on vehicle dynamics. By applying this normalization, the jerk feature is scaled so that smaller jerk values correspond to smoother driving behavior, which directly enhances passenger comfort. This approach ensures that the jerk feature remains within standardized range, allowing the reinforcement learning agent to learn driving strategies that minimize abrupt accelerations and optimize the overall comfort of the driving experience. Thus, incorporating jerk as a feature in autonomous driving systems helps to ensure that vehicles maintain comfortable and passenger-friendly ride.

5 Suggested Method

5.1 Condition and Response

The specific time step t, the status of CFmethod is defined by the following parameters: the speed of the FV speed vn(t), clearance distance dn(t), and relative speed Δvn(t). This process includes the longitudinal acceleration of the FV is an(t). Based on the current state and action at time step t, the subsequent state is simplified utilizing a kinematic point-mass model:

vn(t+1)=vn(t)+an(t)⋅ΔTdn−1,n(t+1)=dn−1(t)+vn(t)⋅ΔT+12an(t)⋅(ΔT)2Δvn−1,n(t+1)=Δvn(t)+an(t)⋅ΔT(8)

here, ΔT denotes the simulation time interval, designated as 0.1, and vn refers to the speed of the LV, which was supplied as an external input.

5.2 Configuration of the Simulation

To allow the RL agent to acquire via trial and error, a straightforward CF simulation setting was created. This model features two agents: the LV and FV. The LV adheres to data collected empirically, while the behavior of the FV is managed by the RL agent. The simulation begins with empirically sourced values for the speed of the FV, the spacing, and the differences in relative velocity: vn(t=0)=vdata,n(t=0),dn−1,n(t=0)=ddata,n(t=0),Δvn(t=0)=Δvdata,n(t=0). By individually time step, created on the agent’s computed acceleration, the simulation updates the FV’s state, which includes the future FV velocity, relative speed, and spacing. This update process is governed by the dynamics of the CF system, which are reflected in the model described by Eq. (8).

To assess the RL agent’s performance, a reward function is provided at individually time step. The reward is designed based on factors such as time headway, TTC, and jerk. A positive reward is given when the time headway is within acceptable bounds, signaling efficient and safe CF behavior. If the time headway becomes too large or too small, the reward is reduced or brought close to zero, indicating inefficient or unsafe driving. When CFepisode extends its conclusion, the simulation state is adjusted with realistic data from the next affair. To prevent sequence dependencies and ensure that the RL agent learns from diverse scenarios, the CF events are randomly shuffled before each training episode. This shuffling helps the RL agent generalize its learning across different traffic conditions and improve its adaptability to varying driving patterns.

5.3 Function of Reward

The reward function, r(s, a), acts as a preparation sign that promotes or inhibits certain actions related to a specific task. In the case of car following, a reward function was created using a linear arrangement of the structures developed in the following equation:

r=w1Rst+w2F th+w3Ffuel+w4Fjerk (9)

where w1, w3, w3, and w4 represent weights. These weights are employed to modify the reward values so that they are in a comparable range.

To certify an optimal balance between safety, efficiency, energy consumption, and passenger comfort, the weighting factors in the reward function were selected through an iterative tuning process. The initial values were informed by [28], followed by multiple training iterations where adjustments were made based on model performance in various driving scenarios. A compassion analysis was conducted by systematically varying each weighting factor while maintaining the others constant, observing the resulting changes in key performance indicators such as TTC, headway compliance, fuel consumption, and jerk values. The analysis revealed that increasing w1, which prioritizes safety, led to longer TTC values but slightly increased jerk levels, indicating a trade-off between cautious driving and passenger comfort. Similarly, a higher w2 value resulted in improved headway compliance, ensuring more consistent following distances, but had minimal impact on fuel efficiency. Adjusting w3, which focuses on energy efficiency, demonstrated a direct effect on fuel consumption reduction, confirming its role in optimizing energy usage. Meanwhile, increasing w4, which governs driving comfort, resulted in smoother acceleration profiles, reducing jerk while maintaining stable tracking behavior. Table 2 presents the selected weights and their rationale.

images

Overall, these findings confirm that while variations in the weighting factors affect individual performance metrics, the policy remains robust, demonstrating stable behavior across different weight configurations. This robustness ensures that the learned policy effectively balances multiple objectives, reinforcing the reliability of the suggested reward function.

5.4 DDPG-Based Velocity Control Framework

The actor and critic were each denoted by a neural network model. The actor-network receives the state at time step t as its input. st=(vn(t),Δvn(t),dn(t)). Its output is the acceleration of FV, denoted as an (t). The critic network proceeds state-action pairs (st, at) as input. Its output is a single scalar Q-value Q (st, at). Fig. 2 displays the structures of the actor and critic networks. The actor network’s output layer utilized tanh activation function. This function transforms real-valued numbers into a range of [−1, 1], effectively constraining the resulting accelerations to lie between −3 and 3 m/s2. The network parameters were adjusted using the optimization algorithm developed. The Actor network consists of two fully connected hidden layers, each with 256 neurons, both using Rectified Linear Unit (ReLU) activation functions. The output layer employs a tanh activation function to produce bounded continuous actions. The Critic network also contains two hidden layers: the first layer, 256 neurons processes the state input, and the action input is concatenated before the second hidden layer, 128 neurons. The output layer is a single neuron with a linear activation function, representing the estimated Q-value. These structures, including the number of layers, neurons, and activation functions, are now provided in detail to enhance the clarity and reproducibility of the suggested model. The parameters adopted during training, are presented in Table 3, were finalized following iterative tuning and validation to balance convergence speed, training stability, and control performance across multiple runs. The hyperparameters were selected through a combination of heuristic tuning and insights from prior studies in RL-based control systems. Initial values were based on common practices in DDPG implementations, and iterative testing was conducted to observe their effect on training stability and policy convergence. Key parameters such as learning rates, batch size, and exploration noise were adjusted using a manual grid search approach to balance exploration and exploitation, while ensuring convergence without instability. The selected configuration demonstrated the most consistent performance during pre-experiments. The primary critic network improves its performance by reducing the loss function L:

L=1N∑i(ri+γQ′(si+1,μ′(si+1|θμ′)|θQ′)−Q(si,ai|θQ))2(10)

images

Figure 2: The structures of the actor and critic systems

images

The actor-network was modified based on the gradient, as denoted:

∇θμJ=1N∑i∇aQ(s,a|θQ)|s=si,a=μ(si)∇θμμ(s|θμ)|si(11)

An assessment policy was developed by incorporating noise derived from a noise method into the creative actor policy. In line with the recommendations, with parameters θ = 0.15 and σ = 0.2. This process simulates the velocity of Brownian element subjected to resistance, resulting in temporarily connected values that are aligned at zero. The connected noise allows the agent to reconnoiter a physical setting that exhibits momentum effectively.

While the safety rewards will impose penalties on instances with low TTC values, there’s still a likelihood that the agent could engage in unsafe behavior resulting in collisions, even after reaching conjunction. Such potential collisions are not permissible in safety-critical fields such as autonomous driving. The process classifies the state as unsafe when the distance between the LV and FV is below a designated safe distance dsafe threshold, as described by the next equation:

dsafe=vntr+vn22amax−vn−122amax(12)

In this context, tr refers to the reaction time of the FV, which is defined as 1 s in this research, while amax represents the presumed maximum absolute deceleration rate of −3 m/s2. The core principle of this stopping space algorithm is that if the vehicle maintains a next distance exceeding the safe threshold, it should be able to prevent collisions in the event of a sudden full stop by the LV. During the training and evaluation of the DDPG system, the RL framework was combined with a collision avoidance algorithm as below:

an(t)={−3,if dn−1,n(t)<dsDDPG model outputotherwise(13)

It is important to note that the hardcoded safety condition defined in Eq. (13) functions as an emergency override rather than a replacement for the RL policy. This rule is only triggered when the inter-vehicle distance violates the computed safety threshold, primarily during early training phases when the policy may not have fully converged. It ensures safe exploration without hindering the learning process, allowing the agent to learn from unsafe actions while maintaining safety within the simulation environment. To ensure safe learning, a kinematics-based Eq. (13) was embedded into the simulation loop. When the clearance distance between the following and LV dropped below a computed safety threshold, the model applied a hard-coded deceleration of −3 m/s2. This mechanism served as a backup to prevent collisions during unsafe states, particularly in the early training phase before policy convergence. Essentially, this rule was designed to act as an emergency safety override and not interfere with the policy learning process, ensuring that the RL agent could explore while staying within safety limits.

5.5 Training the DDPG-Based Model for Velocity Regulation

Out of the 12541 CF measures extracted, 70% (8779) were allocated for training, while the remaining 30% were set for testing. During the phase of training, the agent of RL systematically processes the training dataset, pretends in the CF events, which have been randomly shuffled. This episodic reset strategy ensures that the agent is exposed to a wide variety of driving patterns and traffic contexts, thereby improving its ability to generalize across diverse CF scenarios. Although this approach breaks temporal continuity, it is particularly effective in our context, where responsive short-horizon decision-making is more critical than long-term sequence prediction. The randomized resets thus help the agent learn robust control policies across the distribution of CF behaviors rather than memorizing specific trajectories. Specifically, once one CF event concludes, from the 8779 training events, a new event is casually chosen, and the agent’s formal is reset using the observed data from this new event. This training process was carried out over 2000 episodes, with each event signifying CF event in this research.

Fig. 3 shows how the rolling mean episode reward changes with training episodes. This reward is computed by averaging the rewards gathered across all time steps (sampled every 0.1 s) during CFevent. Within a 100-episode window of the rolling mean episode reward represents the average of these mean rewards. Numerous training runs were accompanied, with the solid colored lines showing the average of these runs and the shaded regions indicating the mean ± standard deviation range. To provide context for the reward values, the average enactments of human drivers are also included in the graph and the MPC-ACC algorithm, calculated across all training episodes. The figure demonstrated that the DDPG model employing the collision avoidance strategy begins to stabilize around the 250th training episode, achieving faster convergence than the model without this strategy. Upon convergence, the agent attains a reward value of approximately 0.64. This outcome is reached by opting for actions that reduce the TTC and jerk while enhancing the headway feature to about 0.65. After completing 2000 training episodes, the model demonstrated reliable performance following convergence, guaranteeing stability and effective learning throughout the designated training duration.

images

Figure 3: Rolling mean episode reward across training episodes: comparison with human and MPC performance

5.6 ACC Baseline Derived from MPC

MPC is the most widely used method for speed control aimed at achieving multiple objectives in CF behavior, which include safety, efficiency, energy efficiency, and comfort. At each time step, MPC addresses an optimum control problem from an estimate perspective and produces a succession of accelerations. The first acceleration in this succession is then executed. This optimization procedure continues till the cessation situations are met. Because the MPC-centered speed controller can accommodate restraints and implement predictive control, it serves as a benchmark for evaluating performance against the DDPG model. In this calculation, the utilized kinematic point-mass model is referenced, but it was denoted in matrix formula as follows:

x(t+1)=Ax(t)+Bu(t)(14)

here, t is the sampler time step (sampling interval = 0.1 s),

x(t)=(Sn−1,n(t),ΔVn−1,n(t),Vn(t))T,u(t)=an(t),(15)

A=[1ΔT0010001](16)

and

B=[−0.5ΔT2−ΔTΔT](17)

The baseline for the MPC-ACC is established by adjusting the issue of safe, efficient, and comfortable speed control within specific constraints. In comparison, the objective function and constraints will be linked to the DDPG. Here in this research, we adhere to the MPC-based ACC as outlined in [29]. To ensure safety and efficiency, FV maintains a desired distance S~n−1,n from the vehicle ahead and a minimal relative speed ΔVn−1,n. Consequently, a constrained MPC formulation is established as follows:

∑t=0N−1[(Sn−1,n(t)−S~n−1,n(t)Smax)2+(ΔVn−1,n(t)ΔVmax)2+(V(t)−Vp0(t)ΔVe)2+(j(t)jmax)2+(a(t)α)2](18)

s.t.x(t+1)=Ax(t)+Bu(t)(19)

0<V(t)<Vmax(20)

−3 m/s2≤an≤3 m/s2(21)

where N represents the prediction horizon (with N set to 20 in this research) Smax, ΔVmax, and jerkmax are fixed values used to standardize various types of tracking errors Smax=15m,ΔVmax=8m/s, and jerkmax=60m/s3 and a=[a(0),a(1),…,a(N−1)] the action sequence needs to be determined. Once the optimum action sequence is acquired, the first action, a(0), will executed, and the cycle continues in the following time step. Finally, Vmax = 33.3 m/s represents highway speed limits. These parameter values were selected based on domain knowledge, characteristics observed in the highD dataset, and empirical validation to ensure a fair comparison with the DDPG-based controller. These values were used to normalize and scale each term in the cost function to comparable magnitudes, ensuring balanced penalization of spacing error, speed tracking, control effort, and passenger comfort. Sensitivity analysis was conducted to confirm that the selected values produced smooth, realistic vehicle behavior aligned with the real-world highD data.

6 Performance Analysis

Here, we compare CF behavior detected in the experimental highD dataset with the behavior produced by the DDPG and MPC-ACC methods. This analysis demonstrated the models’ capability to follow LV safely, efficiently, energy efficiency, and comfortably. The estimation is achieved with testing data, ensuring the reliability of the results. Notably, the DDPG model exhibits excellent safety performance, with no collisions recorded during the testing phase. By taking the trajectories of the LV as input, the DDPG model generates adaptive and accurate trajectories for the FV.

This comparison underscores the efficiency of the DDPG method in addressing real-world CF challenges while maintaining a balance between safety, comfort, energy efficiency, and operational efficiency.

6.1 Safety Performance

Using TTC analysis safety performance of the DDPG model is assessed. As demonstrated in Fig. 4, TTC Distribution, the DDPG algorithm ensures significantly safer following distances compared to MPC-based ACC and human drivers. Notably, 95% of the TTC values for DDPG exceed 5 s, while MPC and human-driven vehicles present higher risk profiles, with their TTC distributions peaking at 3–4 s and 4–5 s, respectively. Further validation of DDPG’s robustness in critical scenarios is provided in Fig. 5. During a sudden LV deceleration event (simulated at t = 10–15 s), DDPG consistently maintains a TTC greater than 5 s, demonstrating its predictive control and adaptability. In contrast, the MPC algorithm breaches the critical 2-s safety threshold at t = 12 s, emphasizing its reactive nature and limited foresight. Human drivers, though performing slightly better than MPC, still experience occasional TTC dips below 3 s, highlighting the inherent delays in human response.

images

Figure 4: TTC distribution of safety performance

images

Figure 5: TTC during LV braking

6.2 Efficiency

The efficiency of the DDPG model is evaluated through speed tracking accuracy and headway compliance, which are critical for maintaining smooth and safe traffic flow. Fig. 6 shows that 90% of DDPG speed tracking errors are below 1.2 m/s (σ = 0.54), In contrast, the MPC algorithm exhibits larger deviations (σ = 1.1), with 15% of errors exceeding 2 m/s, indicating less precise speed regulation. Human drivers, while adaptive, show even higher variability, with frequent overshooting and undershooting of the target speed. Fig. 7 (Speed Deviation over Time) provides a temporal analysis of speed tracking performance during a simulated LV acceleration event (t = 5–10 s). Headway compliance for DDPG is 89.2% vs. 76.5% for MPC, limiting speed deviations to <1.5 m/s throughout the maneuver. This is achieved through smooth and anticipatory acceleration adjustments, which minimize abrupt speed changes. On the other hand, the MPC algorithm struggles to maintain consistent speed tracking, with deviations occasionally exceeding 2 m/s during the acceleration phase. Human drivers, while capable of adapting, exhibit higher variability in speed control, leading to larger deviations and reduced traffic flow efficiency.

images

Figure 6: Speed tracking error distribution

images

Figure 7: Speed deviation over time

6.3 Energy Efficiency

The energy efficiency of DDPG is analyzed by the use of fuel consumption through a process that focuses on various driving conditions. Here in Fig. 8 are going to notice that the fuel requirements may change for DDPG or MPC methodologies. The research results confirm the fact that the DDPG model is more efficient than the MPC plan; the top fuel consumption is only 5.8 L/100 km, opposed to 6.3 L/100 km for MPC. DDPG reduces fuel use by 12% compared to MPC. Energy efficiency scores are 0.79 for DDPG vs 0.72 for MPC, not to mention the software’s real-time adaptability to diverse driving situations. Moreover, Fig. 9 provides us with detailed statistics on the fuel consumption during constant speed driving (for 5 to 15 s). In this case, the DDPG algorithm outperforms the MPC by 12% decrease in fuel consumption. This progress is made possible through DDPG’s implementation of smart fuel-saving methods, which aid the vehicle in maintaining its speed and thus also the fuel usage. By cutting down the energy requirements for the regular cruise, the DDPG gains a hopeful future of the improvement of sustainability as well as the efficiency of driverless vehicles.

images

Figure 8: Fuel consumption distribution

images

Figure 9: Fuel rate during cruising

6.4 Comfort

Jerk analysis is a crucial method of assessing the comfort created by driving, especially the measure of how a vehicle accelerates and decelerates smoothly. According to the data depicted in Fig. 10, the distribution of jerk values is shown for both the DDPG and MPC-based control methods. The observed results indicate that the DDPG algorithm contributes to a significantly smoother driving experience, as 95% of the jerk values sent out do not rise above 0.3, hence there are minimal oscillations in acceleration. In contrast, MPC and human drivers often deal with higher jerk values, many spikes above 0.5 m/s, which means ride that is uncomfortable and bumpy. During stop-and-go traffic, DDPG triggers 1.2 harsh braking events per hour, compared to 3.8 for MPC. The average jerk is 0.12 m/s3 for DDPG vs. 0.18 m/s3 for MPC in Fig. 11. In situations that require a lot of stopping and starting, DDPG only triggers about 1.2 harsh braking events per hour, while the MPC strategy results in a much higher rate of 3.8 harsh braking events per hour. The Average Jerk metric is calculated as the time-averaged absolute jerk magnitude over the full driving trajectory for each simulation episode. This involves computing the instantaneous jerk at each time step, taking the absolute value, and averaging over the episode duration to reflect overall smoothness and passenger comfort. This clearly shows DDPG’s improved capability to handle braking events smoothly and effectively, reducing the harsh and uncomfortable decelerations that are often associated with other control methods. By cutting down on sudden jerks and harsh braking, DDPG plays a significant role in creating more comfortable and stable driving experience, especially in traffic where frequent stop-and-go is necessary.

images

Figure 10: Jerk distribution

images

Figure 11: Jerk profile in stop-and-go traffic

6.5 Safety Overrides

Safety overrides are crucial in avoiding accidents and ensuring both vehicle and passenger safety. Fig. 12 presents the distribution of acceleration commands for both DDPG and MPC-based control strategies. The examination indicates that DDPG activates safety overrides, issuing a deceleration command of −3 m/s2 in 2.14% of cases. This rapid response helps avoid potential collisions by slowing the vehicle when necessary. Fig. 13 further shows DDPG’s safety advantages in critical situations. For example, during an LV braking event at t = 15 s, DDPG decelerates early to maintain a safe distance, reducing collision risks. In contrast, the MPC algorithm reacts too late, conceding its ability to prevent an impact effectively. This comparison underscores DDPG’s ability to anticipate hazards and apply timely braking, making it a safer and more reliable alternative to traditional control methods.

images

Figure 12: Acceleration command distribution

images

Figure 13: Acceleration during critical scenarios

6.6 Training Performance

The performance of the DDPG algorithm during training is calculated based on policy convergence and the stability of the critic network. Fig. 14 illustrates how episode rewards change over the course of training. The results show that the DDPG algorithm achieves stable policy convergence after approximately 800 episodes, indicating a continuous enhancement in decision-making and overall performance. This stable progress highlights the framework’s efficiency in refining control strategies and adapting to complex driving conditions. Fig. 15 provides further insights into the learning dynamics of the actor and critic networks. The graph reveals that the critic network stabilizes faster than the actor-network, allowing for more accurate Q-value estimation early in the training process. This faster convergence of the critic network contributes to more reliable learning and improved decision-making as training progresses. This swift convergence of the critic network is vital for stabilizing the learning process and boosting the model’s overall reliability, especially in reinforcement learning, where getting precise value estimates is essential for achieving the best policy performance.

images

Figure 14: Episode reward convergence

images

Figure 15: Actor-critic loss over time

6.7 Statistical Significance Testing

Especially, the faster convergence of the critic network plays a key role in shaping the smoothness of the learned policy. In actor-critic frameworks like DDPG, the critic provides Q-value estimates that guide the actor’s policy updates. A stable critic ensures more accurate gradients and reduces policy variance, encouraging smoother acceleration and deceleration behaviors. This contributes directly to minimizing jerk and harsh braking, as observed in our results. Although a full component ablation study is beyond the current scope, preliminary tests indicate that slower critic convergence leads to increased jerk variability and suboptimal comfort performance. This highlights the importance of critic network stability in achieving both performance and comfort objectives. Ensure statistical validity; each paired t-test was conducted using 30 matched simulation episodes (n = 20), resulting in 19 degrees of freedom for each test. This sample size reflects a sufficient number of independent scenarios to evaluate the generalizability of the performance metrics across diverse CF conditions. To provide a more rigorous quantitative validation of the observed performance improvements, we conducted paired t-tests comparing the proposed DDPG-based CF model against the baseline MPC-ACC approach across four key performance metrics: Average TTC, Headway Compliance, Fuel Consumption, and Average Jerk. These metrics were selected for statistical analysis as they are continuous, consistently measured across multiple test scenarios, and represent the core objectives of autonomous driving: safety, efficiency, energy conservation, and passenger comfort. The statistical tests were conducted at 95% confidence level to determine whether the differences in performance were significant and not due to random variation. As shown in Table 4, the results confirm that the DDPG model significantly outperforms the MPC-based controller in all four metrics. Specifically, the DDPG framework achieved longer TTC values, indicating enhanced safety; higher headway compliance, reflecting better adherence to efficient traffic flow; reduced fuel consumption, demonstrating energy efficiency; and lower jerk values, signifying smoother and more comfortable driving experience. These findings reinforce the empirical results presented earlier and provide strong statistical evidence that the improvements achieved by the DDPG-based framework are not only consistent but statistically significant, underscoring its effectiveness in addressing real-world autonomous vehicle control challenges.

images

6.8 Discussion

The DDPG-based autonomous driving framework demonstrated clear advantages over both MPC and human drivers in terms of safety, efficiency, energy efficiency, and comfort. Fig. 16 illustrates sampled CF trajectories during an LV deceleration event (t = 0–30 s), providing a comprehensive assessment of system performance. The comparison reveals distinct differences in behavioral responses. DDPG proactively adjusts its speed to maintain safe following distance with an average error of 0.52 m and minimal jerk (0.15 m/s3), ensuring smooth and consistent driving. This proactive adjustment helps to maintain safety and comfort by reducing sudden accelerations or decelerations. In contrast, the MPC approach reacts abruptly to the LV’s deceleration at t = 10 s, resulting in temporary TTC violations and increased fuel consumption. This delayed response highlights the reactive nature of MPC and its inability to adjust to changes in the driving environment predictively. Human drivers, exhibiting erratic speed adjustments, show a larger following distance error (1.85 m) and maximum jerk value of 0.67 m/s3, reflecting consistency and response time challenges during rapidly changing driving scenarios. These findings, summarized in Table 5, emphasize the performance of DDPG in maintaining a safer, smoother, and more efficient driving profile compared to both MPC and human drivers, particularly in dynamic situations such as sudden decelerations. Although this study primarily uses TTC as the quantitative safety indicator, previous research has shown that TTC is closely related to other safety metrics such as minimum gap distance, time headway, and collision risk index [1]. A lower TTC typically implies reduced clearance distance and a higher risk of collision, while higher TTC values indicate safer spatial separations. Given that the observed TTC values in our experiments consistently exceed commonly accepted safety thresholds, it can be inferred that the corresponding minimum gap distances and collision risks remain within safe limits. Therefore, while additional metrics could provide further granularity, the TTC-based evaluation reliably reflects the safety performance of the proposed DDPG-based framework without altering the conclusions of this study.

images

Figure 16: Sampled trajectories during LV declaration

images

The suggested DDPG-based model demonstrated an inference time of 12.5 ms per step, which is compatible with the real-time constraints of embedded autonomous vehicle systems operating at 50–100 ms control loop intervals [30]. While the simulation-based results confirm model performance, actual vehicle deployment introduces additional challenges such as sensor noise and actuator latency. These can degrade model reliability and responsiveness. To address these issues, future work will explore robustness enhancement techniques, including Gaussian noise injection during training and dynamics randomization. Additionally, we plan to investigate delay-aware decision-making strategies and apply model compression techniques to enable deployment on low-power embedded platforms. These enhancements aim to ensure that the suggested framework is both efficient and practical for real-world car-following control in intelligent autonomous vehicles.

While the suggested DDPG-based car-following model demonstrated excellent performance on safety, efficiency, energy consumption, and comfort using the highD dataset, there are certain limitations regarding its generalization capabilities. The highD dataset is collected from German highway scenarios, which may not fully capture the complexity of urban environments or mixed traffic conditions, such as interactions with pedestrians, cyclists, or non-lane-based driving behaviors. In real-world applications, autonomous vehicles must operate reliably under diverse driving conditions and cultural contexts. Driving behaviors, road infrastructure, and traffic dynamics can vary significantly across different countries and environments. To ensure broader applicability, future research should explore cross-domain validation using other localized datasets representing varied traffic cultures.

Moreover, domain adaptation and transfer learning techniques can be employed to adapt the model to unseen conditions without requiring full retraining. Integrating simulation data with real-world observations from urban, suburban, and rural settings can also help bridge the gap between controlled training environments and practical deployment. By addressing these challenges, the proposed framework can evolve into a more universally robust system capable of supporting intelligent transportation systems across different geographic and cultural contexts.

Table 5 summarizes the consolidated performance metrics for our proposed DDPG-based controller and the baseline MPC system. While MPC is theoretically optimal under accurate modeling and known constraints, its performance degrades when facing nonlinearities, disturbances, and unmodeled dynamics. In contrast, the DDPG controller learns directly from diverse driving scenarios, allowing it to adapt to changing conditions and maintain robust performance across all metrics. Notably, DDPG achieves zero collisions, higher average time-to-collision, better headway compliance, reduced fuel consumption, and significantly smoother driving. Additionally, inference time per control step is approximately 3.6× faster than MPC, enabling quicker decision-making in dynamic traffic environments.

7 Conclusion

This research introduces an RL-based CF model planned to improve the autonomous speed control of vehicles by optimizing multiple critical factors, including safety, efficiency, energy consumption, and passenger comfort. The suggested model leverages the DDPG algorithm to learn optimal driving strategies from real-world traffic data. Through continuous interaction with the environment, the model refines its decision-making process, demonstrating its effectiveness in managing autonomous vehicles in dynamic and congested traffic conditions. The findings highlight that the DDPG-based model significantly outperforms traditional MPC-ACC systems across several key performance metrics. In terms of safety, the RL-based method maintains longer TTC values and proactively avoids hazardous maneuvers during critical driving scenarios, such as sudden braking events. This proactive safety management reduces the risk of rear-end collisions and enhances the overall reliability of autonomous driving systems. Efficiency is another area where the proposed model excels. The DDPG-based control strategy improves speed-tracking accuracy and significantly reduces unnecessary fluctuations in velocity. This refined speed regulation leads to lower fuel consumption and better energy efficiency compared to conventional MPC-based ACC methods. By optimizing acceleration and deceleration patterns, the model not only minimizes energy waste but also contributes to the broader objective of sustainable transportation.

Furthermore, the RL-based model enhances driving comfort by mitigating abrupt changes in acceleration, which are commonly referred to as “jerk.” A smoother transition between speed adjustments creates a more stable and comfortable driving experience for passengers, reducing motion sickness and improving overall ride quality. The ability of the model to balance safety, efficiency, and comfort underscores the advantages of reinforcement learning in autonomous vehicle control. A crucial component of the model’s success is the implementation of a multi-objective reward function that explicitly accounts for safety, energy efficiency, and ride comfort. This comprehensive reward structure ensures stable and effective training, enabling the model to make well-rounded driving decisions. Additionally, incorporating a kinematic collision avoidance strategy further enhances the reliability of the CF system by minimizing the likelihood of collisions, even in complex driving conditions.

Finally, the proposed RL-based CF model presents a promising approach to optimizing the behavior of autonomous vehicles in various traffic scenarios. The results of this study reinforce the potential of reinforcement learning to revolutionize autonomous driving by improving safety, reducing energy consumption, and enhancing ride comfort. Moving forward, future research could focus on refining the model further, such as developing real-time adaptation mechanisms to handle highly dynamic traffic conditions. Incorporating additional factors, such as driver preferences and external environmental constraints, could help create more personalized and scalable autonomous driving solutions, paving the way for safer and more efficient intelligent transportation systems.

This research uses simplified vehicle model focused on freeway scenarios, which may limit applicability in complex urban environments. The reward weights were fixed and tuned empirically, potentially restricting adaptability to diverse driving styles. Some assumptions, like constant reaction time, may not fully reflect real-world conditions. Future work could include more realistic vehicle dynamics, adaptive reward functions, and multi-agent RL for cooperative driving. Online learning could enable real-time adaptation to changing traffic conditions. The suggested model shows promise for integration into advanced driver-assistance systems and intelligent transportation networks, improving safety, efficiency, and comfort. Real-world testing and simulation will be essential for practical deployment. Overall, RL offers great potential to enhance autonomous driving by balancing multiple objectives, paving the way for safer and more efficient transportation.

Acknowledgement: Princess Nourah bint Abdulrahman University Researchers Supporting Project number (PNURSP2025R308), Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia.

Funding Statement: 1. The authors would like to thank the Hebei Province Science and Technology Plan Project (19221909D). 2. Princess Nourah bint Abdulrahman University Researchers Supporting Project number (PNURSP2025R308), Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia.

Author Contributions: Conceptualization, Abu Tayab; writing—original draft preparation, Yanwen Li; supervision, Abu Tayab and Yanwen Li; validation and formal analysis, Yanwen Li; review—writing and editing, Ahmad Syed; software, Ghanshyam G. Tejani and El-Sayed M. El-kenawy; data curation, Doaa Sami Khafaga; investigation, Marwa M. Eid; resources, Amel Ali Alhussan; methodology, Ghanshyam G. Tejani and Yanwen Li; visualization, Marwa M. Eid and El-Sayed M. El-kenawy. All authors reviewed the results and approved the final version of the manuscript.

Availability of Data and Materials: Data is available on request from the corresponding authors.

Ethics Approval: Not applicable.

Conflicts of Interest: The authors declare no conflicts of interest to report regarding the present study.

References

1. Jo A, Yoon Y. Enhancing ride comfort and driving safety in urban autonomous buses: a data-driven hierarchical longitudinal motion planning using approximate and stochastic MPCs. Proc Inst Mech Eng D J Automob Eng. 2025;239(6):1848–71. doi:10.1177/09544070241296176. [Google Scholar] [CrossRef]

2. Negash NM, Yang J. Driver behavior modeling toward autonomous vehicles: comprehensive review. IEEE Access. 2023;11(4):22788–821. doi:10.1109/access.2023.3249144. [Google Scholar] [CrossRef]

3. Haque MM, Sarker S, Dewan MAA. Driving maneuver classification from time series data: a rule based machine learning approach. Appl Intell. 2022;52(14):16900–15. doi:10.1007/s10489-022-03328-3. [Google Scholar] [PubMed] [CrossRef]

4. Gunter G, Stern R, Work DB. Modeling adaptive cruise control vehicles from experimental data: model comparison. In: 2019 IEEE Intelligent Transportation Systems Conference (ITSC). IEEE; 2019. p. 3049–54. [Google Scholar]

5. Yurtsever E, Lambert J, Carballo A, Takeda K. A survey of autonomous driving: common practices and emerging technologies. IEEE Access. 2020;8:58443–69. doi:10.1109/access.2020.2983149. [Google Scholar] [CrossRef]

6. Li L, Ota K, Dong M. Humanlike driving: empirical decision-making system for autonomous vehicles. IEEE Trans Veh Technol. 2018;67(8):6814–23. doi:10.1109/tvt.2018.2822762. [Google Scholar] [CrossRef]

7. Oladimeji D, Gupta K, Kose NA, Gundogan K, Ge L, Liang F. Smart transportation: an overview of technologies and applications. Sensors. 2023;23(8):3880. doi:10.3390/s23083880. [Google Scholar] [PubMed] [CrossRef]

8. Xu Y, Lu Y, Ji C, Zhang Q. Adaptive graph fusion convolutional recurrent network for traffic forecasting. IEEE Internet Things J. 2023;10(13):11465–75. doi:10.1109/jiot.2023.3244182. [Google Scholar] [CrossRef]

9. Lillicrap TP. Continuous control with deep reinforcement learning. arXiv:1509.02971. 2015. [Google Scholar]

10. Rahman MH, Abdel-Aty M, Wu Y. A multi-vehicle communication system to assess the safety and mobility of connected and automated vehicles. Transp Res Part C Emerg Technol. 2021;124:102887. doi:10.1016/j.trc.2020.102887. [Google Scholar] [CrossRef]

11. Treiber M, Kesting A. The intelligent driver model with stochasticity-new insights into traffic flow oscillations. Transp Res Procedia. 2017;23(4):174–87. doi:10.1016/j.trpro.2017.05.011. [Google Scholar] [CrossRef]

12. Zhang Y, Bai Y, Hu J, Wang M. Control design, stability analysis, and traffic flow implications for cooperative adaptive cruise control systems with compensation of communication delay. Transp Res Rec. 2020;2674(8):638–52. doi:10.1177/0361198120918873. [Google Scholar] [CrossRef]

13. Brackstone M, McDonald M. Car-following: a historical review. Transp Res Part F Traffic Psychol Behav. 1999;2(4):181–96. doi:10.1016/s1369-8478(00)00005-x. [Google Scholar] [CrossRef]

14. Sarker A, Shen H, Rahman M, Chowdhury M, Dey K, Li F, et al. A review of sensing and communication, human factors, and controller aspects for information-aware connected and automated vehicles. IEEE Trans Intell Transp Syst. 2019;21(1):7–29. doi:10.1109/tits.2019.2892399. [Google Scholar] [CrossRef]

15. Yu L, Wang R. Researches on adaptive cruise control system: a state of the art review. Proc Inst Mech Eng D J Automob Eng. 2022;236(2–3):211–40. [Google Scholar]

16. Elliott D, Keen W, Miao L. Recent advances in connected and automated vehicles. J Traffic Transp Eng (Engl Ed). 2019;6(2):109–31. doi:10.1016/j.jtte.2018.09.005. [Google Scholar] [CrossRef]

17. Wei H, Shi Y. MPC-based motion planning and control enables smarter and safer autonomous marine vehicles: perspectives and a tutorial survey. IEEE/CAA J Autom Sin. 2022;10(1):8–24. doi:10.1109/jas.2022.106016. [Google Scholar] [CrossRef]

18. Zhou H, Zhou A, Li T, Chen D, Peeta S, Laval J. Congestion-mitigating MPC design for adaptive cruise control based on Newell’s car following model: history outperforms prediction. Transp Res Part C Emerg Technol. 2022;142(5):103801. doi:10.1016/j.trc.2022.103801. [Google Scholar] [CrossRef]

19. Jamil H, Naqvi SSA, Iqbal N, Khan MA, Qayyum F, Muhammad F. Analysis on the driving and braking control logic algorithm for mobility energy efficiency in electric vehicle. Smart Grids Sustain Energy. 2024;9(1):12. doi:10.1007/s40866-023-00190-1. [Google Scholar] [CrossRef]

20. Yu Y, Tang J, Huang J, Zhang X, So DKC, Wong K-K. Multi-objective optimization for UAV-assisted wireless powered IoT networks based on extended DDPG algorithm. IEEE Trans Commun. 2021;69(9):6361–74. doi:10.1109/tcomm.2021.3089476. [Google Scholar] [CrossRef]

21. Jang B, Kim M, Harerimana G, Kim JW. Q-learning algorithms: a comprehensive classification and applications. IEEE Access. 2019;7:133653–67. doi:10.1109/access.2019.2941229. [Google Scholar] [CrossRef]

22. Tan F, Yan P, Guan X. Deep reinforcement learning: from Q-learning to deep Q-learning. In: Neural Information Processing: 24th International Conference, ICONIP 2017; 2017 Nov 14–18; Guangzhou, China. Springer. 2017. p. 475–83. [Google Scholar]

23. Elmorshedy L, Smirnov I, Abdulhai B. Freeway congestion management on multiple consecutive bottlenecks with RL-based headway control of autonomous vehicles. IET Intell Transp Syst. 2024;18(6):1137–63. doi:10.1049/itr2.12492. [Google Scholar] [CrossRef]

24. Chen X, Zhu M, Chen K, Wang P, Lu H, Zhong H, et al. FollowNet: a comprehensive benchmark for car-following behavior modeling. Sci Data. 2023;10(1):828. [Google Scholar] [PubMed]

25. Del Rio A, Jimenez D, Serrano J. Comparative analysis of A3C and PPO algorithms in reinforcement learning: a survey on general environments. IEEE Access. 2024;12:146795–806. doi:10.1109/access.2024.3472473. [Google Scholar] [CrossRef]

26. Han L, Liu G, Zhang H, Fang R, Zhu C. Fuel-saving control strategy for fuel vehicles with deep reinforcement learning and computer vision. Int J Automot Technol. 2023;24(3):609–21. [Google Scholar]

27. Gao F, Wang X, Fan Y, Gao Z, Zhao R. Constraints driven safe reinforcement learning for autonomous driving decision-making. IEEE Access. 2024;12:128007–23. doi:10.1109/ACCESS.2024.3454249. [Google Scholar] [CrossRef]

28. Krajewski R, Bock J, Kloeker L, Eckstein L. The highD dataset: a drone dataset of naturalistic vehicle trajectories on German highways for validation of highly automated driving systems. In: 2018 21st International Conference on Intelligent Transportation Systems (ITSC). IEEE; 2018. p. 2118–25. [Google Scholar]

29. Zhu M, Wang Y, Pu Z, Hu J, Wang X, Ke R. Safe, efficient, and comfortable velocity control based on reinforcement learning for autonomous driving. Transp Res Part C Emerg Technol. 2020;117(2):102662. doi:10.1016/j.trc.2020.102662. [Google Scholar] [CrossRef]

30. Pan X, You Y, Wang Z, Lu C. Virtual to real reinforcement learning for autonomous driving. arXiv:1704.03952. 2017. [Google Scholar]

Cite This Article

APA Style

Tayab, A., Li, Y., Syed, A., Tejani, G.G., Khafaga, D.S. et al. (2026). A Multi-Objective Adaptive Car-Following Framework for Autonomous Connected Vehicles with Deep Reinforcement Learning. Computers, Materials & Continua, 86(2), 1–27. https://doi.org/10.32604/cmc.2025.070583

Vancouver Style

Tayab A, Li Y, Syed A, Tejani GG, Khafaga DS, El-kenawy EM, et al. A Multi-Objective Adaptive Car-Following Framework for Autonomous Connected Vehicles with Deep Reinforcement Learning. Comput Mater Contin. 2026;86(2):1–27. https://doi.org/10.32604/cmc.2025.070583

IEEE Style

A. Tayab et al., “A Multi-Objective Adaptive Car-Following Framework for Autonomous Connected Vehicles with Deep Reinforcement Learning,” Comput. Mater. Contin., vol. 86, no. 2, pp. 1–27, 2026. https://doi.org/10.32604/cmc.2025.070583

BibTex EndNote RIS

Copyright © 2026 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Table of Content

A Multi-Objective Adaptive Car-Following Framework for Autonomous Connected Vehicles with Deep Reinforcement Learning

Abstract

Keywords

References

Cite This Article

1331

426

0

Related articles

Further Information

Guidelines

Follow Us

Join Us

Contact Us

WhatsApp:

Share Link