This work was developed as a Project Work for the Autonomous Agents and Intelligent Robotics course, taught by Professor Giorgio Battistelli, as part of the Master's Degree in Artificial Intelligence at the University of Florence, Italy.
The main objective is to evaluate and compare different Reinforcement Learning algorithms for solving an Autonomous Platoon Control problem by partially reproducing, on a smaller scale, the experimental results obtained in the following reference paper:
Autonomous Platoon Control with Integrated Deep Reinforcement Learning and Dynamic Programming, Tong Liu, Lei Lei, Kan Zheng, Kuan Zhang; 2022.
Autonomous Platoon Control is a highly important task for the future of intelligent transportation systems. Through automated coordination of vehicles in platoon formation, it is possible to optimize traffic flow, reduce fuel consumption, and improve road safety. The main challenge is maintaining an optimal distance between a system of queued vehicles while they adapt to the leader's speed variations.
The setup of this problem follows exactly the one implemented in the reference paper, with the only simplification being the presence of a single agent vehicle and a single preceding vehicle, the leader. All vehicles follow a first-order dynamics:
where
To prevent divergences and unrealistic acceleration spikes that could compromise the agent's training, constraints are imposed on the possible values for the agent's acceleration and action:
The success of the platoon control task strongly depends on maintaining the correct distance between vehicles. In the reference paper, headway is defined as the bumper-to-bumper distance between two consecutive vehicles:
where
At any time instant
where
Optimal platoon control is achieved when each vehicle manages to adjust its motion dynamics to maintain the desired distance from the preceding vehicle over time. Consequently, Platoon Control can be easily transformed into a minimization problem by setting the objective as the minimization, by the agent, of two error values: one for achieving the correct distance from the preceding vehicle, and one for maintaining the correct speed to ensure this desired distance is not only reached but maintained over time.
The state space consists, at each timestep
The action space consists of a single value:
The system evolves according to two distinct discrete dynamic models for the leader and follower:
Leader:
Follower i:
For the leader, the evolution depends only on its current state and control input. For the follower, however, the evolution depends on its own state, its own control input, and the acceleration of the preceding vehicle. This dependence on the predecessor's acceleration allows the follower to anticipate speed variations of the preceding vehicle, thus making the system more stable.
A Huber-like reward function
The parameters
-
$a$ balances the importance of velocity error relative to the position gap-keeping error -
$b$ penalizes overly aggressive control inputs, promoting smoother behavior -
$c$ penalizes sudden acceleration changes (jerk), contributing to driving comfort
Given the expected cumulative reward
The reference paper proposes an integrated approach combining Deep Reinforcement Learning and Dynamic Programming, using an algorithm called FH-DDPG-SS. This method is based on DDPG (Deep Deterministic Policy Gradient) and is designed to handle a complex multi-agent system with multiple vehicles in platoon.
In this work, we implemented a simplified single-agent environment, where the agent must adjust its dynamics to match those of a single leading vehicle, which are set in advance. Therefore, we have a context with only two vehicles, one of which is the agent itself. The agent is trained using two different Q-Learning algorithms, whose performances will be compared:
-
Tabular Q-Learning: This represents the most "classical" approach to Reinforcement Learning, where the Q-function is explicitly represented as a table. While in DQL the state space is continuous, in Tabular Q-Learning both state space and action space are uniformly quantized. The Q-Table, having a value for each possible state-action pair, is initialized with random values in the range [-0.1, 0.1], and its update follows the Bellman Equation:
$$Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha[r_t + \gamma \max_{a} Q(s_{t+1}, a) - Q(s_t, a_t)]$$ , where$Q(s_t, a_t)$ is the current Q-value for the state-action pair;$\alpha$ is the learning rate;$r_t$ is the immediate reward;$\gamma$ is the discount factor;$\max_{a} Q(s_{t+1}, a)$ is the maximum possible Q-value in the next state;$[r_t + \gamma \max_{a} Q(s_{t+1}, a) - Q(s_t, a_t)]$ represents the TD error. -
Deep Q-Learning (DQL): Deep Q-Learning extends classic Q-Learning by using a deep neural network to approximate the Q-function, making it possible to use a continuous state space. The implementation for this problem includes uniform quantization of the action space in the interval
$[u_{min}, u_{max}]$ ; the use of an Experience Replay Buffer to store and sample state transitions; a Target Network to stabilize learning and propagate the original platooning task over time; and an ε-greedy policy to balance exploration and exploitation.
The implementation was developed in Python using PyTorch for DQL and NumPy for Tabular Q-Learning. Training was monitored through Weights & Biases.
The EnvPlatoon
class implements the structure of the platoon environment. It contains numerous attributes that define the specific behavior of the system, imposing limits and constraints, and dimensioning the variables present in the equations that define each vehicle's dynamics.
The class includes a reset
function, called at the beginning of each episode, which initializes a specific leader action pattern, resets rewards, and randomly samples an initial agent state within the defined intervals for each element of the state.
The step
function applies first-order dynamics to the agent performing the state transition: acceleration, position gap-keeping error, and velocity error are calculated for the next timestep, and the reward is computed using the compute_reward
function. While analyzing the reference paper, I noticed the absence of a mechanism to penalize the agent when its state would lead to a negative distance from the leader, which in reality would result in a collision. Therefore, in my implementation, I included an additional collision_penalty
in the reward calculation that depends on the distance from the leader.
Notation | Description | Value |
---|---|---|
Interval for each timestep | 0.1 s | |
Total timesteps in each episode | 100 | |
Number of vehicles | 2 | |
Driveline dynamics time constant | 0.1 s | |
Time gap | 1 s | |
Maximum initial position gap-keeping error | 2 m | |
Maximum initial velocity error | 1.5 m/s | |
Minimum acceleration | -2.6 m/s^2 | |
Maximum acceleration | 2.6 m/s^2 | |
Minimum control input | -2.6 m/s^2 | |
Maximum control input | 2.6 m/s^2 | |
Reward coefficient for the position gap-keeping error term | 0.1 | |
Reward coefficient for the velocity error term | 0.1 | |
Reward coefficient for the jerk term | 0.1 | |
Reward scale | 5e-3 | |
Nominal maximum position gap-keeping error | 15 m | |
Nominal maximum velocity error | 10 m/s | |
Reward threshold | -0.4483 |
While the reference paper uses real driving data extracted from the Next Generation Simulation (NGSIM) dataset, for the scope of this work I implemented a simple pattern generator that creates diverse but controlled scenarios for testing the agent's behavior. The LeaderPatternGenerator
class creates seven different leader movement patterns that the agent must learn to follow:
- Uniform Motion (constant velocity)
- Uniformly Accelerated Motion with positive acceleration
- Uniformly Accelerated Motion with negative acceleration
- Sinusoidal Motion, simulating traffic flow patterns
- Stop-and-Go pattern typical of traffic situations
- Acceleration-Deceleration sequence
- Smooth random changes in acceleration
Visualization of the different leader patterns.
Each pattern is designed to test different aspects of the agent's learning capabilities and its ability to maintain proper distance in various driving scenarios.
The DQNAgent
class implements the Deep Q-Learning agent. The underlying DeepQNetwork
is a simple Multilayer Perceptron that has been tested with both one and two hidden layers. The implementation includes a configurable experience replay buffer.
Action selection is performed using an
The TabularQAgent
class follows the same logic for implementing the Tabular Q-Learning agent, but instead of having a target network, the Q-Table is simply defined as a NumPy array and is updated using the Bellman equation. The state and action spaces are discretized uniformly, creating a finite table of state-action values.
Since performance metrics such as the cumulative reward might not provide the best insight about the efficiency and quality of an agent's training, I decided to implement a visualization system using Panda3D
. The renderer allows episode visualization, monitoring the agent's position relative to where it should be to maintain an optimal distance from the leader, marked by a red line and the text "DESIRED". This visualization tool helped in calibrating rewards and giving them appropriate weights relative to their magnitude. Below are two example frames from the visualization.
Rendering of an episode using Panda3D.
It is important to note that the implementation of a visualization system required a more realistic approach than one that just considers vehicles as points. Therefore, the equations were modified to account for vehicle lengths, transforming the simple point-to-point distance into a more realistic bumper-to-bumper distance. This distance is measured from the front bumper of the following vehicle to the rear bumper of the leading vehicle, which is essential for accurate collision detection and more realistic platooning behavior. This modification not only enhanced the visualization but also made the simulation more realistic by ensuring that the desired distances maintained by the agent consider the physical dimensions of the vehicles.
The training experiments were conducted in two main phases. In the first phase, I performed several training sessions with the leader movement pattern fixed at constant velocity (Uniform Motion). This allowed me to validate both the environment and the DQL learning method in a simplified setting, and to learn how modifications of each hyperparameter influence the training performance.
Validation average score (over 100 episodes) of the four best DQL agents in a system with single leader pattern.
After reaching a satisfactory training performance in this basic scenario, I proceeded to the second phase where I introduced the seven different leader patterns in order to create a more challenging and diverse platooning task. Throughout this phase, a hyperparameter analysis was conducted in order to identify the optimal configuration for each learning method, compare the performance of different parameter settings, and evaluate the robustness of both approaches under varying initial conditions.
Validation average score (over 100 episodes) of the three best DQL agents in a system with seven different leader patterns.
Validation average score (over 100 episodes) of the two best Tabular Q-Learning agents compared with the best DQL agent.
Mean Episode Reward | |
---|---|
DDPG | -0.0680 |
FH-DDPG | -0.0736 |
HCFS | -0.0673 |
FH-DDPG-SS | -0.0600 |
Tabular QL | -0.1221 |
Deep QL | -0.1998 |
State evolution of different training episodes (1000, 5000, 7000, 9000) of the three best DQL agents:
The state evolution across different DQL agent training episodes reveals interesting patterns in the learning process. At episode 1000, there is a clear tendency towards divergence, likely caused by the acceleration converging to zero, leading to no variations in velocity. This results in the agent stabilizing at a speed that is too low to keep pace with the leader, causing the distance to diverge.
At episode 5000 there is better convergence of all three state elements towards zero, although acceleration still shows a noticeable jerk. At episode 7000, while the green run shows oscillations converging to sub-optimal values, the other two runs demonstrate more convincing performance with reduced state oscillations and jerk.
State evolution of different training episodes (1000, 5000, 8000, 10400) of the two best Tabular Q-Learning agents compared with the best DQL agent:
These Tabular Q-Learning runs follow a similar training trend to the DQL ones. Particularly interesting is episode 10400, characterized by very dense oscillations but contained within a remarkably small range of values. These runs generally performed better, likely due to the relatively small size of both action and state spaces, as they were discretized into 10 bins. This suggests that the discrete approach can achieve more stable and consistent performance compared to the continuous state space approach, despite, or more probably, because of, its simpler structure.
Deep Q-Network hidden layers size
This Project Work presented a simplified implementation of an Autonomous Platoon Control environment using two different Q-Learning approaches. Comparisons between them and hyperparameters analysis were conducted, leading to interesting key findings about these Reinforcement Learning methods.
Future work could focus on extending the system to handle multiple following vehicles, implementing more sophisticated Reinforcement Learning algorithms such as PPO, and testing the system with real traffic data from the NGSIM dataset.