Multi-Scene Traffic Light Detection and Fault Identification via Dual-Attention Image Fusion

Yuxiao Shi; Jinglin Zhang; Yuxia Li

doi:10.32604/cmes.2026.078601

icon Open Access

ARTICLE

Multi-Scene Traffic Light Detection and Fault Identification via Dual-Attention Image Fusion

Yuxiao Shi¹, Jinglin Zhang², Yuxia Li^2,*

1 Department of Civil, Environmental and Geomatic Engineering, University College London, London, UK
2 School of Automation Engineering, University of Electronic Science and Technology of China, Chengdu, China

* Corresponding Author: Yuxia Li. Email: email

(This article belongs to the Special Issue: Recent Advances in Geospatial Artificial Intelligence (GeoAI) Models, Approaches, and Applications)

Computer Modeling in Engineering & Sciences 2026, 147(1), 28 https://doi.org/10.32604/cmes.2026.078601

Received 04 January 2026; Accepted 28 February 2026; Issue published 27 April 2026

Abstract

Traffic light detection and fault identification using images from road traffic cameras are important for intelligent traffic management and urban safety monitoring. However, images collected in real traffic environments show clear differences in camera view, lighting conditions, weather, and background complexity. As a result, traffic lights vary greatly in scale, spatial location, and appearance, which reduces detection accuracy in complex scenes. To deal with this problem, this paper presents a multi-scene traffic light detection and fault identification framework based on dual-attention image fusion. Large-scale road camera data from the Chengdu Traffic Management Bureau are used, together with the Bosch Small Traffic Lights (BSTL) dataset. A traffic light and fault dataset is first built and expanded. Then, to improve detection under complex backgrounds and scale changes, a DCSPx module is designed to combine global features with local features in the backbone network. At the same time, a dual-attention mechanism is introduced. This mechanism includes position attention and channel attention, which helps improve effective feature learning. Based on these components, a Multi-Dual-Attention YOLOv5xp (DA-YOLOv5xp) algorithm is developed by combining detection results from images captured at different phases. In addition, a time-domain-based fault judgment method is proposed to detect abnormal traffic light states under real operating conditions. Experiments on real traffic camera data show that the method reaches an F1-score of 95.83% for traffic light detection and 96.97% for fault identification, and it performs better than several existing detection models.

Keywords

Traffic light detection; fault identification; few-shot learning; deep learning; object detection; attention mechanism

1 Introduction

Road traffic safety is an important part of intelligent transportation systems, and traffic signal lights are one of the key elements in road safety. Accurate detection of traffic signal lights directly affects daily travel safety for urban residents and also influences the operating efficiency of urban transportation systems. For this reason, timely detection and identification of traffic signal light faults are an important task in smart transportation and smart city development.

In recent years, deep learning methods have been widely applied in the field of computer vision. The detection and recognition of traffic signal light faults can be formulated as an object detection problem. At present, many studies have focused on classical detection frameworks, such as the region-based convolutional neural network (R-CNN) series and single-stage detection algorithms. However, traffic signal light detection still faces challenges related to complex scenes, small target sizes, occlusion, and illumination variations.

From the perspective of traffic authorities, timely awareness of abnormal traffic signal display conditions is essential for maintaining traffic order and intersection safety. In real-world operations, roadside surveillance cameras continuously capture images of traffic signal lights under diverse lighting, weather, and viewing conditions. As a result, traffic signal images may exhibit abnormal visual patterns, such as all signal lamps appearing dark or multiple lamps appearing illuminated simultaneously within a single image. These phenomena, referred to in this study as traffic signal “fault”, are defined at the image level and do not necessarily imply physical hardware failure. Currently, the identification of such abnormal visual states relies mainly on manual inspection of camera feeds or post-event review, which is inefficient and difficult to scale across large urban networks. Therefore, traffic management agencies have a practical need for vision-based methods that can automatically detect traffic signal lights and recognize abnormal display states directly from roadside camera images. In this study, traffic light detection refers specifically to fixed roadside traffic signal lamps (including vehicular and pedestrian signals at intersections) captured by stationary surveillance cameras. The objective is to identify image-level abnormal display patterns rather than to diagnose underlying hardware malfunctions. Most data used in this work are obtained from the official roadside camera database of the Chengdu Traffic Management Bureau.

Girshick et al. [1] proposed the R-CNN framework, which performs object detection through a two-stage process: first generating region proposals in the input image, followed by feature extraction using CNNs and classification using classifiers such as SVM and Softmax, with bounding box regression applied to localize target objects. On the Pascal VOC 2012 dataset, this method reached a mean average precision (mAP) of 53.30%. Later, Redmon et al. [2] proposed the single-stage regression-based YOLO algorithm. This algorithm runs fast, but its detection accuracy is relatively low. It also has weaker localization accuracy and limited ability to detect many small objects in dense scenes.

Recent studies have also emphasized deployment-oriented traffic monitoring architectures that integrate perception with communication infrastructure. For example, Bakirci [3] proposes an Internet-of-Things (IoT)-enabled UAV-based real-time traffic mobility analysis framework, combining embedded deep-learning inference with IoT data transmission to support smart-city traffic monitoring. This line of work highlights the importance of coupling vision algorithms with practical sensing-and-networking pipelines for real-world operation. In comparison, our work targets continuous, intersection-level monitoring using fixed roadside surveillance cameras, which are widely deployed and naturally suited for large-scale, long-term supervision of traffic signal display states. From a system perspective, UAV-based monitoring can provide flexible, on-demand coverage, while roadside cameras offer persistent coverage at regulated intersections; therefore, the proposed traffic-signal detection and image-level abnormal-state identification module can be viewed as a complementary component that can be integrated into broader ITS/IoT infrastructures for city-scale traffic management.

Wang et al. [4] studied multi-target detection and proposed a multi-task generative adversarial network (CMTGAN). This method increases image resolution beyond the original scale. As a result, the images become clearer and contain more detail, so detection performance improves compared with common methods. Park et al. [5] used a P-Outlier Exposure method to reduce false detections in deep learning networks. Yoneda et al. [6] proposed an image processing approach based on spatial distribution characteristics of vehicle signals, reducing false detections and misidentification of traffic signal lights in autonomous driving scenarios. Masaki et al. [7] noted that traditional computer vision and machine learning methods are often sensitive to illumination changes or fail to effectively extract key features of traffic signal lights, and therefore proposed a method based on hue information segmentation to identify traffic signal light regions. Wang et al. [8] introduced an improved YOLOv3 model, incorporating an attention mechanism in the spatial dimension of feature maps to enhance small-object detection performance.

To further improve detection accuracy under complex conditions, some studies have integrated feature fusion networks and attention mechanisms. By replacing the backbone network and introducing a multi-scale detection strategy, Li et al. [9] improves detection speed while reducing the number of model parameters of YOLOv4-L. By optimizing feature extraction for small aerial targets, the method improves detection performance on UAV-captured datasets, achieving noticeable gains in accuracy compared with the baseline model. Compared with the original YOLOv5 model, the improved YOLO-G model by Qiu et al. [10] enhances feature pyramid design to prevent the loss of shallow semantic information and introduces attention mechanisms to improve target feature representation. By introducing an attention mechanism, mixed-domain attention over feature maps is achieved. Yue and Quan [11], based on SSD, replaced the backbone network and proposed a feature fusion approach. This method integrates features from different scales to obtain more comprehensive information, thereby enhancing the detection capability for small targets. Liu et al. [12] proposed a self-learning ensemble-based traffic signal fault detection system that enables rapid and accurate identification of signal operating states through deep learning and dynamic threshold adjustment. Zhang [13] proposed a fault-diagnosis and warning maintenance management system for traffic signals, which relies on image data collected by surveillance cameras. This system can determine whether traffic signals are functioning normally, but it cannot locate the exact position of faulty signal lights.

Over the past two years, traffic light perception technology has been continuously evolving towards powerful small object detection capabilities and deployment-oriented reliability. Recent research has explored more data-driven strategies, such as methods combining color enhancement with reinforcement learning/integrated learning, to improve recognition accuracy in the presence of varying lighting conditions and heterogeneous signal layouts [14]. Meanwhile, the YOLO series detectors are still widely used, but their functions are constantly being enhanced, with the addition of attention mechanisms and improvements in localization loss, to reduce false negatives and false positives in complex urban backgrounds, where traffic lights occupy only a small number of pixels [15]. In addition to architectural improvements, the research also emphasizes the coverage of the dataset, the consistency of evaluation, and the generalization to real roads; for example, recent benchmark analyses have highlighted the inconsistencies between different datasets/indicators and proposed correlation evaluation methods to distinguish lane signals in the presence of multiple traffic lights simultaneously [16]. Finally, video-based frameworks are increasingly incorporating temporal models. For instance, recent methods combine a detector, a feature extractor, and a recurrent module to improve the stability of state classification in the presence of occlusion and noise. This indicates that combining spatial and temporal cues is of great significance for practical applications [17].

In this study, we focus on traffic signal light fault detection in road images and construct a deep learning model to identify traffic signal lights and classify their fault types. The main contributions of this work are as follows:

1. Improved traffic signal light detection under complex scenarios.

A multi-scene traffic light detection framework is proposed based on the DA-YOLOv5xp net-work and the MDYxp fusion strategy. First, the DA-YOLOv5xp model incorporates a DCSPx module constructed with dense connections and dilated convolutions, to efficiently reuse multi-scale features and expand the receptive field. Simultaneously, a dual-attention mechanism (combining position and channel attention) is integrated to suppress background interference and enhance sensitivity to small targets, improving the baseline F1-score by 4.74%. Second, to address complex environmental interference, a Multi-Dual-Attention YOLOv5xp (MDYxp) al-gorithm is developed. This algorithm employs a frequency-based spatio-temporal alignment strategy to fuse detection results from multi-phase images, significantly reducing false positives caused by transient noise and raising the detection F1-score to 95.83%.

2. Spatio-temporal fault identification method.

A spatio-temporal domain-based fault detection and recognition algorithm is developed to identify and classify traffic signal light faults. Experimental results show that the method achieves an F1-score of 96.97%, with accuracy and recall of 97.96% and 96.00%, respectively.

2 YOLOv5-Based Object Detection Model

The YOLOv5 model uses a backbone network based on the CSPNet [18] architecture. With the CSPNet design, YOLOv5 reduces the computation caused by ResUnit modules while keeping good efficiency and detection accuracy. The backbone network includes a main trunk and three CSP modules. After these CSP modules, an extra decoding part is added to provide shallow feature information for the decoding stage.

In YOLOv5, the decoding stage restores feature maps by upsampling. To reduce network complexity and improve runtime efficiency, the CSP structure in the decoding stage is simplified by replacing ResUnit modules with convolutional layers. During feature fusion, shallow features from the backbone network are combined with deeper semantic features. Subsequently, the detection head processes three feature maps of different scales to perform object detection.

The object detection module consists of three components: a bounding box regression module, a non-maximum suppression module, and three-scale detection heads. Feature maps of three different resolutions are first fed into the detection module. Each feature map is divided into grids of size S×S, and each grid cell predicts multiple candidates bounding boxes and the corresponding probabilities of containing traffic signal lights. Bounding boxes with probabilities below a predefined threshold are discarded. The remaining candidate boxes are further filtered using the non-maximum suppression module to obtain the final predicted bounding boxes of traffic signal lights.

Finally, the predicted bounding boxes of traffic signal lights and their corresponding ground-truth labels are incorporated into the loss function as follows:

Loss=λcoord∑i=0S2∑j=0Bρi,jobjli,jbox+λnoobj∑i=0S2∑j=0Bρi,jnoobjlli,jobj+λobj∑i=0s2∑j=0Bρi,jobjli,j°bj(1)

where: B denotes the number of candidate bounding boxes contained in each image grid cell. λcoord, λnoobj, and λobj are weighting coefficients. The indicator variable ρi,jobj takes a value of 1 or 0. When ρi,jobj=1, it indicates that the candidate bounding box at pixel location (i,j) contains a target. When ρi,jobj=0, it indicates that no target is present in the candidate bounding box at location (i,j). The indicator variable ρi,jnoobj also takes a value of 1 or 0. When ρi,jnoobj=1, it indicates that no target is present in the candidate bounding box (i,j). When ρi,jnoobj=0, it indicates that the candidate bounding box at location (i,j) contains a target. li,jbox and li,jobjare intermediate variables, defined as follows:

li,jbox=1−|b⋂bgt||b⋃bgt|+p2(b,bgt)c2+α[4π2(arctan⁡wgthgt−arctan⁡wh)2](2)

li,jobj=|b⋂bgt||b⋃bgt|⋅log⁡[sigmoid(Pc)]+(1−|b⋂bgt||b⋃bgt|)⋅log⁡[1−sigmoid(Pc)](3)

When ρi,jnoobj=0, it indicates that the candidate bounding box at location (i,j) contains a target.

Where: b and bgt denote the predicted bounding box of the traffic signal light and its corresponding ground-truth bounding box. wgt and hgt represent the absolute differences in X-axis and Y-axis between the ground-truth bounding box and the predicted bounding box. w and h denote the absolute differences in width and height of the predicted bounding box. p represents the Euclidean distance between the center coordinates of the predicted bounding box and the ground-truth bounding box. α is a weighting coefficient, and c is a normalization constant. Pc denotes the predicted probability that a candidate bounding box contains a traffic signal light.

The sigmoid function is defined as:

sigmoid(x)=11+e−x(4)

3 Traffic Signal Light Detection and Fault Identification Algorithms

3.1 Feature Analysis of Traffic Signal Light Images Captured by Roadside Cameras

Traffic light images captured by roadside cameras show large changes over time, varied backgrounds, and complex scenes. Because of this, their spatial and geometric characteristics are also complex. Spatial features mainly describe how traffic signal light targets relate to other objects in the scene, such as partial occlusion or disturbance from objects with similar appearances. Geometric features mainly describe the properties of the traffic signal lights themselves, including their proportion in the full image and their color patterns. Camera-level data split to prevent leakage. To avoid data leakage caused by sharing the same viewpoint across training and testing, we perform the data partition at the camera level. In the Chengdu roadside surveillance dataset, each data “set” corresponds to a unique fixed camera ID deployed at a specific intersection location, which implies a fixed viewpoint and scene layout. This research reserves 23 camera sets as the test set. All remaining camera sets used for training are disjointed from the test set in terms of camera ID and viewpoint; therefore, no images captured by any test camera are included in training. After camera-level splitting, we sample frames for detector training from the training cameras (and additionally include a selected subset of BSTL [19] to increase viewpoint diversity). This camera-disjoint protocol ensures that performance metrics reflect generalization to unseen camera viewpoints rather than memorization of the same scene. To increase viewpoint and appearance diversity beyond fixed roadside cameras, we additionally incorporated a subset of the public Bosch Small Traffic Lights (BSTL) dataset into the training data. Specifically, 49 images were selected from BSTL and used only for training the traffic-light detector, resulting in a total of 215 training images (166 images from the Chengdu dataset + 49 BSTL images). The BSTL images were not used in the Chengdu test set and were included solely to improve generalization across different traffic-light appearances and backgrounds. Since the detection task in this work is formulated as single-class traffic-light object detection, we unify annotations from different sources using a consistent label schema. All traffic-light instances in BSTL (regardless of their signal color/state attributes) are mapped to a single class, “traffic light”, consistent with the Chengdu annotations. Bounding-box annotations from BSTL are converted to the same format used for YOLO training (normalized [xcenter, ycenter, width, height]).

Specifically, the characteristics of traffic signal light images can be summarized as follows:

1. Temporal variation:

From a time perspective, roadside camera datasets include images captured at different times from daytime to nighttime, as shown in Fig. 1a,b. The biggest effect comes from lighting changes, which cause large differences in image brightness.

2. Environmental complexity:

From an environmental perspective, urban scenes captured by roadside cameras are often complex. Overexposure can occur because of reflections from nearby glass building facades (Fig. 1c). In contrast, bad weather such as rainy or overcast days can cause underexposure and add more noise.

3. Scale characteristics of traffic signal lights:

From a scale perspective, traffic signal lights captured by roadside cameras usually take up only a small part of the whole image. As a result, there are only a few pixels that contain geometric feature information, so feature extraction and detection become more difficult.

4. Scene diversity:

From the perspective of scene diversity, roadside cameras are installed in many different locations, so the captured scenes vary a lot. As a result, the spatial relationships between traffic signal lights and other objects in the images are often complex.

images

Figure 1: Different images captured by the same camera (a) Daytime; (b) Nighttime; (c) Overexposure; (d) Rainy.

Roadside cameras generate large volumes of image data. However, interference from time changes, spatial complexity, and lighting changes often reduces image quality significantly. Consequently, conventional deep learning–based detection networks frequently struggle to accurately detect traffic signal lights, which in turn adversely affects fault identification performance. For this reason, effectively using roadside camera images of different quality while also improving traffic signal light detection accuracy remains a key challenge that needs to be addressed.

3.2 Design of the Multi-DA-YOLOv5xp (MDYxp) Algorithm

Traffic signal lights captured by roadside cameras show complex spatial and geometric characteristics. For this reason, the network model needs a stronger ability to use multi-scale features.

Fig. 2 illustrates the overall architecture of the MDYxp framework. Inspired by the capability of the SSD [20] network to utilize multi-scale features, and drawing on the DenseNet [21] architecture, this study first designs a DA-YOLOv5xp model based on a dense connectivity network (a DenseNet variant).

images

Figure 2: The main structure of Multi-DA-Yolov5xp (MDYxp).

In the proposed network, the original CSP blocks in YOLOv5 are replaced with DCSP blocks. In these blocks, the ResUnit modules are replaced by a DenseNet-based structure called D-DenseUnit. This structure stacks multi-scale feature maps more directly. With this design, the use of multi-scale features becomes more efficient, while the number of parameters is reduced. At the same time, the receptive field of the network is increased, so important information is kept during decision making. The structure of the DCSP block is shown in Fig. 3, where x indicates the number of D-DenseUnits in one DCSP module.

images

Figure 3: DCSPx module.

At the same time, compared with YOLOv5 [22], which simplifies the CSP block in the head stage, the proposed DA-YOLOv5xp keeps the full DCSP block while using D-DenseUnit to reduce the number of parameters. With this design, the network can better recover feature information after feature extraction. In the network, the CBS module uses a BN–ReLU–Conv (3 × 3) structure, while the SPP module applies spatial pyramid pooling.

The attention mechanism is inspired by human visual perception. When people look at an image, they usually focus on specific regions instead of treating all visual information in the same way. For this reason, this study introduces a dual-attention mechanism [23]. It includes position attention and channel attention. This design helps the network focus more on traffic signal light targets and improves detection accuracy.

DA-YOLOv5xp adds channel attention and position attention in the middle layers of the network. This improves global information extraction and makes the network more sensitive to traffic signal light targets. After the backbone network, the model splits into two parallel paths. One path applies channel attention to assign weights to channel information. The other path applies position attention to assign weights to position information. The feature maps from the two paths are then added element by element, and a convolutional layer is used to fuse them so the information from both attention mechanisms is combined.

Roadside cameras capture many traffic signal light images at different times. These images are similar in scene content, but they are taken at different times, such as early morning, noon, and nighttime. They can also be affected by vehicle headlights, reflections from building surfaces, overexposure, underexposure, and other noise, so image quality can vary a lot. To make better use of these highly similar images captured by the same camera, and to avoid overfitting and the influence of low-quality images, this study proposes a fusion algorithm called MDYxp. When only a single frame is available, the system bypasses the fusion stage and outputs the DA-YOLOv5xp detection result directly; MDYxp therefore serves as an optional refinement when multi-frame data exists.

The MDYxp algorithm constitutes an integrated detection refinement process built upon the DA-YOLOv5xp detection network. Taking traffic signal light images captured by a single camera as an example, the main steps of the algorithm are as follows:

Step 1:

Traffic signal light images captured by the same roadside camera at different time periods are input into the detection network. The network outputs label sets predicted for different time periods within the same scene, forming a predicted traffic signal light coordinate set L.

Step 2:

For a predicted result set Lj within L, each traffic signal light coordinate in Liis examined to determine whether it spatially intersects with any traffic signal light coordinate in Lj.

If an intersection exists, the two labels are merged (i.e., the two labels are combined and replaced by the merged label).

If a traffic signal light coordinate in Li does not intersect with any coordinate in Lj, it is added to Lj.

Step 3:

Step 2 is repeated until all traffic signal light coordinate sets have been fully processed.

Step 4:

The number of times each merged label appears is counted. If this count exceeds a predefined threshold r, the label is considered to correspond to a traffic signal light and is retained; otherwise, it is regarded as a non–traffic signal light and removed.

The fusion frequency threshold r controls how many times a consistent detection must appear across temporally separated frames before being retained. We select r empirically on the Chengdu roadside-camera data via repeated comparison experiments. Setting r too low retains sporadic false positives caused by snapshot timing and transient noise, whereas setting r too high may remove true traffic lights that are intermittently missed under occlusion or glare. Based on repeated experiments on the 23-camera evaluation setting, r = 11 provides the best trade-off by exploiting the spatio-temporal stability of fixed traffic lights while suppressing non-persistent detections.

3.3 Structure of the DCSPx Module

Due to the relatively low quality of images captured by roadside cameras, the small size of traffic signal light targets, and frequent occlusion, the local features of traffic signal lights are often disturbed. Therefore, this study enhances detection accuracy by improving the network’s capability to capture global contextual information, thereby strengthening its ability to extract spatial information related to traffic signal lights.

To improve the acquisition of global information, a DCSPx module is designed, as illustrated in Fig. 3, where x denotes the number of D-DenseUnit structures contained in the module.

The DCSPx module has two parallel branches. One branch increases the receptive field and improves the extraction of global information. The other branch keeps local information and maintains the network’s ability to capture the geometric features of traffic signal lights. This structure allows local information to be effectively combined with global information.

The first branch mainly uses the D-DenseUnit. In the D-DenseUnit, the early layers use a standard dense connection design. Each layer connects to later layers through skip connections. Each layer follows the structure BN–ReLU–Conv (1 × 1)–BN–ReLU–Conv (3 × 3). After that, three dilated convolution layers are added. Each one follows BN–ReLU–Conv (1 × 1)–BN–ReLU–DConv (3 × 3, rate = 2, 4, or 8). These dilated layers expand the receptive field. They also reuse features by processing feature maps from earlier layers, which supports multi-scale feature fusion. Compared with the common ResUnit structure, the D-DenseUnit uses skip connections more often, so the number of parameters is reduced.

To keep the network’s ability to capture local information and the geometric features of traffic signal lights, the DCSP module adds a second branch. This branch keeps shallow-layer features, which helps improve multi-scale feature extraction. In this branch, the original input features are passed forward directly. Then, after convolution, these features are fused with the global features extracted by the first branch.

The DCSPx module is inspired by DenseNet. It is built with D-DenseUnits and skip connections. In the first branch, D-DenseUnits use dense connections and also add three dilated convolution layers. This design reuses features, combines multi-scale information, and improves global context extraction. At the same time, skip connections help reduce the number of parameters. The second branch preserves shallow local features and fuses them with the global features extracted by the first branch. By strengthening the network’s ability to capture global contextual information and integrating multi-scale features, the DCSPx module improves the detection of spatial features of traffic signal light targets while preserving the network’s capacity to capture their geometric characteristics.

3.4 Dual-Attention Mechanism

The dual-attention mechanism (Dual Attention) mainly consists of channel attention and position attention. The essence of the dual-attention mechanism is to generate two weight matrices, which assign weights to the channel-wise and spatial information of feature maps, respectively, thereby enhancing useful information while suppressing irrelevant or interfering information. This subsection details the structures and characteristics of the two attention mechanisms.

Fig. 4 illustrates the structure of the position attention module (PAM). In PAM, the input feature map is processed through two parallel branches. One branch generates the attention probability map of size (H×W)×(H×W) by constructing the query (Q) and key (K) representations. In the other branch, the feature map is used as the value (V) representation. Here, V, Q, and K denote the value features, query features, and key features, respectively, while C, H, and W represent the channel number, height, and width of the feature map. The overall structure of PAM can be expressed by Eqs. (5) and (6):

Att=softmax(Q(C×HW)⋅K(HW×C))(5)

Fout=(V(C×HW)⋅Att)⋅reshape(C×H×W)+Input(C×H×W)(6)

In Eqs. (5) and (6), the subscripts indicate the corresponding tensor shapes.

images

Figure 4: The main structure of PAM.

The subscripts indicate the corresponding tensor shapes.

Fout denotes the final output feature map.

InputC×H×W represents the original feature map input to the position attention module.

reshape(C×H×W) denotes the operation that transforms the feature map into the shape.

Fig. 5 illustrates the structure of the channel attention mechanism (CAM). CAM focuses on information across the channel dimension of feature maps. In this network structure, CAM generates a probability map of size C×C, which helps enhance feature discriminability. The overall structure of CAM can be expressed by Eqs. (7) and (8):

Att=softmax(Q(C×HW)⋅K(HW×C))(7)

Fout=(Att⋅VC×HW)⋅reshape(C×H×W)+InputC×H×W(8)

images

Figure 5: The main structure of CAM.

3.5 Analysis of Traffic Signal Light Fault States

According to the Chinese traffic industry standard Technical Requirements and Testing Methods for Road Traffic Signal Lamps (GB 14887–2011) [24], traffic signal light fault includes conditions such as complete lamp blackout and incorrect indication. In this study, two major fault scenarios are addressed: (i) complete extinction of all signal lamps, and (ii) incorrect indications caused by multiple lamps being illuminated simultaneously. The characteristics of these fault states are mainly manifested as follows:

1. None of the red, yellow, and green lamps is illuminated.

2. Two or more lamps among red, yellow, and green are illuminated simultaneously.

In practical applications as well as in the experiments and evaluations presented later in this paper, traffic signal light states are classified into two categories: normal and faulty. Since the data used in this study are vehicle-capture images acquired by roadside cameras, for the two fault scenarios described above, some “false fault” cases may occur in single images.

Fig. 6 shows an example that is easily misclassified as a case where no signal lamp is illuminated. The traffic signal light highlighted by the blue bounding box is shown. In the first image, none of the lamps are illuminated. If a judgment is made based on this single image, the signal light would be incorrectly classified as faulty. However, by observing the same traffic signal light at different time instants, the signal light is not actually faulty and only exhibits the situation shown in the first image for a short duration. This phenomenon occurs because the data used in this study consist of vehicle-capture images; during image acquisition, the camera may capture the signal light at the moment when it is switching states, resulting in a “fake all-black” appearance.

images

Figure 6: An example of “Fake all-black”.

Fig. 7 shows an example that is easily misclassified as a case where multiple lamps are illuminated simultaneously. Focusing on the traffic signal light in the image, the first image shows the yellow and red lights illuminated at the same time. In reality, such images occur very rarely. This phenomenon arises because, during vehicle capture, the camera happens to record the instant when the signal transitions between red and yellow. Such transition moments are extremely brief, and even drivers and pedestrians cannot easily perceive them with the human eye. Therefore, these situations do not cause confusion in traffic operation and should not be regarded as faulty.

images

Figure 7: An example of Fake multiple-lights-on.

3.6 Traffic Light Fault Detection Method

To address the problem of distinguishing true fault from “false fault,” this study proposes a time-domain–based traffic light fault detection method, which ensures a high level of detection accuracy.

Based on the issues discussed above, in order to minimize time consumption and computational cost while maintaining high detection accuracy, the proposed fault identification algorithm avoids the use of deep learning networks and instead adopts traditional OpenCV-based methods. To further reduce missed detections and false alarms, and considering that a large number of images are captured by the same roadside camera in real application scenarios, a time-domain multi-image alignment mechanism is designed. In addition, to improve detection accuracy while reducing computational overhead, a two-stage fault detection strategy, consisting of brightness screening followed by color-based judgment is proposed.

The fault identification stage uses hierarchical brightness screening (θv1 = 180, θv2 = 150) followed by a pixel-count threshold (T = 80) to determine whether red/yellow/green regions are activated. The brightness thresholds are determined from brightness-distribution statistics of cropped traffic-light regions and stepwise screening experiments, aiming to mitigate illumination-induced artifacts such as “fake all-black” frames captured during signal switching, glare, or underexposure. The pixel-count threshold T is chosen by sampling the effective activated pixel area of traffic lights across different viewing distances and resolutions; empirically, T = 80 suppresses colorful background noise while preserving sensitivity to small distant traffic lights with limited illuminated pixels.

The proposed traffic light fault detection procedure consists of the following steps:

Step 1: Traffic light segmentation.

Using the post-processing method proposed in Section 3.2, detection bounding boxes are obtained.

From the image dataset captured by the same roadside camera, 500 images are randomly selected, and traffic light regions are cropped according to the detected bounding boxes.

Step 2: Color space conversion.

One cropped traffic light image is selected, and the image is converted from the RGB color space to the HSV color space.

Step 3: Hierarchical brightness screening.

First, the maximum brightness value of the image is obtained. If the brightness of all pixels is below the predefined threshold, the HSV values of all pixels are set to (0, 0, 0).

Step 4: Pixel-level color recognition.

For each pixel, the HSV values are denoted as (h,s,v).

If h<30 or 156<h<180, the pixel is classified as red;

If 78<h<95, the pixel is classified as green;

If 26<h<34, the pixel is classified as yellow.

Step 5: Traffic light color identification.

Three indicators: Rsymbol, Gsymbol, and Ysymbol are assigned to each traffic light, corresponding to red, green, and yellow, respectively. All indicators are initially set to false, and a threshold T is defined. The number of pixels belonging to each color is counted.

If the number of red pixels exceeds T, Rsymbol is set to true;

If the number of green pixels exceeds T, Gsymbol is set to true;

If the number of yellow pixels exceeds T, Ysymbol is set to true.

Step 6: Traffic light fault identification.

Based on the values of Rsymbol, Gsymbol, and Ysymbol obtained in Step 5:

If only one symbol is true, the traffic light image is classified as normal; if two or more symbols are true, the traffic light image is classified as multiple lights illuminated; if all symbols are false, the traffic light image is classified as all lights off.

Step 7: Iterative processing.

Steps 2 through 6 are repeated until all 500 selected images have been processed.

Step 8: Integrated decision.

For all traffic light images processed above, the proportions of each classification type are calculated.

If the proportion of images classified as all lights off exceeds a predefined threshold, the traffic light is considered damaged, and the damage type is identified as “indicator all black.”

If the proportion of images classified as multiple lights illuminated exceeds the threshold, the traffic light is considered damaged, and the damage type is identified as “multiple indicators illuminated.”

In all other cases, the traffic lights are considered not damaged.

The main steps of the time-domain–based traffic light fault detection method are illustrated in Fig. 8.

images

Figure 8: The flowchart for fault detection.

4 Experiments and Results Analysis

4.1 Experimental Dataset

The experimental data were collected from 131 sets of roadside camera recordings captured in the main urban area of Chengdu, covering various traffic scenarios such as elevated bridges, intersections, arterial roads, and auxiliary roads. The total number of images is 10,000.

To prevent adverse effects on network performance caused by single-scene repetition, 23 sets comprising 5728 images were selected from the dataset as the test set for the traffic light fault detection pipeline. From the remaining 108 sets, 35 sets with no traffic lights were removed because they were not valid for this task. From the remaining 83 sets, data cleaning and screening were carried out. Then, two images were selected from each set, which produced 166 images in total. Through this process, the test set and part of the training set were selected from the original large-scale dataset, which contained redundant and invalid data.

To further improve detection accuracy, the training set also includes data from the public Bosch Small Traffic Lights Dataset (BSTL). The BSTL dataset was mainly built for autonomous driving and contains traffic light images captured on urban roads and intersections in Boston. A selected subset of BSTL is added to the 166-image training set. This addition increases the diversity of traffic light target characteristics.

The training set for deep neural network training includes 215 images. The test set for traffic light detection and fault detection includes 23 data groups, with 5728 images in total.

The traffic-signal datasets collected from a limited number of locations may exhibit strong scene bias, we employ a camera-disjoint split to avoid data leakage: no camera viewpoint appears in both training and testing. During training, we use early stopping (patience = 100) and track training losses to reduce the risk of overfitting. In addition, we report ablation results and evaluate performance across multiple scene categories (general, occlusion, and interference) to demonstrate that improvements persist under different background complexity and visibility conditions. These settings are intended to ensure that the reported metrics reflect generalization to unseen camera viewpoints rather than memorization of specific scenes.

4.2 Ablation Experiments for Traffic Light Object Detection Algorithms

To verify the effectiveness of the DCSPx module, the dual-attention mechanism, and the fusion algorithm, ablation experiments were carried out. These experiments were performed step by step to evaluate the contributions of the network structure and each module.

In this study, YOLOv5x is used as the baseline model. Table 1 shows the ablation results. YOLOv5x is the baseline model. YOLOv5xp is the model that adds the D-DenseUnit into the YOLOv5x backbone. DA-YOLOv5xp is the model that further adds the dual-attention mechanism. MDYxp is the full framework that also includes the multi-image fusion strategy. As shown in Table 1, when compared with the baseline YOLOv5x, the DCSPx module, the dual-attention mechanism, and the fusion algorithm each improve performance. DP-YOLOv5xp denotes the model position attention module (PAM) only; DC-YOLOv5xp denotes the model with channel attention (CAM) only; DA-YOLOv5xp combines both PAM and CAM.

images

The results for YOLOv5xp show that adding the DCSPx module improves the network’s ability to use multi-scale information. As a result, precision, recall, and F1-score increase. In terms of model size, replacing the residual network with a densely connected network reduces the number of parameters, so the model becomes smaller.

For fairness, all variants are saved and measured using the same protocol (inference-only parameter weights). The reported model size excludes checkpoint states and any training-time buffers, and no post-training compression (e.g., quantization or pruning) is used.

Among the tested network variants, DA-YOLOv5xp shows the best performance. Its F1-score increases by 4.74% compared with YOLOv5x. The results suggest that the attention mechanism improves precision and recall in a more balanced way. When compared with earlier variants, DA-YOLOv5xp also shows a clear increase in recall, rising by 8.22% over the baseline YOLOv5x. At the same time, the model size is reduced by about 50% compared with the baseline model.

After adding the MDYxp fusion algorithm, detection performance improves further. Precision and recall increase by 8.54% and 9.52%, and the F1-score increases by 9.02%. The largest gain is in recall. This is mainly because, under different scenarios, the network pays attention to different traffic light targets. With multi-scene fusion, traffic lights that are hard to detect in one scene can be detected in another scene and then confirmed. Also, the frequency-based alignment method removes most false detections, so detection accuracy improves further. Since the fusion algorithm does not change the network structure, the number of parameters stays the same.

To provide a more intuitive presentation of the results, the test images used in the ablation experiments are further categorized into three types according to scene characteristics: general scenes, referring to images in which traffic lights are clearly visible with limited background interference; occlusion scenes, referring to images in which traffic lights are partially occluded by background objects; and interference scenes, referring to images containing background objects with visual features similar to traffic lights. In the following sections, qualitative results for these three scene categories are presented separately.

(1) General Scenes

Under general scene conditions, the main factor that affects traffic light detection is the small size of many traffic light targets, so their geometric features are often not clear. Fig. 9 shows detection results for typical general scenes. As shown in the figure, most traffic light targets are detected by all network variants. However, some very small targets contain only a few pixels in the traffic light region. Because of this, the geometric feature information is limited. With the MDYxp fusion strategy, recall is higher, so detection of these small targets is improved. One example is the small traffic light at the far right of the first row in the figure.

images

Figure 9: Example of general scene detection: (a) Ground Truth; (b) YOLOv5x detection result; (c) YOLOv5xp detection result; (d) DA-YOLOv5xp detection result; (e) MDYxp detection result.

(2) Occlusion Scenes

In occlusion scenes, traffic light detection is mainly affected by objects that block the traffic light. This blocking interferes with the geometric features of the traffic light targets. Fig. 10 shows two cases where traffic lights are occluded. In the first row, the traffic light in the upper right is blocked by a horizontal structure. In the second row, three traffic lights on the right are partly blocked by road signs, while the traffic light on the left is blocked by a horizontal structure.

images

Figure 10: Examples of object detection in occluded scenes: (a) Ground Truth; (b) YOLOv5x detection results; (c) YOLOv5xp detection results; (d) DA-YOLOv5xp detection results; (e) MDYxp detection results.

In the first row, YOLOv5x fails to detect the occluded traffic light and also produces false detections in nearby areas. YOLOv5xp detects the occluded traffic light, but false detections still appear. In contrast, DA-YOLOv5xp and MDYxp detect all traffic lights correctly. In the second row, YOLOv5x misses two occluded traffic lights, while the other network variants detect all targets correctly.

These results show that DA-YOLOv5xp focuses more on global features and uses spatial information more effectively. Because of this, it can still extract traffic light features under occlusion, which improves detection accuracy. Based on DA-YOLOv5xp, the MDYxp algorithm also combines detection results from multiple scenes, so detection performance improves further.

(3) Interference Scenes

Interference scenes often contain background objects that look like traffic lights. In these scenes, accurate detection depends on two points. The model needs to extract geometric features correctly. It also needs to judge whether a target is truly a traffic light using spatial information. Fig. 11 shows typical interference scenes. When compared with the ground truth, both YOLOv5x and YOLOv5xp produce false detections in these cases. With the dual-attention mechanism, DA-YOLOv5xp uses global context information more effectively, so it produces fewer false detections. In addition, MDYxp uses a fusion strategy based on how often a detection appears across images. For this reason, wrong detections are removed, which improves robustness and further reduces false positives.

images

Figure 11: Example of detection in a disturbed scene: (a) Ground Truth; (b) Detection result of YOLOv5x; (c) Detection result of YOLOv5xp; (d) Detection result of DA-YOLOv5xp; (e) Detection result of MDYxp.

Across the three scene categories of general scenes, occlusion scenes, and interference scenes, the results show that networks using D-DenseUnit perform better than those using ResUnit. They also show that the dual-attention mechanism improves global information extraction, so it suppresses interference more effectively. In addition, the multi-image fusion strategy increases recall, so the network can detect more traffic lights across multiple scenes.

In addition, to examine the effect of the fusion strategy more clearly, Fig. 12 shows the fusion results and the fusion process for the DA-YOLOv5xp model and the MDYxp method. The figure includes partial scenes from three cameras during the fusion process, which is the deep neural network detection stage. During fusion, some scenes exhibit missed detections, such as those shown in the first-row image (d) and second-row image (e). However, in the post-processing results, these missed detections are recovered by leveraging information from other scenes. Conversely, some scenes also contain false detections, such as those shown in the first-row image (c) and second-row image (c). Benefiting from the frequency-based alignment mechanism, these false detection bounding boxes are eliminated from the final results. Finally, some failure cases remain in the post-processing stage, as illustrated in the third-row images, where a false detection persists in the final output. The primary reason for such failures is that, during detection, the network consistently misclassifies the same background object across most scenes. As a result, this type of false detection cannot be removed by the frequency-based alignment mechanism.

images

Figure 12: Multi-scene fusion results display. (a) Ground Truth; (b) MDYxp; (c) Multi-scene detection process scene 1; (d) Multi-scene detection process scene 2; (e) Multi-scene detection process scene 3.

4.3 Performance Comparison of Different Detection Networks

To further demonstrate the capability of the proposed detection method (MDYxp) for traffic light detection, this study conducts a comparative analysis with several existing detection networks. Specifically, YOLOv5x, SSD, Fast R-CNN [25], TPH-YOLOv5x [26], DA-YOLOv5xp, and MDYxp are compared.

Table 2 presents the horizontal comparison results on the dataset used in this study. It can be observed that, although the precision of the proposed deep learning-based traffic light detection model DA-YOLOv5xp is slightly lower than that of some individual methods, it achieves relatively high F1-score and recall. The MDYxp fusion method attains superior performance across all evaluation metrics.

images

4.4 Experimental Results and Analysis of Fault Detection

In practical applications, it is only necessary to report whether a traffic light is faulty. Therefore, in this section, the two fault types are merged and evaluated together.

Table 3 presents the results of the fault detection experiments. As shown in the table, the fault detection method achieves satisfactory performance, with high values of precision and recall. Among the final 50 faulty traffic lights, 48 faulty traffic lights are correctly identified (TP = 48), 2 faulty traffic lights are missing (FN = 2), and 1 faulty traffic light is falsely identified (FP = 1).

images

Fig. 13 illustrates examples of fault detection results. In the first row, the image on the left contains two fault types, both of which are correctly identified (highlighted by green bounding boxes). The image on the right corresponds to the “all lights off” fault type and is also correctly identified. The second row shows images without fault, all of which are correctly classified. To further analyze the fault detection performance, cases of missed detection and false detection are examined.

images

Figure 13: Effectiveness of fault detection.

Fig. 14 shows a failure case of fault detection. In the left image, the green bounding box indicates a faulty traffic light that is successfully detected, while the red bounding box marks a faulty traffic light that is missing. As shown in the detection results in Fig. 14, the primary reason for this missed detection is that, during the traffic light detection stage, the two traffic light targets were not correctly detected. As a result, these traffic lights were not sent to the fault detection pipeline, which caused fault identification to fail.

images

Figure 14: Fault detection failure example 1.

Fig. 15 shows a false fault detection case. In the left image, the target marked by a green bounding box is wrongly identified as a suspected faulty traffic light. However, if the image is checked carefully, there is no traffic light at the location marked by the green box. The right image explains why this happens. In the traffic light detection stage, this region is wrongly detected as a traffic light target. Then, because the region is very dark, it is further classified as the “all lights off” fault type.

images

Figure 15: Fault detection failure example 2.

The proposed fault detection method works in most cases and reaches strong evaluation results, with F1-score, precision, and recall all above 96.00%. Since traffic light fault patterns are relatively clear, traditional computer vision methods can detect faults effectively once traffic lights are detected correctly. The small number of failure cases mainly come from errors in the earlier traffic light detection stage, such as missed detections or false detections of traffic light targets. The failure examples in Figs. 14 and 15 indicate that fault identification errors are largely inherited from upstream detection errors (missed detections or false detections of traffic-light targets). In practical roadside monitoring, low-confidence detections can arise due to glare, underexposure, occlusion, motion blur, or snapshot timing at signal switching. To improve robustness in such conditions, the proposed system follows a recall-first strategy in the detection stage and then verifies detections through temporal consistency. The MDYxp fusion module applies frequency-based alignment across multiple frames (r = 11), which removes sporadic false positives while recovering targets that are intermittently missed in single frames. The subsequent time-domain fault identification further aggregates evidence over multiple frames, reducing sensitivity to single-frame artifacts (e.g., “fake all-black” or “fake multiple-lights-on”). This design enables stable end-to-end performance even when individual frames yield low-confidence or noisy detections.

5 Conclusion

This study addresses practical engineering challenges by proposing an integrated framework for traffic light detection and traffic light fault detection based on the YOLOv5x network. Because the original YOLOv5x exhibits limited detection accuracy for traffic light targets, particularly in terms of recall, several targeted improvements are introduced.

First, a DCSPx-based backbone network is designed to enhance the integration of spatial and geometric features of traffic light targets, thereby improving detection performance. In addition, a dual-attention mechanism is incorporated to strengthen the use of global contextual information and further improve detection accuracy. To address variations in traffic light appearance caused by different camera viewpoints, lighting conditions, and weather, a multi-image fusion strategy is proposed for real-world applications. By using images captured at different times and under different conditions, this strategy reduces both missed detections and false detections.

A series of ablation experiments are conducted to evaluate the effectiveness of the proposed improvements. The results show that DA-YOLOv5xp, which integrates the DCSPx module and the dual-attention mechanism, achieves the best performance among the tested network models. Compared with the baseline YOLOv5x model, the F1-score, precision, and recall improve by 4.74%, 0.83%, and 8.22%, respectively, while the model size is reduced by 357 MB. Building on this model, the MDYxp fusion algorithm further enhances detection performance and achieves an F1-score of 0.95833.

Comparative experiments with several existing detection networks further demonstrate that the proposed method delivers the best performance in terms of F1-score and recall. This study also investigates typical traffic light fault cases and identifies situations that may lead to misclassification. Based on this analysis, a time-domain traffic light fault detection algorithm is proposed. The algorithm is described in detail and validated experimentally. The results indicate that the fault detection method achieves an F1-score of 0.96970, with precision and recall of 0.97959 and 0.96000, respectively.

This study is evaluated primarily under the data collection setting of the Chengdu Traffic Management Bureau, in which each fixed roadside camera provides continuous or densely sampled time-series images. Accordingly, the proposed MDYxp strategy is formulated as a multi-image fusion refinement module, and its effectiveness is demonstrated in scenarios where multiple frames from the same camera are available. For cold-start or single-frame inputs, such as those from newly installed cameras, intermittent transmission, or sparse sampling, the fusion stage cannot be directly applied. In these cases, the framework relies solely on the base detector, DA-YOLOv5xp, and the robustness gains provided by temporal aggregation are correspondingly reduced. Future work will explicitly benchmark the single-frame setting, extend the fusion mechanism to sparse and short sequences through adaptive decision thresholds, and explore complementary strategies, including uncertainty-aware reporting and cross-time priors, to improve reliability when only limited frames are available.

Acknowledgement: The authors would like to thank the Chengdu Traffic Management Bureau for providing the road traffic camera data used in this study, and the contributors of the Bosch Small Traffic Lights (BSTL) dataset for making their benchmark publicly available. We also acknowledge the valuable discussions and technical support provided by our colleagues during dataset construction, model implementation, and experimental.

Funding Statement: This research was supported by the Science and Technology Program of Sichuan Province, China (No. 2024YFFK0414).

Author Contributions: The authors confirm contribution to the paper as follows: Conceptualization, Yuxiao Shi and Yuxia Li; methodology, Yuxiao Shi and Jinglin Zhang; algorithm design, Yuxiao Shi; model construction, Yuxia Li; software, Yuxiao Shi and Jinglin Zhang; data preprocessing, Jinglin Zhang; data curation, Yuxiao Shi; experimental design, Yuxiao Shi and Jinglin Zhang; validation, Yuxiao Shi, Jinglin Zhang, and Yuxia Li; formal analysis, Yuxiao Shi and Jinglin Zhang; investigation, Yuxiao Shi; resources, Yuxia Li; visualization, Yuxiao Shi; writing—original draft preparation, Yuxiao Shi; writing—review and editing, Yuxiao Shi, Jinglin Zhang, and Yuxia Li; supervision, Yuxia Li; project administration, Yuxia Li. All authors reviewed and approved the final version of the manuscript.

Availability of Data and Materials: The Bosch Small Traffic Lights (BSTL/BSTLD) dataset used in this study is openly available from the Heidelberg Collaboratory for Image Processing (HCI) at [https://github.com/bosch-ros-pkg/bstld]. The road traffic camera images and the associated annotations provided by the Chengdu Traffic Management Bureau (used to construct the traffic-light contamination dataset in this work) are subject to third-party data-sharing restrictions and therefore cannot be released publicly. Derived data (e.g., processed annotations/splits) and trained model weights can be made available from the corresponding author (Yuxia Li) upon reasonable request, subject to approval by the data provider where applicable.

Ethics Approval: Not applicable.

Conflicts of Interest: The authors declare no conflicts of interest.

References

1. Girshick R, Donahue J, Darrell T, Malik J. Rich feature hierarchies for accurate object detection and semantic segmentation. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition; 2014 Jun 23–28; Columbus, OH, USA. p. 580–7. doi:10.1109/CVPR.2014.81. [Google Scholar] [CrossRef]

2. Redmon J, Divvala S, Girshick R, Farhadi A. You only look once: unified, real-time object detection. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2016 Jun 27–30; Las Vegas, NV, USA. p. 779–88. doi:10.1109/CVPR.2016.91. [Google Scholar] [CrossRef]

3. Bakirci M. Internet of Things-enabled unmanned aerial vehicles for real-time traffic mobility analysis in smart cities. Comput Electr Eng. 2025;123(2):110313. doi:10.1016/j.compeleceng.2025.110313. [Google Scholar] [CrossRef]

4. Wang H, Wang J, Bai K, Sun Y. Centered multi-task generative adversarial network for small object detection. Sensors. 2021;21(15):5194. doi:10.3390/s21155194. [Google Scholar] [PubMed] [CrossRef]

5. Park HR, Hwang KH, Ha YG. An object detection model robust to out-of-distribution data. In: 2021 IEEE International Conference on Big Data and Smart Computing (BigComp); 2021 Jan 17–20; Jeju Island, Republic of Korea. p. 275–8. doi:10.1109/bigcomp51126.2021.00057. [Google Scholar] [CrossRef]

6. Yoneda K, Kuramoto A, Suganuma N, Asaka T, Aldibaja M, Yanase R. Robust traffic light and arrow detection using digital map with spatial prior information for automated driving. Sensors. 2020;20(4):1181. doi:10.3390/s20041181. [Google Scholar] [PubMed] [CrossRef]

7. Masaki S, Hirakawa T, Yamashita T, Fujiyoshi H. Distant traffic light recognition using semantic segmentation. Transp Res Rec J Transp Res Board. 2021;2675(11):97–103. doi:10.1177/03611981211016467. [Google Scholar] [CrossRef]

8. Wang K, Liu M, Ye Z. An advanced YOLOv3 method for small-scale road object detection. Appl Soft Comput. 2021;112(1):107846. doi:10.1016/j.asoc.2021.107846. [Google Scholar] [CrossRef]

9. Li G, Zhang Y, Wang H, Chen X. An effective detection algorithm for small UAV based on lightweight YOLOv4-L approach. Appl Soft Comput. 2025;184(3):113841. doi:10.1016/j.asoc.2025.113841. [Google Scholar] [CrossRef]

10. Qiu T-H, Wang L, Wang P, Bai Y-E. Research on object detection algorithm based on improved YOLOv5. Comput Eng Appl. 2022;58(13):63–73. (In Chinese). doi:10.20944/preprints202408.1218.v1. [Google Scholar] [CrossRef]

11. Yue H, Quan Z. Improved small target detection algorithm based on SSD. In: Proceedings of the 2024 5th International Conference on Computer Vision and Deep Learning (CVIDL); 2024 Jul 19–21; Chongqing, China. p. 1421–5. doi:10.1109/CVIDL62147.2024.10603510. [Google Scholar] [CrossRef]

12. Liu Y-T, Fan Y-M, Zhang L, Li G. Anomaly detection of traffic signal lights in network based on self-learning. J Chongqing Jiaotong Univ. 2021;40(3):27–33. doi:10.3969/j.issn.1674-0696.2021.03.05. [Google Scholar] [CrossRef]

13. Zhang L. Method for realizing traffic signal lamp fault automatic detection and alarm operation and maintenance management system. In: Proceedings of the 34th China (Tianjin) IT, Network, Information Technology, Electronics, Instrumentation Innovation Academic Conference; 2020 Aug 16–18; Tianjin, China. p. 37–40. (In Chinese). [Google Scholar]

14. Lin HY, Chen YC. Traffic light detection using ensemble learning by boosting with color-based data augmentation. Int J Transp Sci Technol. 2024:1–15. doi:10.1016/j.ijtst.2024.10.012. [Google Scholar] [CrossRef]

15. Song J, Hu T, Gong Z, Zhang Y, Cui M. TLDM: an enhanced traffic light detection model based on YOLOv5. Electronics. 2024;13(15):3080. doi:10.3390/electronics13153080. [Google Scholar] [CrossRef]

16. Polley N, Pavlitska S, Boualili Y, Rohrbeck P, Stiller P, Bangaru AK, et al. TLD-READY: traffic light detection—Relevance estimation and deployment analysis. arXiv:2409.07284. 2024. [Google Scholar]

17. Khaled LB, Rahman M, Ebu IA, Ball JE. FlashLightNet: an end-to-end deep learning framework for real-time detection and classification of static and flashing traffic light states. Sensors. 2025;25(20):6423. doi:10.3390/s25206423. [Google Scholar] [PubMed] [CrossRef]

18. Wang CY, Mark Liao HY, Wu YH, Chen PY, Hsieh JW, Yeh IH. CSPNet: a new backbone that can enhance learning capability of CNN. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW); 2020 Jun 14–19; Seattle, WA, USA. p. 1571–80. doi:10.1109/cvprw50498.2020.00203. [Google Scholar] [CrossRef]

19. Behrendt K, Novak L, Botros R. A deep learning approach to traffic lights: detection, tracking, and classification. In: 2017 IEEE International Conference on Robotics and Automation (ICRA); 2017 May 29–Jun 3; Singapore. p. 1370–7. doi:10.1109/ICRA.2017.7989163. [Google Scholar] [CrossRef]

20. Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, et al. SSD: single shot MultiBox detector. In: Computer vision—ECCV 2016. Cham, Switzerland: Springer; 2016. p. 21–37. doi:10.1007/978-3-319-46448-0_2. [Google Scholar] [CrossRef]

21. Huang G, Liu Z, Van Der Maaten L, Weinberger KQ. Densely connected convolutional networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2017 Jul 21–26; Honolulu, HI, USA. p. 2261–9. doi:10.1109/CVPR.2017.243. [Google Scholar] [CrossRef]

22. Ultralytics. YOLOv5 [Internet]. 2022 [cited 2026 Feb 1]. Available from: https://github.com/ultralytics/yolov5. [Google Scholar]

23. Fu J, Liu J, Tian H, Li Y, Bao Y, Fang Z, et al. Dual attention network for scene segmentation. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2019 Jun 15–20; Long Beach, CA, USA: IEEE; 2020. p. 3141–9. doi:10.1109/CVPR.2019.00326. [Google Scholar] [CrossRef]

24. GB 14887-2011. Road traffic signal lamps-Technical requirements and testing methods. Beijing, China: General Administration of Quality Supervision, Inspection and Quarantine of the People’s Republic of China; Beijing, China: Standardization Administration of the People’s Republic of China; 2011. (In Chinese). [Google Scholar]

25. Girshick R. Fast R-CNN. In: 2015 IEEE International Conference on Computer Vision (ICCV); 2015 Dec 7–13; Santiago, Chile. p. 1440–8. doi:10.1109/iccv.2015.169. [Google Scholar] [CrossRef]

26. Zhu X, Lyu S, Wang X, Zhao Q. TPH-YOLOv5: improved YOLOv5 based on transformer prediction head for object detection on drone-captured scenarios. In: 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW); 2021 Oct 11–17; Montreal, BC, Canada. p. 2778–88. doi:10.1109/iccvw54120.2021.00312. [Google Scholar] [CrossRef]

Cite This Article

APA Style

Shi, Y., Zhang, J., Li, Y. (2026). Multi-Scene Traffic Light Detection and Fault Identification via Dual-Attention Image Fusion. Computer Modeling in Engineering & Sciences, 147(1), 28. https://doi.org/10.32604/cmes.2026.078601

Vancouver Style

Shi Y, Zhang J, Li Y. Multi-Scene Traffic Light Detection and Fault Identification via Dual-Attention Image Fusion. Comput Model Eng Sci. 2026;147(1):28. https://doi.org/10.32604/cmes.2026.078601

IEEE Style

Y. Shi, J. Zhang, and Y. Li, “Multi-Scene Traffic Light Detection and Fault Identification via Dual-Attention Image Fusion,” Comput. Model. Eng. Sci., vol. 147, no. 1, pp. 28, 2026. https://doi.org/10.32604/cmes.2026.078601

BibTex EndNote RIS

Copyright © 2026 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Table of Content

Multi-Scene Traffic Light Detection and Fault Identification via Dual-Attention Image Fusion

Abstract

Keywords

References

Cite This Article

420

167

0

Related articles

Further Information

Guidelines

Follow Us

Join Us

Contact Us

WhatsApp:

Share Link