iconOpen Access

ARTICLE

crossmark

Toward Efficient Traffic-Sign Detection via SlimNeck and Coordinate-Attention Fusion in YOLO-SMM

Hui Chen1, Mohammed A. H. Ali1,*, Bushroa Abd Razak1, Zhenya Wang2, Yusoff Nukman1, Shikai Zhang1, Zhiwei Huang1, Ligang Yao3, Mohammad Alkhedher4

1 Department of Mechanical Engineering, Faculty of Engineering, University of Malaya, Kuala Lumpur, 50603, Malaysia
2 Department of Mechanical Engineering, Tsinghua University, Beijing, 100084, China
3 School of Mechanical Engineering and Automation, Fuzhou University, Fuzhou, 350108, China
4 Mechanical and Industrial Engineering Department, Abu Dhabi University, Zayed City, Abu Dhabi, 59911, United Arab Emirates

* Corresponding Author: Mohammed A. H. Ali. Email: email

Computers, Materials & Continua 2026, 86(2), 1-26. https://doi.org/10.32604/cmc.2025.067286

Abstract

Accurate and real-time traffic-sign detection is a cornerstone of Advanced Driver-Assistance Systems (ADAS) and autonomous vehicles. However, existing one-stage detectors miss distant signs, and two-stage pipelines are impractical for embedded deployment. To address this issue, we present YOLO-SMM, a lightweight two-stage framework. This framework is designed to augment the YOLOv8 baseline with three targeted modules. (1) SlimNeck replaces PAN/FPN with a CSP-OSA/GSConv fusion block, reducing parameters and FLOPs without compromising multi-scale detail. (2) The MCA model introduces row- and column-aware weights to selectively amplify small sign regions in cluttered scenes. (3) MPDIoU augments CIoU loss with a corner-distance term, supplying stable gradients for sub-20-pixel boxes and tightening localization. An evaluation of YOLO-SMM on the German Traffic Sign Recognition Benchmark (GTSRB) revealed that it attained 96.3% mAP50 and 93.1% mAP50-90 at a rate of 90.6 frames per second (FPS). This represents an improvement of +1.0% over previous performance benchmarks. The mAP at 64 × 64 resolution was found to be 50% of the maximum attainable value, with an FPS of +8.3 when compared to YOLOv8. This result indicates superior performance in terms of accuracy and speed compared to YOLOv7, YOLOv5, RetinaNet, EfficientDet, and Faster R-CNN, all of which were operated under equivalent conditions.

Keywords

Traffic sign detection; YOLO v8; YOLO v5; YOLO v7; SlimNeck; modified coordinate attention; MPDIoU

1  Introduction

The ability to accurately detect traffic signs is a critical component in intelligent transportation systems and autonomous driving. This capability allows vehicles to recognize and comply with regulatory, warning, and guide signs, thereby enhancing road safety and enabling informed navigation decisions. Advanced driver-assistance systems (ADAS) that can localize and identify traffic signs in real time have been shown to improve traffic flow, reduce accident risk, and support higher levels of autonomy.

However, reliable traffic sign detection remains a significant challenge. Traffic signs often appear as small-scale targets—sometimes only a few pixels in size—within complex scenes affected by motion blur, occlusion, variable lighting, and diverse weather conditions [1]. These characteristics make accurate and efficient detection particularly difficult, especially for small and distant signs.

Early deep-learning approaches to traffic sign detection primarily relied on two-stage detectors such as Faster R-CNN, which achieved high accuracy through region proposal and refinement, albeit at the cost of high computational complexity and latency. In contrast, one-stage detectors—such as SSD and the YOLO series—reformulated detection as a single-pass regression problem, allowing for real-time inference with frame rates exceeding 30 FPS. Nevertheless, these models often struggle to recall small targets after multiple downsampling operations [2].

To overcome these limitations, recent versions of YOLO (e.g., v5, v7, and v8) have introduced improved backbones, multi-scale prediction heads, and more lightweight neck designs. Among them, YOLOv8 represents the latest evolution of the YOLO family, combining an anchor-free detection paradigm with a decoupled head and flexible modular architecture. These features are particularly beneficial for small object detection, as they eliminate the dependency on hand-crafted anchor boxes and enable easier customization for task-specific enhancements. Moreover, YOLOv8 provides robust community support, excellent scalability, and efficient deployment capabilities, making it a strong foundation for practical and extensible applications.

Despite these advancements, a trade-off persists between detection accuracy, model complexity, and real-time performance. In particular, the recall of small signs in cluttered environments still behind that of larger objects. Therefore, there remains a need for lightweight, accurate, and generalizable architectures that can address these challenges in an efficient manner.

As powerful as these methods are, they share four major limitations when applied to small-object traffic sign detection:

(a)   Small-sign recall degrades sharply under blur, low contrast, or heavy occlusion, since high-frequency details are lost in deep feature mAPs.

(b)   Multi-scale fusion networks (FPN, PANet, BiFPN) improve small-object representation but incur substantial FLOPs and parameter overhead, limiting real-time feasibility on embedded platforms.

(c)   Lightweight architectures (Mobile Net, Shuffle Net, Ghost Net, Efficient Det) reduce model size but often sacrifice feature richness, leading to missed detections of subtle sign patterns.

(d)   IoU-based losses (GIoU, DIoU, CIoU, SIoU) ameliorate coarse box regression but remain insensitive to corner misalignments on very small boxes, resulting in loose localization and higher false negatives.

To address these limitations, we propose YOLO-SMM, an enhanced YOLOv8 detector that combines:

(a)   SlimNeck—a CSP-OSA and GSConv-based neck that delivers efficient one-shot aggregation of multi-scale features, preserving fine-grained details for small signs while drastically reducing neck computation.

(b)   Modified Coordinate Attention (MCA)—a position-aware attention module that factorizes channel weighting into horizontal and vertical contexts, sharpening focus on tiny sign regions without discarding spatial cues.

(c)   MPDIoU Loss—a novel IoU-based regression loss that augments CIoU with normalized corner-point distance penalties, providing stable, fine-grained gradients to tightly align predicted boxes with ground truth even for sub-20 px objects.

The remainder of this paper is organized as follows. Section 2 reviews related work on traffic sign detection, multi-scale fusion, lightweight architectures, attention mechanisms, and IoU-based losses. Section 3 details the proposed SlimNeck, MCA, and MPDIoU components within the YOLOv8 framework. Section 4 describes the experimental setup, dataset splits, and evaluation metrics, and presents quantitative and qualitative results. Section 5 discusses the implications of our findings and potential avenues for further improvement. Finally, Section 6 concludes the paper and outlines future research directions.

2  Related Work

Traffic signs are critical roadside devices that regulate driver behavior, alert motorists to potential hazards, and provide navigational instructions. Their timely perception is therefore essential to ensuring road safety and facilitating the smooth flow of traffic. On-board traffic-sign recognition (TSR) systems execute two sequential tasks: (i) localization, in which candidate sign regions are identified within a cluttered road scene; and (ii) classification, in which each region is assigned a semantic label (e.g., Stop, Speed-Limit 50 km h−1). Accurate localization is instrumental in preventing missed or delayed detections, whereas reliable classification is crucial for ensuring that autonomous vehicles and advanced driver-assistance systems (ADAS) respond with the appropriate maneuver.

Early TSR pipelines were dominated by handcrafted descriptors and shallow classifiers. Popular choices include Histogram of Oriented Gradients (HOG) or Scale-Invariant Feature Transform (SIFT) for region description, coupled with Support Vector Machines (SVM) or Random Forests for sign/non-sign discrimination and category assignment. Although such methods are lightweight and interpretable, their performance deteriorates under illumination changes, occlusions, and perspective distortion—conditions that frequently arise in real-world driving. The limitations of hand-engineered features motivated the transition to deep convolutional detectors, whose capacity for hierarchical representation learning has largely supplanted traditional TSR techniques in recent years.

The seminal Faster R-CNN of Ren et al. (2015) couples a region-proposal network with region-wise refinement. This model has been shown to achieve a high level of accuracy; however, it also incurs multi-stage latency, which can impede its real-time application [3]. Single-stage detectors eschew proposals and regress boxes densely. The YOLO family (Redmon & Farhadi, 2016–2018) demonstrated the efficacy of a meticulously designed one-stage pipeline in achieving video-rate inference while maintaining minimal mAP loss [4,5]. However, early YOLO variants struggled with objects below 32 × 32 px, a critical limitation for distant traffic signs [6].

Preserving fine spatial cues typically necessitates the incorporation of a feature-pyramid neck design. As posited by Lin et al. (2017), The PANet (Liu et al., 2018) employs a reduction in information flow, while the BiFPN (Tan et al., 2020) introduces learnable bidirectional weights and underlies EfficientDet [7]. While these designs enhance the recall of small objects, they concomitantly increase the number of FLOPs and memory requirements by 20%–100%. In practice, the presence of heavy necks can impede the effective deployment of edge hardware.

Depth-wise and grouped convolutions, first commercialized in MobileNet V1/V2 (Howard et al., 2017; Sandler et al., 2018) and ShuffleNet (Zhang et al., 2018), have been shown to reduce parameters by an order of magnitude. However, these methods have also been observed to sacrifice representational power [8,9]. GhostNet (Han et al., 2020) synthesizes “ghost” feature mAPs by employing inexpensive linear transforms, thereby prioritizing the reduction of mAP density over that of width [10]. In further development of these concepts, Li et al. (2022) introduced GSConv, a method that integrates a partial standard convolution with a depth-wise branch and executes a channel shuffle, thereby preserving accuracy while reducing parameter requirements by half [11]. However, previous studies seldom incorporated such operators into the fusion neck, resulting in over-parameterized pyramid layers.

Channel reweighting (SE block—Hu et al., 2018) and joint channel-spatial attention (CBAM—Woo et al., 2018) have been demonstrated to suppress irrelevant activations. However, it should be noted that both of these methods result in the loss of positional cues following global pooling [12,13]. The Coordinate Attention (CA) factor (Hou et al., 2021) is a method that divides the process of pooling into horizontal and vertical directions. This approach incorporates location into the attention weights, resulting in enhanced performance for objects of minimal size [14]. However, in the case of CA, the injection is typically administered near the backbone tail. The efficacy of CA diminishes during the initial, high-resolution stages where sign pixels first emerge.

The quality of the bounding-box regression has been demonstrated to have a significant impact on the mAP performance for small targets. As posited by Rezatofighi et al. (2019), GIoU (Gradient Interpolation of Overlapping Areas) retains gradients when two boxes do not overlap. Conversely, Zheng et al. (2020) propose DIoU (Dense Intersection of Overlapping Areas) and CIoU (Centered Intersection of Overlapping Areas), which incorporate center-distance and aspect-ratio terms, respectively [15,16]. However, these penalties do not explicitly constrain corner alignment, which may result in the optimizer converging on boxes that overlap well but exhibit deviation at the edges. This error can lead to a disproportionate reduction in the Intersection Over Union (IoU) metric for 20 × 20-pixel signs. YOLOv5 (Jocher et al., 2020) introduces a convolutional neural network (CNN) backbone and pan-neck architecture, achieving 120 frames per second (FPS) on a T4 graphics processing unit (GPU), yet still lacks the capability to detect distant signs [2]. YOLOv7 (Wang et al., 2022) employs an Extended-ELAN backbone and a “bag-of-freebies” training recipe, yielding the highest real-time AP of its day but with a 36.9M-parameter footprint [17]. YOLOv8-n (Jocher et al., 2023) transitions to an anchor-free head and prunes parameters to 3.0M, yet its standard neck remains a computational bottleneck and its IoU loss remains constant.

3  Methodology

3.1 Method Overview

We adopt YOLOv8 as the base detection framework owing to its high detection accuracy, flexible and modular architecture, and anchor-free design, which significantly benefits small object localization. This choice facilitates seamless integration of our proposed modules (SlimNeck, MCA, and MPDIoU) and ensures that our improvements are built upon a strong, widely recognized foundation.

3.2 The Network Structure of YOLOV8-SlimNeck-MCA-MPDIoU

As illustrated in Fig. 1, the proposed YOLOv8-slimneck-mca-mpdiou network incorporates three primary enhancements. The following terms are employed in this study: SlimNeck, MCA (Modified Coordinate Attention), and MPDIoU. The SlimNeck module is a technological advancement that replaces the standard YOLOv8 neck with a more efficient multi-scale fusion pipeline. This modification reduces redundant computations while preserving key features for small-object detection. Concurrently, MCA is integrated to selectively accentuate salient regions by leveraging coordinate-wise attention, thereby enhancing feature distinctiveness against background noise. In conclusion, the bounding-box regression head incorporates MPDIoU, which adds a corner-distance penalty on top of the standard IoU term. This results in more precise localization and faster convergence. The combination of these modifications results in a lightweight yet powerful architecture that exhibits enhanced robustness in detecting traffic signs of varying sizes. The enhanced framework is characterized by the retention of essential features in SlimNeck, the augmentation of discriminative cues via MCA, and the refinement of box regression through MPDIoU. These modifications have been demonstrated to result in enhanced accuracy and accelerated inference.

images

Figure 1: The YOLOv8-SlimNeck-modified coordinate attention-MPDIoU neck architecture

3.3 SlimNeck

As illustrated in Fig. 2a, the conventional YOLOv8 neck (feature pyramid) is effective for multi-scale feature fusion but introduces significant computational redundancy, which can impede the detection of small objects. In the context of traffic sign detection, numerous targets are often diminutive in size and densely spaced, a circumstance that can result in the attenuation of high-resolution feature information due to an overly complex neck. To address this issue, the original neck has been replaced with a SlimNeck, a lightweight feature pyramid structure designed to maintain strong multi-scale feature representation while significantly reducing computation. SlimNeck streamlines the architecture depicted in Fig. 2b to prioritize the fusion of essential features, thereby enhancing the propagation of fine-grained details for small traffic signs.

images

Figure 2: Neck architecture comparison: (a) original YOLOv8 neck; (b) proposed SlimNeck (CSP-OSA + GSConv fusion)

(1)   Structure and Information Flow-Integrating OSA and CSP

Structure: The SlimNeck system is constructed by integrating the One-Shot Aggregation (OSA) concept from VoVNet and the Cross-Stage Partial (CSP) design. In the context of an OSA module, the input feature undergoes a series of transformations and is subsequently concatenated with the outputs of these transformations in a single operation. The input feature tensor is denoted by X, and the output of the ith transformation (e.g., a convolution) in the block is denoted by Ti. The aggregation of feature mAPs is achieved through a one-shot concatenation of the input and all transformed outputs:

U=Concat(X,T1,T2,...,Tn)(1)

where X: input feature map with dimension C×H×W, where C is the number of channels, H is the height, and W is the width. Ti(i=1,2,,n): feature maps after different transformations (e.g., convolution). Each has the same dimension as X. The dimension of X is usually C×H×W, where C is the number of channels, H is the height, and W is the width. Range of values: the dimension of X is usually C×H×W, where the value of C ranges from 16 to 512, and the dimensions of H and W, vary according to the size of the input image.

In our CSP-OSA fusion, the input X is first split into two branches: one branch undergoes the series of transformations producing T1,T2,...,Tn (which are concatenated with X as in OSA) and the other branch (V) bypasses directly. The concatenation of the aggregated features U with the bypass V is then fed into a fusion function H (such as a 1×1 convolution) to produce the SlimNeck output:

Fout=H(Concat(U,V))(2)

where U: The aggregated feature map from Eq. (1), containing all features obtained through different transformations. V: The output of the skip connection, usually the same as the input feature map X, with the same dimensions. H: The fusion function, usually a 1 × 1 convolution, used to generate the final SlimNeck output.

This design preserves the original input content via V while enriching it with new features via U, improving information flow and reducing duplicate feature processing. By aggregating features in one shot and partially preserving the input, SlimNeck avoids unnecessary repetitive transformations, which is especially helpful for retaining the features of small objects.

(2)   GSConv Convolutional Operator

GSConv for Light-Weight Design: To further slim down the neck, we replace standard convolutional layers with Grouped Spatial Convolution (GSConv) modules. A standard convolution with kernel size k and input/output channels Ci, Co has:

PConv=Ci×Co×k2(3)

where Ci: Number of input channels, indicating the number of channels in the input feature map. Co: Number of output channels, indicating the number of channels after convolution. k: Size of the convolution kernel, typically 3 or 5.

GSConv adopts a split-and-shuffle strategy to reduce this cost. In GSConv, roughly half of the output channels are produced by a regular convolution on a reduced channel subset, and the other half are generated by a cheaper depthwise convolution (which has complexity linear in Ci). The two results are then concatenated and channel-shuffled to mix information. The approximate parameter count is:

PConv(Ci×Co2×k2)+(Ci×k2)(4)

which is about half of PConv for large Co. Despite this reduction in complexity, GSConv preserves similar expressive power to a full convolution by combining the partial convolution and depthwise operations. By utilizing GSConv throughout SlimNeck, we substantially cut down the number of parameters and FLOPs in the neck, making the model more efficient for real-time detection on embedded devices.

Detection Head Scales: Following the SlimNeck feature fusion, the model outputs multi-scale feature mAPs for detection, as in YOLOv8. We employ three detection heads operating at strides 8, 16, and 32, which correspond to detecting small, medium, and large objects, respectively. Table 1 summarizes the characteristics of these detection layers (given an example input resolution of 640 × 640). The highest-resolution feature mAP (stride 8) is crucial for small traffic signs, providing fine detail, whereas the lower-resolution mAPs (stride 16 and 32) cover larger signs and more contextual information.

images

3.4 A Modified Coordinate Attention (MCA)

While a light neck has been demonstrated to enhance efficiency, ensuring the network focuses on traffic sign regions is equally important. In this study, we propose a modified Coordinate Attention (MCA) module to enhance feature representation through channel-wise reweighting guided by spatial coordinates, as shown in Fig. 3. Attention mechanisms, such as Squeeze-and-Excitation (SE) and CBAM, have demonstrated that adaptively recalibrating feature responses can substantially enhance performance. However, SE blocks compress spatial information into a single channel descriptor, thereby losing positional cues. CBAM modules apply 2D spatial attention in a coarse manner. Coordinate Attention (CA) was recently proposed as a solution to these limitations by factorizing channel attention into two coordinate directions, thereby enabling the network to preserve precise location information. The present MCA builds upon Coordinate Attention to better highlight traffic sign features, especially for small signs that require accurate spatial localization.

images

Figure 3: Architecture of the modified coordinate attention (MCA) module: coordinate-wise pooling, bottleneck transform, and attention reweighting

Consider an input feature mAP XRCin×H×W, We perform global average pooling along both the vertical and horizontal directions to obtain:

Eh(c,w)=1Hy=1HX(c,y,w),Ew(c,h)=1Wx=1WX(c,h,x)(5)

where X: Input feature map, with dimensions C×H×W, where C denotes the number of channels, H indicates the height of the feature map, and W signifies the width of the feature map. EhRC×1×W represents the aggregated activation in each column for every channel, and EwRC×H×1 encodes the activation in each row. We then concatenate these two outputs and pass them through 1×1 a convolution layer to reduce dimensionality by a factor of 1r. Afterward, the feature is split back into two branches, each undergoing a Sigmoid activation to yield the horizontal attention mAP gw and the vertical attention mAP gh. Finally, these two attention mAPs are broadcast to match the full spatial dimensions of the original feature mAP and multiplied element-wise with X:

Yc,h,w=Xc,h,wgh(c,h)gw(c,w)(6)

where gh and gw: These represent the horizontal and vertical attention weights, respectively, subsequent to Sigmoid activation, with dimensions C×H×1 and C×1×W. The value range is as follows: The values of gh and gw are typically between 0 and 1, representing the importance of each position.

The output Y has the same shape as X but each channel’s activation at position (h,w) has been scaled by factors that depend on its horizontal and vertical coordinates. In effect, MCA can highlight features that align with important rows and columns associated with traffic signs, yielding a more precise attention mAP than conventional channel-only or spatial-only attention. This coordinate attention mechanism preserves the positional context of signs (e.g., a small sign’s feature can be amplified at its specific location without being averaged out), thereby improving the network’s ability to detect and localize small traffic signs in complex backgrounds.

3.5 Minimum-Point-Distance IoU (MPDIoU)

Accurate localization is imperative for traffic sign detection, and thus, the bounding box regression loss is also enhanced. Most object detectors employ a variant of Intersection over Union (IoU) as the localization loss function. However, it is important to note that IoU-based losses have been observed to exhibit certain limitations in effectively guiding regression when predictions deviate from the ground truth. In the case where two boxes do not overlap, the intersection over union (IoU) is equivalent to zero, thereby providing no gradient for learning their relative placement. Despite the presence of overlap, the IoU remains unaffected by the distance between the boxes or the discrepancy in their shapes. Its evaluation is confined solely to the area of overlap. Enhanced variants, such as GIoU, DIoU, and CIoU, address some of these issues. For instance, CIoU loss incorporates penalties for center distance and aspect ratio discrepancies. However, there exist scenarios, particularly in the context of small objects, where CIoU may not fully meet expectations. For instance, if a predicted box is near the ground-truth center but significantly mismatched in size or edges, the center-distance term in CIoU becomes small, failing to adequately penalize the misalignment. This can result in a reduction in the efficiency of box corner refinement during training.

We propose Minimum-Point-Distance IoU (MPDIoU) loss, which introduces a corner-based penalty to overcome these shortcomings. The key idea is to directly measure the discrepancy at the four corners of the predicted and ground-truth bounding boxes, providing richer localization feedback than just the center distance. Let (xmin,ymin) and (xmax,ymax) be those of the ground-truth box, and (xmingt,ymingt) and (xmaxgt,ymaxgt) be those of the ground-truth box. We define a corner distance penalty as the sum of squared differences between corresponding corners:

dmix2=(xminxmingt)2+(yminymingt)2+(xmaxxmaxgt)2+(ymaxymaxgt)2(7)

(xmin,ymin) and (xmax,ymax) are the corners of the predicted and ground truth boxes, respectively. This dmix2 term aggregates the misalignment in both position and size of the predicted box relative to the target. We incorporate this into the IoU loss formulation with a normalization factor to ensure scale invariance. Let C be the diagonal length of the smallest enclosing box that covers both the predicted and ground-truth boxes (like the normalization used in DIoU/CIoU). The MPDIoU loss is then defined as:

LMPDIoU=1IoU+λdmix2C2(8)

where IoU: Standard intersection-over-union ratio indicating the degree of overlap between predicted and ground truth bounding boxes. λ: Hyperparameter controlling corner point penalties, typically ranging from 0 to 1. C: Diagonal length of the minimum bounding box, used for normalizing distance metrics. Value range: In typically ranges from 0 to 1, used to adjust corner point error influence.

By design, MPDIoU provides improved sensitivity and more stable gradients during training. Even in cases of no overlap (IoU=0) and dmix term is generally non-zero (unless the boxes coincide exactly), ensuring the loss can still decrease as the predicted box moves closer to the ground truth. This resolves the zero-gradient issue of plain IoU loss. Moreover, when two boxes significantly overlap but have mismatched sizes or aspect ratios, MPDIoU will impose a higher penalty than CIoU, because the corner distances (xminxmingt), … capture those misalignments. In essence, the loss does not “degenerate” to only IoU or only center-distance in these edge cases. but consistently evaluates the full shape alignment. As a result, MPDIoU accelerates the convergence of box regression and improves final localization accuracy, particularly for small traffic signs where even a few pixels of misalignment can impact IoU. This new loss, combined with the SlimNeck and MCA modules, forms a comprehensive approach to enhance both the detection precision and recall of traffic signs (see Fig. 2 for the architecture overview).

We define MPDIoU as an extension of CIoU, incorporating a corner penalty term to optimize alignment for small object bounding boxes. Let the predicted box be (xp,yp,wp,hp) and the ground truth box be (xg,yg,wg,hg), with their top-left and bottom-right corner coordinates defined as:

P1=(xpwp2,yphp2),P2=(xp+wp2,yp+hp2)(9)

G1=(xgwg2,yghg2),G2=(xg+wg2,yg+hg2)(10)

The corner error distance is then defined as:

dcorner=12[P1G12+P2G22](11)

After normalizing this term to the range [0, 1] using the diagonal D of the smallest enclosing box, MPDIoU is formulated as:

MPDIoU=CIoUαdcornerD2(12)

where α is the weight hyperparameter for the corner alignment term. This structure provides the model with fine-grained corner-level tuning capabilities, which are particularly critical for sub-pixel-level localization of small objects.

To investigate the impact of the corner penalty weight α in MPDIoU on localization accuracy, we conducted ablation experiments with varying α values (see Table 2). The results show that as α increases from 0 (equivalent to CIoU) to 0.5, both mAP50 and mAP50-90 steadily improve, rising from 94.7% to 96.3% and from 90.1% to 93.1%, respectively. However, when α is further increased to 0.8 or 1.0, performance declines. This indicates that a moderate corner weight optimally enhances the model’s convergence for small object bounding box alignment.

images

4  Experiment

4.1 GTSRB Dataset

The German Traffic Sign Recognition Benchmark (GTSRB) is a widely recognized dataset for traffic sign research. It was originally introduced by Stallkamp et al. The dataset comprises tens of thousands of real-world images captured under diverse lighting, weather, and viewing conditions. These images encompass 43 distinct classes, including speed limits, prohibitions, and warnings. See Fig. 4 for a visual representation of the classes. While GTSRB was initially developed for classification purposes, wherein images are cropped and assigned to specific categories, it has been adapted for detection tasks by annotating bounding boxes. This annotated version preserves real-world challenges, such as small or distant signs, partial occlusions, and cluttered backgrounds, thereby establishing a realistic benchmark for evaluating the resilience of detection algorithms.

images

Figure 4: Traffic sign categories in GTSRB dataset

Despite encompassing 43 categories, the dataset exhibits moderate imbalance, as certain sign types are observed more frequently than others. As shown in Table 3, To address this, the refined GTSRB detection set was split into a training set, a validation set, and a test (validation) set at an approximate ratio of 8:1:1, ensuring diverse examples in both partitions. This approach maintains a broad variety of traffic signs, demanding advanced feature extraction and robustness to noise. Consequently, GTSRB effectively assesses a detector’s capacity to localize small-scale targets and manage challenging scenarios, reflecting the complexities of actual road environments.

images

4.2 Training and Implementation Metrics

The model was trained on the GTSRB and TT100K datasets using PyTorch 1.11. The hardware platform consisted of a single NVIDIA GeForce RTX 4080 GPU paired with an Intel Core i7-11700 CPU, running on the Ubuntu 20.04 operating system.

Optimizer: The AdamW optimizer was used, with a momentum coefficient of 0.9 and a weight decay of 1 × 10−4.

Learning Rate: The initial learning rate was set to 1 × 10−4, with a Cosine Annealing scheduling strategy for gradual decay, reaching a minimum learning rate of 1 × 10−6.

Batch Size: The batch size was set to 32 (i.e., 32 images processed per GPU).

Training Epochs: The model was trained for 300 epochs on each dataset, with the model achieving the highest mAP50 on the validation set selected as the final version.

Pre-Training Strategy: The YOLOv8 backbone network was initialized with COCO pre-trained weights, while other modules, such as SlimNeck and MCA, were randomly initialized.

Data Augmentation: During training, data augmentation techniques were employed, including Mosaic augmentation, random rotation (±10°), brightness variation (±20%), scale scaling (0.8–1.2×), random cropping, and random occlusion. These strategies effectively enhanced the model’s robustness in complex environments.

4.3 Evaluation Metrics

To quantitatively assess the performance of a traffic sign detection model, it is necessary to employ a comprehensive set of metrics that capture both localization accuracy and recognition robustness. In this study, we adopt indicators that are widely used in object detection benchmarks, supplemented by considerations specific to small-object detection. The primary evaluation criteria encompass Precision, Recall, Average Precision (AP), and mean Average Precision (mAP) under standardized Intersection over Union (IoU) thresholds. Furthermore, the frames per second (FPS) metric is reported to assess the real-time feasibility, which is imperative for traffic sign detection systems implemented on embedded platforms.

(1)   Precision and Recall

Precision (Prec) is defined as the proportion of predicted bounding boxes that are correct detections. Formally:

P=TPTP+FP(13)

where TP (true positives) are correctly identified targets and FP (false positives) are incorrect detections. High precision indicates that most predicted signs are indeed correct, reducing spurious alerts.

Recall (Rec) measures the proportion of ground-truth traffic signs that are successfully detected.

Rec=TPTP+FN(14)

where FN (false negatives) are missed signs. A high recall signals effective coverage, crucial for safety-related scenarios where missing a speed limit or warning sign can be severe.

(2)   Average Precision (AP) and mean AP (mAP)

AP quantifies the area under the Precision–Recall curve for a specific class, providing a single numeric score for that category’s detection quality. In practice, detections are sorted by confidence scores, and precision is interpolated at varying recall thresholds.

mAP extends AP to multiple classes, computing the mean of the per-class AP values. In traffic sign detection, each of the 43 GTSRB categories contributes to mAP, reflecting the model’s ability to localize and classify a diverse range of sign types accurately.

(3)   Intersection over Union (IoU) Threshold

The IoU metric measures the overlap between predicted and ground-truth bounding boxes, defined as:

IoU=Area of OverlapArea of Union(15)

A common threshold (e.g., IoU0.5) determines whether a detection is treated as a true positive or false positive.

For more stringent evaluation, we may also report AP@IoU = 0.5 (the “PASCAL VOC” criterion) and AP@IoU = 0.5:0.95 (the “COCO” criterion), the latter computing an average of APs at IoU thresholds from 0.5 to 0.95 in 0.05 increments. Including stricter IoU thresholds is especially relevant for traffic signs, which are often small and require precise localization.

4.4 Ablation Experiments

All ablation experiments in this study were conducted on a workstation equipped with an Intel Core i7-11700 CPU (four cores, eight threads) and an NVIDIA GeForce GTX 4080 GPU, supported by 24 GB of system memory. The hardware configuration is summarized as follows:

In terms of the software environment, Ubuntu 20.04 (Linux) was utilized as the operating system, with PyTorch 1.11 serving as the primary deep learning framework. The supplementary libraries and tools encompassed CUDA 11.3, cuDNN 8.0.4, and OpenCV 4.6.0.6, thereby ensuring efficient GPU acceleration and image preprocessing, as shown in Table 4. The integrated development environment (IDE) offered a seamless debugging experience and facilitated efficient experimental record-keeping.

images

This configuration was selected to balance computational capability and practical resource constraints, allowing for the performance of iterative model refinements and ablation analyses within a reasonable runtime. Subsequent experiments, including the evaluation of the effects of SlimNeck, MCA, and MPDIoU modules, were executed under these uniform conditions to ensure consistency and reproducibility.

5  Results and Discussion

To evaluate the effectiveness of the proposed YOLO-SMM model, we conducted comprehensive experiments to assess its performance across multiple dimensions, including detection accuracy, inference speed, and robustness to varying input conditions. These experiments encompass resolution sensitivity analysis, cross-dataset generalization, and ablation studies, with comparisons against state-of-the-art detection algorithms on the GTSRB dataset, as well as additional datasets such as TT100K and LISA. The key metrics evaluated include mAP50, mAP50-90, precision, recall, parameter count, GFLOPs, and FPS, providing a holistic view of YOLO-SMM’s performance and efficiency. The following sections detail these experimental results, highlighting the contributions of the proposed SlimNeck, MCA, and MPDIoU components and their impact on real-world traffic sign detection scenarios.

5.1 Sensitivity Analysis of Proposed Model

To validate the robustness and scalability of YOLO-SMM across different input image sizes, as shown in Fig. 5. We evaluated its detection accuracy and inference speed at three commonly used input resolutions (320 × 320, 416 × 416, and 640 × 640). Table 5 presents the comparative results for mAP50 and FPS.

images

Figure 5: Performance of YOLO-SMM at different input resolutions

images

Table 6 presents the comparative results of YOLO-SMM against current state-of-the-art detection algorithms on the GTSRB dataset, including YOLOv5, YOLOv7, YOLOv8, RetinaNet, EfficientDet, and Faster R-CNN. The results demonstrate that YOLO-SMM achieves the highest performance in both mAP50 and mAP50-90, reaching 96.3% and 93.1%, respectively, while also attaining a precision of 94.1% and a recall of 93.0%. Despite being one of the models with the fewest parameters (2.66 M) and lowest FLOPs (7.49), YOLO-SMM achieves an inference speed of 90.6 FPS, significantly surpassing larger models such as YOLOv7 and Faster R-CNN.

images

These findings indicate that our proposed enhancements (SlimNeck, MCA, and MPDIoU) not only improve detection performance, particularly in the consistent accuracy of small object recognition, but also effectively reduce model complexity and inference latency, demonstrating substantial potential for embedded deployment applications.

5.2 Statistical Validation of Proposed Model

To validate that the performance improvements of YOLO-SMM over the YOLOv8 baseline are statistically significant and not due to random variations, we conducted paired significance testing on per-class Average Precision (AP) scores using the GTSRB dataset, as shown in Table 7.

images

Specifically, a paired t-test between YOLO-SMM and YOLOv8 across 10 representative traffic sign categories yielded a test statistic of t = 20.16 and a p-value of 8.45 × 10−9, indicating that the observed improvement is statistically significant at the 1% level.

In addition, we conducted a non-parametric Wilcoxon signed-rank test, which produced a p-value of 0.00195, further confirming the statistical significance of the results.

These findings provide strong evidence that the accuracy improvements achieved by YOLO-SMM are consistent and reliable, reinforcing the empirical results reported in earlier sections.

Although the proposed detector outperforms the baseline in every accuracy metric, its efficiency is not sacrificed—in fact, it improves throughput from 82.3 FPS to 90.6 FPS while trimming parameters by 11.6% (3.01 M → 2.66 M) and FLOPs by 8.7% (8.20 → 7.49). These gains stem mainly from the SlimNeck redesign: by replacing the original three-scale PAN/FPN neck with a one-shot aggregation +CSP topology built on GSConv blocks, redundant convolution channels are pruned, and feature reuse is maximized. The resulting neck preserves the high-resolution pathway required for small traffic signs yet halves the per-scale convolution cost, thus striking a superior speed–accuracy balance. In practical terms, the model can process a 640 × 640 camera stream at more than 90 frames s−1 on a single RTX 4080, meeting the latency budgets of embedded ADAS platforms without resorting to aggressive input down-sampling.

Beyond efficiency, the Modified Coordinate Attention (MCA) module and the MPDIoU loss jointly elevate small-object performance. MCA injects direction-aware channel weights into each stage of the backbone, steering the network’s focus toward rows and columns that carry discriminative sign patterns even when the target occupies only a few pixels. MPDIoU augments conventional IoU optimization with a corner-distance penalty, forcing tighter box alignment and reducing localization errors that disproportionately harm IoU on tiny objects. As shown in Fig. 6, the synergy of SlimNeck’s fine-grained feature fusion, MCA’s position-sensitive re-weighting, and MPDIoU’s alignment-aware regression has shown an improvement of mAP by +0.9% within 50–90 margins as depicted in Fig. 6a precision of 5% as illustrated in Fig. 6b and +recall with 1.6% as in shown Fig. 6c compared with YOLOv8-n, while still delivering a lighter, faster model. Consequently, the detector is well suited for real-time traffic-sign perception stacks where both millisecond-level latency and high recognition fidelity are mission-critical.

images

Figure 6: The variation curves of precision, recall, and mAP50

5.3 Comparison with Other Algorithms

As shown in Table 6, the YOLO-SMM model demonstrates competitive performance compared to other advanced algorithms. It achieves the highest precision (87.9%) and recall (80.5%) among the models listed, along with an impressive mAP50 of 90.1%.

Table 8 gives a holistic, cross-framework comparison on GTSRB, while Fig. 7 displays qualitative detections on representative urban scenes. The proposed YOLO v8-SMM achieves the best mAP50 (96.3%) and the highest runtime throughput (90.6 FPS) even though it carries the smallest parameter budget (2.66 M) and the lowest computational load (7.49 GFLOPs) among all single-stage competitors. Relative to its parent YOLO v8-n, mAP50 rises by +1.6 pp with −11.6% parameters and −8.7% FLOPs; against the heavier one-stage baselines (YOLO v7-t, YOLO v5-s) it secures gains of +3.6 pp and +4.7 pp, while running 24 FPS and 19 FPS faster, respectively. Two-stage (Faster R-CNN) and transformer-based (DETR) detectors approach our accuracy (94.0%–92.7%) but incur 10×–50× more parameters and fall below 70 FPS, underscoring the real-time advantage of the proposed design.

images

images images

Figure 7: The detection results of YOLOv8-SMM, YOLOv8, YOLOv5, YOLOv7

The visual study in Fig. 7 further highlights the practical effect of these numbers. In each row, YOLO v8-SMM (column 1) correctly localizes and classifies every traffic sign—even those that occupy fewer than 20 × 20 px—while its peers miss at least one instance (e.g., YOLO v8 drops the blue “no way, turn right” sign in row 4) or generate false positives (e.g., YOLO v5 and YOLO v7 confuse tree textures with speed-limit plates in row 2). Confidence scores are consistently higher in SMM (0.89–0.92) than in the other YOLO variants (0.74–0.86), reflecting the tighter localization delivered by the MPDIoU loss and the sharper feature focus produced by MCA. Qualitatively, SlimNeck’s fine-grained multi-scale fusion preserves the high-frequency details of distant plates, allowing the network to retain robust predictions under motion blur, low contrast, and occlusion (row 3). Taken together, the quantitative edge in Table 7 and the near-perfect visual outcomes in Fig. 7 confirm that the synergistic trio—SlimNeck, MCA, and MPDIoU—elevates both efficiency and reliability, making YOLO v8-SMM an attractive choice for real-time, on-board traffic-sign perception systems where every missed or spurious alert directly impacts driving safety.

5.4 Cross-Dataset Generalization

To further validate the robustness of YOLO-SMM across diverse data distributions, we conducted transfer tests on two public traffic sign detection datasets: TT100K (Chinese traffic sign images) and LISA (American traffic sign images). Table 9 and Fig. 8 present the mAP50 performance of YOLOv8 and YOLO-SMM across the three datasets.

images

images

Figure 8: Cross-dataset performance comparison: GTSRB, TT100K, and LISA

The results show that YOLO-SMM achieves an mAP50 of 89.8% on TT100K, representing a +3.6-percentage point improvement over YOLOv8, and a +2.8-percentage point improvement on the LISA dataset. These consistent gains demonstrate that our proposed combination of modules (SlimNeck, MCA, and MPDIoU) exhibits strong adaptability across different countries, image styles, sign shapes, and environmental conditions.

These findings further confirm that YOLO-SMM not only excels on the original GTSRB dataset but also possesses robust potential for cross-scenario deployment.

5.5 Testing in Complex Scenarios

To comprehensively evaluate the robustness and performance boundaries of YOLO-SMM, we constructed an additional test set covering real-world traffic scenarios such as low nighttime illumination, blurry rainy days, partial occlusion, excessive lighting, and motion blur. As shown in Figs. 9 and 10, YOLO-SMM can still accurately detect traffic signs under most challenging conditions, demonstrating strong generalization ability. Notably, under low-light and occlusion scenarios, its recall rate outperforms that of YOLOv8.

images

Figure 9: YOLOv8 detection results under challenging conditions (low light, motion blur, etc.)

images

Figure 10: YOLO-SMM detection results under challenging conditions (low light, motion blur, etc.)

To further analyze performance boundaries, we collected representative failure cases from the GTSRB and TT100K test sets, as illustrated in Fig. 11. These include: (a) extremely small signs (e.g., under 10 × 10 pixels), (b) motion blur, (c) extreme lighting conditions (e.g., backlight glare), and (d) partial occlusion.

images

Figure 11: Failure cases and comparison between YOLOv8 and YOLO-SMM

The results show that YOLO-SMM generally outperforms YOLOv8, especially in handling occlusion and blur (Fig. 11a,b). However, in edge cases such as very small objects or harsh lighting, both models exhibit some false negatives (Fig. 11c,d). These results suggest that future work could explore integrating super-resolution modules or lightweight fine-grained attention to improve performance under extreme conditions.

This dual analysis—on both diverse scenes and failure cases—not only validates the robustness of YOLO-SMM in real-world deployments but also provides concrete directions for further enhancement.

6  Conclusion

In this paper, we propose YOLO-SMM, an enhanced YOLOv8 model tailored for real-time traffic sign detection, with a particular focus on small-object performance. The integration of pivotal innovations, namely SlimNeck, Modified Coordinate Attention (MCA), and MPDIoU loss, has yielded substantial enhancements in the accuracy and efficiency of detection. The proposed SlimNeck module is a lightweight and efficient multi-scale fusion pipeline that significantly reduces computational overhead while preserving fine-grained features critical for small traffic signs. The MCA module facilitates precise spatial attention, thereby ensuring that the network can focus on the most relevant regions, even in the presence of cluttered backgrounds. Additionally, the MPDIoU loss addresses the challenges posed by corner misalignment in small object bounding boxes, thereby ensuring stable gradients and tighter localization for traffic signs.

Extensive experimentation on the GTSRB dataset demonstrated that YOLO-SMM outperforms the baseline YOLOv8 as well as other state-of-the-art methods, such as YOLOv7 and YOLOv5, across all evaluation metrics. The model demonstrated notable efficacy, attaining an impressive 96.3% mAP50, while concurrently maintaining processing speeds of 90.6 FPS and a reduced parameter count. These results substantiate the efficacy of our approach, thereby establishing it as a viable solution for implementation in embedded systems characterized by stringent real-time and resource constraints.

In subsequent studies, we aspire to investigate additional optimizations to enhance small-object detection under challenging real-world conditions, such as extreme weather and occlusion. Furthermore, the proposed method is planned to be extended to address more complex traffic scenarios involving dynamic object recognition, multi-object tracking, and vehicle interaction modeling. In addition, we are exploring the potential of hardware acceleration techniques to further minimize inference time, thereby ensuring that YOLO-SMM is prepared for uninterrupted incorporation into autonomous driving and advanced driver-assistance systems.

Acknowledgement: Authors would like to thank University of Malaya and Ministry of High Education-Malaysia for supporting this work under research grant FRGS/1/2023/TK10/UM/02/3 and GPF020A-2023.

Funding Statement: This work is supported by University of Malaya and Ministry of High Education-Malaysia via Fundamental Research Grant Scheme No. FRGS/1/2023/TK10/UM/02/3.

Author Contributions: Hui Chen: Conceptualization, Methodology, Software, Writing—original draft, Investigation, Validation, Visualization; Mohammed A. H. Ali: Conceptualization, Supervision, Methodology, Writing—review & editing, Validation, Funding acquisition; Bushroa Abd Razak: Supervision; Zhenya Wang: Writing—review; Yusoff Nukman: Supervision, Review & editing; Shikai Zhang: Writing—review & editing; Zhiwei Huang: Writing—review & editing; Ligang Yao: Writing—review; Mohammad Alkhedher: Funding acquisition, Writing—review & editing. All authors reviewed the results and approved the final version of the manuscript.

Availability of Data and Materials: The authors do not have permission to share data.

Ethics Approval: Not applicable.

Conflicts of Interest: The authors declare no conflicts of interest to report regarding the present study.

References

1. Zhang J, Zou X, Kuang LD, Wang J, Sherratt RS, Yu X. CCTSDB 2021: a more comprehensive traffic sign detection benchmark. Hum-Centric Comput Inf Sci. 2022;12:23. [Google Scholar]

2. Kaur R, Singh S. A comprehensive review of object detection with deep learning. Digit Signal Process. 2023;132(33):103812. doi:10.1016/j.dsp.2022.103812. [Google Scholar] [CrossRef]

3. Ren S, He K, Girshick R, Sun J. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell. 2017;39(6):1137–49. doi:10.1109/tpami.2016.2577031. [Google Scholar] [PubMed] [CrossRef]

4. Vijayakumar A, Vairavasundaram S. YOLO-based object detection models: a review and its applications. Multimed Tools Appl. 2024;83(35):83535–74. doi:10.1007/s11042-024-18872-y. [Google Scholar] [CrossRef]

5. Redmon J, Farhadi A. YOLO9000: better, faster, stronger. In: Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2017 Jul 21–26; Honolulu, HI, USA. p. 6517–25. [Google Scholar]

6. Chen H, Ali MAH, Nukman Y, Razak BA, Turaev S, Chen Y, et al. Computational methods for automatic traffic signs recognition in autonomous driving on road: a systematic review. Results Eng. 2024;24:103553. doi:10.1016/j.rineng.2024.103553. [Google Scholar] [CrossRef]

7. Tan M, Pang R, Le QV. Efficientdet: scalable and efficient object detection. In: Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2020 Jun 13–19; Seattle, WA, USA. p. 10781–90. [Google Scholar]

8. Sandler M, Howard A, Zhu M, Zhmoginov A, Chen LC. MobileNetV2: inverted residuals and linear bottlenecks. In: Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2018 Jun 8–23; Lake City, UT, USA. p. 4510–20. [Google Scholar]

9. Howard AG, Zhu M, Chen B, Kalenichenko D, Wang W, Weyand T, et al. MobileNets: efficient convolutional neural networks for mobile vision applications. arXiv:1704.04861. 2017. [Google Scholar]

10. Han K, Wang Y, Tian Q, Guo J, Xu C, Xu C. GhostNet: more features from cheap operations. In: Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2020 Jun 13–19; Salt Lake City, UT, USA. p. 1577–86. [Google Scholar]

11. Li H, Li J, Wei H, Liu Z, Zhan Z, Ren Q. Slim-neck by GSConv: a lightweight-design for real-time detector architectures. J Real-Time Image Proc. 2024;21(3):62. doi:10.1007/s11554-024-01436-6. [Google Scholar] [CrossRef]

12. Woo S, Park J, Lee JY, Kweon IS. CBAM: convolutional block attention module. In: Ferrari V, Hebert M, Sminchisescu C, Weiss Y, editors. Lecture notes in computer science. Cham, Switzerland: Springer International Publishing; 2018. p. 3–19. doi:10.1007/978-3-030-01234-2_1. [Google Scholar] [CrossRef]

13. Hu J, Shen L, Sun G. Squeeze-and-excitation networks. In: Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2018 Jun 18–23; Lake City, UT, USA. p. 7132–41. [Google Scholar]

14. Hou Q, Zhou D, Feng J. Coordinate attention for efficient mobile network design. In: Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2021 Jun 20–25; Nashville, TN, USA. p. 13713–22. [Google Scholar]

15. Rezatofighi H, Tsoi N, Gwak J, Sadeghian A, Reid I, Savarese S. Generalized intersection over union: a metric and a loss for bounding box regression. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2019 Jun 5–20; Beach, CA, USA. p. 658–66. [Google Scholar]

16. Zheng Z, Wang P, Liu W, Li J, Ye R, Ren D. Distance-IoU loss: faster and better learning for bounding box regression. Proc AAAI Conf Artif Intell. 2020;34(7):12993–3000. doi:10.1609/aaai.v34i07.6999. [Google Scholar] [CrossRef]

17. Wang CY, Bochkovskiy A, Liao HYM. YOLOv7: trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In: Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2023 Jun 17–24; Vancouver, BC, Canada. p. 7464–75. [Google Scholar]


Cite This Article

APA Style
Chen, H., Ali, M.A.H., Razak, B.A., Wang, Z., Nukman, Y. et al. (2026). Toward Efficient Traffic-Sign Detection via SlimNeck and Coordinate-Attention Fusion in YOLO-SMM. Computers, Materials & Continua, 86(2), 1–26. https://doi.org/10.32604/cmc.2025.067286
Vancouver Style
Chen H, Ali MAH, Razak BA, Wang Z, Nukman Y, Zhang S, et al. Toward Efficient Traffic-Sign Detection via SlimNeck and Coordinate-Attention Fusion in YOLO-SMM. Comput Mater Contin. 2026;86(2):1–26. https://doi.org/10.32604/cmc.2025.067286
IEEE Style
H. Chen et al., “Toward Efficient Traffic-Sign Detection via SlimNeck and Coordinate-Attention Fusion in YOLO-SMM,” Comput. Mater. Contin., vol. 86, no. 2, pp. 1–26, 2026. https://doi.org/10.32604/cmc.2025.067286


cc Copyright © 2026 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
  • 329

    View

  • 101

    Download

  • 0

    Like

Share Link