Intelligent Semantic Segmentation with Vision Transformers for Aerial Vehicle Monitoring

Moneerah Alotaibi

doi:10.32604/cmc.2025.069195

icon Open Access

ARTICLE

Intelligent Semantic Segmentation with Vision Transformers for Aerial Vehicle Monitoring

Moneerah Alotaibi^*

Department of Computer Science, College of Science and Humanities Dawadmi, Shaqra University, Dawadmi, 11911, Saudi Arabia

* Corresponding Author: Moneerah Alotaibi. Email: email

Computers, Materials & Continua 2026, 86(1), 1-20. https://doi.org/10.32604/cmc.2025.069195

Received 17 June 2025; Accepted 25 September 2025; Issue published 10 November 2025

Abstract

Advanced traffic monitoring systems encounter substantial challenges in vehicle detection and classification due to the limitations of conventional methods, which often demand extensive computational resources and struggle with diverse data acquisition techniques. This research presents a novel approach for vehicle classification and recognition in aerial image sequences, integrating multiple advanced techniques to enhance detection accuracy. The proposed model begins with preprocessing using Multiscale Retinex (MSR) to enhance image quality, followed by Expectation-Maximization (EM) Segmentation for precise foreground object identification. Vehicle detection is performed using the state-of-the-art YOLOv10 framework, while feature extraction incorporates Maximally Stable Extremal Regions (MSER), Dense Scale-Invariant Feature Transform (Dense SIFT), and Zernike Moments Features to capture distinct object characteristics. Feature optimization is further refined through a Hybrid Swarm-based Optimization algorithm, ensuring optimal feature selection for improved classification performance. The final classification is conducted using a Vision Transformer, leveraging its robust learning capabilities for enhanced accuracy. Experimental evaluations on benchmark datasets, including UAVDT and the Unmanned Aerial Vehicle Intruder Dataset (UAVID), demonstrate the superiority of the proposed approach, achieving an accuracy of 94.40% on UAVDT and 93.57% on UAVID. The results highlight the efficacy of the model in significantly enhancing vehicle detection and classification in aerial imagery, outperforming existing methodologies and offering a statistically validated improvement for intelligent traffic monitoring systems compared to existing approaches.

Keywords

Machine learning; semantic segmentation; remote sensors; deep learning; object monitoring system

1 Introduction

The rapid urbanization and growing number of vehicles have intensified the need for advanced traffic monitoring systems [1]. Accurate vehicle detection and classification are vital to intelligent transportation systems (ITS), supporting traffic control and security. Traditional methods using handcrafted features and classical machine learning often lack robustness in aerial imagery due to issues like illumination changes, occlusion, and perspective distortion. These approaches also demand high computational resources and struggle with adaptability across diverse platforms like drones and satellites. Recent advancements in deep learning, particularly semantic segmentation and transformer architecture, offer improved solutions. This study proposes a semantic segmentation framework for aerial vehicle monitoring by integrating Vision Transformers (ViTs) with enhanced preprocessing and optimization techniques. Multiscale Retinex (MSR) improves image contrast, while EM segmentation isolates foreground objects. YOLOv10 provides fast and precise vehicle detection [2]. A combination of MSER, Dense SIFT, and Zernike Moments captures detailed features, which are refined using a Hybrid Swarm-based Optimization strategy. ViTs then process spatial dependencies efficiently. Experiments on UAVDT and UAVID datasets demonstrate superior performance over conventional methods [3], confirming the framework’s effectiveness for real-world, scalable traffic surveillance. Unlike prior approaches that either rely solely on handcrafted descriptors or deep learning models, our method uniquely integrates region-, texture-, and shape-based features with a novel PSO–ACO-based optimization and classifies them using a Vision Transformer, enabling a robust and scalable framework for aerial vehicle monitoring.

UAV-based vehicle detection methods fall into two main categories: traditional handcrafted techniques and deep learning frameworks. Handcrafted methods (e.g., SIFT, MSER, HOG) are understandable and lightweight but struggle with aerial challenges like scale, rotation, and illumination changes. Deep learning models such as YOLOv5, YOLOv8, and Faster R-CNN offer higher accuracy but require large datasets, lack interpretability, and generalize poorly without retraining. Vision Transformers (ViTs) have shown promise in capturing spatial dependencies but are rarely combined with handcrafted features in UAV traffic monitoring. Existing fusion methods often neglect intelligent feature selection, leading to redundancy and inefficiency. To overcome these gaps, this study introduces a unified framework that integrates MSER and EM-based segmentation, multi-domain handcrafted feature extraction (MSER, Dense SIFT, and Zernike Moments), hybrid PSO–ACO optimization, and a ViT classifier. This hybrid approach balances interpretability, performance, and efficiency, offering a robust solution for real-time aerial vehicle recognition. The rest of this article is structured as: Section 2 reviews existing aerial vehicle detection and classification methods, highlighting their limitations. Section 3 details the proposed framework, including preprocessing, segmentation, feature extraction, and classification. Section 4 presents experimental analysis, evaluating performance on benchmark datasets against state-of-the-art methods. Finally, Section 5 summarizes contributions and outlines future research directions.

2 Related Work

Aerial vehicle monitoring has become increasingly important in transportation, surveillance, and urban planning. Challenges like scale variation, occlusion, and complex backgrounds affect detection accuracy. Traditional methods use handcrafted features, while deep learning offers automated, more effective solutions. This section reviews both approaches, outlining their strengths, limitations, and improvement areas.

2.1 Machine Learning Based Approaches

Ayush Kumar et al. [4] addressed small vehicle detection in aerial imagery using PCA for dimensionality reduction, improving accuracy by reducing misclassifications. Among deep models, ResNet50 outperformed MobileNetv1 (85% vs. 76.25%), emphasizing model selection. However, ambiguity in detecting small objects remained a challenge. Qureshi et al. [5] proposed vehicle detection via channel differencing, blob detection, and shape matching, but the method lacked robustness against varying angles and dataset diversity. Lin et al. [6] introduced the VAID dataset with 6000 aerial images and achieved 88% accuracy using a similar approach, though its performance dropped under non-uniform conditions. Chen et al. [7] combined a Deformable Part Model (DPM) with a CNN to enhance structural feature capture and robustness, especially in occluded or angled views.

2.2 Deep Learning Based Approaches

Zhuang et al. [8] addressed inefficiencies in UAV detection caused by patch-based image splitting and proposed a lightweight multi-task classification (MTC) network to enhance speed. However, its performance in dense scenes and accurate trade-offs remains unexamined. Zuraimi et al. [9] combined YOLOv4 with DeepSORT, achieving 82.08% precision, but lacked solutions for occlusion and lighting issues. Yang et al. [10] introduced BFEN with SLPN and PNW, reaching 88.71% mAP on DETRAC, though it wasn’t tested in rural settings and had unclear complexity. Rani et al. [11] merged Faster R-CNN and SSD, with performance dropping from 85.22% (KITTI) to 64.83% (PASCAL2007), revealing limitations under varied conditions. Basak and Suresh [12] achieved 87.56% on CityCam using CNNs in low-res traffic scenes, but environmental variability was not assessed. Lyu et al. [13] used a CNN-based segmentation model on UAVID, achieving 85% accuracy, though they struggled with small object tracking and temporal consistency.

Unlike these methods, our proposed approach addresses small object detection, occlusions, and scene diversity. It employs MSR for illumination enhancement, EM for accurate segmentation, and YOLOv10 for robust detection. MSER, Dense SIFT, and Zernike Moments ensure rich feature extraction, optimized via Hybrid Swarm-based selection. Finally, the Vision Transformer handles scale variance and background clutter through self-attention, delivering high accuracy and generalization in UAV-based vehicle monitoring. While existing works have explored individual use of descriptors or CNN-based architectures, few have combined handcrafted multi-view features with transformer-based classification. To the best of our knowledge, this is the first approach that fuses MSER, Dense SIFT, and Zernike features; applies a hybrid PSO–ACO optimization for redundancy reduction; and leverages a Vision Transformer for final recognition in UAV-based traffic monitoring.

3 Materials and Methods

The proposed framework offers an efficient solution for aerial vehicle detection and classification. It starts with converting image sequences into frames, followed by noise removal and MSR for contrast enhancement. EM segmentation isolates foreground objects, and YOLOv10 performs accurate vehicle detection. Features are extracted using MSER, Dense SIFT, and Zernike Moments, then refined via Hybrid Swarm-Based Optimization. Finally, the Vision Transformer classifies vehicles using its attention mechanism. This streamlined pipeline improves both accuracy and efficiency, as illustrated in Fig. 1.

images

Figure 1: Intelligent traffic monitoring framework integrating preprocessing, segmentation, detection, and classification for accurate real-time analysis

3.1 Image Preprocessing

Preprocessing is crucial in aerial vehicle detection due to common issues like illumination variation, shadows, and low contrast from atmospheric or sensor limitations. To address this, Multiscale Retinex (MSR) is applied as an advanced enhancement technique that normalizes illumination while preserving details [14]. Unlike conventional methods, MSR enhances both dark and bright regions by estimating reflectance across multiple Gaussian scales, making vehicles more visible in challenging backgrounds. This makes MSR highly suitable for aerial imagery. The mathematical formulation is shown in Eq. (1):

R(x,y)=∑s=1sws[logI(x,y)−log(I(x,y)∗Gs(x,y))](1)

here, R(x, y) is the enhanced reflectance, I(x, y) is the input image, and Gs(x,y) is the Gaussian function at scale s with standard deviation σs denotes convolution, S is the number of scales, and ws are scale weights summing to 1. MSR’s multi-scale design enhances details at various levels while suppressing noise. A normalization step, shown in Eq. (2), balances contrast across the image:

IMSR′(x,y)=αR(x,y)−min(R)max(R)−min(R)+β(2)

where IMSR′(x,y) is the final enhanced image, α and β are scaling parameters to control brightness and contrast adaptation, and min(R) and max(R) define the minimum and maximum reflectance values, respectively. This transformation ensures that the enhanced image remains within the valid intensity range, preventing overexposure or excessive darkening. The output of the preprocessing can be seen in Fig. 2.

images

Figure 2: Preprocessing via MSR (a) Original frame (b) Preprocessed frame

3.2 Segmentation: Expectation-Maximization (EM)

Segmentation is vital in aerial vehicle detection to separate vehicles from complex backgrounds, enabling accurate feature extraction and classification. Given challenges like lighting variation and occlusions, the Expectation-Maximization (EM) algorithm is applied as a robust probabilistic segmentation method [15]. Unlike traditional thresholding or edge-based techniques, EM uses Gaussian mixtures to model pixel intensities and iteratively refines clusters, ensuring adaptive and precise segmentation, as shown in Eq. (3):

P(I(x,y)∣θ)=∑k=1KπkN(I(x,y)∣μk,σk2(3)

where, P(I(x,y)∣θ) represents the probability of pixel intensity I(x,y) given the model parameters θ, πk is the weight of the k-th Gaussian component, ensuring ∑k=1Kπk=1, N(I(x,y)∣μk,σk2 denotes a Gaussian distribution with mean μk and variance σk2 for the k-th cluster. The EM algorithm iteratively refines the segmentation through the posterior probability that each pixel belongs to a specific Gaussian component using Bayes’ theorem as mentioned below in Eq. (4).

γk(x,y)=πkN(I(x,y)∣μk,σk2∑k=1KπjN(I(x,y)∣μj,σj2(4)

where γk(x,y) represents the probability that pixel I(x,y) belongs to cluster k. By leveraging the EM algorithm, aerial images are effectively partitioned into meaningful regions, isolating vehicles from complex backgrounds with high accuracy. The results of the segmentation can be depicted in Fig. 3.

images

Figure 3: Segmentation via EMS (a) Original Image (b) Segmented Output

Comparative Evaluation of Segmentation Methods

To validate the effectiveness of EM segmentation in aerial scenes, we compared it against two commonly used unsupervised segmentation methods: Otsu’s thresholding and K-means clustering. All three methods were evaluated on UAVDT frames, using Intersection over Union (IoU) and Dice Similarity Coefficient (DSC) to measure performance. Table 1 shows the quantative comparison of the segmentation strategies.

images

The results demonstrate that the EM algorithm consistently provides better foreground object separation, particularly in complex aerial backgrounds with varied lighting and occlusion. This strong performance supports its integration in our proposed vehicle monitoring pipeline.

3.3 Vehicle Detection

Accurate vehicle detection is essential for aerial surveillance applications, including traffic monitoring and urban planning. However, challenges such as scale variation, occlusion, and dense traffic complicate the task of aerial detection. This study employs YOLOv10, a cutting-edge model designed for fast and precise detection in complex scenes. Unlike region-based methods, YOLO processes the entire image in one pass, enhancing efficiency for large-scale UAV imagery. YOLOv10 boosts performance through an upgraded backbone, an improved Feature Pyramid Network (FPN), and an anchor-free design, making it particularly effective at detecting small vehicles. The detection pipeline consists of feature extraction, bounding box regression, and classification, utilizing a lightweight convolutional backbone:

Fl=ϕ(Wl∗I+bl)(5)

where, Fl represents the feature maps at layer l, Wl and bl denote the weights and biases of the convolutional layer, ϕ is the activation function (e.g., SiLU or Leaky ReLU). Following feature extraction, the detection head generates bounding box predictions by regressing the object coordinates (x, y, w, h) along with the confidence score C. The bounding box regression is defined as in Eq. (6):

B^=σ(tx)+cx,σ(ty)+cy,Pwetw,Pheth(6)

where, x^,y,^w^,h^ are the predicted bounding box center coordinates, width, and height, (cx,cy) denote the coordinates of the cell in the feature grid, (pw, ph) represent prior anchor dimensions, tx,ty,tw,th are the predicted offset values, and σ(⋅) is the sigmoid function ensuring the output remains in a valid range. The final classification stage assigns a confidence score C to each detected object using the SoftMax activation function as defined in Eq. (7):

C=ezi∑j=1Nezi(7)

where zi represents the raw class score for a specific category (i.e., vehicle) and N is the total number of classes. This probability distribution ensures accurate classification of detected vehicles. Table 2 shows the parameters used for the YOLOv10 algorithm. The outcome of YOLOv10 can be seen on Fig. 4.

images

Figure 4: Precise vehicle detection outcomes using YOLOv10 for intelligent traffic monitoring

3.4 Feature Extraction

Feature extraction plays a vital role in vehicle classification by transforming raw images into meaningful representations. This study employs MSER, Dense SIFT, and Zernike Moments to boost detection accuracy. MSER detects stable regions under lighting changes, Dense SIFT captures multi-scale textures, and Zernike Moments offer rotation-invariant shape descriptors. Their integration enhances recognition by addressing scale variation, clutter, and occlusion in aerial imagery.

3.4.1 Maximally Stable Extremal Regions (MSER)

Maximally Stable Extremal Regions (MSER) is a region-based feature extraction method known for its stability under varying conditions, making it valuable for aerial vehicle detection. It identifies consistent regions despite changes in illumination, scale, or viewpoint by analyzing image areas with intensity values significantly different from their surroundings. By selecting stable regions across intensity thresholds, MSER effectively isolates vehicle-like structures in complex aerial scenes. Mathematically, given an intensity function I(x, y), an extremal region R satisfies the condition described in Eq. (8).

∀p∈R,∀q∉RI(p)>torI(p)<t(8)

The stability of region R is determined by evaluating its area change rate across varying thresholds, given in Eq. (9):

S(R)=Rt−Rt−Δt|Rt−Δt|(9)

where |Rt| represents the number of pixels in the region at threshold t, and Δt is the threshold variation step. Regions with minimal S(R) values are selected as maximally stable, ensuring the detection of consistent and repeatable features. The results of the MSER can be seen in Fig. 5.

images

Figure 5: MSER based Feature extraction over Vehicles

3.4.2 Dense Scale-Invariant Feature Transform (Dense SIFT)

Dense SIFT enhances vehicle recognition by capturing fine-grained texture and structural details across multiple scales. Unlike traditional SIFT, which operates on keypoints, Dense SIFT applies descriptors uniformly across the image grid [16], ensuring better coverage. This is particularly effective for aerial detection where scales, orientation, and lighting vary. The dense computation yields richer structural representation, boosting classification in cluttered scenes. Mathematically, it begins by constructing a scale-space representation of image I(x, y) using a Gaussian function (x,y,σ).

L(x,y,σ)=G(x,y,σ)∗I(x,y)(10)

where, G(x,y,σ)=12πσ2e−x2+y22σ2. This ensures that features remain invariant to scale changes. Gradients are then calculated within local regions to form orientation histograms, with the dominant gradient given in Eq. (11):

θ(x,y)=tan−1(∂L∂y/∂L∂x)(11)

Dense SIFT’s strength lies in its ability to provide a detailed texture description, making it particularly effective in aerial vehicle detection where shape alone may not be sufficient for accurate classification. The results of the SIFT can be depicted in Fig. 6.

images

Figure 6: Dense SIFT based Feature extraction of vehicles over the Aerial Images

3.4.3 Zernike Moments

Zernike Moments (ZM) are compact and discriminative shape descriptors that capture global structure while ensuring rotational invariance ideal for UAV-based vehicle detection where objects vary in orientation and scale. They represent image regions using orthogonal Zernike polynomials, reducing redundancy and preserving essential shape features [17]. Mathematically, Zernike Moments of order n and repetition m for an image function I(x, y) are defined as:

Znm=n+1π∬ρ≤1ρI(ρ,θ)Vnm∗(ρ,θ)dρdθ(12)

where Vnm(p,θ) is the Zernike polynomial given by: Vnm(p,θ)=Rnm(ρ)ejmθ and Rnm(ρ) is the radial polynomial: Rnm(ρ)=∑s=0(n−|m|)/2(−1)s(n−s)!s!(n+|m|2−s)!(n−|m|2−s)!ρn−2s. Here, ρ and θ represent the polar coordinates of the image region, and Vnm(p,θ) denotes the complex conjugate of the Zernike polynomial. The output of ZM can be seen in Fig. 7.

images

Figure 7: ZM based feature extraction

3.5 Feature Optimization

Feature optimization improves aerial vehicle detection by selecting relevant features and reducing redundancy. This study uses Hybrid Swarm-Based Optimization (HSO), combining PSO’s global search with ACO’s efficient pathfinding. HSO enhances classification accuracy and avoids premature convergence by refining features from MSER, Dense SIFT, and Zernike Moments. PSO updates positions using personal and global bests to optimize feature subsets:

vit+1=wvit+c1r1(pbest,i−xit)+c2r2(gbest,i−xit)(13)

xit+1=xit+vit+1(14)

where vit and xit represent the velocity and position of the ith particle at iteration t, w is the inertia weight, c1 and c2 are cognitive and social coefficients, and r1, r2 are random numbers between 0 and 1. The ACO component refines feature selection by constructing an optimal subset through pheromone-guided probabilistic selection.

Pi=τiαηiβ∑j∈Sτjαηjβ(15)

where Pi is the probability of selecting feature i, τi is the pheromone level, ηi is the heuristic information (feature importance score), and α, β control the influence of pheromone and heuristic information. 3D graph for the feature optimization for the vehicle classes can be seen in Fig. 8.

images

Figure 8: 3D hybrid swarm-based optimized features graph across vehicle classes

Table 3 lists the key parameter values used in the hybrid PSO–ACO optimization process, selected based on experimental tuning and literature guidance.

images

The optimization terminates when either the maximum number of iterations is reached or no improvement in classification accuracy is observed over 10 consecutive iterations. The final subset is chosen based on the highest validation accuracy achieved using the Vision Transformer classifier. Algorithm 1 shows the flow of Hybrid swarm optimization below.

images

3.6 Classification

The Vision Transformer (ViT) is used for classifying detected vehicles in aerial imagery, utilizing self-attention to capture long-range spatial dependencies and patterns. Unlike CNNs, ViT learns global relationships across the entire image, making it effective for handling scale, occlusion, and perspective variations [18]. The process starts by dividing the image into non-overlapping patches, which are linearly embedded and passed, along with a classification token, into the transformer encoder. Multi-head self-attention (MHSA) then models dependencies between all patches, formulated as:

Aij=exp((WQXi)(WKXj)Tdk)∑j=1Nexp((WQXi)(WKXj)Tdk)(16)

where Xi represents the embedded feature of the ith patch, WQ, WK, and WV are the trainable query, key, and value projection matrices, and dk is the key dimensionality for normalization. This attention mechanism assigns a dynamic weight Aij to each patch, allowing the network to focus on the most relevant features. To further refine feature representation, each transformed patch undergoes a feed-forward transformation incorporating non-linearity and residual connections, formulated as follows:

Zi(l+1)=σ(W1Zi(l)+b1)+W2+b2+Zi(l)(17)

where Zi(l) is the transformed feature vector at layer l, W1 and W2 are weight matrices, b1 and b2 are biases, and σ(⋅) represents a non-linear activation function. Residual connection ensures gradient stability and preserves crucial low-level information while allowing deep feature learning. The final classification is performed by extracting the learned class token and mapping it to output probabilities via a SoftMax function. To enhance robustness, second-order moment-based entropy regularization is introduced to penalize uncertain predictions, defined as:

Lreg=−λ∑i=1Cpilogpi+α∑i=1Cpi2(18)

where pi represents the predicted probability for class i, λ controls entropy regularization, and α adjusts the impact of second-order probability smoothing. This technique ensures confident predictions while mitigating the impact of ambiguous instances.

Table 4 provides a structured overview of the critical hyperparameters and configurations used in the Vision Transformer-based vehicle classification model. To train the Vision Transformer classifier, we initialized the model with pretrained weights from the ImageNet-21k dataset. The transformer was then fine-tuned on the optimized feature vectors obtained from our feature extraction and Hybrid Swarm Optimization pipeline. We used the AdamW optimizer with an initial learning rate of 3 × 10−4 and applied a cosine annealing rate schedule to gradually reduce the learning rate to 1 × 10−6 over the training process.

images

The model was trained for a maximum of 100 epochs with a batch size of 32, and early stopping was implemented with a patience of 10 epochs, monitoring the validation F1-score. Training was conducted using PyTorch 1.12 on an NVIDIA RTX 3060 GPU (6 GB VRAM). The data augmentation included feature-level mixing and standardization. This configuration enabled stable convergence while preventing overfitting and allowed the Vision Transformer to generalize effectively across both UAVDT and UAVID datasets.

Vanishing Gradient Consideration

A potential concern in deep learning models is the vanishing gradient problem, especially in architectures with multiple stacked layers. However, our classification module specifically designs the Vision Transformer (ViT) to mitigate such issues. ViT employs residual connections and layer normalization, which facilitate better gradient flow across layers. Furthermore, the self-attention mechanism in ViT does not rely on deep convolutional stacks, thereby reducing gradient attenuation risks. The use of the AdamW optimizer with adaptive learning rates also helps stabilize training and maintain effective backpropagation. As a result, our approach does not suffer from vanishing gradients during training, and stable convergence was observed across all experiments.

3.7 Proposed Approach

Algorithm 2 presents the full vehicle detection and classification pipeline, integrating all stages from aerial video input through preprocessing, segmentation, detection, feature extraction, optimization, and final classification. This unified approach ensures high accuracy by leveraging advanced techniques at each step. Key advantages include:

• Integrated Pipeline: Combines enhancement, segmentation, detection, feature selection, and classification in one cohesive framework.

• Robust Aerial Performance: MSR and EM segmentation improve visibility under varied lighting and clutter.

• Optimized Features: HSO selects the most relevant features, boosting accuracy and efficiency.

• Transformer-Based Learning: Vision Transformer captures spatial and contextual cues beyond CNNs.

• Scalable Design: Modular architecture allows easy adaptation to other detection tasks or datasets.

This algorithm offers a validated and practical solution for intelligent aerial vehicle monitoring through end-to-end integration and advanced feature optimization. Alogrithm 2 shows the complete proposed methodolgy below.

images

4 Experimental Results

The proposed approach was implemented and evaluated in the Python 3.9 environment, leveraging advanced deep learning and image processing libraries to ensure efficiency and accuracy. The key dependencies utilized include:

• PyTorch 1.12 (for Vision Transformer-based vehicle classification)

• OpenCV 4.6 (for image preprocessing and feature extraction)

• scikit-learn 1.0 (for feature optimization and selection)

• pymc 4.1 (for Expectation-Maximization-based segmentation)

The experiments were conducted on an Intel Core i7-12700H 2.70 GHz processor with 32 GB RAM and an NVIDIA RTX 3060 GPU (6 GB VRAM) to ensure high computational efficiency. The proposed model was evaluated on UAVDT and UAVID, datasets, demonstrating its effectiveness in aerial vehicle monitoring. A comparative analysis with existing state-of-the-art methods further validates the robustness of the approach.

4.1 Dataset Description

This subsection introduces the benchmark datasets used to evaluate the proposed framework: UAVDT and UAVID, both widely recognized for aerial object recognition. Sections 4.1.1 and 4.1.2 detail their characteristics and challenges. These datasets validate the robustness and generalization of our approach across diverse conditions.

4.1.1 UAVDT Dataset

Introduced by Du et al. [19], UAVDT comprises approximately 10 h of UAV video footage, from which 100 video sequences (~80,000 frames) were selected. A total of ~0.84 million vehicle bounding boxes are annotated, with rich metadata including illumination conditions (daylight, night, fog), flying altitude, camera view, occlusion levels, and vehicle categories. Videos are captured at 30 fps with a resolution of 1080 × 540 pixels. This dataset supports object detection (DET), single-object tracking (SOT), and multi-object tracking (MOT) tasks.

4.1.2 UAVID Dataset

Published by Lyu et al. [13], UAVID consists of 30 high-resolution 4K video sequences (later extended to 42 sequences) captured from oblique angles. A subset of 300 images is densely annotated for eight semantic categories: building, road, tree, low vegetation, static car, moving car, human, and background clutter. This dataset addresses challenges such as scale variation, moving object recognition, and temporal label consistency, and uses mean Intersection-over-Union (mIoU) as the evaluation metric.

4.2 Model Evaluation and Experimental Results

The k-fold cross-validation strategy was used to assess the proposed model’s performance on UAVDT and UAVID datasets, ensuring robust generalization across diverse scenarios. The evaluation demonstrated high effectiveness in vehicle detection and classification, with UAVDT achieving 94.40% accuracy, as shown in Fig. 9, while Precision, Recall, and F1-scores are detailed in Table 5.

images

Figure 9: Confusion matric result for the UAVDT dataset

images

We calculate Precision, Recall, and F1-score (F-measure) using the Vision Transformer. These metrics are then visualized through curves that depict their respective values.

For the UAVID dataset, the confusion matrix results are provided in Fig. 10, showing the confusion matrix, with the Precision, Recall, and F1-scores in Table 6. While Table 7 shows the comparison with SOTA methods.

images

Figure 10: Confusion matrix result for the UAVID dataset

images

The comparison table highlights that the proposed method achieves the highest accuracy on both UAVDT (94.40%) and UAVID (93.57%) datasets. Prior methods generally show lower performance, with most results ranging between 75%–92%. Several existing works reports results on only one dataset, limiting their generalization.

4.3 Computational Complexity Assessment and Justification

The complexity of preprocessing and segmentation depends on pixel count per frame, while YOLOv10 detection scales with frames and bounding boxes, showing quadratic growth. Feature extraction relies on keypoints, and HSO is influenced by feature vectors and iterations. Vision Transformer classification, based on embeddings, generally has cubic complexity. Table 8 summarizes execution times and module-wise efficiency.

images

Real-Time System Feasibility

To assess the system’s practical usability in real-world traffic monitoring, we evaluated the average processing time of each module on an NVIDIA RTX 3060 GPU (6 GB VRAM). Table 9 presents the breakdown of runtime:

images

The system achieves an average speed of ~24 FPS, confirming its suitability for real-time traffic monitoring. Since HSO is performed offline, it does not affect inference. Feature extraction supports CPU parallelization, and ViT inference benefits from GPU acceleration and batching, with future deployment feasible on edge devices via model pruning and quantization.

4.4 Discussion/Limitation

Although the proposed hybrid framework demonstrates promising results in aerial vehicle recognition, there remain several avenues for improvement and further exploration. First, while the current system performs well on UAVDT and UAVID datasets, future work should focus on evaluating the model’s generalizability across additional datasets that cover more extreme variations, such as occlusion-heavy or low-resolution nighttime traffic footage. Second, the computational cost associated with multi-stage feature extraction and optimization, although mitigated through HSO, can be further reduced by exploring lightweight descriptor alternatives or model pruning strategies for real-time edge deployment. Third, integrating a self-adaptive feature selection mechanism or reinforcement learning-based optimization may help further reduce redundancy and dynamically adjust to diverse environments. Lastly, incorporating spatiotemporal information, such as object tracking and temporal attention modules, could expand the system’s capabilities toward video-based behavior analysis and anomaly detection in UAV-based surveillance. Fig. 11 shows some limitation of the proposed system.

images

Figure 11: Visual examples of failure cases under poor lighting, motion blur, and occlusion

4.5 Ablation Study and Module-Wise Impact Analysis

To evaluate the individual contribution of each module within the proposed framework, an ablation study was conducted by systematically removing or replacing specific components, and the results are summarized in Table 10.

images

The ablation study evaluates the individual and combined impact of each module in the proposed framework on classification accuracy across UAVDT and UAVID datasets. Removing key components like MSR preprocessing, EM segmentation, or HSO optimization leads to a noticeable performance drop, confirming their contribution. Replacing YOLOv10 or ViT with alternate models (e.g., YOLOv8, ResNet50) also results in decreased accuracy, highlighting the effectiveness of the chosen architecture. The full model achieves the highest accuracy (94.40% on UAVDT), validating the synergy of all components. Overall, the study demonstrates that the integration of all modules is crucial for optimal performance.

5 Conclusion

In this work, we proposed a multi-stage vehicle detection and classification framework that integrates advanced feature extraction, optimization, and deep learning-based classification. The methodology leverages YOLOv10 for precise detection, with MSER, Dense SIFT, and Zernike Moments for feature extraction, followed by Hybrid Swarm-based Optimization and classification using a Vision Transformer model. Evaluations on benchmark datasets achieved high accuracy, including 94.40% on UAVDT and 93.57% on UAVID, highlighting the robustness and efficiency of the framework in complex aerial imagery complex. Looking ahead, future directions include extending the framework with multi-modal aerial data (e.g., thermal or LiDAR) for enhanced robustness and adopting lightweight Transformer variants for real-time edge deployment. Additionally, incorporating self-supervised strategies and exploring multi-drone collaborative monitoring can broaden its applicability. These advancements position the framework as a promising solution for practical scenarios such as smart city traffic management and large-scale aerial surveillance.

Acknowledgement: The author would like to thank the Deanship of Scientific Research at Shaqra University for supporting this work.

Funding Statement: The author received no specific funding for this study.

Availability of Data and Materials: All publicly available datasets are used in the study.

Ethics Approval: Not applicable.

Conflicts of Interest: The author declares no conflicts of interest to report regarding the present study.

References

1. Zhang Z, Li G. UAV imagery real-time semantic segmentation with global-local information attention. Sensors. 2025;25(6):1786. doi:10.3390/s25061786. [Google Scholar] [PubMed] [CrossRef]

2. Trivedi J, Devi MS, Solanki B. Step towards intelligent transportation system with vehicle classification and recognition using speeded-up robust features. Arch Tech Sci. 2023;28(1):39–56. doi:10.59456/afts.2023.1528.039J. [Google Scholar] [CrossRef]

3. Verdhan A, Saini V, Kukreti DC, Negi R, Bohra M, Kumar I. Real-time vehicle classification using deep neural networks based model. In: Proceedings of the 2024 First International Conference on Technological Innovations and Advance Computing (TIACOMP); 2024 Jun 29–30; Bali, Indonesia. p. 101–6. doi:10.1109/TIACOMP64125.2024.00027. [Google Scholar] [CrossRef]

4. Ayush Kumar CS, Maharana AD, Krishnan SM, Hanuma SS, Sowmya V, Ravi V. Vehicle detection from aerial imagery using principal component analysis and deep learning. In: Proceedings of the 13th International Conference on Innovations in Bio-Inspired Computing and Applications; 2022 Dec 15–17; Seattle, WA, USA. p. 129–40. doi:10.1007/978-3-031-27499-2_12. [Google Scholar] [CrossRef]

5. Qureshi AM, Jalal A. Vehicle detection and tracking using Kalman filter over aerial images. In: Proceedings of the 4th International Conference on Advancements in Computational Sciences (ICACS); 2023 Feb 20–22; Lahore, Pakistan. p. 1–6. doi:10.1109/ICACS55311.2023.10089701. [Google Scholar] [CrossRef]

6. Lin HY, Tu KC, Li CY. VAID: an aerial image dataset for vehicle detection and classification. IEEE Access. 2020;8:212209–19. doi:10.1109/ACCESS.2020.3040290. [Google Scholar] [CrossRef]

7. Chen LC, Papandreou G, Kokkinos I, Murphy K, Yuille AL. Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Access. 2017;40(4):834–48. doi:10.1109/TPAMI.2017.2699184. [Google Scholar] [PubMed] [CrossRef]

8. Zhuang S, Hou Y, Wang D. Towards efficient object detection in large-scale UAV aerial imagery via multi-task classification. Drones. 2025;9(1):29. doi:10.3390/drones9010029. [Google Scholar] [CrossRef]

9. Zuraimi MA, Zaman FH. Vehicle detection and tracking using YOLO and DeepSORT. In: Proceedings of the IEEE 11th IEEE Symposium on Computer Applications & Industrial Electronics (ISCAIE); 2021 Apr 3–4; Penang, Malaysia. p. 23–9. doi:10.1109/ISCAIE51753.2021.9431784. [Google Scholar] [CrossRef]

10. Yang J, Xie X, Wang Z, Zhang P, Zhong W. Bi-directional information guidance network for UAV vehicle detection. Complex Intell Syst. 2024;10(4):5301–16. doi:10.1007/s40747-024-01429-9. [Google Scholar] [CrossRef]

11. Rani S, Dalal S. Comparative analysis of hybrid and deep learning models for vehicle tracking. In: Proceedings of the 2024 Second International Conference on Advanced Computing & Communication Technologies (ICACCTech); 2024 Nov 16–17; Rai, India. p. 835–41. doi:10.1109/ICACCTech65084.2024.00137. [Google Scholar] [CrossRef]

12. Basak S, Suresh S. Vehicle detection and type classification in low resolution congested traffic scenes using image super resolution. Multimed Tools Appl. 2023;83(8):21825–47. doi:10.1007/s11042-023-16337-2. [Google Scholar] [CrossRef]

13. Lyu Y, Vosselman G, Xia GS, Yilmaz A, Yang MY. UAVid: a semantic segmentation dataset for UAV imagery. arXiv:1810.10438. 2018. doi:10.1016/j.isprsjprs.2020.05.009. [Google Scholar] [CrossRef]

14. Pandey P, Saurabh P, Verma B, Tiwari B. A multi-scale retinex with color restoration (MSR-CR) technique for skin cancer detection. In: Soft computing for problem solving. Singapore: Springer; 2019. p. 465–73. doi:10.1007/978-981-13-1595-4_37. [Google Scholar] [CrossRef]

15. Li X, Zhong Z, Wu J, Yang Y, Lin Z, Liu H. Expectation-maximization attention networks for semantic segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision 2019; 2019 Oct 27–Nov 2; Seoul, Republic of Korea. p. 9167–76. doi:10.1109/ICCV.2019.00926. [Google Scholar] [CrossRef]

16. Li Y, Leong W, Zhang H. YOLOv10-based real-time pedestrian detection for autonomous vehicles. In: Proceedings of the 2024 IEEE 8th International Conference on Signal and Image Processing Applications (ICSIPA); 2024 Sep 3–5; Kuala Lumpur, Malaysia. p. 1–6. doi:10.1109/ICSIPA62061.2024.10686546. [Google Scholar] [CrossRef]

17. Li X, Li X, Li Z, Xiong X, Khyam MO, Sun C. Robust vehicle detection in high-resolution aerial images with imbalanced data. IEEE Trans Artif Intell. 2021;2(3):238–50. doi:10.1109/TAI.2021.3081057. [Google Scholar] [CrossRef]

18. Mandal M, Shah M, Meena P, Devi S, Vipparthi SK. AVDNet: a small-sized vehicle detection network for aerial visual data. IEEE Geosci Remote Sens Lett. 2020;17(3):494–8. doi:10.1109/LGRS.2019.2923564. [Google Scholar] [CrossRef]

19. Du D, Qi Y, Yu H, Yang Y, Duan K, Li G, et al. The unmanned aerial vehicle benchmark: object detection and tracking. In: Proceedings of the 15th European Conference on Computer Vision (ECCV); 2018 Sep 8–14; Munich, Germany. p. 370–86. [Google Scholar]

20. Wang B, Gu Y. An improved FBPN-based detection network for vehicles in aerial images. Sensors. 2020;20(17):4709. doi:10.3390/s20174709. [Google Scholar] [PubMed] [CrossRef]

21. Wang P, Jiao B, Yang L, Yang Y, Zhang S, Wei W, et al. Vehicle re-identification in aerial imagery: dataset and approach. In: Proceedings of the IEEE/CVF International Conference on Computer Vision 2019; 2019 Oct 27–Nov 2. Seoul, Republic of Korea. p. 460–9. doi:10.1109/TCSVT.2023.3298788. [Google Scholar] [CrossRef]

22. Ma B, Liu Z, Jiang F, Yan Y, Yuan J, Bu S. Vehicle detection in aerial images using rotation-invariant cascaded forest. IEEE Access. 2019;7:59613–23. doi:10.1109/ACCESS.2019.2915368. [Google Scholar] [CrossRef]

23. Almujally NA, Qureshi AM, Alazeb A, Rahman H, Sadiq T, Alonazi M. A novel framework for vehicle detection and tracking in night ware surveillance systems. IEEE Access. 2024;12:88075–85. doi:10.1109/ACCESS.2024.3417267. [Google Scholar] [CrossRef]

24. Atik ME, Duran Z, Ipbuker C. Comparative analysis of deep learning-based techniques for UAV image semantic segmentation. In: Proceedings of the 2024 IEEE India Geoscience and Remote Sensing Symposium (InGARSS); 2024 Dec 2–5; Goa, India. p. 828–35. [Google Scholar]

25. Mittal P, Sharma A, Singh R, Sangaiah AK. On the performance evaluation of object classification models in low altitude aerial data. J Supercomput. 2022;78(12):14548–70. doi:10.1007/s11227-022-04469-5. [Google Scholar] [PubMed] [CrossRef]

Cite This Article

APA Style

Alotaibi, M. (2026). Intelligent Semantic Segmentation with Vision Transformers for Aerial Vehicle Monitoring. Computers, Materials & Continua, 86(1), 1–20. https://doi.org/10.32604/cmc.2025.069195

Vancouver Style

Alotaibi M. Intelligent Semantic Segmentation with Vision Transformers for Aerial Vehicle Monitoring. Comput Mater Contin. 2026;86(1):1–20. https://doi.org/10.32604/cmc.2025.069195

IEEE Style

M. Alotaibi, “Intelligent Semantic Segmentation with Vision Transformers for Aerial Vehicle Monitoring,” Comput. Mater. Contin., vol. 86, no. 1, pp. 1–20, 2026. https://doi.org/10.32604/cmc.2025.069195

BibTex EndNote RIS

Copyright © 2026 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Table of Content

Intelligent Semantic Segmentation with Vision Transformers for Aerial Vehicle Monitoring

Abstract

Keywords

References

Cite This Article

285

113

0

Related articles

Further Information

Guidelines

Follow Us

Join Us

Contact Us

WhatsApp:

Share Link