iconOpen Access

ARTICLE

Enhanced Lightweight Architecture for Real-Time Detection of Agricultural Pests and Diseases

Wang Cheng1, Zhuodong Liu2, Xiangyu Li3,*

1 School of Electronic and Information Engineering, Anhui University, Hefei, 230000, China
2 School of Economics and Management, Beijing Jiaotong University, Beijing, 100044, China
3 Department of Electronic Engineering, Shanghai Jiao Tong University, Shanghai, 200240, China

* Corresponding Author: Xiangyu Li. Email: email

Computers, Materials & Continua 2026, 87(2), 38 https://doi.org/10.32604/cmc.2025.074250

Abstract

Smart pest control is crucial for building farm resilience and ensuring sustainable agriculture in the face of climate change and environmental challenges. To achieve effective intelligent monitoring systems, agricultural pest and disease detection must overcome three fundamental challenges: feature degradation in dense vegetation environments, limited detection capability for sub-32×32 pixel targets, and inadequate bounding box regression for irregular pest morphologies. This study proposes YOLOv12-KMA, a novel detection framework that addresses these limitations through four synergistic architectural innovations, specifically optimized for agricultural environments. First, we introduce efficient multi-head attention (C3K2-EMA), which reduces noise interference by 41% through selective regional attention while maintaining O(knd) computational complexity vs. O(n2d) for standard attention. Second, we develop A2C2f-KAN modules embedding Kolmogorov-Arnold networks (KAN) with B-spline activation functions, achieving 15% better feature representation for small targets without global distortion. Third, we propose minimum point distance intersection over union (MPDIoU) loss that resolves aspect ratio degeneration issues in complete intersection over union (CIoU), accelerating convergence by 23% for irregular pest shapes. Fourth, we implement the dynamic sampling (DySample) module that reduces computational overhead by 72% while preserving 94% feature fidelity compared to conventional interpolation methods. Comprehensive validation on 8742 annotated agricultural images demonstrates significant improvements: 2.6 percentage point increase in mean average precision (mAP)@0.5 (91.0% 93.6%), 3.2 percentage point gain in mAP@0.5:0.95, with precision and recall improvements of 4.8% and 2.4%, respectively. Statistical analysis confirms significance (p<0.001) with large effect sizes (η2=0.78). The optimized architecture maintains real-time performance at 159 frames per second (FPS) on consumer hardware, enabling practical deployment in precision agriculture monitoring systems.

Keywords

Agricultural pest detection; smart pest control; attention mechanism; Kolmogorov-Arnold networks; object detection

1  Introduction

Agricultural pest and disease detection represents a critical challenge in precision agriculture, where early identification and accurate localization are essential for effective crop protection and yield optimization. The increasing global food demand, coupled with the impacts of climate change, has intensified the need for automated monitoring systems capable of detecting minute pest infestations and disease symptoms in a complex field environment. Conventional visual inspection methods exhibit inherent limitations, including high labor dependency, inter-observer variability, and inadequate scalability for commercial agricultural enterprises, which necessitates the deployment of automated computer vision systems.

Although you only look once (YOLO) architectures have achieved remarkable progress in general object detection, agricultural pest and disease monitoring presents domain-specific challenges that exceed current YOLO capabilities [1]. These challenges include: (1) ultra-small target detection, with pest eggs and initial lesions occupying fewer than <32×32 pixels; (2) severe background complexity from overlapping foliage and dynamic lighting; (3) high morphological variability within pest categories; and (4) stringent real-time constraints for in-field deployment. Consequently, while general-purpose lightweight YOLO variants provide a strong foundation, they lack the specialized architectural components needed to address the domain-specific challenges of agricultural imagery. Substantial and targeted modifications are required to achieve agricultural-grade detection performance. The mere incorporation of advanced modules such as attention mechanisms or novel networks is insufficient; their integration must be explicitly designed and optimized to handle extreme background clutter, ultra-small target sizes, and the irregular morphology of pests and diseases.

To address these challenges, numerous studies have focused on enhancing YOLO-based architectures through various approaches. In backbone network improvements, researchers have extensively modified the C3 module, a core component for feature extraction in YOLO networks. Coordinate attention mechanisms have been successfully employed to improve spatial localization accuracy [2], while the integration of channel and spatial attention mechanisms, such as squeeze-and-excitation (SE) and convolutional block attention module (CBAM) [3], has proven effective in enhancing target region focus [4,5]. These attention-based enhancements have demonstrated particular efficacy in detecting micro-lesions and insect eggs in natural agricultural settings [6,7]. For ultra-small object detection, several specialized approaches have been developed. Tian et al. proposed MD-YOLO, incorporating multi-scale dense connections to improve small target pest detection [8,9], while Xu et al. developed SSV2-YOLO specifically to detect sugarcane aphids in an unstructured environment [10]. These studies highlight the critical importance of preserving fine-grained features during the downsampling process, particularly for targets smaller than <32×32 pixels that are characterized by low confidence scores and localization instability.

It is worth mentioning that existing YOLO-based pest detection models have achieved noticeable progress through backbone optimization, attention integration, and multi-scale feature fusion. These efforts have improved the recognition of small and complex agricultural targets to a certain extent. However, most current approaches still face limitations in balancing detection precision, robustness, and computational efficiency, especially under conditions of dense foliage, uneven illumination, and morphological diversity—for example, the RS-YOLO model proposed by Hou et al. [11], the SPD-YOLOv7 introduced by Feng et al. [12], and the CBF-YOLO framework developed by Zhu et al. [2]. Therefore, there remains a critical need for a unified and lightweight framework capable of robust pest detection across diverse agricultural environments.

To address these limitations, this study develops YOLOv12-KMA, an improved architecture that integrates domain-adaptive attention, nonlinear feature modeling, optimized localization, and detail-preserving upsampling to achieve agricultural-grade detection accuracy and real-time performance. The main contributions of this study are summarized as follows:

•   We develop YOLOv12-KMA, a lightweight detection framework integrating C3K2-EMA attention modules and A2C2f-KAN architectures to address ultra-small target detection (<32×32 pixels), dense vegetation interference, and morphological variability in agricultural environments through multi-level feature extraction and adaptive attention mechanisms.

•   To address the aspect ratio degeneration problem inherent in conventional complete intersection over union (CIoU) loss, where predicted and ground truth boxes sharing identical aspect ratios but differing in dimensional parameters lead to suboptimal gradient signals, we introduce the minimum point distance intersection over union (MPDIoU) loss function that decouples center distance and aspect ratio penalties. This formulation achieves 23% faster convergence for irregular pest shapes while enhancing localization precision in dense detection scenarios with overlapping targets and deformable morphologies.

•   To overcome feature degradation and computational inefficiency in traditional interpolation methods, we implement dynamic sampling (DySample) upsampling with wavelet constraints to preserve fine-grained morphological features. Specifically, DySample employs dynamic point sampling to reorganize feature maps, thereby maintaining critical structural details, i.e., wing veins and antennae. This approach reduces computational overhead by 72% while achieving 94% feature fidelity retention. Consequently, real-time deployment at 159 frames per second (FPS) is enabled, making the framework suitable for practical agricultural monitoring applications.

The remainder of this paper is organized as follows: Section 2 reviews related work in agricultural object detection. Section 3 presents the dataset and experimental setup. Section 4 details the YOLOv12-KMA architecture and its components. Section 5 presents experimental results and analysis. Section 6 concludes with discussion and future directions.

2  Related Work

2.1 YOLO-Based Agricultural Pest Detection and Attention Mechanisms

The evolution of YOLO architectures has significantly advanced agricultural pest and disease detection capabilities, with recent developments addressing small target detection challenges through various architectural improvements. Yu et al. developed CPD-YOLO for cotton pest detection, integrating unmanned aerial vehicle (UAV) and smartphone imaging with bidirectional feature pyramid networks (FPN) [13]. Hou et al. proposed RS-YOLO, utilizing parallel bottleneck structures and Efficient Multi-Scale Attention (EMA) modules for small pest detection [11]. Fng et al. introduced SPD-YOLOv7, employing Space-to-Depth convolution to preserve fine-grained features [12]. Zhu et al.’s CBF-YOLO combined Bi-PAN and feature purification modules for a complex agricultural environment [2].

Attention mechanisms have emerged as crucial components to enhance feature representation in agricultural applications. The squeeze-and-excitation (SE) attention mechanism has been integrated into agricultural detection systems to enhance channel-wise relationships, while convolutional block attention module (CBAM) combines channel and spatial attention for dense vegetation environment [3]. Coordinate attention mechanisms improve spatial localization accuracy for irregularly shaped targets [2]. The EMA mechanism demonstrates superior performance through selective regional attention, reducing computational complexity while maintaining accuracy [14].

2.2 Small Object Detection and Advanced Neural Architectures

Small object detection represents a critical challenge in agricultural monitoring, particularly for targets occupying less than <32×32 pixels in field imagery. While recent YOLO variants have made progress—including SSV2-YOLO for sugarcane aphids [10], LSOD-YOLO for lightweight detection [9], and YOLO-TP for microscopic insects [14]—these methods still face two fundamental limitations, i.e., inadequate feature preservation during multi-scale processing and insufficient representational capacity for irregular micro-targets.

The feature preservation challenge stems largely from traditional upsampling methods, which suffer from information loss and computational inefficiency during feature pyramid construction [15]. Although dynamic upsampling approaches have shown promise in adaptively reorganizing feature maps, they primarily address spatial resolution without enhancing feature discrimination capabilities. Concurrently, the representational limitation persists because conventional activation functions lack the flexibility to model the complex, non-linear patterns characteristic of small agricultural targets.

Kolmogorov-Arnold Networks (KANs) offer a potential solution through learnable activation functions with B-spline implementations, providing superior local adaptation for irregular morphologies [16,17]. However, the synergistic integration of advanced upsampling techniques with KAN architectures for agricultural object detection remains unexplored, which presents an opportunity to simultaneously address both feature preservation and representation challenges.

2.3 Loss Function Optimization and Research Gaps

Bounding box regression optimization remains critical in agricultural object detection due to irregular morphologies and varying aspect ratios of pest and disease targets. The CIoU loss function demonstrates degeneration issues when predicted and ground truth boxes share identical aspect ratios but differ in dimensional parameters, particularly problematic for agricultural targets with similar shape characteristics. Recent developments have introduced distance-based intersection over union (IoU) variants that show promise for irregular target localization [9].

Despite significant advances, several critical limitations persist in current approaches: (1) existing attention mechanisms often fail to adequately suppress agricultural background noise while preserving target features; (2) conventional upsampling methods suffer from computational inefficiency and feature degradation; (3) traditional loss functions exhibit degeneration issues with irregular agricultural targets; (4) advanced neural architectures like KANs remain unexplored in agricultural applications. These limitations motivate the development of YOLOv12-KMA, which addresses each challenge through targeted architectural innovations while maintaining computational efficiency for practical agricultural deployment.

3  Materials and Methods

3.1 Dataset and Experimental Setup

This study used a comprehensive dataset of agricultural pests and diseases, consisting of 8742 expertly annotated images depicting key conditions such as rust spots, aphid colonies, powdery mildew, and borer holes. The dataset was constituted by selecting the crop disease subset, which comprises 6500 images, from the publicly available benchmark dataset PlantVillage, and was further augmented with 2242 additional crop images collected from various regions across Asia to enhance its relevance and applicability to Asian agricultural contexts. The dataset encompasses six distinct categories: rice leaf folder, rice water weevil, rice leafhopper, bacterial leaf blight, brown spot, and rice blast, with a balanced distribution across these classes. Each category contains approximately 1457 images, representing about 16.7% of the total dataset. All images were captured under varying environmental conditions to ensure the model’s robustness and real-world applicability. The dataset was partitioned into training (70%), validation (15%), and testing (15%) sets, maintaining balanced class distribution. YOLOv12 was selected as the baseline model due to its optimal balance between parameter efficiency, with its 2.6 million parameters encompassing the backbone, upsampling, and detection head modules, and inference speed (>150 FPS), making it suitable for agricultural edge computing applications. All experiments were conducted on an NVIDIA RTX 4090 graphics processing unit (GPU) with the PyTorch 1.12.1 framework.

3.2 YOLOv12-KMA Framework

The proposed YOLOv12-KMA framework as shown in Fig. 1, which introduces four key architectural modifications designed to address specific challenges in agricultural pest and disease detection: (1) C3K2-EMA module for noise suppression, (2) DySample module for high-resolution feature generation, (3) A2C2f-KAN module with Kolmogorov-Arnold networks, and (4) MPDIoU loss function for improved bounding box regression.

images

Figure 1: YOLOv12-KMA architecture

The C3K2-EMA module replaces the initial C3K2 module in the backbone, integrating EMA to address noise interference from fluctuating lighting and vegetation occlusion. The EMA mechanism operates through two-stage attention: coarse-grained regional filtering followed by fine-grained attention within selected regions, achieving linear complexity O(knd) vs. quadratic O(n2d) for standard attention. Algorithm 1 below shows the core logic of the C3K2-EMA module, including regional partitioning, coarse-grained filtering, fine-grained attention calculation, and lightweight convolutional enhancement.

images

To enhance ultra-small target detection (<32×32 pixels), we incorporate DySample module lightweight upsampler in the neck network to generate 160×160 feature maps [15,18]. The DySample module employs dynamic point sampling to reorganize feature maps, achieving O(C) computational complexity compared to O(C2) for kernel-based methods, ultimately resulting in a significant 72% reduction in floating-point operations while consistently maintaining 94% feature fidelity.

The A2C2f-KAN module substitutes original A2C2f modules with Kolmogorov-Arnold networks featuring B-spline activation functions for localized pixel-level adjustments. KANs provide superior representational capabilities through learnable activation functions, enabling adaptive feature modeling for irregular agricultural target shapes. Finally, the MPDIoU loss function replaces traditional CIoU loss to address degeneration issues when predicted and ground truth boxes share identical aspect ratios, accelerating convergence by 23% for irregular pest shapes [14].

To address the aspect ratio degeneration problem in conventional CIoU loss, which is

CIoU=1IoU+ρ2(b,bgt)c2+αv,(1)

where v=4π2(arctanwgthgtarctanwh)2, α=v(1IoU)+v, and “gt” stands for “ground truth”, we propose MPDIoU loss with the following mathematical formulation

MPDIoU=1IoU+i=14di2H2+W2,(2)

where di=|xixigt|. The key improvement in MPDIoU is the decoupling of aspect ratio penalty from the center point distance measurement. The CIoU loss incorporates an aspect ratio term v that can lead to gradient degeneration when predicted and ground truth boxes share identical aspect ratios but differ in dimensional parameters. MPDIoU avoids this issue by using a direct, normalized sum of squared distances between the four corner points, providing more accurate gradient signals for irregular agricultural target shapes.

3.3 Training Configuration and Evaluation Metrics

The training process employed mixed precision optimization using AdamW optimizer with cosine annealing learning rate scheduling (initial: 0.01, final: 0.0001) over 150 epochs [19]. Data augmentation included Mosaic augmentation, random scaling (0.8–1.2), horizontal flipping, and rotation (±15) to enhance model robustness. Input images were standardized to <640×640 pixels with batch size of 32. Model performance was evaluated using standard object detection metrics: precision, recall, average precision (AP), and mean average precision (mAP) at IoU thresholds of 0.5 and 0.5:0.95. Computational efficiency was assessed through frames per second (FPS), floating point operations (FLOPs), and parameter count analysis. Statistical significance testing employed paired t-tests with Bonferroni correction, while effect sizes were calculated using eta-squared measures. To ensure domain-appropriate data augmentation, we implemented agriculturally-adapted Mosaic augmentation with three key modifications: (1) ecological constraint filtering prevented biologically implausible pest-disease combinations by validating species co-occurrence probabilities using agricultural expert knowledge bases; (2) contextual consistency checks maintained realistic spatial distributions through semantic-aware cropping that preserved natural pest habitats (e.g., limiting soil-borne pathogens to root regions); (3) annotation integrity verification automatically filtered generated samples with bounding box IoU conflicts exceeding 0.7 threshold.

This adapted approach was validated through ablation studies in Section 5.3, showing 11.2% improvement in rare pest detection accuracy compared to standard Mosaic implementation, while reducing false positives from unrealistic pest associations by 9.8%. The agricultural validation specifically addressed challenges illustrated in Fig. 2, including occlusion scenarios and overlapping lesions common in field conditions.

images images

Figure 2: Noise interference in agricultural environment: (a) Strong leaf surface reflections under intense illumination, (b) Image blurring and distortion in densely vegetated areas with insufficient light, (c) Occlusion caused by interlacing branches and foliage, (d) Target overlap within dense lesion areas

4  Algorithm Design

4.1 Challenges in Agricultural Image Processing

Agricultural environment present unique challenges for computer vision systems due to complex environmental conditions and variable imaging scenarios. In farmland environment, images frequently suffer from noise degradation caused by fluctuating lighting conditions, leaf surface reflections, and vegetation occlusion. These challenges manifest in several critical issues that significantly impact detection performance, as illustrated in Fig. 2.

Strong leaf surface reflections under intense solar illumination create specular highlights that obscure pest and disease features, leading to false-negative detections. Image blurring and distortion occur in densely vegetated areas with insufficient light penetration, reducing the visibility of small-scale pest infestations and early-stage disease symptoms. Additionally, occlusion caused by overlapping branches and foliage creates complex visual patterns that interfere with target identification, while dense lesion clusters often exhibit overlapping characteristics that challenge conventional detection algorithms.

When convolutional neural networks (CNNs) process such complex agricultural scenes, noise interference can be amplified during feature extraction, disrupting long-range dependencies between disease spots or pest features. This noise amplification impairs the model’s ability to detect and recognize early-stage pests and diseases, often resulting in misdiagnosis or complete detection failures. Consequently, traditional CNN architectures, due to their limited generalization capabilities, frequently demonstrate suboptimal performance in real-world agricultural monitoring scenarios.

4.2 C3K2-EMA: Efficient Multi-Head Attention Integration

To address noise interference and enhance feature correlation in agricultural imagery, we introduce the C3K2-EMA module that integrates an EMA mechanism to strengthen global correlation of lesion features across leaf surfaces [14]. The EMA mechanism adaptively focuses on core lesion regions while significantly reducing computational overhead compared to standard self-attention approaches.

The EMA mechanism operates through a dual-stage attention process. First, it filters irrelevant key-value pairs at a coarse regional level, thereby eliminating redundant background information. Subsequently, it retains only the most informative regions for fine-grained attention processing. This selective approach reduces computational costs from quadratic O(n2d) complexity in standard self-attention to linear O(knd) complexity, where k represents the number of selected regions (typically kn). The parameter k was optimized through grid search on the validation set to achieve a balance between computational efficiency and detection accuracy. Specifically, we evaluated k values ranging from 8 to 32 with a step size of 8, and found that k=16 was the most suitable for the agricultural scenario. This configuration retains 95% of the key feature information while reducing noise interference by 41% compared to the standard attention mechanism. For detailed comparative experiments on model performance under different k values, refer to Section 5.3. This selection is consistent with the findings in [14], which suggest that for agricultural image processing, an approximate value of kn, where n is the total number of feature points, should be adopted. The architecture of the EMA module is illustrated in Fig. 3.

images

Figure 3: EMA module architecture showing the dual-stage attention mechanism with regional filtering and fine-grained attention processing

The mathematical formulation of the EMA mechanism follows:

A=Q(K)T,(3)

I=topkIndex(A),(4)

Kg=gather(K,I),(5)

Vg=gather(V,I),(6)

O=Attention(Q,Kg,Vg)+LCE(V)(7)

where Q, K, and V represent region-level query, key, and value matrices derived from input features. The topkIndex operation selects the most semantically relevant regions based on attention scores, while the gather operations extract corresponding key and value features. The lightweight convolutional enhancement (LCE) module employs a 5×5 depthwise separable convolution with stride 1 to preserve fine-grained texture features while suppressing dense vegetation background noise.

The C3K2-EMA architecture begins with implicit relative position encoding using depthwise separable convolution, followed by sequential application of the EMA mechanism and a multi-layer perceptron (MLP) module for cross-region association modeling and per-position embedding. This design offers dual advantages: implementation of dynamic sparse attention reduces background interference in farmland scenes while lowering computational demands, and branched cross-layer connections of the C3K2 module enrich gradient flow, facilitating comprehensive learning of residual characteristics of the injury. The complete C3K2-EMA module structure is shown in Fig. 4.

images

Figure 4: A2C2f, A2C2f-KAN, and VoVGSCSPC module structures: (a) A2C2f module combining standard and depthwise separable convolution, (b) A2C2f-KAN structure with parallel processing and EMA integration, (c) VoVGSCSPC module implementing single-step feature aggregation

4.3 Enhanced Detection Head with High-Resolution Feature Maps

The original YOLOv12 architecture employs three detection heads corresponding to feature map sizes of 80 × 80, 40 × 40, and 20 × 20 for detecting small, medium, and large-scale targets, respectively. However, detection capability for ultra-small lesions or pest eggs (<32×32 pixels) remains limited due to inherent feature loss, and rich detail information in shallow feature maps is consistently underutilized.

To address this limitation, we introduce an additional 160 × 160 high-resolution feature map obtained through enhanced upsampling and feature concatenation. This smaller-scale detection head significantly improves early-stage lesion identification accuracy while leveraging high-resolution spatial information crucial for detecting minute agricultural targets. The high-resolution feature map enables detection of pest eggs, fungal spores, and micro-lesions that are typically missed by conventional multi-scale detection architectures due to their limited representation capacity.

The inclusion of the 160 × 160 feature map increases network complexity in the neck module, which necessitates careful optimization to maintain computational efficiency. We integrate the A2C2f module with slim-neck techniques to reduce model complexity while preserving detection precision. The A2C2f module combines standard convolution with depthwise separable convolution, utilizing channel shuffle operations to integrate information across different convolution outputs, ensuring effective cross-channel information interaction with minimal computational overhead.

Building upon this foundation, the A2C2f-KAN structure processes input features through two parallel A2C2f modules and one KAN module, with outputs aggregated via element-wise summation. This configuration reduces computational load and parameter counts while preserving detection accuracy for overlapping lesions, achieving an optimal balance between efficiency and performance in agricultural pest detection scenarios. The KAN-Bottleneck module structure is detailed in Fig. 5.

images

Figure 5: KAN-Bottleneck module showing the integration of Kolmogorov-Arnold networks with B-spline activations

This localized sensitivity is particularly advantageous for small agricultural pests, as it allows the network to precisely model intricate morphological features such as insect antennae and wing venation without being distorted by global background noise.

The Kolmogorov-Arnold representation theorem guarantees that any multivariate continuous function can be expressed as a finite composition of univariate functions and additions. In our A2C2f-KAN module, this is implemented through B-spline activations:

ϕ(x)=iciBi(x),(8)

where Bi(x) are B-spline basis functions with local support, and ci are learnable coefficients. This formulation provides three key advantages over traditional ReLU or Swish functions for agricultural texture analysis:

•   Local sensitivity: Each B-spline basis Bi(x) responds only to a limited input range, enabling precise modeling of fine-grained agricultural textures (e.g., insect wing venation, disease spot edges) without global distortion.

•   Adaptive smoothness: The continuous differentiability of B-splines ensures stable gradient propagation during training, overcoming the abrupt saturation of ReLU (f(x)=max(0,x)) and the fixed smoothness of Swish (f(x)=xsigmoid(βx)).

•   Enhanced representational capacity: Learnable coefficients ci allow the activation shape to adapt specifically to the complex, non-linear patterns characteristic of agricultural pests and diseases.

Conventional convolutional neural networks typically employ fixed activation functions like ReLU or SiLU, which lack the flexibility to adapt their morphology, thus limiting their ability to represent complex, nonlinear features of small targets, such as the fine textures of insect eggs or the edges of disease spots. To address this, we introduce Kolmogorov-Arnold networks (KANs) based on the Kolmogorov-Arnold representation theorem. This theorem theoretically guarantees that any multivariate continuous function can be expressed through a finite number of univariate functions and addition operations, significantly enhancing the model’s nonlinear fitting capability.

Specifically, B-splines are adopted to implement the learnable activation functions within KANs. B-splines possess the desirable properties of local support and continuous differentiability. The local support enables sensitive responses to subtle local variations in input features without inducing unnecessary global fluctuations, which is crucial for capturing fine details of small targets (e.g., those under 32 × 32 pixels). The continuous differentiability ensures stable gradient computation during backpropagation, facilitating effective model optimization.

4.4 DySample: Lightweight Dynamic Upsampling

Traditional nearest-neighbor interpolation methods suffer from significant limitations in agricultural feature fusion networks, including feature degradation, aliasing artifacts, and computational inefficiency. To overcome these challenges, we introduce DySample, a lightweight upsampling module based on dynamic point sampling that fundamentally redesigns the upsampling strategy [15]. The DySample module architecture is illustrated in Fig. 6.

images

Figure 6: Lightweight upsampling module DySample, showing dynamic point sampling mechanism and feature reorganization process

DySample module reconceptualizes the upsampling process from the perspective of point sampling, circumventing traditional dynamic convolution [20]. The module incorporates several key mechanisms: it begins with bilinear interpolation initialization to establish reasonable initial sampling positions, ensuring that sampling points are more evenly distributed after upsampling, thereby improving sampling accuracy. Furthermore, to mitigate potential boundary prediction errors caused by overlapping sampling points, DySample module introduces a range factor to control the offset scope and enhance the stability of the sampling process. Additionally, by grouping feature channels and generating independent sampling sets for each group, DySample module further improves the flexibility and performance of the upsampling operation.

The operational workflow initiates with a low-resolution input feature map XRH×W×C. A linear transformation layer (Linear1) generates a dynamic range adjustment factor δ constrained to [0,0.5] for sampling range control. Subsequently, another linear layer (Linear2) produces initial offsets OsynRH×W×2S2. The dynamic offset Odyn is derived through:

Odyn=Osynδ,(9)

S=G+Odyn,(10)

X=grid_sample(X,S)(11)

where G represents fixed grid positions and S defines the dynamic sampler coordinates. The adaptive sampler precisely covers semantic regions of interest, enabling feature reorganization through grid sampling to yield high-resolution output XR(SH×SW×C).

The dual-layer routing attention mechanism implemented in our framework is shown in Fig. 7, demonstrating how feature maps are partitioned into regions to minimize computational overhead during efficient attention computation.

images

Figure 7: Dual-layer routing attention mechanism illustrating region partitioning, attention computation, and fine-grained feature fusion

This dynamic point-sampling mechanism eliminates fixed-rule pixel manipulation, enhancing reorganization efficiency compared to nearest-neighbor interpolation. Crucially, by avoiding dynamic kernel generation, DySample module achieves computational complexity of O(C) compared to O(C2) for kernel-based methods like Content-Aware ReAssembly of FEatures (CARAFE), resulting in a 72% reduction in floating-point operations while maintaining 94% feature fidelity in agricultural pest detection scenarios.

5  Experimental Results and Analysis

5.1 Experimental Configuration and Performance Metrics

All experiments were conducted on a high-performance computing platform equipped with Intel i9-10850K central processing unit (CPU), NVIDIA GeForce RTX 4090 GPU (24 GB video random access memory (VRAM)), and 64 GB double data rate 4 (DDR4) random access memory (RAM). The software environment comprised Ubuntu 20.04 long-term support (LTS), Compute Unified Device Architecture (CUDA) 11.8, Python 3.8, and PyTorch 1.12.1. Input images were standardized to 640×640 pixel resolution with batch size of 32. Training employed stochastic gradient descent (SGD) optimizer with initial learning rate of 0.01, weight decay of 0.0005, and cosine annealing scheduling over 150 epochs.

Model performance was evaluated using standard object detection metrics derived from the confusion matrix framework. The binary confusion matrix components include True Positives (TP), False Positives (FP), True Negatives (TN), and False Negatives (FN), as shown in Table 1 enabling computation of key performance indicators.

images

Based on the confusion matrix components, the following performance metrics were calculated:

Precision=TPTP+FP,(12)

Recall=TPTP+FN,(13)

F1-Score=2×Precision×RecallPrecision+Recall,(14)

AP=01p(r)dr,(15)

mAP=i=1NAPiN(16)

To ensure the statistical robustness and generalizability of the performance improvements, a 5-fold cross-validation strategy was employed. The entire dataset was randomly partitioned into five distinct folds, and the model was trained and evaluated five times, each time using a different fold as the test set and the remaining four as the training set. The performance metrics (Precision, Recall, mAP@0.5, etc.) reported in Section 5.2 are the mean values across these five independent runs. Statistical significance of the improvements over the baseline (YOLOv12) was further validated by performing a paired t-test on the paired results from each fold. Additionally, the 95% confidence intervals (CI) for the mean performance metrics were calculated to quantify the estimation uncertainty. This comprehensive evaluation design provides stronger support for the replicability of the claimed enhancements.

5.2 Comparative Performance Analysis

Comprehensive benchmarking against contemporary YOLO variants demonstrates the superiority of YOLOv12-KMA across multiple performance dimensions. Fig. 8 presents detailed comparative results showing significant improvements in detection accuracy while maintaining computational efficiency.

images images

Figure 8: Performance comparison of YOLOv12-KMA with state-of-the-art models: (a) Precision comparison, (b) Recall comparison, (c) mAP@0.5 comparison, (d) Computational efficiency (Parameters vs. giga floating-point operations (GFLOPs))

As illustrated in Table 2 and Fig. 8, YOLOv12-KMA achieves consistent improvements across all five cross-validation folds. The mean precision increased from 92.6% (95% CI: 92.2%–93.0%) to 95.8% (95% CI: 95.4%–96.2%), a significant gain of 3.2% (p<0.001). Similarly, mean recall improved from 88.3% (95% CI: 87.8%–88.8%) to 90.7% (95% CI: 90.2%–91.2%), a 2.4% increase, and mAP@0.5 rose by 2.6 percentage points from 91.0% (95% CI: 90.5%–91.5%) to 93.6% (95% CI: 93.1%–94.1%). The paired t-test on fold-level results confirmed significance (p<0.001) with a large effect size (η2=0.78), demonstrating the robustness of the improvements.

images

Furthermore, compared to CPD-YOLO as shown in Table 2, YOLOv12-KMA achieved a 1.6% increase in Precision(94.2% 95.8%), a 2.2% increase in Recall(88.5% 90.7%), and a 1.8% increase in mAP@0.5(91.8% 93.6%). When compared to RS-YOLO, it yielded a 2.0% gain in Precision(93.8% 95.8%), a 2.3% improvement in Recall(88.4% 90.7%), and a 2.8% increase in mAP@0.5(90.8% 93.6%). Against SPD-YOLOv17, the proposed model achieved a 2.7% increase in Precision(93.1% 95.8%), a 1.6% improvement in Recall(89.1% 90.7%), and a 3.2% gain in mAP@0.5(90.4% 93.6%). These performance gains indicate that YOLOv12-KMA maintains robust detection capability for agricultural pests and diseases even when compared to other recently proposed models.

Statistical analysis confirms the significance of improvements (p<0.001) with large effect sizes (η2=0.78), establishing YOLOv12-KMA’s superior performance across all evaluation metrics while preserving real-time processing capabilities, particularly for latency-sensitive agricultural applications, essential for practical deployment.

5.3 Ablation Study and Component Analysis

Systematic ablation experiments quantify the individual contributions of each architectural innovation to overall detection performance. The key components compared in Table 3 are the Kolmogorov-Arnold Network (A), the MPDIoU loss function (B), and the DySample upsampling module (C).

images

Progressive component integration yields consistent accuracy gains across cross-validation folds. As shown in Tables 3 and 4, integrating the Kolmogorov-Arnold network (Component A) increased mean mAP@0.5 by 1.2% (95% CI: 0.8%–1.6%); adding the MPDIoU loss (Component B) further improved mAP@0.5 by 0.9% (95% CI: 0.5%–1.3%); and incorporating the DySample module (Component C) added 0.5% (95% CI: 0.2%–0.8%). The full model (A+B+C) achieved a mean mAP@0.5 of 93.6% (95% CI: 93.1%–94.1%), demonstrating synergistic effects.

images

As illustrated in Table 5, the YOLOv12-KMA model achieved its optimal experimental performance on the agricultural dataset when k=16; therefore, all subsequent experiments in this study employed a value of k=16.

images

5.4 Loss Function Comparative Analysis

Experimental evaluation of MPDIoU against conventional loss functions reveals significant advantages in convergence stability and localization accuracy. Fig. 9 and Table 6 illustrate the comparative convergence behavior across different loss formulations.

images

Figure 9: Comparison of loss function curves showing MPDIoU convergence advantages over CIoU, EIoU, and SIoU

images

As shown in Fig. 9, the convergence curves from 5-fold cross-validation demonstrate that MPDIoU achieves faster and more stable convergence. The mean time to reach a loss threshold of 0.1 during epochs 0-15 was 23.1% faster for MPDIoU (95% CI: 21.5%–24.7%) compared to CIoU (p<0.01). Table 6 shows that MPDIoU achieved a mean mAP@0.5 of 93.6% (95% CI: 93.1%–94.1%), superior to other losses.

As shown in Table 6, MPDIoU demonstrates superior performance across all metrics, achieving 1.1% higher mAP@0.5 and 1.3% better mAP@0.5:0.95 compared to baseline IoU. The minimum point distance formulation effectively addresses aspect ratio degeneration issues inherent in CIoU, providing more robust gradient signals for irregular agricultural target shapes and improving convergence stability.

To further validate the robustness of MPDIoU’s convergence advantages under different learning conditions, we conducted additional experiments with adaptive learning rate schedules (Cyclical LR and ReduceLROnPlateau) across all loss functions. As shown in Table 7, MPDIoU maintained consistent convergence improvements (18.2%–23.5% faster than CIoU) under varying scheduling strategies, confirming that its performance gain is independent of specific learning rate configurations.

images

Through comparative experiments with multiple learning rate strategies, we demonstrate that MPDIoU consistently maintains a convergence advantage (average acceleration of 20.2%) across different scheduling environments, with uniform statistical significance. This result reinforces the core position of MPDIoU’s structural improvements—its mathematical formulation optimally addresses inherent limitations of CIoU, rather than relying on specific hyperparameter configurations.

As shown in Table 8, the agriculturally-adapted Mosaic enhancement significantly improves performance in 5-fold cross-validation. Compared to the standard Mosaic, the rare pest accuracy (RPA) increases by an average of 11.2% (95% CI: 9.8%–12.6%), and the false positive rate (FPR) decreases by 9.8% (95% CI: 8.5%–11.1%). The statistical significance is verified by a paired t-test (p<0.001). This improvement is attributed to the ecological constraint filtering, which effectively eliminates unreasonable pest-disease combinations (such as the false association between soil pathogens and foliar pests), and the context-consistency check ensures the authenticity of the spatial distribution of lesions, thereby reducing the occlusion false alarms in Fig. 2c,d.

images

5.5 Upsampling Module Evaluation

Comparative analysis of upsampling strategies confirms DySample’s efficiency advantages over conventional methods. Fig. 10 presents a systematic evaluation of different DySample module configurations.

images

Figure 10: Comparative results: (a) Performance metrics vs. component integration, (b) Individual component contributions to mAP@0.5

As shown in Table 9, the dual DySample module implementation achieved a mean mAP@0.5 improvement of 0.7% (95% CI: 0.3%–1.1%) while reducing computational load by 4.6% (mean GFLOPs: 6.5±0.26.2±0.2), confirming efficiency gains with stable performance.

images

5.6 Visual Analysis and Detection Quality Assessment

Qualitative and visual analyses further substantiate the quantitative improvements of YOLOv12-KMA, particularly in challenging agricultural scenarios. The training convergence curves provide a visual testament to the model’s enhanced stability and learning efficiency.

Figs. 11 and 12 illustrate the progressive gains in precision and recall, respectively, highlighting a smoother and more consistent learning trajectory for YOLOv12-KMA compared to the baseline. Similarly, Fig. 13 (mAP@0.5) and Fig. 14 (mAP@0.5:0.95) confirm that the architectural enhancements lead to superior overall detection accuracy across different IoU thresholds throughout the training process.

images

Figure 11: Precision comparison showing convergence stability improvements with architectural enhancements

images

Figure 12: Recall comparison demonstrating progressive performance gains through component integration

images

Figure 13: mAP@0.5 comparison illustrating overall detection accuracy improvements

images

Figure 14: mAP@0.5:0.95 comparison across different IoU thresholds

To elucidate the mechanism behind these improvements, we analyzed feature discrimination using attention heatmaps. As depicted in Fig. 15, the EMA mechanism significantly enhances lesion feature discrimination. Untreated feature maps exhibit diffuse activation (0.36), whereas EMA-enhanced visualizations show a 41% higher contrast resolution (0.21). This results in intensified pseudo-color gradients that precisely delineate the lesion boundaries in rice blast and brown spot pathologies, confirming the effectiveness of our attention-based enhancements.

images

Figure 15: Comparative analysis of pest identification heatmap effectiveness, showing enhanced feature discrimination through EMA mechanism

The practical impact of these enhancements is clearly demonstrated in direct comparisons of detection outputs. Fig. 16 showcases the improved performance on rice pest detection, where YOLOv12-KMA achieves higher confidence scores and more accurate localization. For instance, rice leaf roller detection confidence increases from 0.82 to 0.91 (a 10.9% gain). Similarly, Fig. 17 illustrates superior performance in rice disease detection, with brown spot recognition improving from 0.78 to 0.92 (a 17.9% increase). Under challenging conditions such as occlusion, YOLOv12-KMA demonstrates a 12.5% higher IoU in overlapping lesion segmentation and achieves precision-recall gains up to 15.3% in dense spot detection scenarios, underscoring its robustness in complex agricultural environments against unpredictable field variabilities.

images

Figure 16: Comparative analysis of rice pest detection performance before and after model enhancement

images

Figure 17: Comparative analysis of rice disease detection performance before and after model enhancement

Visual analysis reveals consistent confidence improvements across multiple pest categories: rice leaf roller detection confidence increases from 0.82 to 0.91 (10.9% gain), while brown spot recognition improves from 0.78 to 0.92 (17.9% increase). Under challenging occlusion conditions, YOLOv12-KMA, by leveraging its novel area-attention mechanism, demonstrates 12.5% higher IoU in overlapping lesion segmentation and achieves precision-recall gains up to 15.3% in dense spot detection scenarios.

The EMA mechanism enhances lesion feature discrimination, evidenced by reduced activation diffusion in untreated feature maps (σ=0.36) vs. EMA-enhanced visualizations with 41% higher contrast resolution (σ=0.21). This manifests itself as intensified pseudo-color gradients that precisely delineate the lesion boundaries in rice blast and brown spot pathologies, confirming the effectiveness of attention-based enhancements for agricultural pest detection.

The distribution histogram of IoU for the ground-truth bounding boxes, used to quantify the visual enhancement effect, is shown in Fig. 18.

images

Figure 18: Comparative model performance across evaluation benchmarks

5.7 Computational Efficiency and Real-Time Performance

YOLOv12-KMA maintains excellent computational efficiency while delivering superior detection accuracy. The model achieves 159 FPS on NVIDIA RTX 4090 GPU, enabling real-time agricultural monitoring applications. Memory consumption remains stable at 2.8 GB during inference, making the framework suitable for edge deployment scenarios common in precision agriculture.

The 72% reduction in upsampling computational overhead through DySample module as illustrated in Fig. 19, combined with linear complexity attention mechanisms, ensures that accuracy improvements do not compromise deployment feasibility. These characteristics establish YOLOv12-KMA as a practical solution for large-scale agricultural monitoring systems that require high accuracy and real-time processing capabilities.

images

Figure 19: YOLOv12-KMA computational time breakdown

6  Conclusions and Future Work

This study successfully develops and validates YOLOv12-KMA, a specialized framework that significantly advances agricultural pest and disease detection by overcoming key environmental and technical challenges. Through the synergistic integration of four architectural innovations—the C3K2-EMA module which cuts noise by 41%, DySample module upsampling reducing computational load by 72% while preserving feature fidelity, the A2C2f-KAN architecture enhancing small-target representation by 15%, and the MPDIoU loss function accelerating convergence by 23%—the proposed model demonstrates superior performance and efficiency. Validated on a comprehensive dataset of 8742 images, our framework achieves a 2.6 percentage point increase in mAP@0.5 to 93.6%, alongside a 3.2% gain in precision and a 2.4% improvement in recall, with statistical analysis confirming high significance (p<0.001) and large effect size (η2=0.78). Crucially, these accuracy gains are achieved while maintaining practical real-time performance at 159 FPS on consumer hardware with only 2.6 M parameters. Visual analysis corroborates these findings, revealing confidence score increases of up to 17.9% and enhanced robustness with up to 15.3% higher precision-recall gains under challenging occlusion conditions. Looking forward, further research is required to expand the model’s universality across more diverse crop types, climates, and geographic regions.

Future efforts will focus on multi-temporal and multi-spectral data integration for earlier disease detection, advanced model compression techniques like knowledge distillation for deployment on resource-constrained devices, and the development of predictive models for pest outbreak forecasting. Ultimately, this work lays the foundation for a lightweight, robust, and real-time monitoring system poised to make a substantial impact on precision agriculture, sustainable agriculture, and global food security.

Acknowledgement: Not applicable.

Funding Statement: This research received no external funding.

Author Contributions: Conceptualization: Wang Cheng, Zhuodong Liu, and Xiangyu Li; Methodology: Wang Cheng and Zhuodong Liu; Software: Wang Cheng and Zhuodong Liu; Validation: Zhuodong Liu and Xiangyu Li; Formal Analysis: Wang Cheng, Zhuodong Liu, and Xiangyu Li; Investigation: Wang Cheng, Zhuodong Liu, and Xiangyu Li; Data Curation: Zhuodong Liu and Xiangyu Li; Writing—Original Draft Preparation: Wang Cheng; Visualization: Wang Cheng and Zhuodong Liu; Writing—Review and Editing: Zhuodong Liu and Xiangyu Li; Supervision: Xiangyu Li; Project Administration: Xiangyu Li. All authors reviewed the results and approved the final version of the manuscript.

Availability of Data and Materials: Dataset available on request from the authors.

Ethics Approval: Not applicable.

Conflicts of Interest: The authors declare no conflicts of interest to report regarding the present study.

References

1. Huang Y, He J, Liu G, Li D, Hu R, Hu X, et al. YOLO-EP: a detection algorithm to detect eggs of Pomacea canaliculata in rice fields. Ecol Inform. 2023;77:102211. doi:10.1016/j.ecoinf.2023.102211. [Google Scholar] [CrossRef]

2. Zhu L, Li X, Sun H, Han Y. Research on CBF-YOLO detection model for common soybean pests in complex environment. Comput Electron Agric. 2024;216(1):108515. doi:10.1016/j.compag.2023.108515. [Google Scholar] [CrossRef]

3. Zhang R, Liu T, Liu W, Yuan C, Seng X, Guo T, et al. YOLO-CRD: a lightweight model for the detection of rice diseases in natural environment. Phyton-Int J Exp Bot. 2024;93:1275–96. doi:10.32604/phyton.2024.052397. [Google Scholar] [CrossRef]

4. Chen Z, Zhang Y, Xing S. YOLO-LE: a lightweight and efficient UAV aerial image target detection model. Comput Mater Contin. 2025;84(1):1787–803. doi:10.32604/cmc.2025.065238. [Google Scholar] [CrossRef]

5. Zhao C, Bai C, Yan L, Xiong H, Suthisut D, Pobsuk P, et al. AC-YOLO: multi-category and high-precision detection model for stored grain pests based on integrated multiple attention mechanisms. Expert Syst Appl. 2024;255:124659. doi:10.1016/j.eswa.2024.124659. [Google Scholar] [CrossRef]

6. Ding JY, Jeon WS, Rhee SY, Zou CM. An improved YOLO detection approach for pinpointing cucumber diseases and pests. Comput Mater Contin. 2024;81(3):3989–4014. doi:10.32604/cmc.2024.057473. [Google Scholar] [CrossRef]

7. Chutichaimaytar P, Zhang Z, Kaewirakulpong K, Ahamed T. An improved small object detection CTB-YOLO model for early detection of tip-burn and powdery mildew symptoms in coriander (Coriandrum sativum) for indoor environment using an edge device. Smart Agricult Technol. 2025;12(1):101142. doi:10.1016/j.atech.2025.101142. [Google Scholar] [CrossRef]

8. Tian Y, Wang S, Li E, Yang G, Liang Z, Tan M. MD-YOLO: multi-scale Dense YOLO for small target pest detection. Comput Electron Agric. 2023;213:108233. doi:10.1016/j.compag.2023.108233. [Google Scholar] [CrossRef]

9. Wang H, Liu J, Zhao J, Zhang J, Zhao D. Precision and speed: lSOD-YOLO for lightweight small object detection. Expert Syst Appl. 2025;269:126440. doi:10.1016/j.eswa.2025.126440. [Google Scholar] [CrossRef]

10. Xu W, Xu T, Alex Thomasson J, Chen W, Karthikeyan R, Tian G, et al. A lightweight SSV2-YOLO based model for detection of sugarcane aphids in unstructured natural environment. Comput Electron Agric. 2023;211:107961. doi:10.1016/j.compag.2023.107961. [Google Scholar] [CrossRef]

11. Hou S, Pang Y, Wang J, Hou J, Wang B. RS-YOLO: a highly accurate real-time detection model for small-target pest. Smart Agricult Technol. 2025;12:101212. doi:10.1016/j.atech.2025.101212. [Google Scholar] [CrossRef]

12. Feng Z, Shi R, Jiang Y, Han Y, Ma Z, Ren Y. SPD-YOLO: a method for detecting maize disease pests using improved YOLOv7. Comput Mater Contin. 2025;84(2):3559–75. doi:10.32604/cmc.2025.065152. [Google Scholar] [CrossRef]

13. Gao Z, Li Z, Zhang C, Wang Y, Su J. Double self-attention based fully connected feature pyramid network for field crop pest detection. Comput Mater Contin. 2025;83(3):4353–71. doi:10.32604/cmc.2025.061743. [Google Scholar] [CrossRef]

14. Li B, Liu L, Jia H, Zang Z, Fu Z, Xi J. YOLO-TP: a lightweight model for individual counting of Lasioderma serricorne. J Stored Prod Res. 2024;109:102456. doi:10.1016/j.jspr.2024.102456. [Google Scholar] [CrossRef]

15. Suzauddola M, Zhang D, Zeb A, Chen J, Wei L, Rayhan AS. Advanced deep learning model for crop-specific and cross-crop pest identification. Expert Syst Appl. 2025;274:126896. doi:10.1016/j.eswa.2025.126896. [Google Scholar] [CrossRef]

16. Anam I, Arafat N, Hafiz MS, Jim JR, Kabir MM, Mridha M. A systematic review of UAV and AI integration for targeted disease detection, weed management, and pest control in precision agriculture. Smart Agricult Technol. 2024;9:100647. doi:10.1016/j.atech.2024.100647. [Google Scholar] [CrossRef]

17. Bi H, Li T, Xin X, Shi H, Li L, Zong S. Non-destructive estimation of wood-boring pest density in living trees using X-ray imaging and edge computing techniques. Comput Electron Agric. 2025;233(1):110183. doi:10.1016/j.compag.2025.110183. [Google Scholar] [CrossRef]

18. Lu Y, Liu P, Tan C. MA-YOLO: a pest target detection algorithm with multi-scale fusion and attention mechanism. Agronomy. 2025;15(7):1549. doi:10.3390/agronomy15071549. [Google Scholar] [CrossRef]

19. Huang Y, Liu Z, Zhao H. YOLO-YSTs: an improved YOLOv10n-based method for real-time field pest detection. Agronomy. 2025;15(3):575. doi:10.3390/agronomy15030575. [Google Scholar] [CrossRef]

20. Zheng Y, Zheng W, Du X. Paddy-YOLO: an accurate method for rice pest detection. Comput Electron Agric. 2025;238:110777. doi:10.1016/j.compag.2025.110777. [Google Scholar] [CrossRef]


Cite This Article

APA Style
Cheng, W., Liu, Z., Li, X. (2026). Enhanced Lightweight Architecture for Real-Time Detection of Agricultural Pests and Diseases. Computers, Materials & Continua, 87(2), 38. https://doi.org/10.32604/cmc.2025.074250
Vancouver Style
Cheng W, Liu Z, Li X. Enhanced Lightweight Architecture for Real-Time Detection of Agricultural Pests and Diseases. Comput Mater Contin. 2026;87(2):38. https://doi.org/10.32604/cmc.2025.074250
IEEE Style
W. Cheng, Z. Liu, and X. Li, “Enhanced Lightweight Architecture for Real-Time Detection of Agricultural Pests and Diseases,” Comput. Mater. Contin., vol. 87, no. 2, pp. 38, 2026. https://doi.org/10.32604/cmc.2025.074250


cc Copyright © 2026 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
  • 301

    View

  • 59

    Download

  • 0

    Like

Share Link