iconOpen Access

ARTICLE

crossmark

APPLE_YOLO: Apple Detection Method Based on Channel Pruning and Knowledge Distillation in Complicated Environments

Xin Ma1,2, Jin Lei3,4,*, Chenying Pei4, Chunming Wu4

1 Department of Aircraft Control and Information Engineering, Jilin University of Chemical Technology, Jilin, 132022, China
2 Micro Engineering and Micro Systems Laboratory, School of Mechanical and Aerospace Engineering, Jilin University, Changchun, 130025, China
3 School of Marine Science and Technology, Northwestern Polytechnical University, Xi’an, 710129, China
4 Key Laboratory of Modern Power System Simulation and Control & Renewable Energy Technology, Ministry of Education, Northeast Electric Power University, Jilin, 132012, China

* Corresponding Author: Jin Lei. Email: email

(This article belongs to the Special Issue: Advances in Object Detection: Methods and Applications)

Computers, Materials & Continua 2026, 86(2), 1-17. https://doi.org/10.32604/cmc.2025.069353

Abstract

This study proposes a lightweight apple detection method employing cascaded knowledge distillation (KD) to address the critical challenges of excessive parameters and high deployment costs in existing models. We introduce a Lightweight Feature Pyramid Network (LFPN) integrated with Lightweight Downsampling Convolutions (LDConv) to substantially reduce model complexity without compromising accuracy. A Lightweight Multi-channel Attention (LMCA) mechanism is incorporated between the backbone and neck networks to effectively suppress complex background interference in orchard environments. Furthermore, model size is compressed via Group_Slim channel pruning combined with a cascaded distillation strategy. Experimental results demonstrate that the proposed model achieves a 1% higher Average Precision (AP) than the baseline while maintaining extreme lightweight advantages (only 800 k parameters). Notably, the two-stage KD version achieves over 20 Frames Per Second (FPS) on Central Processing Unit (CPU) devices, confirming its practical deployability in real-world applications.

Keywords

LMCA; LFPN; LDConv; group_slim; distillation

1  Introduction

The quality attributes of apples (e.g., size, sweetness, firmness) are heavily affected by harvesting methods [1]. Traditional harvesting methods have low efficiency, while harvesting robots can solve labor shortages, reduce costs, and improve apple quality [2]. Recent advances in computer vision and semiconductor technologies have enabled vision-based fruit detection systems, where deep learning methods outperform conventional image processing in complex scenarios [3]. The YOLO (You Only Look Once) architecture excels in agricultural applications by achieving optimal speed-accuracy trade-offs with low deployment costs, where current research develops tailored lightweight solutions for chips with varying computing capabilities to enhance both real-time detection accuracy and cost-effectiveness on edge devices.

The key challenge in visual task research lies in achieving optimal balance between model lightweighting and detection performance, particularly for deployment on resource-constrained edge devices. Although an enhanced YOLOv8n model [4] improved mAP50 by 1.61% while maintaining computational efficiency, its counting accuracy of 69.39% remains insufficient for high-precision harvesting applications, without considering adaptive compression strategies for different hardware platforms. The NBR-DF [5] approach with YOLOv5 reached 96.4% AP through depth fusion and attention mechanisms, yet its fixed architecture lacks flexibility for hardware-specific optimization. For extreme lightweight deployment, Faster-YOLO-AP [6] compressed parameters to 0.66 M and FLOPs to 2.29 G while maintaining 84.12% mAP, demonstrating excellent edge device compatibility, yet its rigid design cannot adapt to different hardware capabilities. A FBoT-Net [7] incorporated focal bottleneck transformer modules to enhance local and global feature representation, achieving 47.3% AP on small-scale apples and a DETR-based approach [8] achieved 97.12% AP50 and 51% speed improvement, but Transformer architectures face challenges on memory-limited devices. While APPLE_YOLO [9] reached 99.13% mAP, its computational demands make it unsuitable for low-power devices. Critically, existing methods employ static compression strategies that cannot dynamically adapt to different hardware constraints. Unlike these hardware-agnostic approaches, our method provides a series of deployment strategies specifically tailored to various hardware resources, with the goal of optimizing the balance between economic cost and detection performance. The proposed method enhances detection accuracy in complex environments by strengthening multi-scale feature fusion and incorporating an attention mechanism, improving the recognition capability for densely clustered targets. In summary, our contributions are as follows.

•   YOLOv8 is lightweight through cascading, and a solution tailored to different hardware resources is proposed to achieve improved detection performance and cost-effectiveness in real-world applications.

•   The APPLE_YOLO model was developed as a comprehensive apple detection framework that incorporates KD, Group_Slim, a novel LFPN, LDConv, and LMCA.

•   Both schemes can achieve an inference speed of over 20 FPS on the CPU.

2  Materials and Methods

2.1 Dataset Analysis

We utilized an apple image dataset (Fig. 1a) collected using Redmi K50 smartphones from orchards in Yantai, Shandong and Panjin, Liaoning, capturing various lighting conditions and seasonal variations for our experiments. The dataset comprises 3376 images (2260 for training, 408 for validation, and 708 for testing), predominantly containing small-to-medium sized targets with dense detection characteristics. To validate model generalizability, we employed a public apple dataset [10] (Fig. 1b) consisting of 4195 images (3356 for training and 839 for validation) featuring apples of two distinct colors captured under diverse lighting conditions and viewing angles.

images

Figure 1: Data samples from two different apple datasets

2.2 Overall Architecture

Fig. 2 illustrates the architecture of the proposed APPLE_YOLO framework. We propose a two-stage simplification approach. Stage 1 integrates LMCA and LDConv into the LFPN backbone with initial KD. Stage 2 applies Group_Slim pruning to the model from Stage 1, followed by secondary distillation, enhancing detection performance (Fig. 3).

images

Figure 2: An APPLE_YOLO framework used in this research

images

Figure 3: Cascade distillation process diagram

2.2.1 LFPN

The FPN structure (Fig. 4a) and detection heads in YOLOv8n show limitations in apple detection, due to ineffective feature scale alignment, parameter redundancy, and complex background interference. Since apples are typically small to medium in size, the default P5 head—designed for large objects-is computationally redundant and contributes little to detection accuracy. Inspired by Feng et al. [11], we propose an LFPN (Fig. 4b) by analyzing bounding box sizes and dataset characteristics. Since aligning detection heads with target scales can reduce parameters and improve accuracy, it replaces the original P5 head—designed for large object detection—with P2 and P6 detection heads that better match the size characteristics of small to medium-sized apples. In terms of channel configuration, the original 1024-dimensional channels in the P5 layer are optimized and adjusted to 128-dimensional channels in the P2 layer and 762-dimensional channels in the P6 layer. This adjustment decreases model parameters while enhancing feature alignment capability, making the model more suitable for apple detection.

images

Figure 4: A comparative analysis of various FPNs

2.2.2 LDConv

The traditional convolutional operations in YOLOv8’s neck network exhibit feature extraction limitations: their fixed kernel sizes and restricted receptive fields make it difficult for the network to maintain computational efficiency while achieving coordinated optimization between local detail enhancement and global context preservation. By integrating average pooling with grouped convolution techniques, we develop an LDConv module (Fig. 5) to address these challenges.

images

Figure 5: Lightweight downsampling convolution integrating grouping ideas

The dual-path architecture integrates the computational efficiency of average pooling with the feature discriminability of grouped convolutions. The upper path employs average pooling for dimensionality reduction, decreasing the number of parameters while preserving critical local features through a pixel importance assessment mechanism. The lower path incorporates grouped convolution as the core feature extractor, which partitions input feature maps into multiple groups and performs convolution independently within each group. This design reduces both parameters and computational cost to 1/G (G: the number of groups) compared to standard convolution, enhancing the hardware-friendly nature of the module. The output of the grouped convolution is then weighted and aggregated using the importance weights generated by the upper path, achieving soft pruning of low-weight features and further suppressing redundant parameters. This hardware-conscious design achieves a parameter reduction while maintaining original detection accuracy, making it suitable for edge device deployment where both computational resources and detection performance are considerations. By dynamically adjusting the weighted multiplication operations based on real-time feature importance evaluation, it suppresses interference from complex backgrounds, further enhancing the module’s efficient feature extraction capabilities.

2.2.3 LMCA

In object detection, complex backgrounds (e.g., cluttered textures, occlusions, or lighting variations) can obscure targets, resulting in false detections and missed detections. Dense or overlapping objects may also cause feature confusion and localization errors, reducing accuracy. To address these issues, we propose a LMCA module (Fig. 6) to enhance the network’s ability to capture critical target information. The input feature map (H × W × C) undergoes spatial average pooling along the horizontal and vertical directions using (H, 1) and (1, W) kernels, respectively, producing two feature sets. Bidirectional interaction connections are then established to enable deep cross-dimensional semantic fusion. The enhanced feature map is obtained by summing the fused horizontal and vertical features, with their combined output represented by Eq. (1).

f=(1Hm=1HF:,:,c(m,:))c+(1Wn=1WF:,:,c(:,n))c(1)

where F:,:,c represents the fixed-channel 3D feature slice. LMCA uses 1 × 1 convolution and the Sigmoid to generate dynamic spatial weights P, which adaptively weight the feature map F. To facilitate cross-dimensional fusion, LMCA designs a bidirectional cross-excitation pathway: decomposing directional feature f into horizontal/vertical components (fX1, fX2) and performing multi-scale matrix multiplication with cross-dimensionally compressed forms of spatial weights (PX2, PX1), producing dimensionally aligned interactive features. Simultaneously, a global-local joint calibration is introduced: compressing input feature F through Global Average Pooling (GAP) and multiplying it with the spatial weight to form channel scaling coefficients; the original input F is then element-wise multiplied with the aforementioned cross-dimensional interaction output, ultimately obtaining optimized feature Y through triple fusion (Eq. (2)).

γ=F(ΓX1ΓX2)1HWi,jPGAP(F)(2)

where ΓX1=fX1×PX2, Γy=fX2×PX1, respectively represent horizontal/vertical information enhancement. fX1 is a horizontal feature, PX2 is the vertical spatial weight. DCN [12] uses offset masks to constrain kernel displacements, boosting flexibility. LMCA adjusts these masks by multi-channel perception of key feature regions. Embedded in DCN’s offset layer, it generates a spatially adaptive weight tensor. Multiplying these by original masks enables adaptive offset tuning, enhancing feature extraction. We integrated multiple enhanced DCNv2 models (Fig. 7), proposed the DCNv2_LMCA, and used it as a teacher model in distillation.

images

Figure 6: Lightweight multi-channel attention

images

Figure 7: A feature extraction module integrating deformable convolution and LMCA

2.2.4 Knowledge Distillation

During lightweight model optimization, dense object detection often suffers performance degradation due to feature confusion and localization ambiguity caused by clustered targets, occlusion, and background interference. We propose a knowledge distillation strategy that resolves two key issues: the failure of traditional Softmax-based classification distillation due to background noise in overlapping regions and task misalignment, and the inability of discrete probability-based localization distillation to capture continuous spatial relationships. We introduce a binary classification distillation loss (Eq. (3)) that aligns target-background confidence between teacher and student models, enhancing foreground detection while suppressing background interference.

Rcls(x)=i=1nj=1KPijBCE(pi,js,pi,jt)(3)

where pt=11+eFt and ps=11+eFs, weight Pij=|pi,jtpi,jt| and wRn×K, pi,js and pi,jt respectively represent the predicted probability values of the student (S) and the teacher (T) model for the j-th category on the i-th anchor box, BCE () represents the standard binary cross entropy function.

Furthermore, based on the Generalized Intersection over Union (GIoU) metric [13], we calculate the GIoU value between the T’s bounding box bit and the S’s bounding box bis, which is subsequently integrated with Eq. (3) to construct the total distillation loss function (Eq. (4)):

Rtotal(x)=β1Rcls(x)+β2i=1nmaxj(Pij)(1GIoU(bit,bis)](4)

where β1 and β2 are hyperparameter. Adopting channel-wise distillation (CWD) loss [14] (Eq. (5)) for feature distillation enhances per-channel information exploitation and synchronizes cross-network channel activations.

θ(hc)=exp(hc/T1)exp(hc/T1)1(5)

H(hT,hS)=T12Cc=1Ci=1W.Hθ(hc,iT).log[θ(hc,iT)θ(hc,iS)](6)

where hcRW×H and T1 controls the scaling. H(hT,hS) (Eq. (6)) measures channel distribution divergence between T and S networks, where hT and hS denote their respective activation maps. Activation values are transformed into probability distributions via θ(, with i denoting spatial positions and c[1,C] indexing channels. The KD loss is shown in Eq. (7).

R=η1×|γteacherγstudent|+η2×Rtotal(x)+η3×H(hT,hS)(7)

where the hyperparameters η1, η2, and η3 control the weights of various loss components, Y represents the prediction values.

2.2.5 Group_Slim

To meet mobile computational and storage constraints, we employ model pruning. While non-sparsity-inducing methods like Group_norm [15] reduce training time, they incur significant detection degradation under high pruning ratios. We address this by integrating sparse group Lasso [16] into Slim pruning [17], enforcing sparsity through group-wise weight pruning. The sparse group lasso is shown in Eq. (8); the pruned model’s loss function comprises the base model’s loss plus dual regularization terms.

L(θ)=L0(θ)+λ1i=0g||Wi||0+λ2j=0N||wj||0(8)

where L(θ) and L(θ)0 represent the pruned and unpruned loss functions, respectively, λ1 and λ2 are regularization hyperparameters. Wi and wj denote the weight matrix and the i-th weight of the i-th group, respectively. ||Wi|| and ||wj|| represent the number of nonzero elements in the i-th group and the indicator function of whether the j-th weight is zero, respectively.

The Group_Slim process pruning is shown in Fig. 8. First, the weights of the convolution layer in the network are divided into multiple groups, with each group corresponding to a convolution kernel. The sparse group lasso technique is then employed to eliminate insignificant groupings. The weights in each group undergo sparse training, and then the model is Slim pruned. Finally, the model recovery training after pruning. The fundamental objective is to eliminate input and output channels that fall below a specified low threshold, achieving a more efficient and streamlined network architecture. For multi-channel models, the technique provides additional branches.

images

Figure 8: A slim pruning method integrating spark group lasso

2.3 Experimental Parameters

For deploying the APPLE_YOLO model, we used an Ubuntu 22.04 platform with an NVIDIA RTX A5000 GPU (24 GB) and an Intel Xeon Platinum 8362 CPU (14 vCPUs), under Python 3.8, CUDA 11.3, and PyTorch 1.12. The key parameters are: input size = 640 × 640, learning rate = 0.001, batch size = 16, workers = 16, epochs = 200, close_mosaic = 10, optimizer = SGD. Pruning settings: reg = 0.02, reg_decay = 0.08, sparse iterations = 500, pruning recovery = 300 iterations. Mosaic augmentation was disabled after epoch 200. Experiments show optimal performance at speed_up = 3.0. To avoid excessive sparsification and feature degradation in the lightweight model, global pruning is intentionally disabled. We use cascaded distillation with pruning disabled (False) in Stage 1 and enabled (True) in Stage 2. Distillation lasts 300 epochs, with mosaic augmentation turned off at epoch 250. Loss ratios are logic_loss = 2.0 and feature_loss = 1.2, with 21, 24, 27, and 30 distillation layers for teachers and students, with a fixed attenuation constant. To maximize hardware efficiency, batch sizes of 16 (CPU) (pt->ncnn) and 64 (GPU) are used to evaluate model inference speed.

2.4 Evaluation Indicators

To simplify the experimental analysis, Precision (Eq. (9)), Recall (Eq. (10)), and AP (Eq. (11)) are used as evaluation indices in this paper.

Precision=TPTP+FP(9)

Recall=TPTP+FN(10)

AP=01p(r)dr(11)

where TP is True Positive, FP is False Positive, FN is False Negative, p(r) is the function of the PR curve. Furthermore, this article selects parameters (Param), size, FPS, and GFLOPS as the evaluation criteria. Where GFLOPS denotes the amount of computation required by the model, and FPS is the number of frames per second to fill the image.

3  Analysis of Experimental Results

3.1 Ablation Experiment

The lightweight ablation study (Table 1) shows that the Phase I optimized model (S4) improves AP(0.5) and AP(0.5:0.95) by 1.1% and 2.9%, respectively, over the baseline YOLOv8n (S0), while reducing parameters by 25% and model size by 20%. The LFPN module (m0) contributes most significantly to both lightweighting and detection performance, maintaining a 0.5% AP gain in S1 compared to S0 despite major parameter reduction. LDConv (m2) replacement of backbone convolutions effectively suppresses parameter increases from the LMCA mechanism (m1) with negligible AP loss. Initial distillation using S2 as the student and its s-class variant as the teacher improves AP(0.5:0.95) by 3% over S0. In Phase II (S5–S6), pruning (m4) achieves the most substantial lightweighting effects, compressing parameters, GFLOPS, and model size by 63%, 67%, and 58%, respectively, while still delivering about 0.8% AP gain versus S0. Subsequent secondary distillation (m5) effectively restores detection performance compromised by pruning operations.

images

The comparison of the curves of the teacher model and the distilled student model is shown in Fig. 9. The green, red, and blue curves correspond to the teacher, student, and distilled model, respectively. The blue curve (distilled model) clearly outperforms the red curve (student model) during initial distillation and maintains a slight advantage in secondary distillation. As shown in Fig. 10, it is a comparison diagram of the pruning channels in this paper. The x-axis denotes the name of each network layer, while the y-axis represents the number of parameters. The red indicates the parameter quantity of the original model channel, and the blue reflects the parameter quantity post-pruning. To complete the pruning, some network layers that cannot be pruned typically need to be skipped. This paper skips the downsampling convolution layer in LDConv, as well as the specific convolution and Distribution Focal Loss layers in the detection head.

images

Figure 9: Comparison of distillation curves between teacher and student models

images

Figure 10: Comparison of the number of channels before and after pruning

3.2 Comparative Experiment

We evaluate APPLE_YOLO against comparable YOLO algorithms (Table 2). While YOLOv9 demonstrates comparable detection performance to the OUR2 model, it exhibits increased parameters, computations (GFLOPS), and model size; conversely, YOLOv12 and YOLOv13 models with similar parameter scales yield inferior AP values relative to our proposed solution. Although the present approach manifests marginally slower GPU inference speeds compared to YOLOv11, YOLOv10, YOLOv8n, and YOLOv5n, it demonstrates noteworthy accuracy advantages, achieving AP improvements ranging from 1.3% to 3%. Our method establishes an optimal equilibrium between detection precision and operational efficiency, representing a balanced solution. Fig. 11 visually presents the comparative results of key metrics from Table 2, where subfigures display the evolution curves of precision, recall, AP(0.5), and AP(0.5:0.95). The red and blue curves represent the initially distilled model and the final pruned-distilled model, respectively. To reduce computational costs, we adopted model convergence rather than fixed training epochs as the stopping criterion. Our proposed approach achieves comparable accuracy and AP metrics to YOLOv9 while significantly reducing model size and computational requirements, with only marginally inferior recall performance compared to other algorithms.

images

images

Figure 11: Comparison of different YOLO algorithms

This study evaluates advanced multiple backbone networks (Table 3) (Networks with low FPS_CPU are represented by “–”). As the backbone for YOLOv8, CSwin Transformer [18] achieves optimal performance (followed by Swin Transformer [19]), albeit with slower inference speeds. The reversible column network (RevCol) [20] demonstrates faster inference than MobileNetv3 [21] (47.5 FPS on CPU/990 FPS on GPU), but provides only marginal detection improvement over DEYOv2. Furthermore, we incorporate a comparative analysis of the SSD [22], YOLOX [23], and Cascade-RCNN [24]. Faster-RCNN achieves higher accuracy but suffers from a large model size (160 MB) and slow speed (<30 FPS); while DEYOv3 (Deyov1.5n weights) exhibits faster inference (second only to RevCol backbone) but inferior accuracy. Heatmap visualization (Fig. 12) shows that compared to YOLOv8n, our approach more accurately focuses on apple features while effectively suppressing background interference. In terms of AP(0.5), Faster R-CNN achieves the second-highest accuracy after OUR1/OUR2, yet suffers from a large model size (160 MB) and slow inference speed (<30 FPS); while DEYOv3 (Deyov1.5n weights) exhibits faster inference (second only to RevCol backbone) but inferior accuracy. Our proposed method demonstrates superior balance between detection accuracy and model efficiency.

images

images

Figure 12: Comparison of apple feature visualization results

Furthermore, we assess the model’s ability to generalize using the Apple public dataset and randomly selecting several images to visualize its detection capabilities (Fig. 13). As shown in Table 4, when compared to the datasets we created, there was no significant change in the model concerning the number of parameters, model size, complexity, or inference speed for these two data sets. Nonetheless, the detection performance has noticeably decreased, primarily due to the disparity between the dataset used in this study and the public dataset in the scene. Furthermore, the increase in the number of immature apple samples in the latter has resulted in a significant number of missed and false detections. Nevertheless, both OUR1 and OUR2 exhibit superior detection performance compared to the baseline algorithm, while simultaneously achieving a significant reduction in the number of parameters and overall complexity.

images

Figure 13: Comparison of different datasets

images

Fig. 14 presents a comprehensive performance comparison of Param, FPS, and AP. Fig. 14a compares parameters, AP(0.5:0.95), and FPS, while Fig. 14b evaluates model size, AP(0.5), and FPS. Color-coded algorithms (red ellipse: OUR1, blue: OUR2) with size-scaled circles (smaller = lighter) demonstrate our model’s superior mobile deployment suitability versus alternatives in Tables 2 and 3. Ours were validated on a Raspberry Pi 5 (Fig. 15), which reached 20 FPS with TensorRT acceleration and model pruning, demonstrating real-time detection with maintained accuracy.

images

Figure 14: Comprehensive comparison of different algorithms

images

Figure 15: Edge device deployment (remote control by RealVNC software)

4  Conclusion

Lightweight detection models are crucial for deploying deep learning models on resource-constrained devices. This paper proposes a scheme for two deployments considering varying hardware resource limitations. Even without a GPU, the model suggested in this research can still be deployed and achieve an inference speed of 20 FPS. Detailed validation of generalization performance was conducted on multiple datasets, and the proposed model reduces the number of parameters and simplifies the complexity of the model. The two schemes show an average improvement of 1.4% in their model identification capabilities. Future research should address potential limitations and biases within datasets, such as limited variability in illumination conditions, limited representation of different species, and a narrow range of scales and viewing angles. These factors may introduce model biases and constrain generalization in the real-world. It is therefore essential to enhance dataset diversity by encompassing varied lighting conditions, multi-species characteristics, and multi-scale/multi-angle scenarios. Meanwhile, forthcoming work should focus on optimizing the efficiency of attention mechanisms to alleviate computational bottlenecks during inference and develop scalable solutions suited for large-scale deployment.

Acknowledgement: We gratefully acknowledge the Jilin Provincial Department of Education for funding and our co-authors for their valuable insights.

Funding Statement: This research were funded by Jilin Provincial Department of Education Project Fund, grant number JJKH20240315KJ and the National Natural Science Foundation of China under Grant 52175538.

Author Contributions: The authors confirm contribution to the paper as follows: Conceptualization and methodology: Xin Ma and Jin Lei; software: Chenying Pei; validation: Chunming Wu. All authors reviewed the results and approved the final version of the manuscript.

Availability of Data and Materials: The dataset and code can be accessed at the following URL: https://github.com/JinLei101/1012 (accessed on 24 September 2025).

Ethics Approval: Not applicable.

Conflicts of Interest: The authors declare no conflicts of interest to report regarding the present study.

References

1. Zhang X, Xiang W, Dong F, Tahir MM, Yang W, Zhang D, et al. Overexpression of apple MdGAMYB promotes early flowering and increases plant height in transgenic Arabidopsis and tomato. Sci Hortic. 2024;328:112880. doi:10.1016/j.scienta.2024.112880. [Google Scholar] [CrossRef]

2. Hu G, Zhou J, Chen Q, Luo T, Li P, Chen Y, et al. Effects of different picking patterns and sequences on the vibration of apples on the same branch. Biosyst Eng. 2024;237(6):26–37. doi:10.1016/j.biosystemseng.2023.11.010. [Google Scholar] [CrossRef]

3. Wang T, Ma Z, Yang T, Zou S. PETNet: a YOLO-based prior enhanced transformer network for aerial image detection. Neurocomputing. 2023;547(3):126384. doi:10.1016/j.neucom.2023.126384. [Google Scholar] [CrossRef]

4. Jin T, Han X, Wang P, Zhang Z, Guo J, Ding F. Enhanced deep learning model for apple detection, localization, and counting in complex orchards for robotic arm-based harvesting. Smart Agric Technol. 2025;10(7):100784. doi:10.1016/j.atech.2025.100784. [Google Scholar] [CrossRef]

5. Kaukab S, Ghodki BM, Ray H, Kalnar YB, Narsaiah K, Brar JS. Improving real-time apple fruit detection: multi-modal data and depth fusion with non-targeted background removal. Ecol Inform. 2024;82(17):102691. doi:10.1016/j.ecoinf.2024.102691. [Google Scholar] [CrossRef]

6. Liu Z, Abeyrathna RMRD, Sampurno RM, Nakaguchi VM, Ahamedet T. Faster-YOLO-AP: a lightweight apple detection algorithm based on improved YOLOv8 with a new efficient PDWConv in orchard. Comput Electron Agric. 2024;223(8):109118. doi:10.1016/j.compag.2024.109118. [Google Scholar] [CrossRef]

7. Sun M, Zhao R, Yin X, Xu L, Ruan C, Jia W. FBoT-Net: focal bottleneck transformer network for small green apple detection. Comput Electron Agric. 2023;205:107609. doi:10.1016/j.compag.2022.107609. [Google Scholar] [CrossRef]

8. Ji W, Zhai K, Xu B, Wu J. Green apple detection method based on multidimensional feature extraction network model and transformer module. J Food Prot. 2025;88(1):100397. doi:10.1016/j.jfp.2024.100397. [Google Scholar] [PubMed] [CrossRef]

9. Karthikeyan M, Subashini TS, Srinivasan R, Santhanakrishnan C, Ahilan A. YOLOAPPLE: augment Yolov3 deep learning algorithm for apple fruit quality detection. Signal Image Video Process. 2024;18(1):119–28. doi:10.1007/s11760-023-02710-z. [Google Scholar] [CrossRef]

10. Sekharamantry P, Melgani F, Malacarne J, Ricci R, Almeida R, Marcato J. A seamless deep learning approach for apple detection, depth estimation, and tracking using YOLO models enhanced by multi-head attention mechanism. Computers. 2024;13(3):83. doi:10.3390/computers13030083. [Google Scholar] [CrossRef]

11. Feng C, Zhong Y, Gao Y, Scott M, Huang W. TOOD: task-aligned one-stage object detection. In: IEEE/CVF International Conference on Computer Vision (ICCV); 2021 Oct 10–17; Montreal, QC, Canada. p. 3490–9. [Google Scholar]

12. Zhu X, Hu H, Lin S, Dai J. Deformable convnets v2: More deformable, better results. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2019 Jun 15–20; Long Beach, CA, USA. p. 9300–8. [Google Scholar]

13. Hamid R, Nathan T, Gwak J, Sadeghian A, Reid I, Savarese S. Generalized intersection over union: a metric and a loss for bounding box regression. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2019 Jun 15–20; Long Beach, CA, USA. p. 658–66. [Google Scholar]

14. Shu C, Liu Y, Gao J, Yan Z, Shen C. Channel-wise knowledge distillation for dense prediction. In: IEEE/CVF International Conference on Computer Vision (ICCV); 2021 Oct 10–17; Montreal, QC, Canada. p. 5291–300. [Google Scholar]

15. Fang G, Ma X, Song M, Mi M, Wang X. DepGraph: towards any structural pruning. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2023 Jun 17–24; Vancouver, BC, Canada. p. 16091–101. [Google Scholar]

16. Friedman J, Hastie T, Tibshirani R. A note on the group lasso and a sparse group lasso. arXiv:1001.0736. 2010. [Google Scholar]

17. Liu Z, Li J, Shen Z, Huang G, Yan S, Zhang C. Learning efficient convolutional networks through network slimming. In: 2017 IEEE International Conference on Computer Vision (ICCV); 2017 Oct 22–29; Venice, Italy. p. 2755–63. [Google Scholar]

18. Dong X, Bao J, Chen D, Zhang W, Yu N, Yuan L, et al. CSWin transformer: a general vision transformer backbone with cross-shaped windows. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2022 Jun 18–24; New Orleans, LA, USA. p. 12114–24. [Google Scholar]

19. Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, et al. Swin transformer: hierarchical vision transformer using shifted windows. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV); 2021 Oct 10–17; Montreal, QC, Canada; 2021. p. 9992–10002. [Google Scholar]

20. Cai Y, Zhou Y, Han Q, Sun J, Kong X, Li J, et al. Reversible column networks. arXiv:2212.11696. 2022. [Google Scholar]

21. Koonce B. MobileNetV3. In: Convolutional neural networks with swift for tensorflow. Berkeley, CA, USA: Apress; 2021. p. 125–44. doi:10.1007/978-1-4842-6168-2_11. [Google Scholar] [CrossRef]

22. Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu C, et al. SSD: single shot multibox detector. In: Proceedings of the 14th European Conference on Computer Vision (ECCV); 2016 Oct 11–14; Amsterdam, The Netherlands. [Google Scholar]

23. Zheng G, Liu S, Wang F, Li Z, Sun J. YOLOX: exceeding YOLO series in 2021. arXiv:2107.08430. 2021. [Google Scholar]

24. Cai Z, Vasconcelos N. Delving into high quality object detection. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2018 Jun 18–23; Salt Lake City, UT, USA. p. 6154–62. [Google Scholar]


Cite This Article

APA Style
Ma, X., Lei, J., Pei, C., Wu, C. (2026). APPLE_YOLO: Apple Detection Method Based on Channel Pruning and Knowledge Distillation in Complicated Environments. Computers, Materials & Continua, 86(2), 1–17. https://doi.org/10.32604/cmc.2025.069353
Vancouver Style
Ma X, Lei J, Pei C, Wu C. APPLE_YOLO: Apple Detection Method Based on Channel Pruning and Knowledge Distillation in Complicated Environments. Comput Mater Contin. 2026;86(2):1–17. https://doi.org/10.32604/cmc.2025.069353
IEEE Style
X. Ma, J. Lei, C. Pei, and C. Wu, “APPLE_YOLO: Apple Detection Method Based on Channel Pruning and Knowledge Distillation in Complicated Environments,” Comput. Mater. Contin., vol. 86, no. 2, pp. 1–17, 2026. https://doi.org/10.32604/cmc.2025.069353


cc Copyright © 2026 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
  • 300

    View

  • 80

    Download

  • 0

    Like

Share Link