Open Access
ARTICLE
Deep Learning-Based Toolkit Inspection: Object Detection and Segmentation in Assembly Lines
1 Department of Mechanical Engineering, National Chung Cheng University, 168, University Rd., Min Hsiung, Chia Yi, 62102, Taiwan
2 Department of Biomedical Imaging, Chennai Institute of Technology, Sarathy Nagar, Chennai, 600069, India
3 Department of Computer Science and Engineering, Thapar Institute of Engineering and Technology, Patiala, 147001, India
4 Director of Technology Development, Hitspectra Intelligent Technology Co., Ltd., Kaohsiung, 80661, Taiwan
* Corresponding Author: Hsiang-Chen Wang. Email:
Computers, Materials & Continua 2026, 86(1), 1-23. https://doi.org/10.32604/cmc.2025.069646
Received 27 June 2025; Accepted 21 August 2025; Issue published 10 November 2025
Abstract
Modern manufacturing processes have become more reliant on automation because of the accelerated transition from Industry 3.0 to Industry 4.0. Manual inspection of products on assembly lines remains inefficient, prone to errors and lacks consistency, emphasizing the need for a reliable and automated inspection system. Leveraging both object detection and image segmentation approaches, this research proposes a vision-based solution for the detection of various kinds of tools in the toolkit using deep learning (DL) models. Two Intel RealSense D455f depth cameras were arranged in a top down configuration to capture both RGB and depth images of the toolkits. After applying multiple constraints and enhancing them through preprocessing and augmentation, a dataset consisting of 3300 annotated RGB-D photos was generated. Several DL models were selected through a comprehensive assessment of mean Average Precision (mAP), precision-recall equilibrium, inference latency (target ≥30 FPS), and computational burden, resulting in a preference for YOLO and Region-based Convolutional Neural Networks (R-CNN) variants over ViT-based models due to the latter’s increased latency and resource requirements. YOLOV5, YOLOV8, YOLOV11, Faster R-CNN, and Mask R-CNN were trained on the annotated dataset and evaluated using key performance metrics (Recall, Accuracy, F1-score, and Precision). YOLOV11 demonstrated balanced excellence with 93.0% precision, 89.9% recall, and a 90.6% F1-score in object detection, as well as 96.9% precision, 95.3% recall, and a 96.5% F1-score in instance segmentation with an average inference time of 25 ms per frame (≈40 FPS), demonstrating real-time performance. Leveraging these results, a YOLOV11-based windows application was successfully deployed in a real-time assembly line environment, where it accurately processed live video streams to detect and segment tools within toolkits, demonstrating its practical effectiveness in industrial automation. The application is capable of precisely measuring socket dimensions by utilising edge detection techniques on YOLOv11 segmentation masks, in addition to detection and segmentation. This makes it possible to do specification-level quality control right on the assembly line, which improves the ability to examine things in real time. The implementation is a big step forward for intelligent manufacturing in the Industry 4.0 paradigm. It provides a scalable, efficient, and accurate way to do automated inspection and dimensional verification activities.Keywords
Supplementary Material
Supplementary Material FileCite This Article
Copyright © 2026 The Author(s). Published by Tech Science Press.This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.


Submit a Paper
Propose a Special lssue
View Full Text
Download PDF
Downloads
Citation Tools