VitSeg-Det & TransTra-Count: Networks for Robust Crack Detection and Measurement in Dynamic Video Scenes

Langyue Zhao; Yubin Yuan; Yiquan Wu

doi:10.32604/cmc.2025.070563

icon Open Access

ARTICLE

VitSeg-Det & TransTra-Count: Networks for Robust Crack Detection and Measurement in Dynamic Video Scenes

Langyue Zhao^1,2, Yubin Yuan^3,*, Yiquan Wu^2,*

1 College of Computer Science, Weinan Normal University, Weinan, 714000, China
2 College of Electronic and Information Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing, 210000, China
3 College of Information Engineering, Yangzhou University, Yangzhou, 225127, China

* Corresponding Authors: Yubin Yuan. Email: email ; Yiquan Wu. Email: email

Computers, Materials & Continua 2026, 87(1), 82 https://doi.org/10.32604/cmc.2025.070563

Received 18 July 2025; Accepted 22 October 2025; Issue published 10 February 2026

Abstract

Regular detection of pavement cracks is essential for infrastructure maintenance. However, existing methods often ignore the challenges such as the continuous evolution of crack features between video frames and the difficulty of defect quantification. To this end, this paper proposes an integrated framework for pavement crack detection, segmentation, tracking and counting based on Transformer. Firstly, we design the VitSeg-Det network, which is an integrated detection and segmentation network that can accurately locate and segment tiny cracks in complex scenes. Second, the TransTra-Count system is developed to automatically count the number of defects by combining defect tracking with width estimation. Finally, we conduct experimental verification on three datasets. The results show that the proposed method is superior to the existing deep learning methods in detection accuracy. In addition, the actual scene video test shows that the framework can accurately label the defect location and output the number of defects in real time.

Keywords

Crack detection; multi object tracking; semantic segmentation; counting; transformer

1 Introduction

Cracks are a critical form of damage in civil engineering and are commonly found in various infrastructures, including bridges, pavements, buildings, dams, and tunnels. These cracks not only impair the functionality of structural components but also pose significant safety hazards, potentially leading to catastrophic consequences. Thus, early crack detection is vital for pavement safety. However, identifying cracks, particularly microcracks, is challenging due to complex backgrounds, uneven illumination, and obstructions. While conventional manual inspection is tedious and error-prone, computer vision advancements now enable automated crack detection.

Current mainstream detection algorithms, such as Faster R-CNN [1], the YOLO series [2], and DETR [3], have achieved efficient localization and classification on single-frame images. However, they still face several limitations: (1) Lack of Dynamic Continuity Modeling: These models process individual frames independently, ignoring the continuous evolution of cracks across video sequences. As a result, the same crack may be repeatedly detected in real world scenarios, leading to duplicate counting or trajectory drift. (2) Inability to Track Crack Identity: Most existing detectors lack an object tracking mechanism, making it difficult to determine whether cracks detected in consecutive frames belong to the same entity. This hinders consistent identity maintenance over time. (3) Unpredictable Object Counts: While detection models can localize objects with bounding boxes, they cannot directly output the total number of defects. In video sequences, defect counting often relies on additional post processing logic, increasing system complexity. (4) Limited Performance on Small Objects: Traditional convolution based detection networks struggle with small or texture blurred cracks due to restricted receptive fields and resolution reduction. (5) Absence of Robust Counting Mechanisms: Most industrial inspection systems either lack integrated counting functionality or rely on heuristic threshold matching rules, missing end-to-end learnable and robust counting solutions.

In practical engineering applications, such as road condition assessment [4], pipeline crack monitoring [5], weld defect inspection [6] and other scenarios, defect detection systems must process continuous video streams rather than performing single frame image analysis. Under such dynamic conditions, defect manifestations across temporal frames are susceptible to multiple interference factors, such as camera perspectives, illumination changes, sensor vibration, or occlusion. Defect morphology may dynamically evolve (e.g., crack propagation/lateral widening), while appearance features may suffer from blurring or distortion. Consequently, static detectors alone are incapable of achieving accurate defect identification and quantification, let alone fulfilling the demands for automated diagnosis and early warning systems in engineering practice.

In view of the above problems, we propose a dynamic crack defect integrated measurement framework based on Transformer from the perspective of visual measurement system design. The framework is structurally integrated with tasks such as detection and segmentation, tracking and counting, and gives consideration to recognition accuracy and quantitative analysis capability. By introducing a fine feature sampling mechanism and cross frame identity matching technology, the system can realize continuous identification and quantitative statistics of cracks in the video stream and has the measurement capability of high automation, strong robustness and low delay. The main contributions of this chapter are as follows:

(1) A multi task integrated visual inspection system for dynamic scenes is proposed, integrating the tasks of detection, segmentation, tracking and counting in the same architecture to achieve continuous recognition, status tracking and quantity statistics of defect objects in the video stream. In view of the significant characteristics of crack like defects in dynamic scenes, mechanisms such as multi scale dilated convolution, channel spatial attention fusion, and dynamic feature sampling are adopted, enabling the system to maintain stable recognition capabilities in complex environments such as dynamics, occlusion, and non-rigid changes.

(2) The integrated network of VitSeg-Det detection and segmentation is designed, and the multi scale feature expression system is built with EfficientNet-b5 as the backbone. The micro scale feature scoring module and macro scale perception module are integrated to achieve high precision positioning and mask generation of microcracks. Among them, the scoring module combines the channel and spatial attention mechanism, selects the high information entropy area through dynamic sampling and inputs it into the Transformer encoder to improve the response ability of the model to fine-grained defects; Macro scale branching uses dilated convolution to model the global topological structure of cracks, which enhances the system’s perception of long range irregular cracks.

(3) The TransTra-Count object tracking and counting module is proposed. Based on the self attention mechanism of Transformer, Spatial Feature Dual Modal Data Association and Long Term Memory Update Trade off Model (CrackDSF-LMe) is constructed to maintain the stable identity of crack objects under occlusion, blurring and illumination disturbance. The system integrates an unsupervised mask skeleton width estimation algorithm, combines a width smoothing mechanism and a change rate constraint, effectively solves the problem of “sudden change of the same crack width”, and realizes dynamic quantitative assessment and robust statistics of crack defect.

(4) A high resolution, sequential and self-built video dataset of cracks, RoadDefect-MT, was constructed, covering a variety of complex scenes such as pavement disease types, occlusion and illumination changes, to comprehensively verify the performance of the system in terms of measurement accuracy, tracking consistency and statistical stability. This dataset fills the gap in the existing public data set that lacks dynamic features and measurement annotation, and provides basic resources and experimental platform support for subsequent crack identification and visual measurement research.

In addition, the system in this paper fully considers the project deployment requirements in the design, supports the operation on edge equipment, and is suitable for industrial measurement scenarios such as road inspection, weld quality assessment, pipeline crack monitoring, and it has good popularization and practical value. This work not only improves the crack detection accuracy, but also improves the “defect recognition” to “quantifiable measurement”, providing a design paradigm for future video intelligent measurement systems.

2 Related Work

Recent advances in computer vision have revolutionized infrastructure monitoring, enabling automated, precise measurement for structural health assessment. Key applications include pavement condition evaluation [7], bridge structural analysis [8], and concrete crack detection [9], where vision-based systems now play a pivotal role in operational decision-making. Particularly in metrology, the integration of computer vision with quantitative measurement techniques has emerged as a critical research focus, addressing the fundamental challenge of achieving automated defect identification with parametric precision and repeatability.

2.1 CNN-Based Crack Detection Method

The rise of deep learning has provided new technological pathways for automatic crack detection and measurement [10], particularly within the CNN framework. Through end-to-end training methods, deep learning enables automatic learning of multi-level semantic features, significantly improving detection accuracy and scene adaptability. Research in this field can be broadly categorized into two main types: image segmentation and object detection.

Segmentation Methods. The goal of image segmentation tasks is to achieve pixel-level localization of crack areas. Common approaches include architectures such as FCN [11], DeepLab [12], and U-Net [13]. In the field of crack segmentation, Sun et al. developed a multi-scale attention module based on Deeplabv3+ to guide the decoder in extracting more fine-grained crack edge information [14]. Kang et al. integrated Faster-RCNN with tubular flow field and distance transformation modules to achieve segmentation and parametric measurement under complex backgrounds [15]. Ali and Cha introduced adversarial training mechanisms to alleviate the issue of scarce annotations while enhancing segmentation performance on concrete structures [16]. Kang and Cha designed a semantic transformation network combining multi-head attention mechanisms and compression modules, significantly improving segmentation accuracy and computational efficiency [17].

Detection Methods. Object detection methods focus on rapidly localizing crack targets through bounding box regression and classification models, which can be divided into two stage and single stage detection approaches. Two stage detectors like the R-CNN series algorithms [18] in accuracy, with innovations such as illumination robust Gaussian mixture integration [19] and morphology enhanced bounding box optimization [20]. Meanwhile, single stage models (YOLO, SSD [21]) prioritize speed, with recent improvements including deformable SSD for complex cracks [22], lightweight MobileNet variants [23], and attention-augmented YOLOv3 [24]. Hybrid approaches like YOLO-MF [25] further bridge speed and functionality by incorporating flow based defect counting.

Although CNN-based methods have demonstrated strong capabilities in static image analysis, they still face three major limitations: The restricted receptive field hinders the modeling of long range structural dependencies, thereby limiting the accurate identification of elongated or discontinuous cracks. The inability to maintain interframe consistency complicates the achievement of temporally coherent structural measurements. Detection results often require additional modules for tasks such as object counting, which undermines system integration and operational automation.

2.2 Transformer-Based Crack Detection Method

The Transformer architecture has demonstrated remarkable advantages in recent visual tasks, owing to its global modeling capacity and powerful self attention mechanism. Since the initial application of Vision Transformer (ViT) in image classification, related approaches have rapidly expanded to domains such as semantic segmentation, object detection, and video modeling.

Segmentation Methods. The application of Vision Transformer (ViT) in segmentation tasks primarily revolves around two types of architectures: pure Transformer models (e.g., SETR [26], SegFormer [27]) and hybrid CNN-Transformer models (e.g., Swin-UNet [28]). In the context of crack segmentation: Wang et al.’s efficient depthwise separable convolution (88.08% mIoU with 20% data) [29] and Zhou et al.’s SCDeepLab (inverted residuals + SwinTransformer), which outperforms CNNs [30]. These advances demonstrate transformers’ superiority in accuracy and robustness for crack analysis.

Detection Methods. Transformers have revolutionized crack detection through their self-attention mechanisms, with several key advancements: Swin Transformers [31] enhance noise robustness via window-based attention, Linformer [32] reduces complexity to O(n) using low rank approximations [33], and Crack-DETR [34] combines high/low frequency features for noise resistant detection. Additional innovations include attention fused encoder decoders [34] for improved accuracy and NMS-free contrastive learning [35] for pavement defects. These developments demonstrate Transformers’ superiority in handling complex crack detection scenarios while addressing computational challenges.

The Transformer architecture effectively addresses three critical challenges in crack measurement: global context modeling, cross frame identity consistency, and morphological parameter quantification. Recent advances demonstrate its dual capability in both enhancing structural recognition accuracy and enabling automated measurement of defect characteristics (location, width, evolution trends) through superior sequence modeling. While successful in static image analysis, current Transformer applications largely neglect dynamic video requirements, particularly in modeling temporal patterns and quantifiable metrics like width variation trends and target consistency across frames. This highlights a crucial research gap in developing video oriented Transformer architectures for structural measurement tasks.

3 Methodology

This paper proposes an integrated framework for the detection, segmentation, tracking, and counting of pavement crack defects in video sequences. As illustrated in Fig. 1, the system is designed to achieve continuous identification of crack objects across spatial temporal dimensions and automate the quantification of structural parameters. Built upon a Transformer based architecture, the method integrates spatial structure modeling and temporal state maintenance mechanisms to overcome limitations of conventional image level approaches, such as poor cross frame correlation and repeated counting. The system consists of a front end crack perception module and a back end structural measurement module. The former employs a VitSeg-Det network to perform high precision object detection and pixel level mask segmentation, while the latter fuses segmentation results with multi frame features and incorporates a TransTra-Count module to establish object level tracking chains for identity preservation and quantity statistics. Additionally, the system integrates mask skeleton extraction and width estimation algorithms, enabling the extraction of crack geometric parameters without manual annotation. Ultimately, the system can stably output structured measurement results under continuous video input, offering an efficient and reliable vision based solution for intelligent assessment of surface defects in infrastructure.

images

Figure 1: Framework of the Transformer-based detection, segmentation, tracking and counting algorithm

3.1 Detection and Segmentation Model: VitSeg-Det

VitSeg-Det is an integrated visual modeling structure for pavement crack detection and segmentation proposed by us, which aims to achieve high-precision crack area recognition and structural parameter extraction, and provide accurate and detailed input for follow-up tracking and measurement modules. The network takes EfficientNet-b5 as the backbone feature extraction structure, and uses its compound expansion mechanism in width, depth and resolution to achieve efficient multi scale receptive field modeling. On this basis, VitSeg-Det introduces the refined feature refinement module Sampled ViT and the macro scale structure modeling module. The former combines the channel attention and spatial attention mechanisms to build a scoring network, dynamically samples the areas with rich fracture morphology information, and leads the Transformer encoder to capture the changes of fracture details through the lightweight embedding mechanism; the latter uses dilated convolution to expand the receptive field and extract topological continuity and irregular distribution pattern of cracks in the global space. The above two types of features jointly construct the segmentation branch and detection branch through abstract fusion to achieve the unified modeling of pixel level crack mask generation and target box positioning. While maintaining lightweight and deployability, the entire architecture significantly improves the model’s ability to identify tiny cracks, branching structures and defects in complex backgrounds, laying an accurate perceptual foundation for highly robust defect measurement and dynamic structure modeling in video sequences. Its structure is shown in Fig. 2.

images

Figure 2: The detailed network of the Transformer-based pavement crack segmentation and detection (VitSeg-Det)

3.1.1 Abstract Feature Fusion

After feature extraction, we abstract feature F into micro scale features Fmicro and macro scale feature Fmacro. The pavement crack detection task Fmicro mainly describes the local details of cracks, such as edge sharpness, microcracks, surface texture, etc. These features are crucial to distinguish real cracks from noise (such as stains, and shadows). Fmacro is mainly used to model the continuity, branch structure, overall trend of cracks, etc., to overcome the fracture and false detection problems of local features (such as edges and textures) in complex scenes. Fmicro and Fmacro complement each other, and both further reduce the computational load of the model. Fmicro and Fmacro are fused by the following methods to generate the final abstract feature Ffuset, as shown in Formula (1), where Wg1 is a learnable parameter, which is automatically optimized through standard backpropagation.

Ffuset=Sigmoid(Wg1⋅Fmicrot)⊙Fmacrot(1)

3.1.2 Fine Feature Refinement Module: Sampled-ViT

In VitSeg-Det, there is a Sample-ViT module, which is mainly used to refine features. Fmicro is the output embedding of Sampled-ViT. First, the refined feature Ffinet is obtained through a lightweight scoring network, and then it is flattened into patch embedding using 16 × 16 convolution. It is connected with learning position embeddings and sent to the encoder of ViT. The transformer encoder uses six layers of multi head attention mapping.

The lightweight scoring network adopts the method of combining channel attention and spatial attention, in which the spatial attention part uses the depth separable convolution to further reduce the number of parameters. In specific implementation, channel attention generates channel weight through global average pooling and a full connection layer, while a spatial attention calculates spatial importance score through a deeply separable convolution (first channel by channel 3 × 3 convolution, then a 1 × 1 convolution dimension reduction), and finally multiplies the two to get the fused feature score. While maintaining the sensitivity to the crack area, this design significantly reduces the computational complexity and is more suitable for edge equipment deployment. The information score S of each spatial feature position is:

S=LightScoreNet(F)∈RB×1×H×W(2)

S=Sigmoid(MLP(GAP(F)))⊙Sigmoid(DepthwiseSepConv(F))(3)

S(F)→=(s1,s2,…,sij)(4)

Fmicro=Flatten(Ffinet)+PE(5)

PE for position embeddings, Ffinet by scoring network S sorted before sampling characteristic of Top-N.

Refine feature Ffinet. As shown in Formula (4), S is expanded and sorted to get a set of vectors S(F)→ with the length of H·W In information theory, it can be approximately considered that sij is positively correlated with the information entropy of the feature location, so the high S value area carries more task related information (such as the shape and direction of the crack). We dynamically select Top-N features before sampling, and take the predicted information score S as the modulation factor of the fine feature set for sampling. Note that N ∈ H·W, N can change randomly with the image content, so that N = [ω·H·W], and select Top-N feature location according to dynamic ω, where the value of ω will be embodied in the ablation experiment later:

ω=DynamicOmega(S)∈[ωmin,ωmax](6)

Macro scale feature Fmacro. In view of the long range and irregular distribution characteristics of pavement cracks on the macro scale, we added a macro scale feature extraction module, expanded the receptive field through 7 × 7 dilated convolution (dilation rate = 2), captured the global topological characteristics of cracks (such as continuity, branch structure, etc.) without sacrificing the resolution, and output the macro scale feature map Fmacro. Compared with concatenating multiple small convolution cores, a single large dilated convolution core can effectively reduce the number of parameters when maintaining the same receptive field. This module and the fine feature module form a hierarchical complement.

As shown in Fig. 2, the Ffinet after the fusion of Fmicro and Fmacro is divided into two parts. After the final convolution and upsampling operations, one part generates a binary segmentation map, and the other part is sent to the decoder of the transformer.

The detection network is shown in Fig. 2. After the feature passes through the 6-layer encoder, it is input into the decoder together with the location code and the learned query. The output of each layer of the decoder predicts the type and location of defects, similar to the feature pyramid network.

3.2 Tracking and Counting Model: TransTra-Count

The technology of using traditional CNN based networks to detect defects has become very mature, but when it comes to defect counting, these detection networks can’t distinguish whether defects are detected in the same way, which leads to the problem of multiple counting. Therefore, we have developed a Transformer based defect tracking and counting method, TransTra-Count.

The tracking network is shown in Fig. 3. In this paper, we propose a pavement crack tracking method, CrackDSF-ULMe. The model is based on the DETR. Our core contribution is to establish a data association model that includes the appearance feature similarity and spatial similarity of cracks according to the characteristics of pavement cracks. It can not only deal with the small displacement between consecutive frames but also adapt to the feature fluctuations caused by illumination changes, and can also solve the occlusion and ambiguity problems. In the long term memory update module, we introduce the change of illumination as the control signal to suppress shadow interference, and propose an adaptive aggregation algorithm to fuse the output of two adjacent frames to alleviate the occlusion problem. At the same time, the unsupervised method is used to monitor the crack width according to the segmentation mask, and the width loss will be increased in the loss function to restrict the width change rate, so as to avoid the unreasonable situation that “the same crack has a sudden change in the width of adjacent frames”; The trajectory management module automatically initializes and terminates the trajectory according to the match configuration reliability and historical activity to ensure that the life cycle of each crack ID is consistent with the physical reality. Through the closed-loop process of feature enhancement → detection → association → long-term memory update → track management, the whole system has realized the whole process of structured automatic measurement and output of crack object number, location, size and damage level, as shown in Fig. 4, providing a quantitative basis with space-time continuity for pavement disease diagnosis.

images

Figure 3: The detailed network of transformer-based spatial-feature dual-modal joint enhancement long-term memory method for pavement crack tracking (CrackDSF-LMe)

images

Figure 4: The closed-loop flowchart of CrackDSF-LMe

We input the results of VitSeg-Det into the tracking model, because the pavement cracks in the video may have different appearances due to changes in light, rain cover or shooting angles, and the same crack looks like different objects in different frames. Therefore, we use VitSeg-Det to obtain the multi scale characteristics of the cracks, as well as the segmentation mask, to improve the robustness of the network.

3.2.1 Spatial-Feature Dual-Modal Data Association: DSFM

In the data association phase, the algorithm combines spatial IoU measurement and feature similarity calculation based on the attention mechanism, and dynamically balances their contributions through learnable weight parameters. The cost matrix is shown in Eq. (7):

Ct[i,k]=λ1⋅IoU(Bt,B^t−1)+(1−λ1)⋅DecoderAttention(Q,K)∈RN×K(7)

DecoderAttention(Q,K)=softmax(Q⋅KTdk)(8)

Ct[i,k] is the matching cost of the i-th detection and the k-th track, and stores the matching cost of all detection track pairs. The smaller the value, the higher the matching probability; λ1 is the spatial weight coefficient (0 ≤ λ1 ≤ 1), which controls the relative importance of location and feature information; Bt is the coordinate matrix of the current frame detection frame, and B^t−1 represents the coordinate matrix of the previous frame trajectory prediction frame. The current frame position is predicted based on the historical trajectory motion model. Q and K are introduced below. dk = 512 is the dimension of K.

The association decision ati,k is used to determine whether the detection box should update the existing track or create a new track, as shown in Formula (9):

ati,k={1,ifk=arg maxjCt[i,j] andCt[i,j]>θmatch0, otherweise(9)

where arg maxjCt[i,j] represents the j that makes Ct[i,j] the largest, which is used to find the most possible historical track corresponding to the current detection; θmatch is the matching threshold, which is generally between 0.5 and 0.7. In this paper, 0.6 is selected to filter low-quality matching and avoid incorrect correlation.

When ati,k=1, the system considers that the previous detection i and track k are the continuation of the same crack, and the detection i will update the status of track k (such as the updated position, width, and memory features); When ati,k=0, the system thinks that this is a new crack or a false detection, and the detection i may initialize a new track (if the newborn condition is met).

If the matching score Ct[i,k] of a detection object Ffuset (the feature vector of the i-th detection object in the current frame) and all existing tracks in the current frame is lower than θnew (the threshold value of the new track, θnew = 0.5 in this paper), it is considered as a new target, and its feature Ffinet is added to the track feature memory set M˙t of the current frame; If the proportion of unmatched frames of a track M→tk in the latest T frame exceeds θterm (the threshold value of the termination track, θterm = 0.85 in this article), it is considered that the target has left the scene, and the track is removed from the set M˙t. T is the size of the time window to judge the termination, T = 20 in this article. The formula for updating the track is as follows:

M˙t={M˙t⋃{f→fuse_ti},if maxkCt[i,k]<θnewM˙t∖{M→tk},if 1T∑s=t−Ttgsk>θterm(10)

3.2.2 Adaptive Appearance Aggregation-Long-Term Memory Update Trade Off Model: ULMeM

Different from most existing methods in CrackDSF-ULMe, our core contribution is to establish a long-term memory, maintain the long-term time characteristics of each crack, and effectively inject time information into the follow-up tracking process, so that some extremely long cracks will not be identified repeatedly.

Generally, in the video stream, the object changes and moves very little in consecutive frames, so we have designed a Long-term Memory Retention Update Trade off Module (ULMeM), as shown in Fig. 4. Initially, we send the M^tk output by encoder into the decoder, and then update it according to the update Formula (11). The decoder output is O^tk, the value of λ2 will be discussed in the experiment.

Mt+1k=(1−λ2)⋅M^tk+λ2⋅O^tk(11)

Problems such as blurring or occlusion often occur in video streams. An intuitive way to solve this problem is to use multi frame features to enhance single frame representation. We use adaptive aggregation algorithm in ULMeM to fuse the output of two adjacent frames. Due to occlusion and blurring, the output embedded signal of the current frame may not be reliable. Therefore, as shown in Fig. 5, we generate a channel weighted Wg2 for each tracking instance to determine the proportion of track k retained in the memory of the current frame to alleviate this problem:

O^tk=(1−Wg2)⋅Ot−1k+Wg2⋅Otk(12)

Wg2=Sigmoid(MLP(O^t−1k∥Δℓt∥Otk))(13)

Δℓt=‖hist(It)−hist(It−1)‖2(14)

wherein, O^tk is the updated decoder output embedding, which is regarded as Q, Otk is V, and M^tk is K. Δℓt represents the inter frame illumination change, calculated using the gray histogram moment, and It represents the image of the t-th frame. [·∥·∥·] is a vector splicing operation. Splicing O^t−1k is to ensure the continuity of crack ID, and splicing Otk is to obtain the latest state of the crack. Splicing Δℓt can resist the interference of sudden brightness changes. When Δℓt becomes larger, it means that the illumination changes sharply, then Wg2 tends to 1, and historical memory is preferred. When ‖Otk−Ot−1k‖ increases, it means that there is a significant difference between the appearance of the previous detection and the historical track, and the object may be deformed, occluded or change in state (such as crack widening), then Wg2 tends to 0, updates the memory, and adapts to the object change.

images

Figure 5: The structure diagram of Memory Retention-Update Trade-off Module (ULMeM)

3.2.3 Loss Function

At last, we design a combination of detection, tracking and width loss. The detection loss dominated initial training is used to quickly locate the crack, and the tracking and width loss dominated in the later stage for fine tracking. Before calculating the loss, we first design an unsupervised crack width calculation method, which interacts with the width loss.

According to the standard for rating technical conditions of highways, the damage degree of cracks is judged by the average width of cracks, so we use an unsupervised method to calculate the average width of cracks in this paper. The process is shown in Algorithm 1. Firstly, the input binary mask is skeletonized to extract the central skeleton of the object; Then calculate the Euclidean distance transformation of the mask (that is, the distance from each foreground pixel to the nearest background pixel). At this time, the distance value of the skeleton pixel represents the maximum inscribed circle radius (that is, half width) from the position to the edge; Finally, take the average value of twice the distance of all skeleton pixels (that is, the full width) as the average width of the object.

In order to make the width curve smoother and suppress the measurement noise, the width is gradually updated, as shown in the following formula:

wtk=γtwt−1k+(1−γt)wti(15)

γt=γbase⋅exp⁡(−λ3⋅|wti−wt−1k|)(16)

where, wtk is the width of the current frame after smoothing, wt−1k is the width of the previous frame after smoothing, wti is the detection width of the current frame, and γt is the adaptive memory attenuation coefficient. After the adaptive memory attenuation is added, it can better adapt to the dynamic changing scene. When the width changes sharply (such as rapid deformation of objects), it will not cause hysteresis due to the setting of too large γt, or the measurement noise will suddenly increase (such as sensor abnormalities), and too small γt will introduce noise. In this paper, γbase (foundation attenuation coefficient) = 0.9, λ3 (control adjustment sensitivity) = 0.1. Therefore, when the width changes dramatically, that is, when the Lwidth (width smoothing loss) is large, γt can be reduced adaptively to respond to new data faster.

The specific formula of the loss function is shown in (17). Ldet uses Focal Loss to detect branch loss, which optimizes the accuracy of crack detection; Ltrack is the loss of track consistency to ensure the space-time continuity of crack ID, δ1 is the weight of track consistency loss to balance detection and tracking; Lwidth can suppress the jump of width estimation, and δ2 is the weight of width smoothing loss.

L=Ldet+δ1Ltrack+δ2Lwidth(17)

images

In this paper, δ1 = 0.8, δ2 = 0.4, and the calculation method of Ltrack is shown in Formula (18).

Ltrack=−1‖P‖∑(i,k)∈Plog⁡(exp⁡(f→fuse_ti⋅M→tk/τ)∑qexp⁡(f→fuse_ti⋅M→tq/τ))(18)

where, P is the successfully associated detection tracking pair, and ‖P‖ is the effective tracking number of the current frame; τ is the temperature coefficient less than 1, which can amplify the similarity score of the current matching pair (detection i and track k) and control the sharpness of the probability distribution. f→fuse_ti⋅M→tk represents the current detection and track matching degree. When the video accuracy is high, a smaller τ such as 0.05 can be selected to distinguish similar cracks more strictly. When encountering low quality video streams, a larger τ such as 0.5 can be selected, allowing more relaxed matching. In this paper, τ = 0.1; ∑qexp⁡(f→fuse_ti⋅M→tq/τ) is the competitive weight of all possible tracks, which is used to prevent other adjacent crack tracks from interfering with the current matching. In conclusion, Ltrack can promote the network to learn the discriminant features, so that the feature similarity of the same crack in different frames is higher than that of different cracks.

Lwidth is the key module to ensure the continuity of crack size measurement, which directly determines the reliability of width change trend analysis. It includes two parts: absolute error term and width change consistency term. The calculation method is shown in Formula (19). The first half is the absolute error term, which constrains the time smoothness of the width estimation, and ∥·∥2 is the square error used to amplify significant differences; The second half is the consistency term of width change, which can prevent width mutation. wt−1i is the width of the previous frame to detect i, Δt is the time between frames (seconds), in this paper Δt = 1/30 s, and γ2 is the weight coefficient of width change rate, γ2 = 0.4.

Lwidth=1|P|∑(i,k)∈P‖wti−wtk‖2+γ2∑(i,k)∈P‖wti−wt−1iΔt−wtk−wt−1kΔt‖2(19)

3.2.4 Count Evaluation Module: CoM

In the TransTra-Count system, the counting module plays a critical role in the statistical analysis of crack targets and the evaluation of damage levels. Its core function is to achieve real-time, unique, and non-repetitive counting of crack objects in video streams while outputting structured measurement results. Deeply integrated with the detection, segmentation, and tracking branches, the module forms a closed-loop automated measurement framework tailored for surface defect quantification. The overall workflow is illustrated in Fig. 6.

images

Figure 6: The closed-loop flowchart of TransTra-Count

First, the system determines whether structural crack targets exist in the current frame based on results from the front-end detection and segmentation subnetworks. If no valid defect region is detected, the system skips to the next frame to avoid unnecessary computation, forming a sparse, efficient, and low-redundancy frame-level processing pipeline. Once a potential crack region is identified, the counting module immediately invokes a multi-object tracking mechanism to perform real-time matching between the current detection and historical trajectories. This matching process relies on multimodal similarity measures incorporating spatial location, segmentation mask, and appearance features. A matching control strategy is then applied to determine the temporal continuity of the target: if it is a new crack instance, the system assigns a unique ID and updates the global defect counter; if it corresponds to an existing target tracked across frames, the original ID and counting state are retained to ensure temporal consistency and avoid duplicate counting.

Ultimately, the counting module outputs structured statistical results-including crack quantities and identifiers-providing quantitative, traceable, and engineering-ready decision support for pavement condition management, maintenance prioritization, and lifecycle assessment. By establishing a closed-loop feedback mechanism integrating detection, tracking, and measurement, the module not only enhances counting accuracy and system responsiveness but also embodies an industrial intelligent inspection philosophy oriented toward “structured measurement”. The system demonstrates strong generalization capability and deployment feasibility in practical applications, particularly in use cases such as road crack maintenance, bridge structural diagnosis, and condition assessment of transportation infrastructure.

4 Dataset Contruction

Our work relies on two public datasets, DeepCrack [36] (general crack segmentation dataset), UVA-PDD2023 [37] (pavement disease detection dataset) and UAPD [38]. And independently collected the RoadDefect-MT data set to form a multi-source, multi scene pavement disease database. The specific composition is as follows:

4.1 Public Dataset

DeepCrack: provides high-resolution (up to 2592 × 1944) fine crack annotation, covering various scenes such as walls and floors, and is used for the generalized learning of slender crack characteristics by the model.

UVA-PDD2023: including the annotation data of common pavement diseases such as cracks, with low resolution (640 × 480), which is suitable for the training of small target detection ability of the model.

UAPD: comprises 3151 images with an original resolution of 7952 × 5304 pixels. It includes six types of road defects with diverse sizes and morphological characteristics: longitudinal cracks (LC), transverse cracks (TC), alligator cracks (AC), oblique cracks (OC), repair marks, and potholes.

4.2 Self-Built RoadDefect-MT Dataset

The Dajiang Mini 3 Pro UAV (equipped with a 4K/60 fps camera) is used to fly at a height of 2–5 m to ensure that pixel level diseases are visible. The 4K resolution video capture is carried out on the roads in Jiangning District of Nanjing and the campus of Nanjing University of Aeronautics and Astronautics for asphalt roads. Through systematic inspection, it covers different lighting conditions (sunny and cloudy) and shooting angles (overhead and squint) to ensure data diversity.

The constructed RoadDefect-MT (Measurement & Tracking) dataset consists of 33 video clips (each lasting 2–18 s, with a resolution of 3840 × 2160 and a frame rate of 30 fps), totaling 3390 frames, each of which has been meticulously annotated. The dataset covers four typical types of road defects: transverse cracks, longitudinal cracks, mesh cracks, and crack patches. The distribution of these defects is not uniform, with crack patches being the most prevalent. The remaining three categories are roughly evenly distributed, with an overall ratio of approximately 1 (transverse): 1 (longitudinal): 1 (mesh): 2 (patches). This distribution reflects the real-world scenario where repaired areas are more commonly encountered in road networks. In order to adapt to the dynamic detection task, the annotation follows the MOT (Multi Object Tracking) format, and the annotation objects include defect categories, bounding boxes, and motion tracks. The video capture mode covers five UAV motion states: forward shooting, in situ rotation, forward rotation, and horizontal/vertical screen switching to better meet the needs of different scenes. Fig. 7 is an example of dataset annotation.

images

Figure 7: Some samples of RoadDefect-MT dataset

The RoadDefect MT dataset has high resolution, which can provide pixel level defect details and support the accurate positioning of small targets (such as fine cracks); and multi-mode UAV motion simulation of a real patrol scene can reduce the dependence of the model on a fixed shooting angle.

5 Experiment

In this study, we use high-performance computing equipment to train the deep learning model. In terms of hardware configuration, a computer equipped with the 12-th generation Intel® Core™ I9-12900K central processor, NVIDIA GeForce RTX 3090 graphics processor and 64 GB memory computer. The software environment is based on Python 3.8 programming language, and the PyTorch framework is used for model development. At the same time, CUDA 11.1 acceleration library is used to optimize the GPU computing performance.

A phased training strategy was adopted in this systems to ensure stable and efficient learning. The entire process consists of two core stages:

In the first stage, the focus is on training the VitSeg-Det integrated detection and segmentation network. This network employs an end-to-end joint training approach, with a backbone shared by both detection and segmentation tasks. During training, the model simultaneously receives sample data annotated with bounding boxes and segmentation masks. In the forward pass, the network computes both detection loss and segmentation loss in parallel. These two losses are combined into a total loss function via weighted summation, and then the weights of the shared backbone and the two task-specific heads are updated synchronously through backpropagation. This design enables the detection and segmentation tasks to optimize collaboratively and mutually reinforce each other, allowing the shared backbone to learn more general and robust feature representations.

In the second stage, after the VitSeg-Det network is fully trained, all its weights are frozen, and the TransTra-Count tracking and counting network is trained on this basis. Specifically, consecutive video frames are fed into the frozen VitSeg-Det network to obtain precise bounding boxes, category confidence scores, and corresponding deep appearance features of all cracks in each frame. This information (boxes + features), along with the tracking trajectories from the previous frame, forms training sample pairs that are input into the TransTra-Count network. The core task of this network is to learn data association-i.e., determining which target in the current frame corresponds to which track in the previous frame for the same crack entity-and to develop the ability to maintain consistent target identity IDs in complex scenarios through training.

In terms of model performance evaluation, this research has constructed a set of multi-dimensional comprehensive evaluation system to comprehensively measure the performance of the algorithm in the crack segmentation, detection and tracking tasks. For the crack segmentation task, we use the average intersection and union ratio (mIoU) as the core index, and evaluate the pixel level positioning accuracy by calculating the overlap ratio of predicted and real labeled regions. In the crack detection task, we take the average accuracy (mAP) as the main indicator, combined with the detection performance under different IoU thresholds (such as AP50), supplemented by the Precision, Recall rate, F1 score, Floating Point Operations Per Second (FLOPS), Frame Per Second (FPS) and other indicators, to conduct a comprehensive evaluation from the two dimensions of detection accuracy and integrity. For more challenging crack tracking tasks, we introduce higher-order evaluation indicators: HOTA (Higher Order Tracking Accuracy) is used to comprehensively measure the overall performance of detection and correlation, DetA (Detection Accuracy) and AssA (Association Accuracy) are used to quantify the pure detection accuracy and pure correlation accuracy respectively, and IDF1 scores are used to evaluate identity retention capability. In addition, in order to comprehensively analyze the tracking quality, we also introduced the area under the curve (AUC) to evaluate the tracking stability, used the P-norm to calculate the trajectory prediction error, and combined it with the accuracy index to verify the reliability of the tracking results.

5.1 Sampled-ViT Ablation Experiment

In Section 3.1.2, we designed a lightweight scoring network to obtain Ffinet. When extracting Ffinet, ω adaptively changes, thus controlling the sampling number N of the feature vector S(F)→. The generation of dynamic ω directly depends on the statistical characteristics of the scoring matrix S, which is specifically shown as follows:

ω=fθ(S)∈[ωmin,ωmax](20)

where fθ is a dynamic ω generating function, which can be statistical calculation or lightweight network (such as MLP). If ω1 is the method based on statistics and ω2 is the method based on MLP, then there are

ω1=σ(α⋅μ(S)+β⋅std(S))⋅(ωmax−ωmin)+ωmin(21)

ω2=MLP(GAP(S))(22)

where, α and β are learnable parameters, μ(S) is the mean score, and std(S) is the standard deviation of the score.

In Table 1, the performance of two dynamic ω generation methods in the pavement crack segmentation task is verified. It can be seen from the table that Stat-ω is slightly better than MLP-ω (+0.59% mIoU), because the statistics are more stable, and MLP overfitting noise can be avoided. And the FPS of Stat-ω is close to fixed ω, because it only needs simple tensor operation without additional trainable parameters, while MLP-ω requires additional forward calculation and parameter update, increasing delay and memory occupation, and FPS decreases by 23%; The omega range of MLP-ω is larger, but some extreme values (such as ω < 0.3) will lead to missed detection.

images

In this study, the dynamic ω generating function based on statistics is used to achieve optimal performance. When deploying edge devices, in order to balance the calculation efficiency and accuracy, a fixed ω value can be selected according to the characteristics of the scene: ω = 0.5 is recommended for conventional scenarios; ω = 0.3 can be selected for sparse crack scenes (such as expressways) to improve calculation efficiency; it is recommended to adopt ω = 0.7 to enhance robustness in areas with dense cracks (such as old pavement). The server retains the dynamic ω mechanism to ensure the highest detection accuracy. This hierarchical strategy achieves an adaptive balance between computing resources and detection accuracy.

As shown in Table 2, the statistical approach demonstrates greater stability and effectively prevents the MLP from overfitting to noise. Specifically, different levels of Gaussian noise (with noise variances σ2 = 0.01, 0.02, and 0.03) were added to the test set images, and the performance of three methods Fixed-ω(0.5), Stat-ω, and MLP-ω was evaluated accordingly.

images

The results in Table 2, under different noise levels, show that the decrease in mIoU (ΔmIoU) of Stat-ω is consistently smaller than that of MLP-ω. For example, at σ2 = 0.03, the performance of Stat-ω declines by 5.86%, while MLP-ω exhibits a decrease of nearly 10%. This quantitatively demonstrates that MLP-ω, due to its learnable parameters, is more prone to learning and amplifying noise present in the training data (i.e., overfitting), leading to a sharp performance degradation on noisy inputs. In contrast, Stat-ω relies on simple statistical measures and lacks learning capacity, making it insensitive to noise and thus exhibiting stronger generalization robustness.

The results in Table 3 show that μ(S) and std(S) have significant synergistic effects in the Stat-ω method. When μ(S) is used only, mIoU is 43.1%, which indicates that single dependence on global features will lead to insufficient local saliency perception; When std(S) is used only, the mIoU is 42.8%, which reflects that simply focusing on local changes will reduce the sensitivity to the overall structure. When μ(S) + std(S) are used together, mIoU increases to 44.12%, which proves that the combination of the two can effectively balance the perception of global features and local details.

images

To systematically evaluate the contribution of key modules in the proposed VitSeg-Det framework to model performance, a series of ablation studies were conducted on the UAPD and UVA-PDD2023 datasets. The experiments were designed to incrementally incorporate each submodule for analysis, with results summarized in Table 4. All experiments consistently employed EfficientNet-b5 as the backbone network and maintained identical training parameter settings to ensure fairness and reliability in the comparisons. In the experimental design, the fusion of fine-grained and macro scale features was implemented using the joint μ(S) + std(S) method from the statistical approach.

images

The experimental results demonstrate that both the fine grained feature module and the macro scale feature module in the VitSeg-Det model contribute significantly to performance improvement. When used in combination, the model achieves its best overall performance, with the segmentation mIoU increasing to 47.0 and the detection mAP reaching 48.0 on the UAPD dataset. The fine grained feature module notably enhances the detection capability for small targets (reflected by a significant improvement in APs), while the macro scale feature module more substantially improves segmentation accuracy and detection performance for regular sized objects. The two modules exhibit complementary characteristics, and their synergistic effect leads to optimal results across all evaluation metrics. This consistent trend is observed on both the UAPD and UVA-PDD2023 datasets.

5.2 CrackDSF-LMe Ablation Experiment

In this section, we studied several components of the model, such as spatial feature double joint data association, adaptive appearance aggregation, light gating, and long-term memory. Our main contribution is that we can accurately identify the cracks in the video stream and avoid the problem of repeated counting.

In this paper, we propose a data association method DSFM, and control the fusion ratio between IoU measurement and feature similarity calculation based on the attention mechanism by introducing the adjustable parameter λ1 = 1. In order to verify the effectiveness of this method, we designed four groups of comparative experiments, and the experimental results are shown in Table 5. Among them, type 1 (λ1 = 1), as the baseline, only uses IoU matching strategy. Although the calculation speed is the fastest (average processing time is 28 ms/frame), which is suitable for high frame rate scenes, its AssA is only 58.7% due to the lack of feature information assistance, and frequent ID switching problems occur. Type 2 (λ1 = 0) completely depends on feature similarity matching, and performs well in anti occlusion. AssA is significantly improved to 70.2%, but the DetA decreases by 8.0% due to the false correlation of some fuzzy fracture areas. Type 3 adopts a hybrid strategy (λ1 = 0.6). While maintaining real-time (processing time 45 ms/frame), it achieves an absolute increase of 15.9% in HOTA indicators and 18.3% in IDF1, significantly improving the object’s identity retention ability. Although type 4 (λ1 = 0.8) has slightly increased DetA by 3.1%, its comprehensive performance is inferior to type3, which indicates that a moderate feature fusion ratio (about 0.6) can achieve the best balance between detection accuracy and association accuracy. Experimental results show that our DSFM effectively overcomes the limitations of a single association strategy by dynamically adjusting the fusion ratio. Spatial-Feature Dual-Modal Data Association

images

In Section 3.2.2, we designed a ULMeM module, which can dynamically fuse the object features from adjacent frames, and integrate the interframe lighting changes to avoid the impact of sharp changes in light. We decompose this structure in Table 4. The first result only uses single frame features as the comparison benchmark, the second group uses fixed weights instead of adaptive weights, the third group removes illumination compensation, and the last group is the complete model. As shown in Table 6, compared with Wg2 = 0.5, the adaptive weight increases about 12.3% (65.7 → 73.8) in HOTA, which indicates that it is necessary to dynamically adjust the memory retention ratio for occluded/blurred scenes. IDF1 increases significantly (71.8 → 78.7), which verifies that adjusting the memory weight through dynamic response appearance difference significantly reduces ID switching errors; Compared with the fourth group, the removal led to a 2.9% decrease in DetA (71.4 → 68.5), indicating that the false detection rate increased when the illumination suddenly changed, and the histogram moment matching effectively improved the tracking stability; In the fourth group, compared with baseline, the FPS is reduced from 36 to 28, because the introduction of multi frame computing increases the time consumption, but significantly improves the accuracy (HOTA + 11.5), which is suitable for high-precision demand scenarios. It can be seen that all indicators of the complete model are optimal, indicating that the joint modeling of illumination, adaptive appearance aggregation and memory update can comprehensively solve the problems of fuzziness, occlusion and brightness change.

images

The long-term memory we proposed in Section 3.2.2 is to use the longer time information and further inject it into the subsequent track embedding to enhance the object characteristics. We have studied the performance of long memory. As shown in Table 7, when λ2 gradually increases from 0.01 to 0.4, our model has the highest HOTA score at λ2 = 0.3, while DetA score decreases slightly. This indicates that moderate memory updating can achieve a balance between feature stability and adaptability. Based on the experimental results, we suggest that λ2 should be used as an adjustable super parameter, which should be optimized according to the needs of specific scenarios.

images

5.3 Detection and Segmentation Qualitative and Quantitative Analysis

Since our segmentation detection model is a multi task model, and Sampled-ViT is used as a shared encoder to extract global features, we have carried out unified processing on the dataset to adapt to the multi task model training. First, the training set and verification set of the DeepCrack dataset are integrated to generate the detection frame annotation, and each image is expanded to 30 video sequences through translation transformation to synchronously generate frame by frame segmentation tags and detection frame annotation. At the same time, we manually segment and annotate 100 crack images in the UVA-PDD2023 dataset, and generate 30 frames of video data and their corresponding multitask annotations using translation scaling transformation. In the test phase, the UVA-PDD2023 test set is selected to comprehensively evaluate the generalization performance of the model.

For the performance analysis of Sample-ViT, we conducted systematic comparative experiments on all test images, as illustrated in Fig. 8. In the comparative experiments of this study, we selected representative benchmark models based on the following principles: (1) Task relevance. The selected models (e.g., Mask2Former, DPT) are advanced representatives dedicated to semantic segmentation or general segmentation tasks, and their design objectives are highly relevant to our crack segmentation task. (2) Architectural comparability. We specifically included Transformer-based models (e.g., SegGPT, DPT) to enable fair comparisons of efficiency (parameter count, FLOPs) and accuracy on the basis of similar architectures. (3) Industry influence and versatility. As a pioneering work in context-aware segmentation, SegGPT provides an important performance reference. The results show that Sample-ViT with the EfficientNetB5 backbone network exhibits the most excellent fine-grained segmentation capability in the road crack detection task. It can generate finer and more continuous crack segments, significantly improving the detection rate of microcracks.

images

Figure 8: The results of segment: From left to right, they are original image, SegGPT, Mask2former, DPT and ours in sequence

As shown in Table 8, the methods proposed in this study show excellent performance balance in video segmentation tasks. Compared with the mainstream model, our method achieves 45.12% mIoU with only 91M parameters (based on ViT base), which is superior to SegGPT (41.2%) in accuracy. At the same time, the calculation amount (157 G FLOPS) is significantly lower than DPT and SegGPT of similar ViT base architectures. Although Mask2Former (Swin Transformer) performs best with 52.6% mIoU, its 219 M parameters and 520 G FLOPS have significantly higher computing costs. Experimental results show that the proposed method has lightweight advantages while maintaining competitive segmentation accuracy, and is particularly suitable for real-time video analysis scenes with limited computing resources.

images

When testing model validation, this study conducted a comprehensive performance evaluation of the trained model on the standard test set, and used indicators such as accuracy, precision, recall and F1 score to compare with the CNN based method. The experimental results show that the detection network optimized by Transformer has achieved significant improvement: as shown in Table 9, our method’s F1 score reached 74.13, with an accuracy rate of 70.20%, 23.4% and 9.6% higher than the optimal CNN model (Yolo v8) respectively. From the performance comparison of each algorithm listed in Table 10, we can see that under the same input size (31,333,800), our method performs well in detection accuracy while maintaining a real-time processing speed of more than 16 fps. Although the number of model parameters (41.55 M) is slightly higher than some of the comparison methods, the computational efficiency (81.63 G FLOPs) is better than that of DETR (86.93 G FLOPs) and is significantly higher than the computational complexity of the traditional CNN method (Faster RCNN 0.178 T FLOPs). As shown in Fig. 9, our method and DETR show stable detection capability for various defect types, while SSD and Yolo series have faster detection speed, but their performance in this dataset is poor, and Faster RCNN has obvious problems of false detection and missing detection. These results confirm the advantages of the proposed method in accuracy and efficiency. In the future, we will further improve the real-time performance through code optimization to meet the application requirements of high-precision real-time detection.

images

Figure 9: Visual comparison of test results

5.4 Tracking Segmentation Qualitative and Quantitative Analysis

To verify the performance of the tracking model, we use the UVA-PDD dataset to generate analog video sequences, and expand each image to 30 frames through translation transformation to simulate continuous frame input in the real scene. In the test phase, the RoadDefct MT dataset is used for generalization evaluation. Experimental results show that the proposed tracking network achieves an average accuracy of 97.1% and an F1 score of 0.84 on the test set. As shown in Fig. 10, the Precision Recall (PR) curve of pavement defect detection is close to the upper right corner of the coordinate system, indicating that the network can track crack targets stably and accurately.

images

Figure 10: Precision recall (PR) curve and F1 score performance analysis

To verify the advancement of the proposed TransTra-Count method, we conducted a comprehensive comparison between it and current mainstream multi-object tracking algorithms on the RoadDefect-MT dataset. As a classic paradigm, DeepSORT integrates motion models and appearance features, serving as the foundation for many subsequent studies and thus adopted here as a performance baseline. ByteTrack focuses on motion models as its core and significantly enhances association robustness by effectively utilizing low-score detection boxes, representing the advanced level of the current technical route in this field. BoT-SORT and OC-SORT, respectively, introduce camera motion compensation and trajectory smoothing optimization based on ByteTrack, together forming a collection of various advanced online tracking strategies. All the aforementioned trackers have open-source and stable implementations, facilitating reproducible and fair comparisons. The experimental results are presented in Table 11.

A controlled variable approach was adopted, where two detectors-the proposed VitSeg-Det and the widely used YOLOv8-were fixed and paired with each of the five tracking methods to evaluate the performance of different combinations in multi-object tracking tasks.

First, in terms of detector performance comparison, the proposed VitSeg-Det in this paper demonstrates significant advantages. When paired with the same tracker, VitSeg-Det consistently and stably outperforms YOLOv8 in all core tracking metrics (HOTA, MOTA, IDF1). Taking the combination with Bytetracker as an example, VitSeg-Det achieves a 1.5% performance improvement in HOTA (64.5 vs. 63.0), MOTA (79.5 vs. 78.0), and IDF1 (78.5 vs. 77.0), fully demonstrating its higher accuracy and robustness as a detection module.

Second, in the horizontal comparison of trackers, it was observed that for the same detector, Bytetrack and BoT-SORT delivered the best and closely matched performance, followed by OC-SORT, while the classic DeepSORT algorithm performed relatively weakly. This trend reflects the continuous progress in multi-object tracking technology regarding the robustness of data association. Notably, the proposed TransTra-Count tracker achieved the best performance among all combinations. When paired with VitSeg-Det, this combination reached the highest tracking performance (HOTA: 67.0, MOTA: 81.1, IDF1: 80.2), significantly outperforming other compared algorithms and strongly validating the effectiveness of TransTra-Count’s innovative design.

Finally, in terms of computational efficiency, combinations using YOLOv8 achieved the highest frame rates (ranging from 38 to 42 FPS), benefiting from the computational efficiency of its CNN architecture. Although the Transformer-based VitSeg-Det combinations resulted in slightly lower frame rates (ranging from 34 to 39 FPS), they offered a favorable trade-off between accuracy and efficiency due to their superior precision. It is particularly noteworthy that the proposed TransTra-Count method maintained a real-time processing speed of 34 FPS while achieving leading accuracy metrics, successfully balancing precision and efficiency.

In conclusion, the experimental results demonstrate that both VitSeg-Det as a detection module and TransTra-Count as a tracking module deliver outstanding performance. Their combination forms a powerful and efficient solution for road defect detection and tracking.

To comprehensively evaluate the overall performance of the proposed method, we conducted a comparative analysis from a more macroscopic method paradigm perspective. The experimental results are shown in Table 12.

images

The experimental results show that the proposed joint optimization framework in this paper achieves the best performance across all key indicators and leads comprehensively. Specifically, this method achieves an mAP of 47.9% in the detection task, an mIoU of 50.1% in the segmentation task, and significantly outperforms other frameworks in multi-object tracking, with HOTA of 67.0, MOTA of 81.1, and IDF1 of 80.2. This result indicates that the joint optimization mechanism can effectively promote the collaboration and information complementation between detection and tracking tasks, thereby achieving global performance optimization.

By comparing different paradigms, it can be observed that the end-to-end paradigm (TrackFormer) outperforms the decoupled method in tracking metrics, demonstrating certain potential of end-to-end learning. However, its detection accuracy (mAP 45.5) is relatively low, and its inference speed is the slowest (18 FPS), reflecting the limitations of this paradigm in terms of efficiency and flexibility. The decoupled paradigm, leveraging its architectural advantages, performs best in terms of speed (40–42 FPS) and has become a widely adopted efficient solution in practical applications. However, its accuracy ceiling still falls short of the joint optimization method.

The reference pure segmentation model Mask2Former achieved an mIoU of 52.6% in semantic segmentation tasks, but it lacks tracking capabilities and cannot output multi-object tracking metrics. Moreover, its inference speed (22 FPS) is lower than that of most detection and tracking models.

5.5 TransTra-Count Analysis

To verify the effectiveness of the Transformer-based pavement defect detection and counting model, we conducted comprehensive tests on multiple datasets, including: (1) a publicly available benchmark dataset enhanced with translation augmentation; (2) our self-built 840 × 2160@30 fps. During the experiment, we strictly maintained the original resolution of each dataset to ensure the authenticity of the test conditions. The experimental results are shown in Figs. 11 and 12.

images

Figure 11: Defect tracking count (sequence 1)

images

Figure 12: Defect tracking count (sequence 2)

As shown in Fig. 10, our system achieved accurate defect detection and stable tracking on the aforementioned datasets. The experimental results demonstrate that our method performs excellently on test data with different resolutions and frame rates. It not only accurately detects various types of pavement defects but also achieves stable target tracking and precise quantity counting. Notably, in the 4K high-definition video tests, despite a significant increase in the volume of input data, our model still maintained stable detection accuracy and counting precision. This preliminarily proves the algorithm’s adaptability to high-resolution scenarios.

At the same time, we fully recognize a limitation of the current research: the self-collected dataset relied on for primary validation is insufficient in terms of scale (only 33 videos) and scene diversity (sourced from a single collection location). This may affect the model’s generalization ability in broader and more complex real-world environments. Although supplementary experiments on public datasets such as DeepCrack and UVA-PDD further support the effectiveness of the method, we acknowledge that these datasets constructed from static images cannot fully replicate the challenges posed by real dynamic video scenarios. Therefore, the conclusions of this paper should be regarded as a strong validation of the effectiveness of the proposed TransTra-Count model within the scope of our existing datasets. These results provide valuable references and a foundation for practical engineering applications, but the universality of the model still needs further verification in the future through larger-scale and more diverse real-world video datasets.

5.6 Embedded Device Verification

To comprehensively evaluate the deployment feasibility and resource adaptability of the proposed method in actual industrial scenarios, this section focuses on the edge inference platform and systematically analyzes the performance of this method under lightweight deployment conditions.

The experiment was conducted on the embedded edge device Jetson AGX Orin 32 GB. Before deployment, the FP16 precision optimization processing was carried out using the Tensor RT toolchain to closely align with the acceleration scheme in actual embedded scenarios; the input image size was uniformly set to 1024 × 1024, and the batch size was set to 1 to simulate the continuous single-frame image processing flow, ensuring the consistency and reusability of the evaluation.

The performance tests cover four key indicators: model size (MB), maximum frame rate (FPS), peak memory usage (MB), and single-frame latency (ms). As shown in Table 13, the experimental results show that the proposed method, while integrating detection and segmentation, still maintains low resource usage and high inference efficiency. The maximum frame rate reaches 20 FPS, and the single-frame processing latency is 49.8 ms. This performance demonstrates the excellent engineering applicability of this method in high-frequency industrial defect detection and quality monitoring tasks, providing a feasible lightweight solution for actual industrial applications.

images

6 Conclusion and Limitations Discussion

Focusing on the requirements of automatic identification of road crack defects and quantification of structural parameters, this research has constructed a multi task integrated crack measurement system based on the Transformer architecture, which focuses on breaking through the problems of repeated counting errors, structural information loss and measurement output inexplicability faced by existing methods in complex dynamic scenes. The system as a whole has realized end-to-end closed-loop modeling from detection, segmentation, tracking to counting, and promoted the transformation of the crack recognition task from “image level classification” to “structured measurement”. Its innovative contributions in visual perception and quantitative analysis are mainly reflected in the following four aspects:

First of all, to solve the problem that microcracks are difficult to segment in the low contrast background, a VitSeg-Det integrated detection and segmentation network is designed. Combining the channel space attention fusion scoring mechanism and the macro scale dilated convolution feature modeling, the edge structure and global topology information of cracks are accurately extracted, which effectively supports the measurement and calculation of subsequent structural parameters.

Secondly, to solve the problem of recognition drift and repeated counting of the same crack object in dynamic video due to changes in illumination and perspective, a TransTra-Count module is proposed, which is based on the space feature bimodal association and long-term memory update mechanism, and realizes the identity maintenance and life cycle management of the crack object. In many actual videos, the system achieves accurate and non repetitive object counting output, and maintains measurement stability under interference conditions such as occlusion and blurring.

Thirdly, in terms of crack structure measurement, this paper proposes an unsupervised width estimation method based on mask skeleton, and designs a smooth update mechanism and a width change rate constraint loss function to ensure the continuity and physical rationality of the estimation results in the time dimension.

Fourthly, to provide more comprehensive experimental support for system verification and measurement evaluation, this paper constructs a self built high resolution dynamic crack dataset, RoadDefect-MT, which covers a variety of typical pavement disease patterns, different illumination and occlusion conditions, and is supplemented by average width marking and track numbering, making up for the gaps in time sequence consistency and measurement marking in existing public datasets. The experimental results show that the method in this paper is significantly better than the existing mainstream methods on the dataset, showing stronger measurement stability, statistical accuracy and engineering adaptability.

Overall, this paper has made several key breakthroughs in visual measurement modeling, dynamic target statistics, and structural parameter extraction, verifying the effectiveness and scalability of the Transformer architecture in multi-task visual measurement systems. However, there are still several limitations that need to be further explored:

Firstly, the current validation mainly focuses on road surface scenarios. The applicability of this method in vertical or elevated surfaces such as bridges and tunnels remains an open question. This is mainly due to the differences in crack morphology under different perspectives and the influence of gravity, as well as the challenges of shadow and perspective distortion in imaging of vertical surfaces. Secondly, existing research also has limitations such as a single target type and limited measurement granularity. The current system focuses on the measurement of linear cracks, and the generalization ability for other common disease types such as potholes and network cracks has not been verified. At the same time, the parameter measurement is still concentrated on macroscopic indicators such as average width, and the quantitative ability for three-dimensional attributes such as depth and volume needs to be expanded. Additionally, the system faces challenges in extreme environments such as heavy rain, severe stains, and intense vibrations during actual deployment. Although this method performs robustly under common interference conditions, its generalization ability in unseen extreme scenarios still needs to be verified through more extensive datasets.

In response to these limitations, future work will focus on the following aspects: Firstly, extend the detection range from road surfaces to bridge facades, tunnel walls, and other vertical surfaces to verify and enhance the model’s generalization ability in different scenarios and perspectives; Secondly, develop a universal measurement framework that can be applied to potholes, network cracks, and various structural diseases; Thirdly, construct a new generation of three-dimensional bridge and road disease data sets that integrate multi-view imaging, depth information, and calibration parameters to support fine-scale structural measurement at spatial scales; At the same time, explore segment-based structural modeling methods based on Transformer to achieve differentiated measurement and evaluation of cracks and other diseases, providing more precise data support for road and bridge maintenance decisions. Finally, further research will focus on robustness enhancement methods that integrate physical priors and adaptive learning mechanisms to improve the performance stability of the system in extreme conditions, promoting the effective transformation of research results into practical engineering applications.

Acknowledgement: We would like to express our gratitude to all those who contributed to the completion of this research. Their insights, discussions, and support greatly enhanced the quality and depth of this work.

Funding Statement: This work was supported in part by the Natural Science Foundation of Shaanxi Province of China under Grant 2024JC-YBQN-0695.

Author Contributions: The authors confirm contribution to the paper as follows: methodology, Langyue Zhao; software, Langyue Zhao and Yubin Yuan; validation, Langyue Zhao, Yubin Yuan and Yiquan Wu; formal analysis, Langyue Zhao and Yubin Yuan; investigation, Langyue Zhao; resources, Langyue Zhao; data curation, Langyue Zhao; writing—original draft preparation, Langyue Zhao and Yubin Yuan; writing—review and editing, Langyue Zhao and Yubin Yuan; visualization, Langyue Zhao and Yubin Yuan; project administration, Langyue Zhao and Yubin Yuan. All authors reviewed the results and approved the final version of the manuscript.

Availability of Data and Materials: Data sharing not applicable.

Ethics Approval: Not applicable.

Conflicts of Interest: The authors declare no conflicts of interest to report regarding the present study.

References

1. Ren S, He K, Girshick R, Sun J. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell. 2017;39(6):1137–49. doi:10.1109/TPAMI.2016.2577031. [Google Scholar] [PubMed] [CrossRef]

2. Redmon J, Farhadi A. YOLOv3: an incremental improvement. arXiv:1804.02767. 2018. [Google Scholar]

3. Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S. End-to-end object detection with transformers. In: Proceedings of the European Conference on Computer Vision; 2020 Aug 23–28; Glasgow, UK. p. 213–29. doi:10.1007/978-3-030-58452-8_13. [Google Scholar] [CrossRef]

4. Fan L, Wang D, Wang J, Li Y, Cao Y, Liu Y, et al. Pavement defect detection with deep learning: a comprehensive survey. IEEE Trans Intell Veh. 2024;9(3):4292–311. doi:10.1109/tiv.2023.3326136. [Google Scholar] [CrossRef]

5. Ma D, Fang H, Wang N, Lu H, Matthews J, Zhang C. Transformer-optimized generation, detection, and tracking network for images with drainage pipeline defects. Comput Aided Civil Eng. 2023;38(15):2109–27. doi:10.1111/mice.12970. [Google Scholar] [CrossRef]

6. Vasan V, Sridharan NV, Balasundaram RJ, Vaithiyanathan S. Ensemble-based deep learning model for welding defect detection and classification. Eng Appl Artif Intell. 2024;136:108961. doi:10.1016/j.engappai.2024.108961. [Google Scholar] [CrossRef]

7. Zhao K, Xu S, Loney J, Visentin A, Li Z. Road pavement health monitoring system using smartphone sensing with a two-stage machine learning model. Autom Constr. 2024;167:105664. doi:10.1016/j.autcon.2024.105664. [Google Scholar] [CrossRef]

8. Peng X, Wang P, Zhou K, Yan Z, Zhong X, Zhao C. Bridge defect detection using small sample data with deep learning and Hyperspectral imaging. Autom Constr. 2025;170:105900. doi:10.1016/j.autcon.2024.105900. [Google Scholar] [CrossRef]

9. Dung CV, Anh LD. Autonomous concrete crack detection using deep fully convolutional neural network. Autom Constr. 2019;99(4):52–8. doi:10.1016/j.autcon.2018.11.028. [Google Scholar] [CrossRef]

10. Li S, Zhao X. Automatic crack detection and measurement of concrete structure using convolutional encoder-decoder network. IEEE Access. 2020;8:134602–18. doi:10.1109/ACCESS.2020.3011106. [Google Scholar] [CrossRef]

11. Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation. In: Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2015 Jun 7–12; Boston, MA, USA. p. 3431–40. doi:10.1109/CVPR.2015.7298965. [Google Scholar] [CrossRef]

12. Chen LC, Papandreou G, Kokkinos I, Murphy K, Yuille AL. Semantic image segmentation with deep convolutional nets and fully connected CRFs. arXiv:1412.7062. 2014. [Google Scholar]

13. Ronneberger O, Fischer P, Brox T. U-Net: convolutional networks for biomedical image segmentation. In: Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015; 2015 Oct 5–9; Munich, Germany. p. 234–41. doi:10.1007/978-3-319-24574-4_28. [Google Scholar] [CrossRef]

14. Sun X, Xie Y, Jiang L, Cao Y, Liu B. DMA-net: deepLab with multi-scale attention for pavement crack segmentation. IEEE Trans Intell Transp Syst. 2022;23(10):18392–403. doi:10.1109/TITS.2022.3158670. [Google Scholar] [CrossRef]

15. Kang D, Benipal SS, Gopal DL, Cha YJ. Hybrid pixel-level concrete crack segmentation and quantification across complex backgrounds using deep learning. Autom Constr. 2020;118(4):103291. doi:10.1016/j.autcon.2020.103291. [Google Scholar] [CrossRef]

16. Ali R, Cha YJ. Attention-based generative adversarial network with internal damage segmentation using thermography. Autom Constr. 2022;141:104412. doi:10.1016/j.autcon.2022.104412. [Google Scholar] [CrossRef]

17. Kang DH, Cha YJ. Efficient attention-based deep encoder and decoder for automatic crack segmentation. Struct Health Monit. 2022;21(5):2190–205. doi:10.1177/14759217211053776. [Google Scholar] [PubMed] [CrossRef]

18. He K, Gkioxari G, Dollár P, Girshick R. Mask R-CNN. In: Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV); 2017 Oct 22–29; Venice, Italy. p. 2980–8. doi:10.1109/ICCV.2017.322. [Google Scholar] [CrossRef]

19. Chen DR, Chiu WM. Road crack detection using Gaussian mixture model for diverse illumination images. In: Proceedings of the 2020 30th International Telecommunication Networks and Applications Conference (ITNAC); 2020 Nov 25–27; Melbourne, VIC, Australia. p. 1–6. doi:10.1109/ITNAC50341.2020.9315113. [Google Scholar] [CrossRef]

20. Yan BF, Xu GY, Luan J, Lin D, Deng L. Pavement distress detection based on faster R-CNN and morphological operations. China J Highw Transp. 2021;34(9):181–93. doi:10.19721/j.cnki.1001-7372.2021.09.015. [Google Scholar] [CrossRef]

21. Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, et al. SSD: single shot multiBox detector. In: Proceedings of the Computer Vision—ECCV 2016; 2016 Oct 11–14; Amsterdam, The Netherlands. p. 21–37. doi:10.1007/978-3-319-46448-0_2. [Google Scholar] [CrossRef]

22. Yan K, Zhang Z. Automated asphalt highway pavement crack detection based on deformable single shot multi-box detector under a complex environment. IEEE Access. 2021;9:150925–38. doi:10.1109/ACCESS.2021.3125703. [Google Scholar] [CrossRef]

23. Hou Y, Shi H, Chen N, Liu Z, Wei H, Han Q. Vision image monitoring on transportation infrastructures: a lightweight transfer learning approach. IEEE Trans Intell Transport Syst. 2023;24(11):12888–99. doi:10.1109/tits.2022.3150536. [Google Scholar] [CrossRef]

24. Zhang Y, Zuo Z, Xu X, Wu J, Zhu J, Zhang H, et al. Road damage detection using UAV images based on multi-level attention mechanism. Autom Constr. 2022;144:104613. doi:10.1016/j.autcon.2022.104613. [Google Scholar] [CrossRef]

25. Ma D, Fang H, Wang N, Zhang C, Dong J, Hu H. Automatic detection and counting system for pavement cracks based on PCGAN and YOLO-MF. IEEE Trans Intell Transport Syst. 2022;23(11):22166–78. doi:10.1109/tits.2022.3161960. [Google Scholar] [CrossRef]

26. Zheng S, Lu J, Zhao H, Zhu X, Luo Z, Wang Y, et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2021 Jun 20–25; Nashville, TN, USA. p. 6877–86. doi:10.1109/CVPR46437.2021.00681. [Google Scholar] [CrossRef]

27. Xie E, Wang W, Yu Z, Anandkumar A, Alvarez JM, Luo P. SegFormer: simple and efficient design for semantic segmentation with transformers. Adv Neural Inf Process Syst. 2021;34:12077–90. [Google Scholar]

28. Cao H, Wang Y, Chen J, Jiang D, Zhang X, Tian Q, et al. Swin-Unet: Unet-like pure transformer for medical image segmentation. In: Proceedings of the Computer Vision—ECCV 2022; 2022 Oct 23–27; Tel Aviv, Israel. p. 205–18. doi:10.1007/978-3-031-25066-8_9. [Google Scholar] [CrossRef]

29. Wang W, Su C. Automatic concrete crack segmentation model based on transformer. Autom Constr. 2022;139:104275. doi:10.1016/j.autcon.2022.104275. [Google Scholar] [CrossRef]

30. Zhou Z, Zhang J, Gong C. Hybrid semantic segmentation for tunnel lining cracks based on Swin Transformer and convolutional neural network. Comput Aided Civ Infrastruct Eng. 2023;38(17):2491–510. doi:10.1111/mice.13003. [Google Scholar] [CrossRef]

31. Guo F, Qian Y, Liu J, Yu H. Pavement crack detection based on transformer network. Autom Constr. 2023;145:104646. doi:10.1016/j.autcon.2022.104646. [Google Scholar] [CrossRef]

32. Wang S, Li BZ, Khabsa M, Fang H, Ma H. Linformer: self-attention with linear complexity. arXiv:2006.04768. 2020. [Google Scholar]

33. Wan H, Gao L, Su M, Sun Q, Huang L. Attention-based convolutional neural network for pavement crack detection. Adv Mater Sci Eng. 2021;2021:5520515. doi:10.1155/2021/5520515. [Google Scholar] [CrossRef]

34. Liu G, Liu G, Chai X, Li L, Dai F, Huang B. Crack-DETR: complex pavement crack detection by multifrequency feature extraction and fusion. IEEE Sens J. 2025;25(9):16349–60. doi:10.1109/jsen.2025.3549121. [Google Scholar] [CrossRef]

35. Dong R, Xia J, Zhao J, Hong L. CL-PSDD: contrastive learning for adaptive generalized pavement surface distress detection. IEEE Trans Intell Transp Syst. 2025;26(4):5211–24. doi:10.1109/TITS.2024.3525193. [Google Scholar] [CrossRef]

36. Zou Q, Zhang Z, Li Q, Qi X, Wang Q, Wang S. DeepCrack: learning hierarchical convolutional features for crack detection. IEEE Trans Image Process. 2019;28(3):1498–512. doi:10.1109/TIP.2018.2878966. [Google Scholar] [PubMed] [CrossRef]

37. Yan H, Zhang J. UAV-PDD2023: a benchmark dataset for pavement distress detection based on UAV images. Data Brief. 2023;51:109692. doi:10.1016/j.dib.2023.109692. [Google Scholar] [PubMed] [CrossRef]

38. Zhu J, Zhong J, Ma T, Huang X, Zhang W, Zhou Y. Pavement distress detection using convolutional neural networks with images captured via UAV. Autom Constr. 2022;133:103991. doi:10.1016/j.autcon.2021.103991. [Google Scholar] [CrossRef]

39. Pujara A, Bhamare M. DeepSORT: real time & multi-object detection and tracking with YOLO and TensorFlow. In: Proceedings of the 2022 International Conference on Augmented Intelligence and Sustainable Systems (ICAISS); 2022 Nov 24–26; Trichy, India. p. 456–60. doi:10.1109/ICAISS55157.2022.10011018. [Google Scholar] [CrossRef]

40. Zhang Y, Sun P, Jiang Y, Yu D, Weng F, Yuan Z, et al. ByteTrack: multi-object tracking by associating every detection box. In: Proceedings of the Computer Vision—ECCV 2022; 2022 Oct 23–27; Tel Aviv, Israel. p. 1–21. doi:10.1007/978-3-031-20047-2_1. [Google Scholar] [CrossRef]

41. Aharon N, Orfaig R, Bobrovsky BZ. BoT-SORT: robust associations multi-pedestrian tracking. arXiv:2206.14651. 2022. [Google Scholar]

42. Cao J, Pang J, Weng X, Khirodkar R, Kitani K. Observation-centric SORT: rethinking SORT for robust multi-object tracking. arXiv:2203.14360. 2022. [Google Scholar]

Cite This Article

APA Style

Zhao, L., Yuan, Y., Wu, Y. (2026). VitSeg-Det & TransTra-Count: Networks for Robust Crack Detection and Measurement in Dynamic Video Scenes. Computers, Materials & Continua, 87(1), 82. https://doi.org/10.32604/cmc.2025.070563

Vancouver Style

Zhao L, Yuan Y, Wu Y. VitSeg-Det & TransTra-Count: Networks for Robust Crack Detection and Measurement in Dynamic Video Scenes. Comput Mater Contin. 2026;87(1):82. https://doi.org/10.32604/cmc.2025.070563

IEEE Style

L. Zhao, Y. Yuan, and Y. Wu, “VitSeg-Det & TransTra-Count: Networks for Robust Crack Detection and Measurement in Dynamic Video Scenes,” Comput. Mater. Contin., vol. 87, no. 1, pp. 82, 2026. https://doi.org/10.32604/cmc.2025.070563

BibTex EndNote RIS

Copyright © 2026 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Table of Content

VitSeg-Det & TransTra-Count: Networks for Robust Crack Detection and Measurement in Dynamic Video Scenes

Abstract

Keywords

References

Cite This Article

449

138

0

Related articles

Further Information

Guidelines

Follow Us

Join Us

Contact Us

WhatsApp:

Share Link