A Causal-Transformer Based Meta-Learning Method for Few-Shot Fault Diagnosis in CNC Machine Tool Bearings

Youlong Lyu; Ying Chu; Qingpeng Qiu; Jie Zhang; Jutao Guo

doi:10.32604/cmc.2025.068157

icon Open Access

ARTICLE

A Causal-Transformer Based Meta-Learning Method for Few-Shot Fault Diagnosis in CNC Machine Tool Bearings

Youlong Lyu^1,2,*, Ying Chu³, Qingpeng Qiu³, Jie Zhang^1,2, Jutao Guo⁴

1 Institute of Artificial Intelligence, Donghua University, Shanghai, 201620, China
2 Shanghai Engineering Research Center of Industrial Big Data and Intelligent System, Shanghai, 201620, China
3 College of Information Science and Technology, Donghua University, Shanghai, 201620, China
4 Shanghai Spaceflight Precision Machinery Institute, No. 388 Chuanda Road, Minhang District, Shanghai, 201109, China

* Corresponding Author: Youlong Lyu. Email: email

(This article belongs to the Special Issue: Advancements in Machine Fault Diagnosis and Prognosis: Data-Driven Approaches and Autonomous Systems)

Computers, Materials & Continua 2025, 85(2), 3393-3418. https://doi.org/10.32604/cmc.2025.068157

Received 22 May 2025; Accepted 09 July 2025; Issue published 23 September 2025

Abstract

In intelligent manufacturing processes such as aerospace production, computer numerical control (CNC) machine tools require real-time optimization of process parameters to meet precision machining demands. These dynamic operating conditions increase the risk of fatigue damage in CNC machine tool bearings, highlighting the urgent demand for rapid and accurate fault diagnosis methods that can maintain production efficiency and extend equipment uptime. However, varying conditions induce feature distribution shifts, and scarce fault samples limit model generalization. Therefore, this paper proposes a causal-Transformer-based meta-learning (CTML) method for bearing fault diagnosis in CNC machine tools, comprising three core modules: (1) the original bearing signal is transformed into a multi-scale time-frequency feature space using continuous wavelet transform; (2) a causal-Transformer architecture is designed to achieve feature extraction and fault classification based on the physical causal law of fault propagation; (3) the above mechanisms are integrated into a model-agnostic meta-learning (MAML) framework to achieve rapid cross-condition adaptation through an adaptive gradient pruning strategy. Experimental results using the multiple bearing dataset show that under few-shot cross-condition scenarios (3-way 1-shot and 3-way 5-shot), the proposed CTML outperforms benchmark models (e.g., Transformer, domain adversarial neural networks (DANN), and MAML) in terms of classification accuracy and sensitivity to operating conditions, while maintaining a moderate level of model complexity.

Keywords

Fault diagnosis; meta-learning; CNC machine tools; aerospace

1 Introduction

As a core component of advanced manufacturing equipment, the health status of high-precision computer numerical control (CNC) systems directly influences machining accuracy and production line stability [1]. Among mechanical failures in CNC machine tools, bearing faults account for a significant proportion, often leading to unplanned downtime that results in substantial economic losses and potential safety risks [2,3]. Specifically in aerospace manufacturing, where CNC systems machine high-value alloy components (e.g., turbine blades, structural parts) to micron-level tolerances, bearing defects are often highly concealed and propagate rapidly. Undetected failures may cause catastrophic scrapping of parts during costly machining processes and disrupt production schedules [4,5]. Therefore, accurate bearing fault diagnosis has become an indispensable capability for intelligent manufacturing systems.

Fundamentally, vibration signals provide the most direct physical manifestation of bearing degradation through contact-induced harmonic responses, making them the primary diagnostic data source in industry benchmarks. Traditional manual diagnostic approaches suffer from inherent limitations, including low efficiency and prolonged inspection cycles, due to the structural complexity and precision requirements of CNC systems [6,7]. Although deep learning-based intelligent fault diagnosis methods have achieved remarkable progress with the advancement of industrial Internet of Things (IoT) and smart sensing technologies [8,9], their generalization capability is severely constrained by the scarcity of labeled fault samples in high-reliability applications such as aerospace [10,11]. In aerospace production, collecting sufficient vibration-based fault data is prohibitively expensive and risky, as inducing bearing failures on operational equipment during the machining of critical components is unacceptable. Furthermore, the dynamic adjustment of process parameters in flexible aerospace manufacturing introduces time-varying nonlinear characteristics in bearing contact fatigue damage [12,13], making conventional data-driven models unreliable under varying operational conditions.

To address the above challenges, existing methods have shown significant effectiveness but suffer from obvious limitations. Classic transfer learning methods [14] mitigate data distribution differences through cross-domain knowledge transfer, yet their limited feature alignment capabilities impair adaptation to complex operational variations. Domain adversarial neural networks (DANN) [15] achieve finer domain adaptation, but their complex adversarial training mechanisms make model convergence difficult and result in huge computational overhead, severely limiting their practical application in industrial scenarios. Conversely, meta-learning methods adopt a “learning-to-learn” optimization paradigm, demonstrating unique advantages: model-agnostic meta-learning (MAML) [16] achieves rapid cross-condition adaptation with minimal samples by constructing a two-layer optimization framework; while embedding-based meta-learning methods [17] further validate this technology’s feasibility in industrial settings. However, existing meta-learning approaches remain constrained by gradient propagation’s sequential dependencies when processing temporally rich vibration signals, impeding full exploitation of fault features’ time-frequency correlations. Additionally, they lack effective gradient optimization mechanisms to balance model convergence speed and stability.

Therefore, this paper proposes a causal-Transformer meta-learning method (CTML), with three primary contributions: (1) constructing multi-scale time-frequency maps based on wavelet transforms to enhance weak fault feature recognizability, (2) developing causal-Transformers to ensure the physical consistent temporal modeling, and (3) designing a dynamic meta-optimization strategy for enhanced cross-condition adaptation. Experimental results demonstrate CTML’s superiority over existing bearing fault diagnosis methods under few-shot and variable operating conditions.

The remainder of this paper is organized as follows: Section 2 reviews the current bearing fault diagnosis methods of CNC machine tools and related deep meta-learning theories; Section 3 details the proposed CTML methodology; Section 4 presents experimental validation on the CWRU dataset, evaluating CTML performance against benchmarks; Section 5 discusses CTML’s results and implications; Section 6 concludes the study and outlines future research directions.

2 Related Work

2.1 Bearing Fault Diagnosis

CNC machine tools, serving as core equipment in aerospace and other manufacturing sectors, contain critical moving components (e.g., spindles, screws, and cutting tools) that operate continuously under extreme conditions of high speed, precision, and heavy loads. These demands directly impact the machining accuracy and productivity of aerospace components [18,19]. Especially, the machine tool-bearing system has become a research focus in fault diagnosis due to its complex dynamic characteristics and high failure rate [20,21]. Advances in the Industrial Internet of Things have enabled signal processing techniques (e.g., time-domain analysis, spectral diagnosis, etc.) combined with deep learning are capable of automatically extracting fault-sensitive features from bearing monitoring signals (vibration, temperature, acoustic emission, etc.), significantly enhancing diagnostic efficiency [22,23]. For example, Gao et al. [24] proposed the use of an adaptive generalized empirical wavelet transform to effectively suppress the noise of bearing fault signals; Alam et al. [25] proposed a multimodal bearing fault classification method based on a one-dimensional convolutional neural network (CNN) framework that fuses the internal vibration and the motor phase current signals. In particular, fault diagnosis models based on the Transformer architecture show significant advantages in handling high-dimensional time-series data generated by industrial equipment due to its unique self-attention mechanism. Lv et al. [26] proposed an adaptive feature modal decomposition and Transformer fusion method to realize high-precision diagnosis of rolling bearing faults; Han et al. [27] constructed a CNN-Transformer hybrid framework, which maintains excellent diagnostic performance under noise interference and data distribution imbalance conditions; Jang et al. [28] effectively improved the classification accuracy of mechanical faults through visual Transformer (ViT) technology; and Ren and Lou [29] proposed synchronized wavelet transform coupled with an enhanced ViT for accurate rolling bearing diagnosis across operating conditions.

However, most existing methods rely on laboratory steady-state data, making them difficult to adapt to dynamic operational changes caused by process iterations (e.g., material changes, parameter adjustments, etc.) in real aerospace manufacturing [30,31]. Additionally, CNC machine tool bearing failure characteristics are affected by material replacement, clamping parameter adjustments, and sudden changes in cutting loads, which must strictly follow the temporal causality law and are only related to current and historical moment characteristics [32,33]. The above contradiction makes traditional data-driven models face the challenge of feature drift. Therefore, this paper proposes to construct a physically interpretable time-frequency feature space based on the wavelet transform, and then introduce a mask matrix into the Transformer architecture to improve the multi-attention mechanism, which focuses only on the temporal features of the current moment and its historical moments, preventing the model from incorrectly utilizing the vibration features of the future moments to backpropagate the historical faults.

2.2 Deep Meta-Learning

Compared with conventional data-driven feature mining approaches, meta-learning demonstrates distinctive paradigm superiority through its capacity to extract highly generalizable meta-knowledge from the task distribution perspective [34–36]. Fundamentally, this paradigm shift, combined with deep neural networks’ intrinsic representational learning capabilities, enables robust fault diagnosis performance even under conditions of severe data scarcity (as limited as single or few samples).

The rapid development of deep meta-learning further has accelerated integration of neural network representation learning and meta-learning frameworks, forming two primary representative research branches: (1) Optimization-based approaches: Pourghoraba et al. [37] employed MAML for reliable mechanical fault detection using limited data. Mallick et al. [38] integrated MAML with majority voting to enhance few-shot learning capabilities and generalization while ensuring output stability. Qiao et al. [39] proposed a MAML based fault diagnosis model for convolutional normalized transformer encoder, which has a significant advantage in the diagnosis of faults of wavelet transform generators. Li et al. [40] proposed a meta-learning method based on meta-feature augmentation, which is applied to bearing fault recognition under different working conditions. Liu and Peng [41] proposed a semi-supervised meta-learning based on a simplified graph convolutional neural network to classify bearing faults under variable operating conditions. (2) Metric-based approach: Zhong et al. [42] proposed an improved metric-based meta-learning approach for accurate shot-less cross-domain fault diagnosis of bearings.

Notably, optimization-based approaches achieve rapid model adaptation via gradient-based updates, making them particularly suited for variable-condition fault diagnosis. Building on this foundation, this paper proposes a deep meta-learning framework integrating MAML with Transformer architecture. This approach effectively addresses bearing fault diagnosis under few-shot variable conditions by leveraging MAML’s fast adaptation capability and Transformer’s powerful feature extraction capacity.

3 Methods

Therefore, this study proposes a causal-Transformer meta-learning (CTML) method for CNC machine tool bearing fault diagnosis. As shown in Fig. 1, first, based on the original bearing signals, the vibration signals are converted into time-frequency maps as inputs using continuous wavelet transform (CWT); second, based on the causal relationship of the fault propagation time series, the causal-Transformer layer is designed to extract features and perform classification; and finally, a two-stage meta-learning strategy combining meta-training and meta-testing is adopted to enable cross-condition parameter fast adaptation, enhancing the performance of small-sample fault diagnosis under variable working conditions. In addition, the main mathematical symbols and descriptions involved in this paper are summarized in Table 1.

images

Figure 1: Illustrates the technical route of the CTML method

images

3.1 Data Acquisition and Pre-Processing

For bearing failures (e.g., inner ring spalling, ball wear) of CNC machine tools, periodic shock vibrations are usually triggered, which need to be analyzed by time-frequency analysis to ensure the interpretability of the features. Therefore, in this section, the original vibration signal is segmented by the sliding window (window size 1024, step size 512) operation, and the data magnitude is unified after maximum absolute value normalization.

Bearing failures (such as inner ring peeling and rolling element wear) generate periodic impact vibrations that manifest as damped oscillations in the time domain. These oscillations closely match the Morlet wavelet basis function in waveform characteristics. Compared to traditional Fourier transforms, the wavelet transform provides superior noise resistance when processing non-stationary signals. Therefore, the complex Morlet wavelet with the specification cmor1-3 is selected as the basis function to balance the time-frequency resolution. The “cmor” denotes the complex Morlet wavelet family, and the suffix “1–3” explicitly defines its bandwidth parameter (B = 1) and center frequency (C = 3) according to the standard wavelet notation. The B value controls the balance between time resolution and frequency resolution, while the C value determines the oscillation frequency of the wavelet. This configuration (B = 1, C = 3) provides optimal transient impact sensitivity while maintaining sufficient frequency resolution for bearing fault features. In this section, vibration signals acquired from CNC machine tool bearings are transformed into two-dimensional time-frequency representations using the complex Morlet wavelet transform with the cmor1-3 parameters. This representation preserves both the time-domain and frequency-domain information of the signal, thereby delivering more informative input features for subsequent fault diagnosis models. The formula for this process is given below:

CWT(a,b)=1a∫−∞∞x(t)φ(t−ba)¯dt,(1)

where x(t) is the input vibration signal, φ is the wavelet basis function (cmor1-3), a is the scale parameter (ranging from 1 to 64), and b is the time translation parameter (with step size of 512 samples).

3.2 Feature Extraction and Classification

The bearing fault diagnosis method (CTML) proposed in this paper contains a key module: feature extraction and classification. The feature extraction layer first segments the time-frequency map obtained from the CWT transformation into 8 × 8 image blocks and linearly projects them into 256-dimensional embedding vectors, which are enhanced by positional encoding and fed into a 4-layer stacked causal-Transformer encoder. The encoder adopts the typical structure of a multi-head attention mechanism (8 heads, 256-dimensional hidden layers) and a feed-forward neural network (FFN), with regularization achieved by layer normalization and a 0.1 dropout ratio. In particular, the attention module introduces a strict lower triangular mask matrix to ensure that each time step can only focus on the time-frequency characteristics of the current and historical moments, thus maintaining the physical causality of the fault evolution process. Ultimately, the extracted deep features are used to realize end-to-end fault diagnosis through a 3-layer Multilayer Perceptron (MLP).

3.2.1 Causal-Transformer Layer

Given the CWT output X=CWT(a,b), the feature extraction process of causal-Transformer is as follows:

• Linear projection:

Q=XWQ,(2)

K=XWK,(3)

V=XWV,(4)

where WQ,WK,WV∈Rd×d are learnable projection matrices.

• Multi-attention computation:

In the standard Transformer, each position can pay attention to all other positions in the sequence, which means that information can flow upward in any direction, including from the future to the past. However, when processing time-series data (such as bearing failure signals), this unrestricted flow of information violates the law of causality in the physical world—future events should not affect past events. In order to maintain physical causality, this section introduces the lower triangular mask matrix M and designs a causal self-attention mechanism Attention(Q,K,V):

Attention(Q,K,V)=Softmax(QKTdK⊙M)V,(5)

where Q, K, and V denote the query, key, and value matrices, respectively, where dK=d/h, h is the number of attention heads, and dK is used to scale the dot product result to prevent the gradient from vanishing or exploding. The symbol ⊙ denotes the Hadamard product. M is the lower triangular matrix mask, as shown in Eq. (6) below.

Mij={1,i≥j−∞,i<j,(6)

where the element in the i-th row and j-th column of the Mij mask matrix represents the target position (current time) and the source position (historical time), respectively. When i≥j, it indicates that the current time can focus on the current and previous times, and is assigned a value of 1. When i<j, it indicates that the current time attempts to focus on future times, and is assigned a value of −∞ (in actual implementation, a very large negative number is typically used, such as −1 × 109).

The causal self-attention mechanism strictly restricts the direction of information flow through a lower triangular mask matrix, ensuring that only historical information can attend to at the current moment. As shown in Fig. 2, the masking matrix sets the attention weights for future time steps to zero (red area) and retains only the valid weights for historical time steps (blue area). The middle matrix displays the specific values of the attention weights (A=QKTdK). The causal attention weights (A⊙M). obtained through the Hadamard product are correctly masked to 0 (red area) in the upper triangular region, while the lower triangular region retains the original attention values. This design enables the model to accurately capture the temporal evolution characteristics of bearing failures while fully adhering to the physical laws of fault propagation, significantly improving diagnostic accuracy and interpretability.

images

Figure 2: Lower triangular mask matrix and physical causality

• Position encoding:

PE(pos,2i)=sin⁡(pos10000(2idmodel)),(7)

PE(pos,2i+1)=cos⁡(pos10000(2idmodel)),(8)

where pos denotes the time step index, i denotes the dimension index, and dmodel = 256 is the hidden dimension. Position encoding is generated by sine and cosine functions, embedding temporal order information.

• Multi-attention output:

MultiHead(Q,K,V)=Concat(head1,…,headh)Wo,(9)

where Wo∈Rd×d is the output projection matrix. headi=Attention(Q,K,V)i.

• FFN output:

Output=LayerNorm(X+MutiHead(Q,K,V)).(10)

3.2.2 Classification Layer

Ultimately, the deep fault features extracted by Eq. (10) are used to achieve fault classification through 3-layer MLP. The computational process is as follows:

logits=FC3(ReLU(FC2(ReLU(FC1(Output))))),(11)

where FC1 and FC2 are linear transformation layers using ReLU as the activation function. the FC3 layer applies the softmax function to transform the model outputs into probability distributions P(y|x) of fault categories:

P(y|x)=softmax(logits)=exp⁡(logitsi)∑j(logitsi).(12)

3.3 Meta-Learning Optimization Strategy

Meta-learning aims to enable models to learn how to adapt to new conditions quickly. Referring to the classic MAML strategy, the meta-learning process of CTML is divided into two phases: meta-training and meta-testing. The goal of the meta-training phase is to learn initial model parameters with strong generalization ability; the meta-testing phase directly loads the pre-trained parameters learned in the meta-training phase and verifies the model’s ability to quickly adapt to new operating conditions with unknown fault types.

To facilitate description, the variable working conditions (e.g., rotational speed, temperature) generated by different product processing must first be simulated to construct the cross-condition classification task. Specifically, assume there exist N groups of working conditions ℋ={hp1,hp2,…,hpN}, where each working condition hpi corresponds to a specific operating condition (e.g., rotational speed) that varies according to processing requirements of different products. The set of conditions ℋ is divided into training conditions ℋtrain (to improve generalization) and testing conditions ℋtest (to evaluate adaptability), with the latter containing new conditions to validate the model performance.

3.3.1 Meta-Training: Meta-Parameter Optimization

In the meta-training phase, a working condition hpi is selected from the ℋtrain, and subsequently, n categories are randomly chosen from the set of fault categories ∁={C1,C2,…,Cn} contained in working work case to form the subtask ∁′.

For each selected fault category cjϵ∁′, its corresponding vibration signal dataset Dcj is partitioned into a support set Scj and a query set Qcj. The support set is composed of s samples selected from Scj through no-putback sampling and is formally represented as:

Scj={(xj,1,yj,1),(xj,2,yj,2),…,(xj,s,yj,s)},(13)

where xj,l is the feature (time-frequency graph) after wavelet transform and yj,l is the category label. The query set is then composed of q samples taken from the remaining samples:

Qcj={(xj,1′,yj,1′),(xj,2′,yj,2′),…,(xj,k′,yj,q′)}.(14)

Ultimately, the task ∁′ consists of the support set of all selected categories combined with the query set:

∁′={(Sc1,Qc1),(Sc2,Qc2),…,(Scs,Qcq)}.(15)

In the meta-training phase, a two-stage training strategy concerning the inner and outer loops is applied through the support set Scj and query set Qcj to optimize the global parameters.

In actual production scenarios, CNC machine tools generate bearing vibration signals with different speeds, loads, and fault types depending on the product type, resulting in significant differences between operating conditions. The differences in data distribution between different operating conditions may lead to excessively large gradient values, causing the model to overfit specific operating conditions and lose its generalization ability. Traditional fixed thresholds are difficult to adapt to dynamically changing gradient distributions. Therefore, this paper proposes a two-stage adaptive gradient clipping strategy. In the outer loop stage, the gradient clipping threshold is set to 5.0, and clipping is achieved by scaling the L2-norm of the gradient vector to meet the stability requirements of meta-learning across tasks. In the inner loop stage, 90% of the original direction of the normal gradient is retained, and only abnormally large gradients are scaled to account for the dynamic nature of task-internal data (local adaptation). This achieves a balance between “countering non-stationarity” and “maintaining generalization” in industrial few-shot problems. The specific training process is as follows:

• Inner loop:

Assuming that the initial model parameters (meta-parameter) are 0, for the support set (Xsupport, Ysupport) in each-meta-task ∁′, the parameters are adjusted by multiple gradient descent iterations:

A clipping threshold τt is computed via exponential moving average of historical gradient norms:

τt=μ⋅τt−1+(1−μ)⋅P90(∥∇θ(t−1)ℒsupport∥2)(μ=0.9),(16)

where P90 denotes the 90th percentile (90% of historical gradients have smaller norm).

The raw gradient is constrained to prevent explosion:

gclipped(t)={τt⋅∇θ(t−1)ℒsupport∥∇θ(t−1)ℒsupport∥2∇θ(t−1)ℒsupportotherwiseif∥∇θ(t−1)ℒsupport∥2>τt(17)

The parameter update is then performed:

θ(t)=θ(t−1)−α⋅gclipped(t),(18)

where θ(t) denotes the parameter after t iterations, α is the inner learning rate, Xsupport is the feature set of the Scj, Ysupport is the set of labels of the Scj. And ℒsupport is the cross-entropy loss function on the Scj, i.e.:

ℒsupport=−∑l=1kYsupport,llog(softmax(logitsj,l)),(19)

where logitsj,l is the output of the model classification layer at input xj,l. Ysupport,l is the label of the l-th sample. After k iterations, the temporary parameters θ′ after fast adaptation are obtained:

θ′=θ(k).(20)

• Outer loop:

The query set (Xquery, Yquery) is utilized to compute the meta-loss ℒquery, and the gradient is back-propagated to update the initial parameters θ′. The mathematical expression is as follows:

ℒquery=ℒCE(fθ′(Xquery),Yquery),(21)

and ℒCE is the cross-entropy loss on the query set:

ℒCE=−∑m=1qYquery,mlog(softmax(logitsj,m′)),(22)

Yquery,m is the label of the m-th query sample.

The outer loop gradient needs to back propagate the query set loss to the initial parameter θ via the chain rule:

∇θℒquery=∇θ′ℒquery⋅∇θθ′.(23)

To ensure cross-task stability, a fixed-threshold gradient clipping is applied:

∇θℒquery←clip(∇θℒquery,λ)(λ=5.0),(24)

where clip(g, λ) scales the gradient vector to a maximum L2-norm of λ if its norm exceeds λ. The final realization, the global parameter θ update:

θ←θ−β∇θℒquery,(25)

where β is the learning rate of the outer loop.

3.3.2 Meta-Testing: Few Shot Diagnosis

In the meta-learning framework, the meta-testing phase and meta-training phase form a complete validation closed-loop. The meta-training phase learns to obtain meta-parameters θ with strong generalization ability through two-layer optimization (5-step task-specific adaptation using Adam optimizer in the inner loop) for tasks with known fault-type working conditions, while the outer loop updates the meta-parameters θ through Adam optimizer. The specific process is as follows:

The meta-parameters θ obtained from pre-training (containing all parameters of the causal Transformer and classification layers) are used to construct n-way k-shot diagnostic tasks. From the target working condition test set ℋtest, n classes of fault samples (k samples per class) are randomly selected to form the support set for fast model adaptation. The adaptation leverages the mean dynamic threshold from training:

τ¯=1Ntrain∑tτt(meanoveralltrainingiterations).(26)

The meta-parameters θ are fine-tuned through single-step SGD optimization with adaptive clipping:

θ″=θ−α⋅clip(∇θLsupport,τinner=τ―)(α=0.01).(27)

This process maintains the same network architecture and input specifications (64 × 64 wavelet time-frequency maps) as meta-training, but requires only a single-step gradient update to achieve model adaptation.

4 Results

4.1 Experimental Dataset

4.1.1 CWRU Bearing Dataset

This section validates the model based on the Case Western Reserve University (CWRU) bearing dataset [43,44]. The experimental platform consists of a 1.5 kW three-phase induction motor, torque transducer/encoder, and dynamometer, with acceleration sensors rigidly mounted on the bearing housing top to collect vibration signals under different failure modes. The experiments were conducted under four motor loading conditions (0–3 hp) with a 12 kHz sampling frequency, focusing on drive-end vibration signal analysis. The data were categorized according to fault diameters (0.007″, 0.014″, 0.021″, 0.028″) and fault types (inner race fault, outer race fault, and ball fault). The experiment includes 12 bearing condition categories, as detailed in Table 2.

images

4.1.2 Paderborn University Dataset

The Paderborn University bearing (PUB) dataset [45] originates from the bearing experimental platform at Paderborn University in Germany, which is specifically designed for rolling bearing condition monitoring and fault diagnosis research. The exerimental setup consists of an electric motor driving the rolling bearing and a flywheel to rotate via a shaft. A piezoelectric accelerometer is installed on the SKF6203 bearing to collect vibration signals at a sampling rate of 64 kHz. The operating condition classification of this dataset is primarily based on the combined changes in rotational speed and load torque. As shown in Table 3, based on the typical operating condition change patterns observed in actual industrial applications, this study reorganizes the dataset into two main operating condition categories: Condition A (high-speed, high-load: 1500 rpm) and Condition B (low-speed, high-load: 900 rpm). This classification method better reflects the impact of speed and load changes on bearing fault characteristics. The tested bearings include three states: normal (N), inner ring failure (IR) and outer ring failure (OR). Each operating condition includes a complete set of fault types, forming a systematic cross-operating-condition validation dataset.

images

4.1.3 Shanghai Aerospace Manufacturing Enterprise Dataset

To further validate the model’s performance in actual industrial environments, this study utilized a dataset of CNC machine tool bearings provided by a Shanghai aerospace manufacturing enterprise (SAME). The dataset comprises vibration data collected from a 1.6-m CNC vertical lathe (model number: GTC16090) at the company’s facility in Shenyang. This large precision machine tool is widely used for machining spacecraft components.

The operating conditions in this dataset reflect typical machining process requirements for CNC machine tools in actual production. Based on spindle speed and precision requirements across different processing stages, operating conditions are categorized into three main types: Condition A (rough machining: 60 rpm, high material removal rate, high-load), Condition B (semi-finishing: 120 rpm, moderate material removal rate, moderate-load), and Condition C (finishing: 200 rpm, low material removal rate, low-load with high-precision requirements).

High-precision vibration sensors were installed on the spindle bearings, with data sampled at a rate of 25.6 kHz. The bearing health conditions included four typical states: Ball fault, Inner race fault, and Outer race fault. These faults represent naturally occurring failures identified during the machine tool’s operational lifespan within the production environment. Due to confidentiality agreements, Table 4 presents only a subset of the available data.

images

4.2 Signal Processing

In this study, a sliding window (1024-point window, 512-point step) is used to segment the original bearing vibration signals from the bearing dataset, and the time-frequency maps are generated by the cmor1-3 wavelet transform (64-scale) after normalization. Taking the CWRU dataset as an example, as shown in Fig. 3, the horizontal axis (samples) corresponds to the time series points of the vibration signals, characterizing the time-domain evolution of fault characteristics; the vertical axis (scale) represents the scale parameter of the wavelet transform, which is inversely proportional to the frequency components—smaller scales correspond to higher frequency components, while larger scales correspond to lower frequency components. The observed results show that the ball fault under 2 hp operating conditions presents significant energy aggregation features (brightness enhancement) in the time-frequency domain, whereas the energy distribution remains uniform under normal operating conditions. This multi-scale time-frequency feature provides highly discriminative input characteristics for the causal-Transformer.

images images

Figure 3: Time-frequency characterization comparison of bearing fault signals: (a) Original vibration signal and its wavelet transform under 0 hp condition; (b) Original vibration signal and its wavelet transform under 2 hp condition

4.3 Experimental Setup and Benchmarks

The experimental validation in this study is conducted on a computer equipped with an Intel(R) Core (TM) i9-13900HX 2.20 GHz processor (13th Gen), 32 GB RAM, and an NVIDIA RTX 4060 GPU (CUDA 11.8). The implementation is based on the PyTorch 1.10.0 framework (Python 3.10) under the Windows 11 operating system. To systematically evaluate the generalization performance of the proposed CTML model under variable operating conditions, eight sets of cross-condition fault classification experiments with four different load conditions are constructed based on the bearing dataset in this section, and validated using the transfer learning paradigm of “training condition→test condition” (source→target). For example, “0→1” indicates that 0 hp condition data is used as the training set and 1 hp condition data is used as the test set.

Two types of representative benchmark methods are selected for the comparison experiments: (1) Transformer variant models, including the standard ViT [28] and the Token-to-Token Transformer (T2T) [46]; and (2) small-sample variable-condition learning methods: the DANN [15], transfer learning algorithms such as Joint Adaptation Network (JAN) [47], and MAML [16] as a meta-learning baseline. All models take the 128 × 128 time-frequency maps generated from 1024-point segmented signals by cmor1-3 wavelet transform as input, and use the same training/testing division strategy. In order to adapt to a unified feature input size and distribution, this section makes adjustments to the above advanced algorithms. The network architecture and parameter settings of the comparison method are as follows:

1. CTML (Ours): The input 64 × 64 wavelet time-frequency map is encoded by 256-dimensional learnable positions, and then passed through a 4-layer causal Transformer (each layer contains 8 heads of self-attention, a 256-dimensional hidden layer, and uses 0.1 dropout and layer normalization). The attention module ensures causality by using a strict lower triangular mask. The classification head uses a 3-layer MLP cascade (256→128→12 dimensions). The meta-learning phase is fast-adapted by inner-loop 5-step SGD (lr = 0.01), with the outer loop using Adam (lr = 0.0001) to optimize global parameters, and gradient clipping (threshold 5.0) is applied throughout to stabilize the training.

2. ViT: The 64 × 64 time-frequency map is segmented into 8 × 8 image blocks, which are linearly projected into 768-dimensional token sequences, and classified by 12 layers of standard Transformer (8-head attention in each layer, 768-dimensional hidden layer, and 3072-dimensional FFN), with ultimate classification by a 3-layer MLP (768→384→12). The positional encoding is done using a standard sinusoidal function.

3. T2T: A two-stage token reorganization is performed on the 64 × 64 time-frequency map: the first stage aggregates the 4 × 4 local blocks into 32 dynamic tokens, and the second stage further compresses them to 16 global tokens. A 3-layer MLP (512→256→12) is connected at the end of a 9-layer improved Transformer (4-head attention, 512-dimensional hidden layer, 2048-dimensional FFN).

4. DANN: A 5-layer 2D-CNN is used to process the time-frequency map (channel counts 64-128-256-512-1024, each layer contains 3 × 3 convolution, BN and LeakyReLU), and after generating 1024-dimensional feature vectors, they are accessed in parallel by the domain discriminator (a 3-layer MLP: 512-256-1) and the classifier (a 3-layer MLP: 512-256-12). Adversarial training is realized by the gradient inversion layer.

5. JAN: An improved ResNet-18 architecture: the first layer uses 7 × 7 convolution (stride = 2) followed by 3 × 3 maximum pooling, the subsequent 4 residual blocks keep the original structure but adjust the number of channels (64-128-256-512), and insert the multicore MMD module (combination of linear + Gaussian kernels) after the last residual block. The final output is achieved by global average pooling and a two-layer MLP (512→12).

6. MAML: The 64 × 64 time-frequency map is split into 4 × 4 image blocks, processed by 4-layer 2D-TCN (expansion 1-2-4-8). Meta-learning uses 5-step inner-loop SGD updates, and outputs the classification results directly. The outer-loop is optimized by Adam to update the initialization parameters.

4.4 Evaluation Metrics

In order to comprehensively evaluate the model performance, in addition to the conventional classification accuracy (ACC), the mathematical formula is as follows:

ACC=1q∑m=1qΠ(argmaxP(y|xmquery)=Yquery,m),(28)

where argmaxP(y|xmquery) is the softmax probabilistic output defined in Eq. (11), xmquery is the query sample, and xmquery is the label of the query sample. q is the number of samples in the query set, Π(⋅) is the indicator function that output 1 for correct predictions and 0 otherwise. The ACC directly reflects the model’s prediction accuracy under new working conditions, with the value domain [0, 1], where larger values indicate better performance.

Special metrics for meta-learning are also introduced in this study, namely:

• Rate of convergence across tasks (RCT):

RCT=1−1|∁′|∑j=1|∁′|tc(j)K,(29)

where ∁′ is the total number of meta-tasks, tc(j) is the number of epochs at which the j-th task reaches convergence (convergence determination criterion: validation set loss decreases by <1% for 3 consecutive epochs), and K is the total number of epochs in the inner loop. RCT measures the ability of the model to adapt quickly to new tasks, with values closer to 1 indicating faster convergence.

• Operating condition sensitivity (OCS):

OCS=1ℋtest∑k′=1|ℋtest||Ak′−A¯|A¯,A¯=1|ℋtest|∑k′=1|ℋtest|Ak′,(30)

where ℋtest is the set of test conditions and Ak′ is the accuracy under the k′-th condition. OCS evaluates the model’s robustness to condition changes, with smaller values indicating greater stability.

• Number of parameters (NP):

NP=∑i|θi|,θi∈θ,(31)

where θ is the set of all trainable parameters of the model. In CTML, θ represents the meta-parameter, which includes the causal-Transformer layer parameters, classification layer parameters, and all other trainable parameters. NP measures the model complexity, with fewer parameters indicating better suitability for edge device deployment.

• Time taken per task (TPT):

TPT=1M∑m=1Mte,(32)

where M test the number of tasks, tmtask is the average reasoning time (in seconds) for the m-th task. TPE measures the computational efficiency of the algorithm, which directly affects the engineering utility.

4.5 Experimental Results

This study evaluates the performance of the meta-learning CTML model based on a small-sample cross-condition experimental paradigm (source condition→target condition). The experimental setup includes two typical small-sample paradigms: 1-shot (single sample per class) and 5-shot (five samples per class) tasks. A 3-way classification task is constructed for typical bearing fault types (rolling element damage/Ball, outer race failure/OR, inner race failure/IR). The experimental results are analyzed as follows:

1. Small-sample cross-condition generalization analysis: As shown in Table 5 and Fig. 4, CTML outperforms the comparison models in terms of diagnostic accuracy in the 1-shot/5-shot tasks on the CWRU, Paderborn University simulation datasets, and the Shanghai aerospace manufacturing enterprise dataset dataset. Compared to domain adaptation methods (DANN/JAN), meta-learning baselines (MAML), and Transformer variants (ViT/T2T), the average accuracy has been improved. The confusion matrix further reveals that CTML reduces misdiagnosis rates in operating condition transition scenarios (e.g., the 0 hp→3 hp load transition in CWRU). This indicates that the CTML model may effectively suppress feature distribution shifts through a causal-meta-learning coupling mechanism, demonstrating strong robustness in complex operating conditions. In particular, in the 5-shot task in the real industrial scenario of Shanghai Aerospace, CTML maintains an accuracy rate of around 96%, verifying its complete generalization capability from laboratory simulation to industrial application.

2. Training convergence analysis: In terms of model training performance, Fig. 5 shows that under the “0→1” condition, CTML achieves approximately 90% accuracy in the 3-way 1-shot task after 2000 iterations. The accuracy of the 3-way 1-shot task improves with increasing sample size. The loss function exhibits a clear convergence trend, with the 3-way 5-shot task showing faster convergence and superior final performance. These results confirm CTML’s robustness and optimization stability under sample-sparse conditions.

3. Computational efficiency and industrial deployment value: Compared with other benchmarks, CTML demonstrates significant advantages in inference speed and computational cost in real-time industrial applications. Its efficiency stems from three core architectural designs: (1) Causal attention mechanism: By strictly limiting information flow through a lower triangular mask (only from history to future), it reduces redundant computations by 50% compared to the global attention of standard Transformers, avoiding memory bottlenecks caused by dense attention matrices in models like ViT; (2) Meta-learning decoupling strategy: When adapting across operating conditions, only the classification head needs to be fine-tuned (5 steps of SGD), whereas DANN requires online domain classification and gradient reversal layers, and MAML’s sequence-dependent structures (such as dilated convolutions) cause model training delays due to historical state caching; (3) Lightweight hierarchical compression: A 4-layer Transformer (8 heads per layer, 256-dimensional hidden layer) combined with a streamlined 3-layer MLP, and batch processing capabilities are enhanced through gradient clipping and layer normalization. As shown in Table 6, CTML achieved the highest RCT and OCS values on three datasets while maintaining the lowest TPT value, with moderate parameter complexity. This improvement in computational efficiency makes CTML particularly suitable for deployment in industrial environments with limited computational resources.

images

Figure 4: Confusion matrices of CTML model for cross-domain few-shot fault diagnosis: (a) 3-way 1-shot diagnosis on CWRU dataset; (b) 3-way 5-shot diagnosis on CWRU dataset; (c) 3-way 1-shot diagnosis on PUB dataset; (d) 3-way 5-shot diagnosis on PUB dataset; (e) 3-way 1-shot diagnosis on SAME dataset; (f) 3-way 5-shot diagnosis on SAME dataset

images

Figure 5: Training dynamics of CTML model under 0-1 working condition transfer: (a) Accuracy evolution on CWRU dataset; (b) Cross-entropy loss convergence on CWRU dataset; Cross-entropy loss convergence on CWRU dataset; (c) Accuracy evolution on PUB dataset; (d) Cross-entropy loss convergence on PUB dataset; (d) Accuracy evolution on SAME dataset; (d) Cross-entropy loss convergence on SAME dataset

images

5 Discussion

5.1 Ablation Experiment

To validate the contribution of each module in CTML, this paper conducted a 3-way 5-shot experiment using the CWRU dataset as an example. In this section, we set up the following ablation experiments:

1. CTML-CWT: Remove the cmor1-3 wavelet transform, modify the model input dimensions, and the rest is consistent with CTML;

2. CTML-Transformer: Remove the causal constraints of the attention mask, change to standard Transformer self-attention, the rest is consistent with CTML;

3. CTML-CNN: Remove the causal Transformer layer and replace it with 5-layer 2D-CNN (kernel = 3 × 3, channels = 64-128-256-512-1024) as the feature extraction layer, the rest is consistent with CTML;

4. CTML-AGP: Remove the adaptive gradient pruning mechanism while keeping all other components intact, the rest is consistent with CTML;

5. CTML-Attention: Remove all causal constraints from the entire architecture including both attention mask and temporal ordering constraints, the rest is consistent with CTML;

6. CTML (Ours): The complete model architecture proposed in this paper.

The ablation study was conducted using the same experimental setup as the main experiments to ensure fair comparison. All model variants were trained on the 0 hp→1 hp cross-condition diagnosis task training. Each experiment was repeated 10 times with different random seeds to ensure statistical significance of the results. The evaluation metrics include accuracy, precision, recall, F1-score. As shown in the experimental results in Fig. 6: (1) Removing the complete causal-Transformer architecture (CTML-CNN) results in the most severe performance degradation, demonstrating that causal-Transformer is fundamental for modeling physical fault progression; (2) The adaptive gradient pruning mechanism contributes significantly to both performance improvement and training stability; (3) Removing the cmor1-3 wavelet transform reduces model performance, indicating that the wavelet transform plays a decisive role in weak fault feature enhancement; (4) The causal constraints of the attention mask are essential for accurate spatiotemporal modeling and confirm the necessity of temporal causal constraints for modeling the laws of physics. Therefore, the improvement in CTML performance stems from the synergistic combination of high-quality multi-scale feature space constructed by wavelet transform, accurate spatiotemporal modeling achieved by causal attention mechanisms, comprehensive temporal causality enforcement throughout the architecture, and training stabilization through adaptive gradient pruning.

images images

Figure 6: Training variation curve of CTML model in 0→1 cross operating conditions diagnosis: (a) Accuracy; (b) Precision; (b) Recall; (b) F1-Score

5.2 Analysis of Attentional Head Counts

Since the choice of hyperparameters affects the performance of the model, the critical impact of the number of attention heads in the causal-Transformer layer on the fault diagnosis performance was experimentally verified. Through a controlled-variable experimental design, cross-working-condition small-sample diagnostic tests were conducted on the CWRU dataset (30 samples per class) with five different attention-head configurations of 1, 2, 4, 8, and 16 heads. As shown in the experimental results in Fig. 7, the number of attention heads has a significant effect on the diagnostic performance of the model. For example, in the 2→3 cross-condition diagnosis task, the 8-head configuration achieves a single-shot accuracy of 91.5%, which is higher than that with the 4-head and 16-head configurations. Notably, the model performance decreases when the number of attention heads is increased to 16, suggesting that too many attention heads can lead to feature redundancy. The comprehensive analysis shows that the 8-head configuration achieves the optimal balance between model performance and generalization ability in different working-condition tasks.

images

Figure 7: Effect of variation in the number of attention heads on model performance under few-shot scenarios: (a) 1-shot Performance vs. Attention Heads; (b) 5-shot Performance vs. Attention Heads

5.3 Visualization of Feature Space

In order to further analyze the feature learning mechanism of the proposed CTML model, this section visualizes and analyzes the feature distributions of the CWRU dataset under 2 hp operating conditions in a 0→2 cross-domain diagnostic scenario. As shown in Fig. 8, the original vibration signal (Fig. 8a) presents a highly overlapping chaotic distribution feature. After the cmor1-3 wavelet transform processing (Fig. 8b), the separability of the time-frequency feature space is improved. The causal-Transformer layer (Fig. 8c) further optimizes the quality of the feature representation, resulting in the expansion of the feature distance between different fault categories. Finally, the classification layer (Fig. 8d), optimized by meta-learning, transforms the features of 12 fault categories into a clustering structure with significant separability. The above experimental results fully validate the effectiveness of the proposed model in fault feature learning.

images

Figure 8: Multi-stage feature space visualization of CTML model through t-SNE dimensionality reduction: (a) Original time-domain vibration signal; (b) Time-frequency representation via cmorl-3 wavelet transform; (c) High-level features extracted by causal-Transformer with attention masking; (d) Final discriminative features optimized by meta-learningclassification layer

6 Conclusions

In this study, a collaborative diagnosis framework integrating time-frequency analysis, causal-Transformer, and meta-learning is proposed for the fault diagnosis challenges of CNC machine tool bearings under few-shot learning scenarios (1-shot/5-shot) with small samples and variable working conditions. The method constructs the time-frequency feature space by using wavelet transform to enhance the recognizability of weak fault features, and adopts causal-Transformer with strict lower triangular attention mask to realize time-constrained feature extraction and diagnosis. Meanwhile, the dynamic meta-learning strategy specifically designed for few-shot tasks is implemented through the meta-training-testing closed-loop mechanism and adaptive gradient pruning, demonstrating superior performance in data-scarce environments. Experiments based on the multiple datasets verify that this method outperforms existing data-driven methods in terms of few-shot diagnostic accuracy and working condition adaptability. In aerospace production, machining high-cost alloy components under varying conditions to exacting tolerances entails severe financial and scheduling consequences from unplanned downtime and part scrapping. Nevertheless, the proposed CTML maintains high diagnostic accuracy and rapidly adapts using very limited new fault examples, enhancing spindle bearing reliability, significantly reducing costly unplanned downtime and part scrap, and improving production efficiency.

Future research will explore the deep integration of large model technology with industrial diagnostics, focusing on the development of an intelligent diagnostic framework based on multimodal large models. First, a pre-training-fine-tuning paradigm with engineering interpretability will be constructed by integrating multi-source sensor data such as vibration and acoustic emission. Second, lightweight incremental learning algorithms will be further developed so that the large model can adapt to the continuous evolution of production line equipment to meet real-time requirements. Finally, virtual-reality fusion validation will be carried out through the digital twin platform. This approach will promote the large model from laboratory to production line, addressing core industrial challenges such as data scarcity and frequent working condition changes.

Acknowledgement: The authors are grateful to all the editors and anonymous reviewers for their comments and suggestions.

Funding Statement: This study received financial support from the National Key Research and Development Program of China (Grant No. 2022YFB3302700), the National Natural Science Foundation of China (Grant No. 52375486), and the Shanghai Rising-Star Program (Grant No. 22QB1404200).

Author Contributions: The authors confirm their contribution to the paper as follows: Study conception and design: Youlong Lyu and Jutao Guo; data collection: Ying Chu and Qingpeng Qiu; analysis and interpretation of results: Ying Chu, Youlong Lyu and Jie Zhang; draft manuscript preparation: Youlong Lyu and Ying Chu. All authors reviewed the results and approved the final version of the manuscript.

Availability of Data and Materials: The authors confirm that the data supporting the findings of this study are available within the article.

Ethics Approval: Not applicable.

Conflicts of Interest: The authors declare no conflicts of interest to report regarding the present study.

References

1. Ahmed YS, Amorim FL. Advances in computer numerical control geometric error compensation: integrating AI and on-machine technologies for ultra-precision manufacturing. Machines. 2025;13(2):140. doi:10.3390/machines13020140. [Google Scholar] [CrossRef]

2. Iqbal M, Madan AK. Bearing fault diagnosis in CNC machine using hybrid signal decomposition and gentle AdaBoost learning. J Vib Eng Technol. 2024;12(2):1621–34. doi:10.1007/s42417-023-00930-8. [Google Scholar] [CrossRef]

3. Xue R, Zhang P, Huang Z, Wang JJ. Digital twin-driven fault diagnosis for CNC machine tool. Int J Adv Manuf Technol. 2024;131(11):5457–70. doi:10.1007/s00170-022-09978-4. [Google Scholar] [CrossRef]

4. Li X, Chen J, Wang J, Wang J, Li X, Kan Y. Research on fault diagnosis method of bearings in the spindle system for CNC machine tools based on DRSN-Transformer. IEEE Access. 2024;12(18):74586–95. doi:10.1109/ACCESS.2024.3404968. [Google Scholar] [CrossRef]

5. Siemens AS. Synthetic vibration data generation and fault classification in CNC machines using transformer GANs and ConvLSTM Networks. In: Proceedings of the 2024 9th International Conference on Computer Science and Engineering (UBMK); 2024 Oct 26–28; Antalya, Türkiye. doi:10.1109/UBMK63289.2024.10773462. [Google Scholar] [CrossRef]

6. Liu L, Zhao Y, Hu Y, Ma Y, Guo Z. Lightweight mechanical equipment fault diagnosis framework based on GCGAN-MDSCNN-ICA model. Sci Rep. 2025;15(1):4911. doi:10.1038/s41598-025-89576-y. [Google Scholar] [PubMed] [CrossRef]

7. Çekik R, Turan A. Deep learning for anomaly detection in CNC machine vibration data: a rough LSTM-based approach. Appl Sci. 2025;15(6):3179. doi:10.3390/app15063179. [Google Scholar] [CrossRef]

8. Chen Y, Shi J, Hu J, Shen C, Huang W, Zhu Z. Simulation data driven time-frequency Fusion 1D convolutional neural network with multiscale attention for bearing fault diagnosis. Meas Sci Technol. 2025;36(3):035109. doi:10.1088/1361-6501/adb329. [Google Scholar] [CrossRef]

9. Huang H, Sun S, Wang D, Xu W. Data privacy protection diagnostic algorithm for industrial robot joint harmonic reducers based on swarm learning. IEEE/ASME Trans Mechatron. 2025:1–10. doi:10.1109/TMECH.2025.3528212. [Google Scholar] [CrossRef]

10. Wang Y, Shen J, Yang S, Han Q, Zhao C, Zhao P. Knowledge and data dual-driven fault diagnosis in industrial scenarios: a survey. IEEE Internet Things J. 2024;11(1):19256–77. doi:10.1109/JIOT.2024.3387538. [Google Scholar] [CrossRef]

11. Wang S, Yu Z, Huo F, Lu C, Wang J. The intelligent operation and maintenance method and experimental research of CNC machine tool bearings. Proc Inst Mech Eng Part E J Process Mech Eng. 2024;33(10):09544089241284483. doi:10.1177/09544089241284483. [Google Scholar] [CrossRef]

12. Song B, Liu Y, Fang J, Liu W, Zhong M, Liu X. An optimized CNN-BiLSTM network for bearing fault diagnosis under multiple working conditions with limited training samples. Neurocomputing. 2024;574(1):127284. doi:10.1016/j.neucom.2024.127284. [Google Scholar] [CrossRef]

13. Liang Y, Wang Y, Li W, Pham DT, Lu J. Adaptive fault diagnosis of machining processes enabled by hybrid deep learning and incremental transfer learning. Comput Ind. 2025;167(59):104262. doi:10.1016/j.compind.2025.104262. [Google Scholar] [CrossRef]

14. Pan X, Chen H, Wang W, Su X. Adversarial domain adaptation based on contrastive learning for bearings fault diagnosis. Simul Model Pract Theory. 2025;139:103058. doi:10.1016/j.simpat.2025.103058. [Google Scholar] [CrossRef]

15. Li J, Shen C, Shi J, Li C, Wang D, Zhu Z. Bi-generator cooperative domain adversarial neural network for bearing fault diagnosis. IEEE Sens J. 2024;24(7):10584–93. doi:10.1109/JSEN.2024.3361013. [Google Scholar] [CrossRef]

16. Wang D, Zhang X, Li S, Jia C. Model-agnostic meta-learning-based fault classification for industrial processes with small sample. In: Proceedings of the 2024 39th Youth Academic Annual Conference of Chinese Association of Automation (YAC); 2024 Jun 7–9; Dalian, China. doi:10.1109/YAC63405.2024.10598657. [Google Scholar] [CrossRef]

17. Wang S, Shuai H, Hu J, Zhang J, Liu S, Yuan X, et al. Few-shot fault diagnosis of axial piston pump based on prior knowledge-embedded meta learning vision transformer under variable operating conditions. Expert Syst Appl. 2025;269(3):126452. doi:10.1016/j.eswa.2025.126452. [Google Scholar] [CrossRef]

18. Liu H, Meng C, Li C, Zhang Y. Dynamic reliability analysis of CNC lathe spindle-bearing system with thermal effect in radial runout. Proc Inst Mech Eng Part C J Mech Eng Sci. 2024;238(13):6377–90. doi:10.1177/09544062241227308. [Google Scholar] [CrossRef]

19. Zhang Z, Jiang F, Lou M, Wu B, Zhang D, Tang K. Geometric error measuring, modeling, and compensation for CNC machine tools: a review. China J Aeronaut. 2024;37(2):163–98. doi:10.1016/j.cja.2023.02.035. [Google Scholar] [CrossRef]

20. Tang J, Hu Y, Zhou X, Xu M, Wang D, Zheng B, et al. Self-powered sensor for online monitoring of eccentricity faults in rotating machinery and its application in spindle eccentricity monitoring of machine tools. Nano Energy. 2024;130:110084. doi:10.1016/j.nanoen.2024.110084. [Google Scholar] [CrossRef]

21. Umamaheswara Raju RS, Ravi Kumar K, Vargish K, Bharath Kumar M. Machine learning based surface roughness assessment via CNC spindle bearing vibration. Int J Interact Des Manuf. 2025;19(1):477–94. doi:10.1007/s12008-024-01963-3. [Google Scholar] [CrossRef]

22. Sharma G, Kaur T, Mangal SK, Dhiman NK, Jat GL. MEMS approach for rolling bearing fault diagnosis using vibration signal analysis. J Vib Eng Technol. 2025;13(1):1–27. doi:10.1007/s42417-024-01730-4. [Google Scholar] [CrossRef]

23. Manikandan R, Mutra RR. Fault classification in rotor-bearing system using advanced signal processing and machine learning techniques. Results Eng. 2025;25(6):103892. doi:10.1016/j.rineng.2024.103892. [Google Scholar] [CrossRef]

24. Gao Z, Zheng J, Pan H, Cheng J, Tong J. Adaptive generalized empirical wavelet transform and its application to fault diagnosis of rolling bearing. Measurement. 2025;249(4):116958. doi:10.1016/j.measurement.2025.116958. [Google Scholar] [CrossRef]

25. Alam TE, Ahsan MM, Raman S. Multimodal bearing fault classification under variable conditions: a 1D CNN with transfer learning. arXiv:2502.17524. 2025. doi:10.48550/arxiv.2502.17524. [Google Scholar] [CrossRef]

26. Lv J, Xiao Q, Zhai X, Shi W. A high-performance rolling bearing fault diagnosis method based on adaptive feature mode decomposition and Transformer. Appl Acoust. 2024;224(4):110156. doi:10.1016/j.apacoust.2024.110156. [Google Scholar] [CrossRef]

27. Han Y, Zhang F, Li Z, Wang Q, Li C, Lai P. MT-ConvFormer: a multi-task bearing fault diagnosis method using a combination of CNN and transformer. IEEE Trans Instrum Meas. 2024;74(11):3501816. doi:10.1109/TIM.2024.3502821. [Google Scholar] [CrossRef]

28. Jang J, Lee S, Hwang S, Lee J. Noise reduction in CWRU data using DAE and classification with ViT. Appl Sci. 2024;14(24):11771. doi:10.3390/app142411771. [Google Scholar] [CrossRef]

29. Ren S, Lou X. Rolling bearing fault diagnosis method based on SWT and improved vision transformer. Sensors. 2025;25(7):2090. doi:10.3390/s25072090. [Google Scholar] [PubMed] [CrossRef]

30. Nguyen CHC, Liem RP. Multi-aircraft attention-based model for perceptive arrival transit time prediction. Adv Eng Inform. 2025;64(1):103067. doi:10.1016/j.aei.2024.103067. [Google Scholar] [CrossRef]

31. Friederich J, Lazarova-Molnar S. Data-driven reliability assessment of manufacturing systems using process mining. Simulation. 2025;25:00375497241302866. doi:10.1177/00375497241302866. [Google Scholar] [CrossRef]

32. Liu X, Zhang Z, Li Z, Wang J, Zhu Y, Ma H. Advancements in bearing health monitoring and remaining useful life prediction: techniques, challenges, and future directions. Meas Sci Technol. 2025;36(3):032003. doi:10.1088/1361-6501/adafc8. [Google Scholar] [CrossRef]

33. Zhang H, Jiang S, Gao D, Sun Y, Bai W. A review of physics-based, data-driven, and hybrid models for tool wear monitoring. Machines. 2024;12(12):833. doi:10.3390/machines12120833. [Google Scholar] [CrossRef]

34. Yan R, Zhou Z, Shang Z, Wang Z, Hu C, Li Y, et al. Knowledge driven machine learning towards interpretable intelligent prognostics and health management: review and case study. Chin J Mech Eng. 2025;38(1):5. doi:10.1186/s10033-024-01173-8. [Google Scholar] [CrossRef]

35. Wang S, Han W, Jian J, Chang X, Zeng L. A domain adaptation meta learning method with multilayer convolutional attention for cross-domain bearing fault diagnosis. IEEE Sens J. 2025;25(8):14440–52. doi:10.1109/JSEN.2025.3546955. [Google Scholar] [CrossRef]

36. Jiang J, Chen C, Lackinger A, Li H, Li W, Pei Q, et al. MetaTrans-FSTSF: a transformer-based meta-learning framework for few-shot time series forecasting in flood prediction. Remote Sens. 2025;17(1):77. doi:10.3390/rs17010077. [Google Scholar] [CrossRef]

37. Pourghoraba A, KhajueeZadeh MS, Amini A, Vahedi A, Agah GR, Rahideh A. Model-agnostic meta-learning for fault diagnosis of induction motors in data-scarce environments with varying operating conditions and electric drive noise. IEEE Trans Energy Convers. 2025:1–10. doi:10.1109/TEC.2025.3556100. [Google Scholar] [CrossRef]

38. Mallick M, Shim YD, Won HI, Choi SK. Ensemble-based model-agnostic meta-learning with operational grouping for intelligent sensory systems. Sensors. 2025;25(6):1745. doi:10.3390/s25061745. [Google Scholar] [PubMed] [CrossRef]

39. Qiao L, Zhang Y, Wang Q, Li D, Peng S. Fault diagnosis for wind turbine generators based on Model-Agnostic Meta-Learning: a few-shot learning method. Expert Syst Appl. 2025;267(6):126171. doi:10.1016/j.eswa.2024.126171. [Google Scholar] [CrossRef]

40. Li X, Zhu G, Hu A, Xing L, Xiang L. A meta-learning method based on meta-feature enhancement for bearing fault identification under few-sample conditions. Mech Syst Signal Process. 2025;226(4):112370. doi:10.1016/j.ymssp.2025.112370. [Google Scholar] [CrossRef]

41. Liu Z, Peng Z. Few-shot bearing fault diagnosis by semi-supervised meta-learning with graph convolutional neural network under variable working conditions. Measurement. 2025;240:115402. doi:10.1016/j.measurement.2024.115402. [Google Scholar] [CrossRef]

42. Zhong H, He D, Lao Z, Shen G, Chen Y. Improved metric-based meta learning with attention mechanism for few-shot cross-domain train bearing fault diagnosis. Meas Sci Technol. 2024;35(7):075101. doi:10.1088/1361-6501/ad30b6. [Google Scholar] [CrossRef]

43. Meng L, Xie J, Zhou Z, Chen Y. Fault diagnosis model for bearings under multiple operating conditions based on feature parameterization weighting. Electronics. 2024;13(11):2153. doi:10.3390/electronics13112153. [Google Scholar] [CrossRef]

44. Lehmann M, Möckel A. Intelligent bearing condition monitoring for electrical machines through vibration signal based on autoencoder and almost-parameter-free classifiers. In: Proceedings of the 2024 International Symposium on Power Electronics, Electrical Drives, Automation and Motion (SPEEDAM); 2024 Jun 19–21; Napoli, Italy. doi:10.1109/SPEEDAM61530.2024.10609143. [Google Scholar] [CrossRef]

45. Lessmeier C, Kimotho JK, Zimmer D, Sextro W. Condition monitoring of bearing damage in electromechanical drive systems by using motor current signals of electric motors: a benchmark data set for data-driven classification. PHM Soc Eur Conf. 2016;3(1):1–17. doi:10.36001/phme.2016.v3i1.1577. [Google Scholar] [CrossRef]

46. Yuan L, Chen Y, Wang T, Yu W, Shi Y, Jiang Z, et al. Tokens-to-token VIT: training vision transformers from scratch on imagenet. arXiv:2101.11986. 2021. doi:10.48550/arxiv.2101.11986. [Google Scholar] [CrossRef]

47. Long M, Zhu H, Wang J, Jordan MI. Deep transfer learning with joint adaptation networks. arXiv:1605.06636. 2017. doi:10.48550/arxiv.1605.06636. [Google Scholar] [CrossRef]

Cite This Article

APA Style

Lyu, Y., Chu, Y., Qiu, Q., Zhang, J., Guo, J. (2025). A Causal-Transformer Based Meta-Learning Method for Few-Shot Fault Diagnosis in CNC Machine Tool Bearings. Computers, Materials & Continua, 85(2), 3393–3418. https://doi.org/10.32604/cmc.2025.068157

Vancouver Style

Lyu Y, Chu Y, Qiu Q, Zhang J, Guo J. A Causal-Transformer Based Meta-Learning Method for Few-Shot Fault Diagnosis in CNC Machine Tool Bearings. Comput Mater Contin. 2025;85(2):3393–3418. https://doi.org/10.32604/cmc.2025.068157

IEEE Style

Y. Lyu, Y. Chu, Q. Qiu, J. Zhang, and J. Guo, “A Causal-Transformer Based Meta-Learning Method for Few-Shot Fault Diagnosis in CNC Machine Tool Bearings,” Comput. Mater. Contin., vol. 85, no. 2, pp. 3393–3418, 2025. https://doi.org/10.32604/cmc.2025.068157

BibTex EndNote RIS

Copyright © 2025 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Table of Content

A Causal-Transformer Based Meta-Learning Method for Few-Shot Fault Diagnosis in CNC Machine Tool Bearings

Abstract

Keywords

References

Cite This Article

1379

725

0

Related articles

Further Information

Guidelines

Follow Us

Join Us

Contact Us

WhatsApp:

Share Link