LWCNet: A Physics-Guided Multimodal Few-Shot Learning Framework for Intelligent Fault Diagnosis

Yong Hu; Weifan Xu; Xiangtong Du

doi:10.32604/cmc.2026.074437

icon Open Access

ARTICLE

LWCNet: A Physics-Guided Multimodal Few-Shot Learning Framework for Intelligent Fault Diagnosis

Yong Hu¹, Weifan Xu², Xiangtong Du^3,*

1 College of Electrical and Electronic Engineering, Anhui Institute of Information Technology, Wuhu, China
2 College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, China
3 College of Medical Imaging, Xuzhou Medical University, Xuzhou, China

* Corresponding Author: Xiangtong Du. Email: email

Computers, Materials & Continua 2026, 87(2), 67 https://doi.org/10.32604/cmc.2026.074437

Received 11 October 2025; Accepted 04 January 2026; Issue published 12 March 2026

Abstract

Deep learning-based methods have shown great potential in intelligent bearing fault diagnosis. However, most existing approaches suffer from the scarcity of labeled data, which often results in insufficient robustness under complex working conditions and a general lack of interpretability. To address these challenges, we propose a physics-informed multimodal fault diagnosis framework based on few-shot learning, which integrates a 2D time-frequency image encoder and a 1D vibration signal encoder. Specifically, we embed prior knowledge of multi-resolution analysis from signal processing into the model by designing a Laplace Wavelet Convolution (LWC) module, which enhances interpretability since wavelet coefficients naturally correspond to specific frequency and temporal structures. To further balance the guidance of physical priors with the flexibility of learnable representations, we introduce a parametric multi-kernel wavelet that employs channel-wise dynamic attention to adaptively select relevant wavelet bases, thereby improving the feature expressiveness. Moreover, we develop a Mahalanobis-Prototype Joint Metric, which constructs more accurate and distribution-consistent decision boundaries under few-shot conditions. Comprehensive experiments on the Case Western Reserve University (CWRU) and Paderborn University (PU) bearing datasets demonstrate the superior effectiveness, robustness, and interpretability of the proposed approach compared with state-of-the-art baselines.

Keywords

Few-shot fault diagnosis; multimodal feature fusion; laplace wavelet convolution; interpretability

1 Introduction

With the rapid advancement of deep learning, an increasing number of data-driven methods have been applied to the manufacturing industry, particularly in critical sectors such as aerospace, wind power, and railway transportation. In these domains, mechanical systems often operate under harsh conditions, with the engine serving as a core component. The reliability of the engine is crucial to system safety, efficiency, and cost-effectiveness. Among its internal parts, bearings are highly susceptible to degradation, as they operate under extreme environments such as high temperature, noise, and heavy loads [1,2]. Consequently, accurate health monitoring of engine bearings has become a research priority for enabling predictive maintenance and preventing catastrophic failures.

Currently, intelligent fault diagnosis approaches for bearings can be broadly divided into machine learning-based methods and deep learning-based methods. Traditional machine learning methods, as early intelligent diagnostic approaches, mainly rely on handcrafted feature extraction followed by classifier-based diagnosis. Specifically, expert knowledge and signal processing techniques such as Fast Fourier Transform (FFT) and Short-Time Fourier Tranform (STFT) are employed to design discriminative time-domain and frequency-domain features. Subsequently, dimensionality reduction techniques such as PCA are applied to select effective feature representations, which are then fed into classifiers such as support vector machines [3,4], random forests [5,6], or k-nearest neighbors [7] for training and prediction.

In recent years, deep learning-based approaches—including convolutional neural networks (CNNs) [8–11], denoising autoencoders [12,13], deep belief networks (DBNs) [14,15], and recurrent neural networks (RNNs) [16] have emerged as mainstream methods for fault diagnosis. Their key advantage lies in performing end-to-end diagnosis by automatically learning hierarchical feature representations from large-scale data, thus avoiding complex and subjective manual feature engineering. These methods typically achieve higher diagnostic accuracy and better generalization than traditional approaches. However, these methods typically require large-scale labeled data for training and are sensitive to label noise. To overcome the challenge of data scarcity in fault diagnosis, researchers have proposed several strategies, including data augmentation (DA) [17–19], transfer learning (TL) [20–22], and few-shot learning (FSL) [23–25]. The core idea of DA is to generate new and diverse training samples from limited labeled data through transformations that preserve label semantics. This increases dataset scale and diversity, thereby improving model generalization and reducing overfitting. DA is simple to implement and requires no modification to the model structure, but it is domain-dependent and demands carefully designed augmentation strategies. The principle of TL is to transfer knowledge (e.g., model parameters or feature representations) learned in a source domain to a related target domain with limited data, under the assumption that transferable features exist across domains. Liu et al. [26] proposed a deep domain adaptation framework grounded in optimal transport theory, which demonstrated effectiveness in rotating machinery fault diagnosis by aligning feature distributions across domains. Building upon this, Yang et al. [27] developed a targeted transfer learning approach that leverages the distribution barycenter to optimize domain alignment, thereby improving diagnostic performance in machines. Li et al. [28] designed a gated recurrent generative transfer learning network, explicitly addressing challenges arising from data imbalance and varying operating conditions. To further mitigate domain-related issues, Ding et al. [29] introduced a deep imbalanced domain adaptation model for bearing fault diagnosis, focusing on alleviating the label shift problem in transfer learning. In addition, Li et al. [30] proposed a feature-based transfer learning strategy that minimizes cross-dataset discrepancies, offering a practical solution for cross-domain fault diagnosis tasks. Since the source domain usually contains abundant labels, TL can significantly reduce reliance on target-domain annotations. However, when the distribution gap between source and target domains is large (domain shift), naive transfer may cause negative transfer and degrade performance. Thus, domain discrepancy and source-domain dependency remain critical challenges. FSL, a branch of meta-learning, aims to teach models “how to learn”. By meta-training on a large number of tasks, the model can quickly adapt to new and unseen diagnostic tasks using only a few support samples. Fan et al. [31] developed a model-agnostic meta-learning (MAML) framework that exploits prior knowledge by optimizing initialization parameters, thereby enabling fault diagnosis under previously unseen operating conditions. In parallel, Liu et al. [25] introduced a semi-supervised meta-learning approach that incorporates a simplified graph convolutional network, allowing accurate node classification with only a limited number of training samples across variable working conditions. In addition to single-modal vibration-based models, recent studies have also explored multi-modal and multi-scale fusion schemes for mechanical fault diagnosis. For instance, a self-calibrated coordinate-attention multi-scale CNN (SC-MSCNN) [32] converts raw vibration signals into Markov transition field images and uses coordinate attention with multi-scale convolutions and skip connections to improve small-sample bearing diagnosis while reducing parameters and alleviating overfitting. In noisy Industrial IoT settings, a lightweight multi-modal fusion framework [33] combines multi-resolution decomposition and adaptive enhancement to transform 1D signals into multiple 2D representations, which are fused by a tri-branch attention network to achieve highly robust bearing diagnosis under low SNR with clear gains over state-of-the-art baselines. These works further demonstrate the effectiveness of multi-modal and multi-scale feature modeling for bearing fault diagnosis, but they are still designed for relatively data-rich regimes and do not explicitly target extreme few-shot scenarios with physics-guided, interpretable wavelet representations. Despite their success, data-driven methods still face the challenge of limited labeled data under complex operating conditions and often lack interpretability, which restricts their broader deployment in real-world scenarios.

To address the aforementioned challenges, we propose a physics-guided multimodal framework based on few-shot learning, termed LWCNet, designed to improve the accuracy, robustness, and interpretability of fault classification under limited-data scenarios. LWCNet incorporates a learnable parametric Laplace Wavelet Convolution (LWC) module to enhance both robustness and interpretability of the extracted features. First, the raw vibration signals are preprocessed to generate training samples consisting of both 1D temporal sequences and 2D time-frequency representations. Subsequently, the training data from the support set and query set are fed into a 1D temporal feature extractor and a 2D time-frequency feature extractor, respectively, to obtain discriminative feature embeddings. Specifically, in the 1D feature extractor, we introduce the Laplace Wavelet Convolution, where physically meaningful Laplace wavelet kernels replace traditional convolution kernels. This design ensures that the extracted features are inherently interpretable, as their responses explicitly correspond to specific frequency components and temporal positions. Moreover, to balance the guidance of physical priors with the adaptability of neural networks, we propose an Attention-Driven Multi-kernel Laplace Wavelet Convolution (AALWC) structure. In this design, each output channel is equipped with multiple parameterized wavelet kernels (with learnable scale and shift parameters), enabling the network to adaptively adjust the wavelet basis functions according to the data characteristics. Furthermore, a dynamic attention mechanism is incorporated to adaptively select and fuse the most relevant feature responses for each input sample, thereby significantly enhancing the representational capacity of the model. In the 2D time-frequency feature extractor, three convolutional kernels with different receptive field sizes are employed to perform multi-scale feature extraction. The extracted 1D temporal features are reshaped and concatenated with the 2D time-frequency features along the channel dimension for multimodal fusion. Additionally, to construct more accurate and robust decision boundaries under few-shot conditions, we propose a Mahalanobis-Prototype Joint Metric to model the correlation between support and query sets. This method not only computes the Euclidean distance between query samples and class prototypes but also incorporates the Mahalanobis distance to account for the intra-class covariance structure of the feature distributions.

By integrating physics-guided architectural design, parametric adaptive mechanisms, and a hybrid similarity metric, our method effectively addresses the core challenges of few-shot fault diagnosis, including poor interpretability of extracted features, insensitivity to weak fault signatures, and limited metric diversity. The proposed method is evaluated on benchmark bearing fault diagnosis datasets, and the experimental results demonstrate its superior effectiveness and robustness under complex operating conditions and few-shot scenarios. Our main contributions can be summarized as follows:

• We propose a novel few-shot fault diagnosis framework that jointly exploits 1D vibration signals and 2D time-frequency images. By embedding physical constraints into the learning process, the framework ensures both discriminability and consistency with domain knowledge, thus enhancing reliability under data-scarce conditions.

• We design an Interpretable Attention-Driven Adaptive Laplace Wavelet Convolution (AALWC) module that integrates multi-resolution wavelet analysis into deep networks. This not only improves feature discriminability but also enhances interpretability, as the learned coefficients correspond to explicit frequency–time patterns that align with physical principles.

• To reconcile physical priors with the flexibility of deep learning, we introduce a parametric multi-kernel wavelet model. By employing channel-wise dynamic attention, the model adaptively selects relevant wavelet bases, leading to more expressive and robust feature extraction across varying conditions.

• We develop a Mahalanobis–Prototype joint metric that combines prototype-based representation with Mahalanobis distance. This hybrid design captures intra-class variability and constructs more distribution-consistent decision boundaries, significantly improving classification accuracy in few-shot scenarios.

2 Preliminaries

2.1 Fourier Transform and Short-Time Fourier Transform (STFT)

The Fourier Transform (FT) serves as a fundamental tool for analyzing stationary or quasi-stationary vibration signals characterized by stable harmonic structures, such as those induced by imbalance or misalignment. It produces a high-resolution global spectrum that facilitates accurate identification of characteristic frequencies and their associated sidebands. Mathematically, the discrete-time Fourier Transform (DTFT) is expressed as:

X(ω)=∑n=−∞+∞x[n]e−jωn(1)

which yields a comprehensive description of the signal in the frequency domain. However, it does not provide any information about the temporal localization of spectral components. To overcome this limitation, the Short-Time Fourier Transform (STFT) introduces time–frequency localization by applying a sliding window to the signal. The STFT of x[n] with a sliding window g[n] is defined a:

X(τ,ω)=∑n=−∞+∞x[n]g[n−τ]e−jωn(2)

where g[n] is a real, nonnegative window function with finite support, τ denotes the time shift, ω represents the angular frequency. This formulation provides a fixed-resolution time–frequency representation, enabling the detection of transient spectral structures. The window length determines the inherent trade-off between time and frequency resolution: a longer window improves frequency resolution at the expense of time localization, whereas a shorter window achieves the opposite. Nevertheless, the time–frequency resolution of STFT remains fixed once the window length is chosen, resulting in a constant-bandwidth decomposition that may be suboptimal for signals with multi-scale characteristics.

2.2 Wavelet Transform

The Wavelet Transform (WT) is a powerful tool for the analysis of non-stationary signals. Unlike the Fourier Transform (FT), which provides only global frequency information without temporal localization, the WT enables joint localization in both the time and frequency domains. Compared with the Short-Time Fourier Transform (STFT), which relies on a fixed window length, the WT employs short windows for high-frequency components and long windows for low-frequency components. This adaptive resolution makes it particularly suitable for analyzing non-stationary signals with abrupt changes, impulses, or modulations, such as bearing vibration signals. Unlike the FT, which utilizes global sinusoidal basis functions, the WT represents signals using a set of localized basis functions derived from a single mother wavelet function. The mother wavelet ψ(t) must satisfy the admissibility condition ∫−∞+∞ψ(t)dt=0, which ensures its zero-mean property and band-pass characteristics.

Wx(a,b)=1|a|∫−∞+∞x(t)ψ∗(t−ba)dt(3)

where a is the scale parameter that controls the frequency resolution, b is the translation parameter determining the temporal position, and the normalization factor 1|a| ensures energy consistency across different scales.

In mechanical fault diagnosis, the WT is particularly effective due to its sensitivity to abrupt changes, transients, and impact components. Compared with traditional Fourier analysis, wavelet-based methods offer significant advantages for non-stationary and locally structured signals.

Typical examples of wavelets include the Morlet wavelet and the Laplace wavelet. The Morlet wavelet is defined as a complex exponential sinusoid modulated by a Gaussian envelope:

ψM(t)=π−14e−jω0te−t22(4)

where ω0 is the central frequency and π−14 is the normalization factor ensuring unit energy. Its spectrum is Gaussian-shaped, with concentrated energy and excellent frequency localization properties. As a quasi-analytic wavelet, the Morlet wavelet provides both amplitude and phase information, making it well-suited for analyzing harmonic components, frequency/amplitude modulations, and phase synchronization phenomena.

The Laplace wavelet, also known as the Mexican Hat or Ricker wavelet, is derived by applying the Laplacian operator (second derivative) to a Gaussian function. Its analytical form is given by:

ψL(t)=23π14(1−t2)e−t22(5)

which is highly sensitive to impulses, discontinuities, and transients. It is particularly effective for detecting impact-type features such as bearing faults and gear meshing defects. However, the selection of the mother wavelet and scale parameters is typically hand-crafted, which may be suboptimal under complex, data-rich, and highly variable operating conditions.

2.3 1D CNN

A typical one-dimensional convolutional neural network (1D CNN) generally consists of a 1D convolutional layer, a batch normalization (BN) layer, and an activation function layer. The BN layer normalizes the input features to stabilize and accelerate training, while the activation function introduces nonlinearity, thereby enhancing the network’s representation capability. The 1D convolutional layer performs convolution operations on the input sequence, which can be mathematically expressed as:

yn[n]=∑k=0K−1h[k]x[n−k]=(x∗h)[n](6)

where x[n] denotes the input signal and h[k] is the convolutional kernel. Consequently, each convolutional kernel can be interpreted as a learnable band-pass filter, and a collection of kernels forms a trainable filter bank. Although 1D CNNs are capable of learning task-specific filters directly from labeled vibration data, their performance often relies on large amounts of annotated data and they generally suffer from limited interpretability.

3 Proposed Method

3.1 Overall Architecture

The overall pipeline of the proposed LWCNet is illustrated in Fig. 1, which consists of four major stages: data acquisition, data preprocessing, feature extraction via dual-branch encoders, and classification through a distribution-aware metric module.

images

Figure 1: Overall architecture of the proposed LWCNet

Data Acquisition and Preprocessing. The vibration signals are collected from a rotating machinery test bench comprising an electric motor, torque shaft, bearing test module, flywheel, and load motor. The acquired raw signals are subsequently divided into the support set and query set, following the standard few-shot learning paradigm. Each vibration sample undergoes preprocessing to generate two complementary modalities: the 1D raw temporal signal, which preserves the oscillatory and transient structures in the time domain, and the 2D time–frequency spectrogram, obtained via Short-Time Fourier Transform (STFT), which encodes the localized spectral distribution.

Laplace Wavelet Convolution Net (LWCNet). The core of our framework is the proposed LWCNet, which is designed to jointly exploit temporal and time-frequency information through a dual-branch architecture. The processing pipeline can be summarized as follows: (1) Temporal Branch (1D Feature Extractor). The raw vibration signal x∈RB×1×L is first processed by a 1D convolutional layer to capture local oscillatory structures. This is followed by the Attention-Driven Adaptive Laplace Wavelet Convolution (AALWC) module, where the convolutional kernels are replaced with learnable parametric Laplace wavelet kernels. Each kernel is characterized by scale, shift, and amplitude parameters, enabling the network to adaptively align with task-relevant frequency bands. To further enhance discriminability, multiple Laplace kernels are assigned to each output channel, and a dynamic attention mechanism computes soft weights over these kernels, thereby adaptively selecting and fusing the most relevant responses for each input sample. The output tensor from the temporal branch is denoted as F1D∈RB×Ct×L′. (2) Time-Frequency Branch (2D Feature Extractor). The time-frequency spectrograms obtained via STFT are fed into a 2D convolutional encoder. To capture spectral patterns at different resolutions, we employ multi-scale convolutional blocks with kernel sizes (3×3),(5×5),(7×7). These convolutional layers are followed by Batch Normalization and ReLU activations to stabilize training and introduce nonlinearity. This branch outputs a 2D feature map F2D∈RB×Cf×H×W. (3) Multimodal Fusion. To unify the information from both modalities, the temporal features F1D are first reshaped into a pseudo-2D representation RB×Ct×H×W so that they share the same spatial dimensionality as the time-frequency features. The reshaped tensor is then concatenated with F2D along the channel dimension, yielding a fused multimodal feature map Ffused∈RB×(Ct+Cf)×H×W. This step enables the model to combine transient temporal cues with stable spectral representations. (4) Downstream Processing. The fused feature map is passed through additional convolutional and pooling layers to reduce dimensionality while preserving discriminative information. The resulting compact representation serves as the input to the Mahalanobis-Prototype Joint Metric module for classification. Through this dual-branch design, LWCNet is able to leverage the complementary strengths of both modalities: the temporal branch excels at capturing transient impulses and non-stationary oscillations, while the time-frequency branch captures stable harmonic and modulation patterns. Their joint fusion provides a comprehensive and robust feature representation, well-suited for few-shot fault diagnosis.

Mahalanobis-Prototype Joint Metric. To achieve robust classification under few-shot conditions, we introduce a Mahalanobis-Prototype Joint Metric. For each class, prototype vectors are computed as empirical means of support embeddings, while covariance matrices capture intra-class variability. The query features are then compared against class prototypes using both Euclidean distances and Mahalanobis distances, where the latter accounts for the covariance structure of feature distributions. These complementary similarity scores are fused via a lightweight classifier, yielding decision boundaries that closely approximate the Bayes optimal rule.

Fault Diagnosis and Interpretability. The final predictions are obtained through the joint metric module, and the diagnostic performance is visualized via confusion matrices and t-SNE feature distributions. Notably, the learned wavelet kernels provide interpretable insights into the spectral bands emphasized by the network, while the Mahalanobis–Prototype integration ensures distribution-aware robustness in small-sample regimes.

3.2 An Attention-Driven Adaptive Laplace Wavelet Convolution Module

We propose an Attention-Driven Adaptive Laplace Wavelet Convolution (AALWC) module, a dual-branch residual structure that incorporates both physics-informed priors and learnable adaptability. The AALWC is specifically designed to enhance the feature representation capability of deep neural networks when analyzing non-stationary and oscillatory signals. The module synergistically integrates three complementary components: (i) a learnable parametric Laplace wavelet kernel, which embeds multi-resolution prior knowledge; (ii) a dual-branch LoRA-inspired structure, which ensures parameter efficiency while maintaining representational richness; and (iii) an attention-based dynamic fusion mechanism, which adaptively selects the most informative features under varying signal conditions. This tripartite design effectively tackles the critical challenge of capturing multi-scale and transient signal characteristics in the presence of strong background noise, thereby ensuring robust and discriminative feature extraction. The AALWC module adopts a dual-branch architecture: the first branch employs a baseline 1D convolution, serving as a robust band-pass representation; the second branch incorporates a Laplace-of-Gaussian (LoG) wavelet enhanced with attention and LoRA mechanisms, where LoRA ensures parameter efficiency and the attention module enables conditional adaptivity. This design significantly strengthens the representation of impulsive and transient features. For an input x∈RB×Cin×L, the data processing workflow of the AALWC can be formulated as follows:

y=Convbase(x)+γA(LWC(Bx))(7)

Here, A and B denote the low-rank adapters implemented by 1×1 convolutions, while LWC represents the Laplace convolutional operation. The residual scaling factor is defined as γ=α/r, where α is a scalar hyper-parameter that controls the overall strength of the low-rank residual branch and r denotes the rank. In all experiments, α is kept constant across layers so that the magnitude of the added low-rank residual is normalized by r and remains numerically stable. Notably, A is initialized to zero, ensuring that the module starts from a “baseline-equivalent” state and progressively learns effective residual representations. Firstly, a 1×1 convolution is applied to perform low-rank decomposition, reducing the feature dimension to r, and yielding z∈RB×r×L Secondly, the reduced features are processed by the AALWC module for further extraction. The core of the proposed module is a parametric learnable Laplace wavelet kernel, which functions as a versatile and interpretable feature extractor. The kernel function is parameterized as:

ψ(t;s,u,A)=As(1−(t−us)2)e−12(t−us)2(8)

where t denotes the time index within the kernel window, A is the amplitude parameter that controls the energy of the kernel. s is the scale parameter, governing the width of the wavelet and inversely related to the frequency of interest: a smaller s captures high-frequency, transient components, whereas a larger s emphasizes low-frequency, sustained trends. u is the shift parameter, which centers the kernel within the window, enabling precise alignment with salient signal features. To improve both the representation capability and interpretability, we adopt learnable kernel parameters θ={sc,m,uc,m,Ac,m} for c=1,...,Cout and m=1,...,M. These parameters are adaptively optimized during training, enabling the network to adjust the wavelet shape. Analyzing the learned kernels further reveals the frequency bands that the model emphasizes, thereby offering interpretability. Specifically, instead of using a single static kernel per output channel, we employ multiple Laplace wavelet kernels per channel, which endows the network with multi-scale feature extraction capability within a single layer. The scale parameter s is initialized according to the kernel size, while u and A are initialized to zero. All parameters are optimized via gradient descent, and the update process can be expressed as:

{∂ψ∂A=1s(1−(t−us)2)e−12(t−us)2∂ψ∂u=As3/2(3t−us−(t−us)3)e−12(t−us)2∂ψ∂s=As3/2(−1+7(t−us)2−2(t−us)4)e−12(t−us)2(9)

{A←A−α∂ψ∂Au←u−α∂ψ∂us←s−α∂ψ∂s(10)

We argue that not all wavelet kernels contribute equally to every input sample, and their relative importance should be adaptively determined according to the spectral characteristics of the input signal. To this end, we compute a set of attention weights Wm∈RB×Cout×M, which implement a soft selection mechanism over the M candidate kernels. First, we derive a channel-wise descriptor vector g∈RB×Cout×M by applying Global Average Pooling (GAP) along the temporal dimension of the reshaped tensor Y∈RB×Cout×M×T:

gb,c,m=1T∑t=1TYb,c,m,t(11)

where g compactly represents the activation strength of each wavelet kernel for the given input batch. The descriptor vector g is then flattened and passed through a lightweight shared attention network, typically implemented as a two-layer MLP with ReLU activation and dropout regularization. This attention network learns to capture the complex interdependencies among different wavelet kernels:

Wm=Softmaxdim=−1(Attn(Flatten(g)))(12)

where Flatten(⋅) denotes vector reshaping, and Attn(⋅) computes attention scores over the M kernels per channel. The Softmaxdim=−1 operation guarantees that, for each output channel and batch sample, the weights across the M kernels sum to one, thereby interpreting them as normalized importance scores. The final fused feature map Yfuse∈RB×Cout×T is obtained by computing a weighted sum of the M parallel feature maps, guided by the attention weights:

Yfuse(b,c,t)=∑m=1MWm(b,c,m)⋅Y(b,c,m,t)(13)

The attention-based fusion mechanism acts as a powerful non-linear gating function. It enables the network to emphasize the most relevant kernels (e.g., those tuned to fault-related frequencies) while suppressing irrelevant or noisy ones. This context-aware selection substantially enhances both the robustness and representational power of the extracted features. Finally, the extracted features are projected back to the higher-dimensional space via matrix B, scaled by the residual factor, and subsequently fused with the features extracted from the main convolutional branch.

3.3 Mahalanobis-Prototype Joint Metric

To achieve robust classification and construct more accurate, distribution-consistent decision boundaries under few-shot conditions, we integrate prototype-based representation with a Mahalanobis distance similarity metric. This approach effectively combines the “central tendency” of prototype learning with the “distributional shape” captured by the Mahalanobis metric, forming an advanced similarity measure. The design is motivated by the statistical optimality of the sample mean and covariance under Gaussian assumptions, which guarantees minimum estimation variance and strong discriminative power. Prototype Estimation. Given a support set Sk={si(k)}i=1nk belonging to class k, the prototype vector is defined as the empirical mean of its feature embeddings:

u^k=1nk∑i=1nkfθ(si(k))(14)

where fθ(⋅) denotes the shared feature extractor. According to the Cramér–Rao Lower Bound (CRLB), this estimator attains the minimum variance among all unbiased estimators of the class mean, thus providing an efficient and theoretically grounded representation for each category. Covariance Estimation and Mahalanobis Similarity. To capture intra-class variability, we compute the covariance matrix of the support features:

Σ^k=1nk−1∑i=1nkfθ(si(k)−u^k)fθ(si(k)−u^k)T(15)

In practice, a low-rank approximation with shrinkage regularization is applied to enhance numerical stability under small-sample conditions. The inverse covariance Σ^k−1 then defines the Mahalanobis geometry of class k. For a query feature q=f(xq), the Mahalanobis distance to class k is defined as:

DM(q,u^k)=(q−u^k)TΣ^k−1(q−u^k)(16)

This formulation penalizes deviations more heavily along directions of low variance, thereby aligning the distance metric adaptively with the underlying feature distribution. Integration of Prototype and Mahalanobis Distance. In practice, the network computes both prototype-based similarity scores and Mahalanobis-based quadratic forms. These complementary representations are concatenated and processed by a lightweight classifier head:

z=Conv1D([−‖q−u^k‖2,−DM(q,u^k)])(17)

where the convolutional layer learns to balance the contributions of Euclidean prototype similarity and covariance-aware Mahalanobis similarity. This hybrid strategy effectively combines the efficiency of prototype learning with the distribution-aware discriminability of the Mahalanobis distance, resulting in a classifier that closely approximates the Bayes optimal decision rule under Gaussian class-conditional models.

4 Experiments

4.1 Datasets

To rigorously evaluate the effectiveness and generalization of the proposed approach, we conduct experiments on two widely used bearing-fault benchmarks: CWRU and PU. These datasets are well established in the fault diagnosis community, owing to their diverse fault types, controlled test conditions, and high-quality vibration signals. Using these two complementary datasets enables a comprehensive performance assessment, ensuring that the proposed framework is validated on both controlled laboratory measurements (CWRU) and complex, mixed-condition scenarios (PU). CWRU dataset. The CWRU data were collected by the Bearing Data Center of Case Western Reserve University on a 2-hp Reliance motor test rig consisting of a drive motor, fan end, torque transducer/encoder, flywheel, and load motor (see Fig. 2). Faults were introduced into the inner race, outer race, and rolling element using electro-discharge machining (EDM) with diameters of 0.007, 0.014, and 0.021 inches. Vibration signals were measured by accelerometers at the drive end (DE) and fan end (FE) with sampling rates of 12 and 48 kHz (all FE data at 12 kHz). The dataset spans multiple loads (0, 1, 2, 3 hp) and fault locations, making it suitable for testing diagnostic algorithms across varied conditions. A detailed mapping of fault types, severities, and labels is given in Table 1. This dataset is particularly valuable for validating the robustness of diagnostic models under load variations and different fault severities. For a fair comparison with state-of-the-art (SOTA) methods, we adopt EnsembleNet [34] as the baseline and follow the same partitioning protocol across all datasets, while reproducing the performance of several representative SOTA approaches on this dataset.

images

Figure 2: Test rig of the CWRU dataset

images

For data preparation, each vibration sequence is first divided into two disjoint halves based on time: the first half is used exclusively for training, and the second half for testing. Both sets are segmented using a fixed-length window of 2048 consecutive points. To enhance the diversity of the training data while avoiding information leakage, a sliding window with stride 80 is applied within the training partition only, whereas non-overlapping segments are used for the test partition. This ensures strict temporal separation between training and testing samples, preventing overlapping windows from the same continuous signal from straddling both sets. To further assess robustness under varying data regimes, we fix the test set to 750 randomly selected non-overlapping samples and vary the training sample size from 30 to 19,800. The dataset covers multiple fault types across the inner race, outer race, and rolling element, ensuring consistent class complexity.

PU dataset. The Paderborn University (PU) dataset, released by the Chair of Mechatronics at the Heinz Nixdorf Institute (Germany), is designed to approximate realistic industrial operating conditions. It contains vibration and motor current signals from both healthy and faulty bearings, collected under laboratory as well as accelerated life test conditions. As illustrated in Fig. 3, the test bench comprises an electric motor, torque-measurement shaft, bearing module, flywheel, and load motor. Bearing faults were induced by three approaches: electrical discharge machining (EDM), manual damage, and natural degradation. Data acquisition was performed under three rotational speeds (900, 1500, 2700 rpm) and two radial loads (0.7, 1.4 kN), with a sampling rate of 64 kHz. By including both artificially induced and naturally developed faults, the PU dataset is particularly suitable for domain adaptation studies and cross-condition generalization in fault diagnosis. In total, the dataset provides 32 labels, each recorded under four different operating conditions (Table 2), ensuring diverse and challenging fault scenarios for thorough evaluation. Experiments on PU further validate the effectiveness and generalization of our method under complex operating environments. Compared to CWRU, PU presents more intricate fault patterns and diverse operating scenarios, thereby providing a more challenging benchmark. In this study, as summarized in Table 3, we select 13 fault labels: one healthy bearing, six artificially induced faults, and six naturally degraded faults. Following the same preparation strategy as for CWRU, we fix the test set to 750 randomly selected samples, while varying the number of training samples between 1040 and 19,800 across different settings. To further assess robustness and generalization, we also conduct experiments under single-condition and multi-condition scenarios, highlighting the proposed method’s superior performance in complex industrial environments.

images

Figure 3: Test rig of PU dataset

images

4.2 Implementation Details

For all experiments, the vibration signals are segmented using a 2048-point window consistent with the data preparation described in Section 4.1. A non-overlapping sliding window is applied to the testing data to ensure strict temporal independence, while the training data are augmented using an overlapping sliding window with stride 80 within the training partition only. This segmentation strategy enriches the diversity of training samples without introducing temporal leakage between training and test sets, as segments from the same continuous bearing record are never split across the two subsets.

For a fair and consistent comparison, all baseline models (CovaMNet, MF-Net, and EnsembleNet) and our proposed method were trained under the same optimization and scheduling settings. Specifically, we employed the Adam optimizer with an initial learning rate of 0.001, a five-epoch warm-up phase, and a cosine annealing learning rate schedule to ensure smooth convergence. Each model was trained for 100 epochs using the same batch size of 32, input preprocessing pipeline, and without any data augmentation settings. To ensure fairness, no additional tuning was applied to individual baselines beyond reproducing their official configurations within the unified training framework. For our proposed model, a layer-specific learning rate adjustment was introduced in the Attention-Driven Adaptive Laplace Wavelet Convolution (AALWC) module. The learnable kernel parameters u,s,A, which control the translation, scale, and amplitude of the Laplace wavelet were optimized using a smaller learning rate of 0.0001, ensuring stable convergence of the physically meaningful parameters and encouraging interpretable wavelet shapes. In our architecture, the proposed Attention-Driven Adaptive Laplace Wavelet Convolution (AALWC) module is deployed in the first layer of the temporal branch. The input to this module is a single-channel vibration segment, and the AALWC layer produces 64 output channels. The Laplace wavelet convolution within AALWC uses a kernel size of 3, and the number of learnable wavelet kernels per channel is set to 3, and the influence of this kernel-count hyper-parameter on performance is analyzed in the ablation study. The attention component of AALWC is implemented as a lightweight 1D convolution along the temporal dimension rather than a separate MLP block or multi-head self-attention module, and therefore it does not introduce additional hidden-layer sizes or dropout hyper-parameters. All experiments were conducted on an NVIDIA H20 GPU (96 GB memory) under the same hardware and software environment to guarantee reproducibility and a fair comparison of computational budgets across methods. Following standard practice in few-shot learning, all models are trained and evaluated under an N-way K-shot episodic protocol. On the CWRU dataset we adopt a 10-way setting, and on the PU dataset we use a 13-way setting, consistent with the number of fault categories in each benchmark. During both training and testing, each episode is constructed by first sampling N classes from the corresponding split and then drawing K support samples and Q=1 query sample per class.

4.3 Main Results

Comparison Methods. To ensure a fair and comprehensive evaluation, we compare our method with a range of representative baselines, including classical metric-learning approaches such as ProtoNet [35] and Cosine Classifier [36], enhanced variants such as CovaMNet [37], and MF-Net [38], as well as more recent attention- and transformer-based architectures including QSFormer [39] and EnsembleNet [34]. These methods cover different modeling philosophies ranging from prototype-based learning to attention-augmented feature extractors and multi-branch fusion, thereby providing a solid benchmark for validating the effectiveness of our proposed approach.

Results on CWRU. Table 4 reports the results on the CWRU dataset under the 10-way k-shot setting. Our method achieves the best performance across all training sample sizes and shot configurations. In the most challenging scenario with 30 training samples and 1-shot support, the proposed method attains 97.53% accuracy, surpassing EnsembleNet (84.87%) by 12.7 percentage points. With 5-shot support, the accuracy further improves to 98.40%. In medium- and high-sample regimes (90–19,800), our method consistently achieves accuracies above 99.5%, slightly outperforming strong baselines such as MF-Net and EnsembleNet. These results demonstrate that the proposed approach not only addresses the extreme few-shot challenge but also maintains competitiveness when abundant labeled data are available. As shown in Fig. 4, the results demonstrate that the testing accuracy of all methods increases as the number of training samples grows. Nevertheless, our proposed method consistently achieves the highest accuracy across all settings. Note that the 10-way 5-shot setting at 30 samples (CWRU) and the 13-way 5-shot setting at 65 samples (PU) are not reported because the training pool is insufficient to construct valid 5-shot episodes with separate query samples. The results (Table 5) show that although accuracies drop due to the increased difficulty of cross-sensor generalization, our method still outperforms CovaMNet, MF-Net, and EnsembleNet by a clear margin across all sample sizes (30–600). For example, under the 30-sample 1-shot setting, our model achieves 90.8%, significantly higher than EnsembleNet (80.6%). These findings confirm that the proposed method generalizes well even when the training and testing data originate from different measurement channels and bearing domains, thus effectively addressing the reviewe’s concerns regarding industrial realism.

images

Figure 4: Comparison of testing accuracy among different few-shot learning methods on the CWRU dataset under 10-way k-shot fault diagnosis tasks with varying numbers of shots

images

Results on PU (Condition 0). Table 6 summarizes the results on the PU dataset under Condition 0 with the 13-way k-shot protocol. Our method again delivers superior performance across all sample scales. In the low-data regime (65 samples, 1-shot), our model reaches 87.79%, which is more than 11 percentage points higher than EnsembleNet (76.36%). At larger sample sizes (1300–25,844), our method achieves 97.85%–99.69%, consistently surpassing the strongest baseline by noticeable margins. These findings further confirm the scalability and robustness of the proposed design. As illustrated in Fig. 5, we further evaluate the proposed method on the PU dataset under 13-way k-shot tasks. Similar to the results on the CWRU dataset, the performance of all methods improves with increasing training samples.

images

Figure 5: Comparison of testing accuracy among different few-shot learning methods on the PU dataset under 13-way k-shot fault diagnosis tasks

Results on PU (Multiple Conditions). To evaluate robustness under varying operating environments, Table 7 reports the average accuracy over four working conditions (0–3) under the 13-way, 5-shot setting. The average was computed by training on all four conditions combined and testing across all the conditions. However, to ensure the generalization capability of the proposed method across different working conditions, we further perform leave-condition-out evaluations: for example, training on conditions 0, 1, 2 and testing on condition 3. Our method achieves the best results across all training sample sizes, reaching 95.18% at 4160 samples compared to 87.20% for EnsembleNet and 83.44% for MF-Net. The relative improvements are especially significant in medium-scale regimes (260–1300), where our method improves accuracy by 7–10 percentage points over the strongest competitor. This demonstrates that the proposed framework not only excels in controlled scenarios but also generalizes effectively across diverse and complex working conditions. As presented in Fig. 6, we visualize the feature distributions using t-SNE. It can be observed that CovaMNet and MF-Net exhibit notable overlaps between different classes, while EnsembleNet improves the separation to some extent. In contrast, our proposed method produces more compact intra-class clusters and clearer inter-class boundaries. This indicates that the proposed approach can extract more discriminative and robust feature representations even under complex and variable operating conditions.

images

Figure 6: t-SNE visualization of feature embeddings under multiple working conditions on the PU dataset. Compared with baseline methods (a–c), the proposed method (d) achieves more compact intra-class clustering and clearer inter-class separation, highlighting its superior feature discriminability

Results on PU (Leave-Condition-Out Evaluation). To further investigate generalization under unseen operating conditions, we additionally perform leave-condition-out experiments, following the reviewer’s suggestion. Specifically, the model is trained on Conditions 0, 1, and 2 and evaluated on the unseen Condition 3. Results are summarized in Table 8. This evaluation forms a much stricter distribution shift, as the model encounters a completely unseen load/speed condition at test time. Our proposed model achieves the highest accuracy across all sample sizes (65–25844). For instance, under the 65-sample 1-shot setting, our method reaches 84.6%, compared to 73.5% for EnsembleNet. The performance gap becomes even more pronounced in medium- and high-sample scenarios, highlighting the model’s strong resilience to operating-condition mismatch. These findings provide compelling evidence that the proposed method maintains robust discriminative ability even when deployed in previously unseen working conditions.

images

Interpretability Visualization Analysis. To further demonstrate the interpretability of the proposed learnable Laplace wavelet module, Fig. 7 provides a visualization of both the initial and the learned wavelet kernels. As shown in Fig. 7a, the randomly initialized kernel exhibits a wide time-domain envelope with no preference for any particular frequency component. After training, the kernels automatically adjust their scale parameters such that the time-domain shapes become narrower or wider depending on the target frequency. The frequency-domain responses presented in Fig. 7b reveal a clear alignment between the learned spectral peaks and the theoretical bearing fault characteristic frequencies. Specifically, the kernel associated with the BPFI fault converges to a dominant peak at approximately 161.7 Hz, while another kernel aligns with the BPFO frequency at approximately 107.36 Hz. This demonstrates that the learnable wavelet module not only differentiates between fault types but also emphasizes fault-relevant spectral bands that are physically consistent with the mechanical structure of the bearing. These observations validate the claim that the proposed method possesses inherent physical interpretability: the network adapts its wavelet parameters to selectively enhance characteristic fault frequencies, effectively bridging data-driven learning with domain-specific vibration knowledge.

images

Figure 7: Visualization of the learnable Laplace wavelet kernels. (a) Time-domain kernels with random initialization (black) and learned kernels tuned to the theoretical inner-race (BPFI = 161.7 Hz, blue) and outer-race (BPFO = 107.36 Hz, red) characteristic frequencies. (b) Corresponding magnitude spectra, where the learned kernels exhibit clear peaks around BPFI and BPFO, indicating that the proposed module adaptively emphasizes fault-related frequency bands

Across both datasets and multiple working conditions, two clear trends emerge. First, the performance gains are particularly striking in the extreme few-shot regime, which aligns with the practical challenges of industrial applications where annotated data are scarce. Second, the superiority persists as the number of samples increases and as operating conditions vary, highlighting both scalability and robustness. We attribute these improvements to the synergy of (i) the physics-guided Laplace wavelet convolution, which extracts fine-grained spectral features, and (ii) the covariance-aware Mahalanobis metric, which enhances prototype discrimination by modeling intra-class variability. Together, these innovations enable the model to construct stable and generalizable representations under diverse settings. These strong results provide solid empirical evidence of the effectiveness of our framework and motivate a deeper analysis of its individual components via ablation studies in the following subsection.

4.4 Ablation Study

We conduct ablation experiments on the PU dataset to validate the contribution of each component in the proposed framework, as summarized in Table 9 and visualized in Fig. 8. First, replacing the proposed AALWC module with a standard 1D convolution (E1 vs. E3) results in a significant performance drop, confirming the effectiveness of AALWC in feature extraction. Second, comparing E2–E4 verifies the benefit of the Mahalanobis–Prototype joint metric, which yields more accurate and consistent decision boundaries. Finally, experiments with varying numbers of wavelet kernels (E4–E7) show that moderate kernel numbers (e.g., 3 or 6) achieve optimal performance, while excessive kernel numbers (E7) lead to performance degradation. The t-SNE visualizations further validate these findings, demonstrating that the proposed configuration enables clearer clustering and better inter-class separability.

images

Figure 8: Ablation study on the PU dataset under 13-way 1-shot tasks. (Top) Quantitative results across different experimental settings (E1–E7). (Bottom) t-SNE visualizations of feature embeddings corresponding to each experiment, showing the impact of different components and wavelet kernel numbers on class separability

(i) Effect of AALWC vs. standard convolution. To quantify the gain brought by the proposed AALWC module, we compare E1 (standard 1D convolution + Mahalanobis) with E3 (AALWC + Mahalanobis). As shown in Table 9, introducing AALWC improves macro-F1 from 82.47% to 84.13%, together with consistent gains in precision and recall. This demonstrates that the AALWC is able to capture more discriminative and physically meaningful spectral patterns than conventional convolutions: the learnable Laplace wavelet kernels adapt their scales and centers to emphasize fault-related frequency bands, leading to tighter clusters and fewer misclassifications in the few-shot regime.

(ii) Contribution of the Mahalanobis-prototype joint metric. We next examine the effect of the proposed Mahalanobis-prototype joint metric. E2 (AALWC + prototype only) yields an F1 of 65.24%, which highlights the limitation of relying solely on prototype matching without modeling intra-class variability. Replacing the prototype metric with the Mahalanobis distance (E3: AALWC + Mahalanobis) raises F1 to 84.13%, a gain of nearly 19 percentage points. Further combining both metrics in E4 (AALWC + prototype + Mahalanobis) boosts F1 to 85.91%. These results indicate that the Mahalanobis component plays a crucial role in enhancing feature discrimination by explicitly encoding class-dependent covariance, while the prototype term provides a stable class center; their joint use yields more accurate and consistent decision boundaries than either metric alone.

(iii) Influence of kernel numbers in AALWC. We further vary the number of wavelet kernels per channel from 1 to 9 (E4–E7). The results show that using three kernels (E5) achieves the best overall performance, with an F1 of 86.49%, while increasing to six or nine kernels slightly decreases accuracy. This suggests that a moderate multi-kernel design provides sufficient frequency diversity without introducing excessive redundancy. When too many kernels are used, several of them tend to collapse into overlapping or highly correlated frequency bands, which effectively over-parameterizes the model under the few-shot regime and makes optimization more difficult. As a result, the additional capacity is not translated into useful discriminative features, but instead increases the risk of overfitting and noise amplification, whereas three kernels per channel strike a better balance between expressiveness and generalization.

(iv) Effect of 1D/2D modality fusion. To quantify the benefit of multi-modal fusion, we additionally evaluate two single-branch variants on the PU dataset: E11D, which uses only the 1D temporal branch, and E12D, which uses only the 2D STFT-based branch. Both variants keep the same metric head and training protocol as the full model. As reported in Table 9, E11D and E12D obtain F1-scores of 77.17% and 79.09%, respectively, whereas the fused model E1 (1D+2D) achieves an F1 of 82.47%. Thus, fusion brings absolute improvements of 5.30 and 3.38 percentage points over the 1D-only and 2D-only settings. This indicates that raw 1D vibration features and 2D time–frequency representations provide complementary information: each single branch already yields competitive performance, but their fusion further enhances discrimination and stabilizes the results across different k-shot configurations.

Overall, these ablation studies clearly demonstrate that each component of our framework is indispensable: AALWC effectively captures discriminative spectral patterns, the Mahalanobis-prototype joint metric significantly enhances decision boundaries, and the multi-kernel design offers the best trade-off between feature diversity and generalization. The results provide strong empirical evidence that our architectural choices are both well-motivated and synergistic.

5 Conclusion

In this paper, we proposed a physics-guided multimodal few-shot learning framework that addresses the challenges of data scarcity, robustness, and interpretability in intelligent bearing diagnosis. Unlike most existing purely data-driven methods, our framework jointly leverages a 1D vibration encoder and a 2D time–frequency encoder to capture complementary temporal–spectral features. First, we designed an Attention-Driven Adaptive Laplace Wavelet Convolution (AALWC) module, which embeds prior knowledge from signal processing into the network, enabling the extraction of discriminative and physically interpretable features. By incorporating parametric multi-kernel wavelets with channel-wise attention, the model adaptively selects relevant bases and achieves a balance between physical priors and data-driven flexibility. Second, we introduced a Mahalanobis–Prototype Joint Metric, which explicitly models intra-class variability and constructs more consistent and accurate decision boundaries under few-shot conditions. Comprehensive experiments on the CWRU and PU datasets, covering both single-condition and multi-condition scenarios, demonstrate that the proposed approach consistently outperforms state-of-the-art baselines. The gains are particularly significant in extreme few-shot regimes and under complex operating conditions, highlighting the framework’s effectiveness, robustness, and interpretability. In future work, we plan to further enhance adaptability by developing dynamic wavelet kernel selection strategies, extend the framework to multi-sensor fusion and cross-domain adaptation, and explore its deployment in more diverse and highly dynamic industrial environments.

Acknowledgement: This work was supported by Youth Scientific Research Fund Project Plan of Anhui Institute of Information Technology.

Funding Statement: This research was funded by Yong Hu, grant number 23QNJJKJ010.

Author Contributions: The authors confirm contributions to the paper as follows: Conceptualization and methodology, Yong Hu, Weifan Xu; data curation and investigation, Yong Hu, Xiangtong Du; writing-original draft preparation, Yong Hu, Weifan Xu; writing-review and editing, Yong Hu, Weifan Xu, Xiangtong Du. All authors reviewd and approved the final version of the manuscript.

Availability of Data and Materials: Data available on request from the authors.

Ethics Approval: Not applicable.

Conflicts of Interest: The authors declare no conflicts of interest.

References

1. Lu N, Yin T. Transferable common feature space mining for fault diagnosis with imbalanced data. Mech Syst Signal Process. 2021;156:107645. doi:10.1016/j.ymssp.2021.107645. [Google Scholar] [CrossRef]

2. Huo J, Qi C, Li C, Wang N. Data augmentation fault diagnosis method based on residual mixed self-attention for rolling bearings under imbalanced samples. IEEE Trans Instrum Meas. 2023;72:1–14. doi:10.1109/tim.2023.3311062. [Google Scholar] [CrossRef]

3. Ding J, Wang Z, Yao L, Cai Y. Rolling bearing fault diagnosis based on gcmwpe and parameter optimization svm. China Mech Eng. 2021;32(2):147. [Google Scholar]

4. Rudsari FN, Razi-Kazemi AA, Shoorehdeli MA. Fault analysis of high-voltage circuit breakers based on coil current and contact travel waveforms through modified svm classifier. IEEE Trans Power Deliv. 2019;34(4):1608–18. doi:10.1109/tpwrd.2019.2915110. [Google Scholar] [CrossRef]

5. Yang Q, Liao Y. A novel mechanical fault diagnosis for high-voltage circuit breakers with zero-shot learning. Expert Syst Appl. 2024;245(8):123133. doi:10.1016/j.eswa.2023.123133. [Google Scholar] [CrossRef]

6. Ma S, Chen M, Wu J, Wang Y, Jia B, Jiang Y. High-voltage circuit breaker fault diagnosis using a hybrid feature transformation approach based on random forest and stacked autoencoder. IEEE Trans Ind Electron. 2018;66(12):9777–88. doi:10.1109/tie.2018.2879308. [Google Scholar] [CrossRef]

7. Zhao X, Jia M. Fault diagnosis of rolling bearing based on feature reduction with global-local margin fisher analysis. Neurocomputing. 2018;315:447–64. doi:10.1016/j.neucom.2018.07.038. [Google Scholar] [CrossRef]

8. Zhao K, Xiao J, Li C, Xu Z, Yue M. Fault diagnosis of rolling bearing using CNN and PCA fractal based feature extraction. Measurement. 2023;223:113754. doi:10.2139/ssrn.4400906. [Google Scholar] [CrossRef]

9. Ma J, Cai W, Shan Y, Xia Y, Zhang R. An integrated framework for bearing fault diagnosis: convolutional neural network model compression through knowledge distillation. IEEE Sens J. 2024;24(23):40083–95. doi:10.1109/jsen.2024.3481298. [Google Scholar] [CrossRef]

10. Li C, Xu J, Xing J. A frequency feature extraction method based on convolutional neural network for recognition of incipient fault. IEEE Sens J. 2023;24(1):564–72. doi:10.1109/jsen.2023.3334037. [Google Scholar] [CrossRef]

11. Han S, Sun S, Zhao Z, Luan Z, Niu P. Deep residual multiscale convolutional neural network with attention mechanism for bearing fault diagnosis under strong noise environment. IEEE Sens J. 2024;24(6):9073–81. doi:10.1109/jsen.2023.3345400. [Google Scholar] [CrossRef]

12. Chen L, Ma Y, Hu H, Khan US. An effective fault diagnosis approach for bearing using stacked de-noising auto-encoder with structure adaptive adjustment. Measurement. 2023;214:112774. doi:10.1016/j.measurement.2023.112774. [Google Scholar] [CrossRef]

13. Tian J, Jiang Y, Zhang J, Luo H, Yin S. A novel data augmentation approach to fault diagnosis with class-imbalance problem. Reliab Eng Syst Saf. 2024;243(4):109832. doi:10.1016/j.ress.2023.109832. [Google Scholar] [CrossRef]

14. Gao S, Xu L, Zhang Y, Pei Z. Rolling bearing fault diagnosis based on ssa optimized self-adaptive dbn. ISA Trans. 2022;128(1):485–502. doi:10.1016/j.isatra.2021.11.024. [Google Scholar] [PubMed] [CrossRef]

15. Zhao H, Yang X, Chen B, Chen H, Deng W. Bearing fault diagnosis using transfer learning and optimized deep belief network. Meas Sci Technol. 2022;33(6):065009. doi:10.1088/1361-6501/ac543a. [Google Scholar] [CrossRef]

16. An Z, Li S, Wang J, Jiang X. A novel bearing intelligent fault diagnosis framework under time-varying working conditions using recurrent neural network. ISA Trans. 2020;100:155–70. doi:10.1016/j.isatra.2019.11.010. [Google Scholar] [PubMed] [CrossRef]

17. Shi Y, Deng A, Deng M, Xu M, Liu Y, Ding X, et al. Domain augmentation generalization network for real-time fault diagnosis under unseen working conditions. Reliab Eng Syst Saf. 2023;235(4):109188. doi:10.1016/j.ress.2023.109188. [Google Scholar] [CrossRef]

18. Fan Z, Xu Q, Jiang C, Ding SX. Deep mixed domain generalization network for intelligent fault diagnosis under unseen conditions. IEEE Trans Ind Electron. 2023;71(1):965–74. doi:10.1109/tie.2023.3243293. [Google Scholar] [CrossRef]

19. Li Q, Chen L, Kong L, Wang D, Xia M, Shen C. Cross-domain augmentation diagnosis: an adversarial domain-augmented generalization method for fault diagnosis under unseen working conditions. Reliab Eng Syst Saf. 2023;234:109171. doi:10.1016/j.ress.2023.109171. [Google Scholar] [CrossRef]

20. Chen Z, Xia J, Li J, Chen J, Huang R, Jin G, et al. Generalized open-set domain adaptation in mechanical fault diagnosis using multiple metric weighting learning network. Adv Eng Inform. 2023;57(3):102033. doi:10.1016/j.aei.2023.102033. [Google Scholar] [CrossRef]

21. Li J, Huang R, Chen Z, He G, Gryllias KC, Li W. Deep continual transfer learning with dynamic weight aggregation for fault diagnosis of industrial streaming data under varying working conditions. Adv Eng Inform. 2023;55(10):101883. doi:10.1016/j.aei.2023.101883. [Google Scholar] [CrossRef]

22. Jiang F, Lin W, Wu Z, Zhang S, Chen Z, Li W. Fault diagnosis of gearbox driven by vibration response mechanism and enhanced unsupervised domain adaptation. Adv Eng Inform. 2024;61(11):102460. doi:10.1016/j.aei.2024.102460. [Google Scholar] [CrossRef]

23. Wang H, Li C, Ding P, Li S, Li T, Liu C, et al. A novel transformer-based few-shot learning method for intelligent fault diagnosis with noisy labels under varying working conditions. Reliab Eng Syst Saf. 2024;251:110400. doi:10.1016/j.ress.2024.110400. [Google Scholar] [CrossRef]

24. Li J, Yue K, Wu Z, Jiang F, Zhong Z, Zhang S, et al. Mtsnn: a few-shot fine-grained diagnosis framework for cross-machine fault diagnosis with heterogeneous fault categories. IEEE Trans Instrum Meas. 2025;74(6):1–11. doi:10.1109/tim.2025.3551907. [Google Scholar] [CrossRef]

25. Liu Z, Peng Z. Few-shot bearing fault diagnosis by semi-supervised meta-learning with graph convolutional neural network under variable working conditions. Measurement. 2025;240:115402. doi:10.1016/j.measurement.2024.115402. [Google Scholar] [CrossRef]

26. Liu Z-H, Jiang L-B, Wei H-L, Chen L, Li X-H. Optimal transport-based deep domain adaptation approach for fault diagnosis of rotating machine. IEEE Trans Instrum Meas. 2021;70:1–12. doi:10.1109/tim.2021.3050173. [Google Scholar] [CrossRef]

27. Yang B, Lei Y, Li X, Li N. Targeted transfer learning through distribution barycenter medium for intelligent fault diagnosis of machines with data decentralization. Expert Syst Appl. 2024;244:122997. doi:10.1016/j.eswa.2023.122997. [Google Scholar] [CrossRef]

28. Li Z, Ma J, Wu J, Wong PK, Wang X, Li X. A gated recurrent generative transfer learning network for fault diagnostics considering imbalanced data and variable working conditions. IEEE Trans Neural Netw Learn Syst. 2025;36(8):13782–93. doi:10.1109/tnnls.2024.3362687. [Google Scholar] [PubMed] [CrossRef]

29. Ding Y, Jia M, Zhuang J, Cao Y, Zhao X, Lee C-G. Deep imbalanced domain adaptation for transfer learning fault diagnosis of bearings under multiple working conditions. Reliab Eng Syst Saf. 2023;230:108890. doi:10.1016/j.ress.2022.108890. [Google Scholar] [CrossRef]

30. Li Y, Ren Y, Zheng H, Deng Z, Wang S. A novel cross-domain intelligent fault diagnosis method based on entropy features and transfer learning. IEEE Trans Instrum Meas. 2021;70:1–14. doi:10.1109/tim.2021.3122742. [Google Scholar] [CrossRef]

31. Fan C, Zhang Y, Ma H, Ma Z, Yu K, Zhao S, et al. A novel metric-based model with the ability of zero-shot learning for intelligent fault diagnosis. Eng Appl Artif Intell. 2024;129:107605. doi:10.1016/j.engappai.2023.107605. [Google Scholar] [CrossRef]

32. Xue L, Lei C, Jiao M, Shi J, Li J. Rolling bearing fault diagnosis method based on self-calibrated coordinate attention mechanism and multi-scale convolutional neural network under small samples. IEEE Sens J. 2023;23(9):10 206–14. doi:10.1109/jsen.2023.3260208. [Google Scholar] [CrossRef]

33. Liu J, Zhu J, Bai W, Zhang H, Wu L, Zhou T, et al. A multi-modal lightweight transformer for bearing fault diagnosis under high-noise industrial iot environments. IEEE Internet Things J. 2026;13(2):3552–67. doi:10.1109/jiot.2025.3634730. [Google Scholar] [CrossRef]

34. Vu M-H, Nguyen V-Q, Tran T-T, Pham V-T, Lo M-T. Few-shot bearing fault diagnosis via ensembling transformer-based model with mahalanobis distance metric learning from multiscale features. IEEE Trans Instrum Meas. 2024;73:1–18. doi:10.1109/tim.2024.3381270. [Google Scholar] [CrossRef]

35. Shen H, Zhao D, Wang L, Liu Q. Bearing fault diagnosis based on prototypical network. In: International Conference on Mechatronics Engineering and Artificial Intelligence (MEAI 2022); 2022 Nov 11–13; Changsha, China. 125960D p. [Google Scholar]

36. Yan S, Shao H, Long Z, Liu B. Incremental few-shot fault diagnosis with cosine-represent real-time evolved network of transmission systems. IEEE Trans Transp Electrif. 2025;11(4):9544–54. doi:10.1109/tte.2025.3563630. [Google Scholar] [CrossRef]

37. Li W, Xu J, Huo J, Wang L, Gao Y, Luo J. Distribution consistency based covariance metric networks for few-shot learning. Proc AAAI Conf Artif Intell. 2019;33(1):8642–9. doi:10.1609/aaai.v33i01.33018642. [Google Scholar] [CrossRef]

38. Vu M-H, Pham V-T. Mixerformer-covariance metric neural network: a new few-shot learning model for bearing fault diagnosis. In: 2023 12th International Conference on Control, Automation and Information Sciences (ICCAIS); 2023 Nov 27–29; Hanoi, Vietnam. p. 639–44. [Google Scholar]

39. Wang X, Wang X, Jiang B, Luo B. Few-shot learning meets transformer: unified query-support transformers for few-shot classification. IEEE Trans Circuits Syst Video Technol. 2023;33(12):7789–802. doi:10.1109/tcsvt.2023.3282777. [Google Scholar] [CrossRef]

Cite This Article

APA Style

Hu, Y., Xu, W., Du, X. (2026). LWCNet: A Physics-Guided Multimodal Few-Shot Learning Framework for Intelligent Fault Diagnosis. Computers, Materials & Continua, 87(2), 67. https://doi.org/10.32604/cmc.2026.074437

Vancouver Style

Hu Y, Xu W, Du X. LWCNet: A Physics-Guided Multimodal Few-Shot Learning Framework for Intelligent Fault Diagnosis. Comput Mater Contin. 2026;87(2):67. https://doi.org/10.32604/cmc.2026.074437

IEEE Style

Y. Hu, W. Xu, and X. Du, “LWCNet: A Physics-Guided Multimodal Few-Shot Learning Framework for Intelligent Fault Diagnosis,” Comput. Mater. Contin., vol. 87, no. 2, pp. 67, 2026. https://doi.org/10.32604/cmc.2026.074437

BibTex EndNote RIS

Copyright © 2026 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Table of Content

LWCNet: A Physics-Guided Multimodal Few-Shot Learning Framework for Intelligent Fault Diagnosis

Abstract

Keywords

References

Cite This Article

1355

345

0

Related articles

Further Information

Guidelines

Follow Us

Join Us

Contact Us

WhatsApp:

Share Link