Open Access
ARTICLE
Hierarchical Cyber–Physical Symbiosis with Bidirectional State Space Modeling for IIoT Anomaly Diagnosis
1 School of Communication and Information Engineering, Nanjing University of Posts and Telecommunications, Nanjing, China
2 College of Electronic and Optical Engineering and College of Flexible Electronics (Future Technology), Nanjing University of Posts and Telecommunications, Nanjing, China
* Corresponding Author: Jianfei Chen. Email:
(This article belongs to the Special Issue: Attention Mechanism-based Complex System Pattern Intelligent Recognition and Accurate Prediction)
Computers, Materials & Continua 2026, 88(1), 27 https://doi.org/10.32604/cmc.2026.079644
Received 25 January 2026; Accepted 03 March 2026; Issue published 08 May 2026
Abstract
As 6G-enabled Industrial Internet of Things (IIoT) evolves, green and sustainable industrial monitoring increasingly relies on edge AI to deliver low-latency diagnosis under tight resource constraints. Industrial cyber–physical systems increasingly rely on heterogeneous sensing and communication infrastructures, where network-side attacks can propagate into physical processes and appear as coupled anomalies. Reliable diagnosis therefore requires joint learning from time-synchronized cyber and physical telemetry rather than modeling them as independent signals. This paper develops Cyber–Physical Symbiosis Network (CPSNet), a model designed for edge-AI deployment with a dual-stream architecture for fixed-window multiclass cross-domain anomaly diagnosis in IIoT. CPSNet encodes each modality into hierarchical multi-resolution features and refines them with a Multi-Scale Bidirectional-Recursive (MSBR) block. MSBR couples multi-kernel temporal convolutions with a gated bidirectional state space pathway, capturing transient irregularities while retaining long-range context within the window. Cross-modal dependency is injected at every scale by a symbiosis module that performs bidirectional channel-wise gating and holistic state space fusion to learn unified cross-modal dynamics efficiently. A compact multi-scale pooling head with auxiliary modality supervision preserves discriminative evidence in both streams. On the DataSense benchmark, CPSNet achieves 97.18% Accuracy and 99.04% AUC on Multiclass-8, and 89.07% Accuracy and 94.28% AUC on Multiclass-50, showing consistent improvements over single-modality and multi-modal baselines. Ablation and efficiency analyses further suggest complementary gains from multi-scale refinement and explicit coupling with a favorable accuracy–runtime trade-off. These results suggest that hierarchical cross-modal coupling with state space temporal modeling can improve robust, fine-grained IIoT diagnosis for 6G edge-AI monitoring.Keywords
Green and sustainable 6G-enabled Industrial Internet of Things (IIoT) is expected to support massive connectivity, ultra-reliable low-latency communication, and pervasive sensing for industrial automation. At the same time, it must meet stringent energy and carbon constraints through edge intelligence and efficient data-plane operation [1]. IIoT infrastructures have become a core substrate for industrial automation, enabling fine-grained monitoring and closed-loop control via heterogeneous sensing, networking, and edge/cloud computing. In such 6G IIoT settings, anomaly diagnosis is increasingly deployed as an edge AI service to reduce backhaul traffic and operational energy while preserving real-time responsiveness. However, increased connectivity and protocol diversity also enlarge the attack surface and tighten the coupling between cyber incidents and physical processes [2]. As a result, industrial monitoring is increasingly confronted with cyber–physical anomaly diagnosis: abnormal behaviors may be manifested in network traffic such as reconnaissance, exploitation, and malware propagation, while their consequences and correlates appear in physical telemetry like process deviations and abnormal device dynamics [3,4]. Accurate diagnosis therefore requires learning from time-synchronized dual-stream observations and reasoning about their coupled temporal evolution rather than treating cyber and physical evidence as independent signals [5].
A growing body of work has explored deep learning for anomaly detection in IIoT and cyber–physical systems using either cyber traffic or physical telemetry. Cyber-only approaches typically rely on packet/flow statistics and protocol indicators to model anomalous traffic patterns, whereas physical-only approaches focus on sensor time series to identify deviations from nominal dynamics [6]. Nevertheless, cyber–physical anomalies are often subtle and temporally structured across modalities: a traffic-side deviation may induce delayed or weak physical responses, and a physical deviation may be preceded by low-intensity cyber probing. This motivates dual-stream models that preserve modality-specific evidence while enabling cross-modal interaction under strict temporal co-registration, which has begun to be supported by aligned benchmarks such as DataSense [2,7].
Despite this progress, current window-based diagnosis pipelines still exhibit practical limitations that constrain reliable and efficient deployment in green 6G IIoT monitoring. Temporal encoding is frequently implemented as either convolution-dominated extraction or global sequence mixing, which makes it difficult to represent transient irregularities and window-level evolution within a compact fixed-window encoder. This gap limits discrimination when anomaly evidence is distributed across multiple time scales and needs to be captured without deep stacks or heavy attention. Besides temporal modeling, cross-modal dependency is commonly introduced by late fusion or coarse interaction, which provides limited capability for one modality to recalibrate the other as context under heterogeneous noise and missingness. Such weak coupling can degrade robustness when discriminative evidence is dominant in a single stream and requires cross-stream confirmation. Moreover, cross-modal coupling is often applied at a single resolution, while cyber–physical correlations can be scale-dependent and may emerge differently across shallow and deep representations. Without injecting coupling along the hierarchy, complementary evidence can be diluted before aggregation, reducing the benefit of dual-stream observations under fine-grained label settings.
To address these gaps, this paper proposes Cyber–Physical Symbiosis Network (CPSNet), a dual-stream, multi-scale framework for cyber–physical anomaly diagnosis over synchronized windowed sequences. CPSNet strengthens per-stream temporal modeling under fixed windows by integrating multi-kernel local extraction with window-level state space contextualization and explicit gating, enabling compact representation of mixed time-scale signatures. It further enforces cyber–physical dependency through bidirectional cross-context alignment and holistic state space fusion, so that each modality can be calibrated by the other under a shared window-level context. In addition, CPSNet injects cyber–physical coupling across multiple encoder scales before scale-wise pooling, which preserves complementary evidence across resolutions and stabilizes fine-grained diagnosis. In addition, a three-term objective combines task supervision with modality-specific auxiliary losses to preserve discriminative evidence in both branches and stabilize cross-modal learning. Importantly, CPSNet exploits bidirectional contextual dependencies within each fixed window to strengthen representation learning, without claiming strict online causal constraints. The contributions of this paper are summarized as follows:
• A Multi-Scale Bidirectional-Recursive (MSBR) block is proposed for single-modality temporal modeling, which couples a multi-scale convolution bank with a state space context via an explicit gating mechanism inside a residual structure, enabling efficient sequence modeling over fixed windows while capturing both transient irregularities and window-level long-range evolution.
• A Cyber–Physical Symbiosis (CPS) module is developed to strengthen cross-modal dependency modeling, consisting of symmetric Cyber–Physical Alignment (CPA) blocks for bidirectional cross-context gating and a Holistic State Space Fusion (HSSF) block with Bi-Mamba-based interaction and modality-wise gated injections for controlled fusion.
• A Multi-Scale Feature Aggregation Head is designed to inject CPS at each encoder scale before global pooling, followed by a compact multi-layer perceptron (MLP) classifier, and the training objective integrates task loss with cyber/physical auxiliary supervision to encourage modality-wise discriminativeness alongside fused diagnosis.
The remainder of this paper is organized as follows. Section 2 reviews recent deep learning approaches for IIoT intrusion/anomaly detection and cyber–physical representation learning, with an emphasis on temporal modeling and state space-based sequence encoders. Section 3 formulates the cyber–physical anomaly diagnosis problem and describes the DataSense benchmark. Section 4 details the proposed CPSNet, including MSBR, CPS (CPA and HSSF), the multi-scale aggregation head, and the loss function. Section 5 presents experimental settings and quantitative results, followed by ablation and analysis. Section 6 discusses the advantage of CPSNet according to the experimental results. Section 7 concludes the paper and outlines future directions.
2.1 Deep Learning for Cyber-Physical Anomaly Diagnosis
Cyber–physical anomaly diagnosis in IIoT systems aims to identify and categorize abnormal behaviors by jointly exploiting time-synchronized cyber observations such as packet/flow statistics and protocol indicators and physical observations like sensor/Message Queuing Telemetry Transport (MQTT) telemetry and device-state logs [8]. Compared with single-stream settings, the key difficulty is that anomaly evidence can be distributed across modalities and manifest with different temporal characteristics, which motivates dual-stream learning frameworks with explicit cross-modal interaction under strict window-level co-registration, as supported by aligned benchmarks such as DataSense [2].
Existing studies span cyber-only intrusion/anomaly detection using Convolutional Neural Network (CNN)/Recurrent Neural Network (RNN)/Transformer variants on traffic-derived features [5,6,9] and physical-side fault/anomaly diagnosis using deep feature extractors for sensor time series under non-stationary regimes [10–12]. Recent cyber–physical approaches further integrate both streams through hybrid pipelines or structured representations, including hybrid monitoring that combines system-state and traffic cues [7] and graph-based CPS anomaly/intrusion analysis [13]. Overall, these works suggest that strong within-modality temporal modeling and explicit cross-modal coupling are both essential. Simple late concatenation or decision-level fusion can be insufficient when discriminative evidence is unevenly distributed across modalities.
2.2 State Space Models and Mamba
State space models (SSMs) parameterize sequence dynamics through latent state evolution and readout, offering linear-time recurrence and strong inductive bias for long-range dependency modeling. Structured SSMs enable efficient implementations that scale favorably with sequence length compared with attention-based Transformers, making them attractive for industrial temporal data with long-horizon dependencies.
Representative advances include HiPPO-based history compression [14], S4 structured state spaces [15], and simplified variants such as S5 [16]. Mamba further introduces selective state spaces with input-dependent parameterization, enabling content-adaptive sequence modeling while retaining linear-time scaling [17]. Related engineering efforts on IO efficiency, such as FlashAttention, contextualize practical trade-offs between attention and alternative long-sequence backbones [18,19]. Recent extensions apply Mamba-style designs to multivariate time series and robust classification settings [20], indicating that SSM/Mamba backbones are well-suited for window-based cyber–physical diagnosis where efficient long-context modeling is beneficial.
All related models in Section 2 are concluded in Table 1.

DataSense is collected from a realistic IIoT testbed comprising diverse industrial sensors and common IoT devices interconnected through a dual-band Wi-Fi access point, a managed switch, and a centralized MQTT broker hosted on a Raspberry Pi [2]. DataSense provides two time-synchronized data sources collected from the same IIoT testbed: packet traces captured via continuous monitoring (cyber stream), and IIoT sensor logs collected from the MQTT broker and indexed in the logging backend (physical stream) [2]. Let
For each
Benign data are recorded under normal device operation without interference, producing 12 h of benign traffic. For evaluation, a one-hour benign subset is selected, with an initial 5-min profiling segment reserved for device profiling and excluded from the evaluation set. Anomalous data are produced through controlled execution of 50 realistic attacks spanning seven major categories: reconnaissance (Recon), denial of service (DoS), distributed denial of service (DDoS), web exploitation (Web), man-in-the-middle (MITM) and spoofing, Bruteforce, and malware (Mirai).
This work studies cyber–physical anomaly diagnosis as supervised classification over aligned cyber–physical sequences under a fixed-window setting. Given a history length L, the input to CPSNet at time
The learning objective is to estimate a parametric mapping
where
Table 2 summarizes the key notations used in this paper.

This work targets cyber–physical anomaly diagnosis from synchronized cyber–physical observations, where network-traffic streams and physical-sensor streams exhibit both modality-specific temporal patterns and strong cross-modal dependency. The proposed CPSNet shown in Fig. 1 follows a dual-stream design: each modality is first encoded into a hierarchical set of multi-resolution features, and the resulting multi-scale representations are then fused and aggregated into a compact representation for anomaly-label prediction. Two principles guide the design. First, temporal modeling should be efficient and lightweight for fixed-window diagnosis and practical deployment. Second, cyber and physical cues should be coupled explicitly rather than being merged only at the end, so that cross-modal context can calibrate modality-specific evidence at multiple scales. In particular, CPSNet operates on a fixed-length window and is allowed to exploit bidirectional contextual dependencies within the window to strengthen representation learning.

Figure 1: Architecture of the proposed CPSNet for cyber–physical anomaly diagnosis.
Given aligned windowed inputs from the cyber stream and the physical stream, the CPSNet encoder produces three scale-specific feature pairs
Formally, the end-to-end inference can be summarized as
where
4.2 Multi-Scale Bidirectional-Recursive Block
The MSBR block is a single-modality temporal modeling unit for streams. Recursive refers to the latent state update of the bidirectional state space pathway, which is computed as a discrete-time recurrence, while the coupling between the local and contextual paths is realized by gated residual fusion rather than an additional recursion across branches. Its design targets the coexistence of transient local irregularities and long-range system evolution in cyber–physical anomaly diagnosis sequences. As shown in Fig. 2, MSBR adopts a two-branch structure: a parallel multi-scale convolution path for local pattern extraction and a state space context path for window-level contextualization. The two paths are coupled through a subtractive context interaction followed by a sigmoid gate inside a residual structure.

Figure 2: The proposed MSBR block.
Pure Temporal Convolutional Network (TCN) and pure convolutional temporal encoders are efficient for short-lived irregularities, but they often require deep stacks or carefully tuned dilations to cover long-range evolution within a window. This increases complexity and may still under-represent global context. Lightweight Transformer variants offer flexible global mixing, but attention typically scales quadratically with window length and can be sensitive to noisy or missing industrial telemetry. MSBR therefore combines a shallow multi-kernel convolution bank to capture transient local cues with a linear-time bidirectional SSM pathway to model window-level evolution. A subtractive interaction and a sigmoid gate then emphasize deviations from contextual trends in a compact form.
Let the input feature tensor be
where
Three 1D convolutions with different kernel sizes are applied to
and the outputs are concatenated along the channel dimension and activated to form the local representation:
The global branch applies a bidirectional state space operator [21] to
As illustrated in Fig. 3, Bi-Mamba instantiates two Mamba-style selective state space branches that scan the fused sequence in opposite temporal directions. The module first produces a content stream and a multiplicative gate through two linear projections, where the gate is activated by a sigmoid. The content stream is then locally mixed by a lightweight 1D convolution and passed to a selective SSM, yielding a direction-specific latent sequence. The reverse-direction branch operates on a temporally flipped copy of the input and flips the resulting output back to align with the original order.

Figure 3: Bi-Mamba architecture.
A sigmoid gate is generated from the contextual sequence via a learnable linear mapping:
Following the interaction shown in Fig. 2, MSBR first forms a context-compensated feature by subtracting the contextual sequence from the local representation, and then applies the gate by element-wise modulation:
The final output is obtained by projecting back to the input width and adding a residual connection:
For reproducibility, the key tensor shapes are:
4.3 Cyber–Physical Symbiotic Module
The CPS module shown in Fig. 4 operates on synchronized dual-stream inputs, including the cyber stream

Figure 4: The proposed CPS module (lower part) and CPA (upper part with blue background) block.
Let
4.3.1 Cyber–Physical Alignment Block
The CPA block shown in upper part of Fig. 4 enhances a main modality using the other modality as cross-modal context via channel-wise gating. For a main input
In CPS, the same definition is instantiated twice with swapped roles to obtain
4.3.2 Holistic State Space Fusion Block
The HSSF block shown in Fig. 5 fuses

Figure 5: The proposed HSSF block.
Each modality is first normalized and locally encoded by a linear projection and a 1D convolution with kernel size
where
The additive term aligns correlated trends across modalities. The element-wise interaction emphasizes co-occurring deviations and suppresses uncorrelated fluctuations. Constructing
The joint state is modeled by a bidirectional state space operator Bi-Mamba to produce a shared latent sequence:
Bi-Mamba follows the bidirectional selective state space construction described in Fig. 3, providing linear-time window contextualization in both temporal directions and yielding a single shared state that summarizes coupled evolution.
Conditioned on M, each modality contributes a gated residual injection. Specifically, a residual branch is obtained from the corresponding normalized input, and a modality-specific gate is generated from M:
The shared-state gates regulate how each modality updates the fused representation under the same contextual reference. This mitigates dominance from a single stream and preserves modality-specific cues.
To validate the necessity of the fusion operator and shared-state gated injections, a parameter-matched alternative replaces Bi-Mamba with a two-layer gated temporal convolution using the same width D and matching pointwise projections, and replaces
4.4 Multi-Scale Feature Aggregation Head
The Multi-Scale Feature Aggregation Head maps the hierarchical encoder outputs into a compact representation for cyber–physical anomaly diagnosis. As shown in the right part of Fig. 1, three multi-resolution fused features
For each scale, GAP is applied along the temporal axis:
The pooled vectors are concatenated to form the aggregated representation:
Finally, a two-layer projection with SiLU and dropout produces the anomaly-diagnosis logits and the predicted distribution:
By aggregating multi-scale representations via scale-wise pooling and concatenation, the head preserves complementary evidence across resolutions, enabling the classifier to jointly leverage transient signatures from shallow features and longer-horizon dynamics from deeper features within a single diagnostic representation.
The training objective consists of three components, targeting cyber–physical anomaly diagnosis and the two modality-specific cyber and physical branches. Let
The task loss supervises the final prediction produced from the fused representation:
To explicitly enforce modality-wise discriminativeness, two lightweight auxiliary classifiers are attached to the cyber and physical branches using the corresponding modality-specific features from the fusion module. Let
The overall objective is a weighted sum:
This formulation ensures that the fused head is directly optimized for the target diagnosis, while each modality branch is simultaneously encouraged to preserve task-relevant evidence instead of relying solely on cross-modal fusion.
5.1 Data Preprocessing and Alignment
This subsection instantiates the window-level features
On the cyber side, each packet in
On the physical side, each sensor log record in
If a window contains no cyber events or no physical events, the corresponding vector is set to a valid neutral placeholder by filling the numerical fields with zeros and using a special missing-modality index in the categorical fields. A binary modality-availability indicator is appended so that CPSNet can distinguish true low-activity patterns from windows where an entire stream is absent. All numerical fields are normalized using statistics computed on the training split only to avoid leakage. For a numerical feature
where
Finally, model-consistent sequence tensors are assembled by stacking L consecutive windows to obtain
which enforces that all cross-stream fusion modules operate on time-registered cyber and physical representations at every resolution. The resulting per-window vectors have dimensions
Unless otherwise stated, sequences are constructed with
All models are implemented in PyTorch and trained on a single NVIDIA RTX 3090 GPU. The AdamW optimizer is adopted with an initial learning rate of
Performance is evaluated using Accuracy (Acc), Precision (Prec), and Recall (Rec). For multiclass scenarios, area under the receiver operating characteristic curve (AUC) is also computed using the macro-average one-vs-rest strategy, which averages the AUC of each class against all others so that minority and majority classes are treated equally. Unless otherwise stated, all reported test results correspond to the average over five independent runs with different random seeds.
Table 3 and Fig. 6 report the comparison results on the DataSense benchmark under two label granularities, namely Multiclass-8 and Multiclass-50. The evaluation includes representative baselines that cover single-modality modeling and multi-modal fusion. CNN-BiLSTM is a cyber-stream temporal classifier that models network-side sequential patterns. Deep Convolutional Neural Networks with Wide First-layer Kernel (WDCNN) is a physical-stream baseline that extracts discriminative features from telemetry signals via deep convolutional stacking. CNN-BiLSTM EarlyFusion performs feature-level fusion by concatenating cyber and physical features before temporal encoding. MBConv-ViT combines efficient convolutional feature extraction with global token mixing for improved sequence representation. Hybrid-CPSys adopts a hybrid cyber–physical monitoring design to integrate traffic-side and physical-side cues. IoTGRAF represents graph-based modeling for CPS anomaly diagnosis and captures relational dependencies beyond pure sequence encoders. CPSNet is evaluated under the same dataset partition and metric definitions as the baselines to ensure comparability.


Figure 6: Comparison of CPSNet with representative baselines on the DataSense benchmark under Multiclass-8 and Multiclass-50 settings.
CPSNet achieves the best overall performance across both label granularities. The gains are consistent on Accuracy and AUC, and become larger under Multiclass-50, indicating that the proposed cross-modal coupling and hierarchical temporal modeling better support fine-grained diagnosis where anomaly evidence is weak and distributed across cyber and physical streams.
Fig. 6 shows that the improvement trends are aligned across all reported metrics, suggesting that CPSNet strengthens both classification reliability and ranking quality under different label granularities.
To complement diagnostic accuracy, CPSNet is also compared with representative baselines in terms of computational efficiency in Table 4, including parameter count, Floating Point Operations (FLOPs), end-to-end inference latency, and throughput measured as processed windows per second. This evaluation characterizes the practical deployability of each method under resource and real-time constraints. All efficiency numbers are measured on the same RTX 3090 using batch size 64 under FP32, reporting averaged GPU timing over repeated runs with host-to-device transfer and model forward included while excluding offline preprocessing.

Table 4 indicates that CPSNet maintains moderate parameter count and FLOPs while achieving low latency and high throughput compared with heavier transformer- and graph-based baselines, which facilitates near-real-time deployment without sacrificing accuracy.
Fig. 7 provides complementary evidence for the effectiveness of CPSNet from two perspectives—Fig. 7a: a confusion matrix that reveals how errors distribute across classes, and Fig. 7b: a t-SNE visualization that illustrates the geometric structure of the learned representations. Together, they connect classification outcomes with representation quality, clarifying whether the model not only achieves strong accuracy but also learns discriminative, well-structured fused features that align with cyber–physical anomaly semantics.

Figure 7: Confusion matrix (a) and t-SNE result (b) of CPSNet.
The confusion matrix is dominated by diagonal entries, while the t-SNE embedding forms compact clusters with clear separation between normal behavior and attack categories. The remaining confusions are concentrated among a small number of closely related classes, which is consistent with a fused representation that retains class-discriminative cues while reducing spurious cross-modal correlations.
To quantify the contribution of each architectural component in CPSNet, a component-wise ablation is conducted on the DataSense benchmark under both Multiclass-8 and Multiclass-50 settings in Table 5 and Fig. 8. Starting from a single-scale CNN baseline, we selectively enable MSBR, CPA, HSSF, and Bi-Mamba, and evaluate how each module and their combinations affect discrimination performance and ranking capability in terms of Accuracy, Macro-Precision, Macro-Recall, and Macro-AUC.


Figure 8: Ablation study of CPSNet on Multiclass-8 (a) and Multiclass-50 (b).
Table 5 and Fig. 8 show consistent improvements when enabling stronger temporal modeling and cross-modal coupling modules, and the full configuration achieves the best results. The improvements are more pronounced on Multiclass-50, indicating that hierarchical coupling and fusion contribute more under fine-grained label fragmentation.
To further isolate the role of modality usage and fusion strategy, a second ablation evaluates single-modality variants and a dual-stream late-fusion baseline against the proposed CPSNet fusion mechanism in Table 6. This study distinguishes whether the observed improvements primarily come from having access to both modalities, or from how the modalities are coupled and fused.

Table 6 shows that using both modalities is beneficial and that structured coupling further improves over late fusion, with larger AUC gains under Multiclass-50, indicating more reliable ranking under fine-grained diagnosis.
To further investigate the role of multi-scale cyber–physical symbiosis, we conduct an ablation on the CPS outputs from different encoder scales, as summarized in Table 7 and Fig. 9. Specifically,


Figure 9: Ablation on multi-scale CPS outputs on the DataSense benchmark.
Table 7 and Fig. 9 indicate that each scale is effective and multi-scale aggregation yields the best performance, confirming that complementary information exists across temporal resolutions and becomes more valuable under Multiclass-50.
Across comparison and ablation results, the dominant accuracy gains are consistently associated with explicit cyber–physical coupling and hierarchical fusion rather than solely increasing per-stream temporal encoder strength. The margins are larger under Multiclass-50, suggesting that fine-grained diagnosis benefits from multi-scale interaction states that stabilize class separation when inter-class similarity and long-tail effects are amplified. The representation evidence in Fig. 7 is consistent with this trend, where fused features form compact and separable structures and the remaining confusions are concentrated near semantically adjacent categories, indicating that the model learns discriminative coupled dynamics instead of superficial concatenation.
The efficiency evaluation complements the accuracy results by showing that the proposed design achieves competitive latency and throughput with moderate parameter and FLOP budgets compared with heavier fusion backbones. This indicates that the hierarchical coupling and state space based temporal modeling can be deployed under resource constraints while retaining strong diagnostic performance, which is aligned with real-time industrial monitoring requirements where fixed-window inference must meet runtime budgets without compromising fine-grained recognition.
This work proposes CPSNet, an efficient window-based dual-stream framework for cyber–physical anomaly diagnosis that couples synchronized network-traffic and physical-sensor evidence via explicit multi-scale interaction. On the DataSense benchmark, CPSNet achieves strong performance under both label granularities, reaching 97.18% Accuracy and 99.04% AUC for Multiclass-8, and 89.07% Accuracy and 94.28% AUC for Multiclass-50, outperforming representative single-modality and multi-modal baselines. The gains are particularly notable for Multiclass-50, where improved AUC suggests more reliable ranking under higher inter-class similarity and long-tail effects. Component-wise ablations substantiate the design: enhanced per-stream temporal modeling yields consistent improvements, while the largest gains come from explicit cyber–physical coupling and hierarchical fusion. The full configuration performs best, indicating that MSBR-based multi-scale refinement and CPS-based interaction provide complementary benefits rather than redundant capacity. Efficiency evaluation further supports deployability, as CPSNet maintains competitive latency and throughput with a substantially smaller compute footprint than heavier graph- or transformer-based alternatives, offering a favorable accuracy–cost trade-off for window-based monitoring. These properties align with 6G IIoT requirements for sustainable edge-AI monitoring under strict latency and resource budgets.
Future work will center on digital-twin-driven industrial security and intelligence, where cyber–physical diagnosis acts as a core perception component of continuously updated industrial twins. A key challenge is to maintain scalable synchronization and strict timing alignment across heterogeneous sensing and wireless links, while supporting low-latency and energy-aware edge inference under noisy or partially missing telemetry. Another open issue is uncertainty-aware fusion under changing modality availability and evolving device populations, so that twin state updates remain reliable when streams become intermittent or distribution shifts occur. From the security perspective, digital twins introduce new attack surfaces and demand consistent cyber–physical reasoning to reduce false alarms under benign operational changes and to prevent stealthy attacks from being masked by natural process variation. Addressing these challenges requires attack-aware twin dynamics modeling, continual adaptation to evolving workloads and network configurations, and interpretable cyber–physical evidence that can be mapped to actionable mitigation for safety-critical deployments.
Acknowledgement: Not applicable.
Funding Statement: The authors received no specific funding for this study.
Author Contributions: The authors confirm contribution to the paper as follows: Conceptualization, Kelan Wang and Jianfei Chen; methodology, Kelan Wang and Jianfei Chen; software, Kelan Wang; validation, Kelan Wang; formal analysis, Kelan Wang, Kelan Wang and Jianfei Chen; resources, Jianfei Chen; data curation, Jianfei Chen; writing—original draft preparation, Kelan Wang and Jianfei Chen; visualization, Kelan Wang and Jianfei Chen; supervision, Jianfei Chen; project administration, Jianfei Chen. All authors reviewed and approved the final version of the manuscript.
Availability of Data and Materials: This study used publicly available data from the Canadian Institute for Cybersecurity DataSense: CIC IIoT dataset 2025 repository at https://www.unb.ca/cic/datasets/iiot-dataset-2025.html. The dataset is described in [2]. The designed architecture of CPSNet is available at https://github.com/KelanWang2002/CMC2026CPSNet.
Ethics Approval: Not applicable.
Conflicts of Interest: The authors declare no conflicts of interest.
References
1. Wisanwanichthan T, Thammawichai M. A lightweight intrusion detection system for IoT and UAV using deep neural networks with knowledge distillation. Computers. 2025;14(7):291. doi:10.3390/computers14070291. [Google Scholar] [CrossRef]
2. Firouzi A, Dadkhah S, Maret SA, Ghorbani AA. DataSense: a real-time sensor-based benchmark dataset for attack analysis in IIoT with multi-objective feature selection. Electronics. 2025;14(20):4095. [Google Scholar]
3. Zhang C, Li J, Wang N, Zhang D. Research on intrusion detection method based on transformer and CNN-BiLSTM in internet of things. Sensors. 2025;25(9):2725. doi:10.3390/s25092725. [Google Scholar] [PubMed] [CrossRef]
4. Hu X, Zhang H, Cao J, Huang Y, Zhang X, Wang H, et al. PSRONet: a deep reinforcement learning-based sensor configuration framework in railway point machines fault diagnosis. IEEE Trans Instrum Meas. 2026;75:2500513. [Google Scholar]
5. Wang J, Si C, Wang Z, Fu Q. A new industrial intrusion detection method based on CNN-BiLSTM. Comput Mater Contin. 2024;79(3):4297–318. doi:10.32604/cmc.2024.050223. [Google Scholar] [CrossRef]
6. Odeh A, Taleb A. Robust network security: a deep learning approach to intrusion detection in IoT. Comput Mater Contin. 2024;81(3):4149–69. doi:10.32604/cmc.2024.058052. [Google Scholar] [CrossRef]
7. He J, Zhang W, Liu X, Liu J, Yang G. Toward intrusion detection of industrial cyber-physical system: a hybrid approach based on system state and network traffic abnormality monitoring. Comput Mater Contin. 2025;84(1):1227–52. doi:10.32604/cmc.2025.064402. [Google Scholar] [CrossRef]
8. Hu X, Jiang C, Huang Y, Peng D, Su H, He Y, et al. SMNet: a novel compositional generalization model for industrial robot multi-joint fault diagnosis. IEEE Internet Things J. 2026:1. doi:10.1109/JIOT.2026.3652582. [Google Scholar] [CrossRef]
9. Du C, Guo Y, Zhang Y. A deep learning-based intrusion detection model integrating convolutional neural network and vision transformer for network traffic attack in the internet of things. Electronics. 2024;13(14):2685. doi:10.3390/electronics13142685. [Google Scholar] [CrossRef]
10. Zhang W, Peng G, Li C, Chen Y, Zhang Z. A new deep learning model for fault diagnosis with good anti-noise and domain adaptation ability on raw vibration signals. Sensors. 2017;17(2):425. doi:10.20944/preprints201701.0132.v1. [Google Scholar] [CrossRef]
11. Chen F, Zhao Z, Hu X, Liu D, Yin X, Yang J. Intelligent transformation in the operational maintenance of pumped storage units: hydraulic-mechanical multi-scenario fault diagnosis based on tensor feature extraction indicators. Adv Eng Inform. 2026;69(2):103894. doi:10.1016/j.aei.2025.103894. [Google Scholar] [CrossRef]
12. Yang Z, Mao R, Ye L, Liu Y, Hu X, Li Y. VSC-ACGAN: bearing fault diagnosis model applied to imbalanced samples. Meas Sci Technol. 2025;36(3):036212. doi:10.1088/1361-6501/adb872. [Google Scholar] [CrossRef]
13. Yasaei R, Moghaddas Y, Al Faruque MA. IoT-GRAF: IoT graph learning-based anomaly and intrusion detection through multi-modal data fusion. In: 2024 Design, Automation & Test in Europe Conference & Exhibition (DATE). Piscataway, NJ, USA: IEEE; 2024. p. 1–6. [Google Scholar]
14. Gu A, Dao T, Ermon S, Rudra A, Ré C. Hippo: recurrent memory with optimal polynomial projections. Adv Neural Inf Process Syst. 2020;33:1474–87. [Google Scholar]
15. Gu A, Goel K, Ré C. Efficiently modeling long sequences with structured state spaces. arXiv:2111.00396. 2021. [Google Scholar]
16. Smith JT, Warrington A, Linderman SW. Simplified state space layers for sequence modeling. arXiv:2208.04933. 2022. [Google Scholar]
17. Gu A, Dao T. Mamba: linear-time sequence modeling with selective state spaces. arXiv:2312.00752. 2024. [Google Scholar]
18. Dao T, Fu D, Ermon S, Rudra A, Ré C. FlashAttention: fast and memory-efficient exact attention with IO-awareness. Adv Neural Inf Process Syst. 2022;35:16344–59. [Google Scholar]
19. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. In: Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017); 2017 Dec 4–9; Long Beach, CA, USA. p. 1–11. [Google Scholar]
20. Feng J, Zhang J, Cao G, Liu Z, Ding Y. DecMamba: mamba utilizing series decomposition for multivariate time series forecasting. Comput Mater Contin. 2025;82(1):1049–68. [Google Scholar]
21. Liang A, Jiang X, Sun Y, Shi X, Li K. Bi-mamba+: bidirectional mamba for time series forecasting. arXiv:2404.15772. 2024. [Google Scholar]
Cite This Article
Copyright © 2026 The Author(s). Published by Tech Science Press.This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.


Submit a Paper
Propose a Special lssue
View Full Text
Download PDF
Downloads
Citation Tools