EdgeST-Fusion: A Cross-Modal Federated Learning and Graph Transformer Framework for Multimodal Spatiotemporal Data Analytics in Smart City Consumer Electronics

Mohammed Alenazi

doi:10.32604/cmc.2026.075966

icon Open Access

ARTICLE

EdgeST-Fusion: A Cross-Modal Federated Learning and Graph Transformer Framework for Multimodal Spatiotemporal Data Analytics in Smart City Consumer Electronics

Mohammed M. Alenazi^*

Faculty of Computers and Information Technology, Department of Computer Engineering, University of Tabuk, Tabuk, Saudi Arabia

* Corresponding Author: Mohammed M. Alenazi. Email: email

(This article belongs to the Special Issue: Integrating Computing Technology of Cloud-Fog-Edge Environments and its Application)

Computers, Materials & Continua 2026, 87(2), 59 https://doi.org/10.32604/cmc.2026.075966

Received 11 November 2025; Accepted 05 January 2026; Issue published 12 March 2026

Abstract

Multimodal spatiotemporal data from smart city consumer electronics present critical challenges including cross-modal temporal misalignment, unreliable data quality, limited joint modeling of spatial and temporal dependencies, and weak resilience to adversarial updates. To address these limitations, EdgeST-Fusion is introduced as a cross-modal federated graph transformer framework for context-aware smart city analytics. The architecture integrates cross-modal embedding networks for modality alignment, graph transformer encoders for spatial dependency modeling, temporal self-attention for dynamic pattern learning, and adaptive anomaly detection to ensure data quality and security during aggregation. A privacy-preserving federated learning protocol with differential privacy guarantees enables collaborative model training without centralizing sensitive data. The framework employs data-quality-aware weighted aggregation to enhance robustness against noisy and malicious client updates. Experimental evaluation on the GeoLife, PeMS-Bay, and SmartHome+ datasets demonstrates that EdgeST-Fusion achieves 21.8% improvement in prediction accuracy, 35.7% reduction in communication overhead, and 29.4% enhancement in security resilience compared to recent baselines. Real-world deployment across three smart city testbeds validates practical viability with 90.0% average accuracy and sub-250 ms inference latency. The proposed framework remains feasible for deployment on heterogeneous and resource-constrained consumer electronics devices while maintaining strong privacy guarantees and scalability for large-scale urban environments.

Keywords

Federated learning; graph transformer; spatiotemporal analytics; consumer electronics; smart cities; cross-modal fusion; edge computing; privacy preservation

1 Introduction

The proliferation of consumer electronics (CE) in smart cities has generated unprecedented volumes of multimodal spatiotemporal data, fundamentally transforming urban intelligence paradigms [1]. Modern smart cities leverage diverse CE devices including smartphones, wearables, smart home appliances, and vehicle infotainment systems that continuously generate location traces, sensor readings, and behavioral patterns [2]. These heterogeneous data streams present both opportunities and challenges for urban analytics applications, particularly in maintaining privacy while extracting actionable insights for real-time decision making. Recent smart city research increasingly relies on machine learning to analyze large-scale IoT and spatiotemporal data for improving urban resilience and infrastructure intelligence. However, centralized learning approaches suffer from privacy risks, high communication costs, and limited scalability in distributed consumer electronics environments.

Recent advances in artificial intelligence-enabled smart city frameworks have demonstrated significant potential for urban optimization through intelligent data analysis [3]. Federated learning offers a promising alternative by enabling decentralized model training without raw data sharing. Recent transformer-based federated frameworks demonstrate improved scalability for sequential decision-making in smart city IoT systems [4]. Nevertheless, challenges remain in jointly addressing multimodal data fusion, spatiotemporal dependency modeling, and system robustness, motivating the need for integrated federated architectures. The integration of spatiotemporal big data from multiple CE sources requires sophisticated analytical frameworks capable of handling multimodal heterogeneity while maintaining data locality and security [5].

Recent studies have investigated explainable AI for autonomous urban navigation systems and blockchain-enabled intelligence for smart grid power management [6].

The emergence of federated learning (FL) paradigms offers promising solutions for privacy-preserving smart city analytics by enabling collaborative model training without centralizing raw data [7]. However, existing FL frameworks for spatiotemporal data analysis lack comprehensive support for multimodal CE inputs and fail to capture complex spatial dependencies inherent in urban environments [8]. Furthermore, current approaches do not adequately address the security vulnerabilities associated with federated training in adversarial urban settings.

Graph neural networks have shown remarkable success in modeling spatial relationships in urban systems, while privacy-preserving federated learning has recently emerged as a key enabler for fair, scalable, and data-efficient optimization in smart city environments, particularly for large-scale urban traffic and mobility systems. By decentralizing model training while maintaining performance and fairness guarantees, federated approaches address critical concerns related to data privacy, regulatory compliance, and communication efficiency in heterogeneous urban infrastructures. The integration of these technologies with federated learning principles presents an opportunity to develop comprehensive frameworks for multimodal spatiotemporal analytics in smart city consumer electronics applications.

Fig. 1 illustrates the relationship between heterogeneous consumer electronics data sources and the analytical challenges that arise when processing such data in smart city environments. Data quality is regarded as a major challenge because consumer devices often generate noisy, incomplete, and unreliable measurements due to sensor limitations, intermittent connectivity, and user behavior variability. Model aggregation is challenging in federated settings since non-IID data distributions and uneven data quality across edge nodes can introduce biased or suboptimal global model updates. Repeated data fusion refers to the iterative integration of heterogeneous data across multiple modalities, temporal windows, and federated training rounds. Rather than a single fusion operation, data from location, mobility, environmental, acoustic, and transactional sources must be repeatedly combined to progressively refine spatiotemporal representations and support downstream tasks such as event detection, human activity recognition, and anomaly detection under real-time and resource-constrained conditions.

images

Figure 1: Multimodal spatiotemporal data ecosystem in smart cities showing consumer electronics sources, data modalities, and analytical challenges requiring federated processing frameworks

This paper addresses these limitations by proposing EdgeST-Fusion, a novel cross-modal federated learning framework specifically designed for multimodal spatiotemporal data analytics in smart city consumer electronics applications. Our approach integrates graph transformer architectures with federated learning principles to enable privacy-preserving, scalable, and secure urban intelligence systems.

The main contributions of this work are summarized as follows:

• Cross-Modal Federated Architecture: Existing federated learning solutions for smart cities primarily focus on single-modality data and fail to address gradient inconsistency and temporal misalignment caused by heterogeneous consumer electronics. To address this limitation, a cross-modal federated architecture is designed to align heterogeneous spatiotemporal data streams at the edge level, enabling effective collaborative learning without centralizing sensitive data and preserving user privacy.

• Graph Transformer-Based Spatiotemporal Modeling: Urban sensing data inherently exhibits non-Euclidean spatial dependencies and long-range temporal correlations that cannot be captured by conventional convolutional or recurrent models. To address this challenge, A graph transformer encoder is integrated to jointly model graph-based spatial interactions and temporal self-attention, enabling accurate representation of complex spatiotemporal dynamics in smart city environments.

• Adaptive Security and Data-Quality-Aware Aggregation: In federated smart city systems, varying data quality, unreliable devices, and potential adversarial behaviors can significantly degrade global model performance. To mitigate these risks, An adaptive security framework is introduced to perform real-time anomaly detection and incorporate data quality awareness into the aggregation process, enhancing robustness against malicious updates and privacy leakage.

• Comprehensive and Realistic Evaluation: To demonstrate the practical effectiveness of the proposed framework, extensive experiments are conducted on three large-scale real-world datasets. The results show consistent improvements in prediction accuracy, communication efficiency, and security resilience over recent state-of-the-art methods, validating the necessity and effectiveness of each architectural component.

The remainder of this paper is organized as follows: Section 2 reviews related work; Section 3 presents the proposed methodology and mathematical modeling; Section 4 discusses results and evaluation; Section 5 provides discussion; and Section 6 concludes the paper.

2 Related Work

2.1 Spatiotemporal Data Analytics in Smart Cities

Smart city spatiotemporal analytics has evolved significantly with the integration of artificial intelligence and big data technologies. Wang et al. [2] presented AI and digital twin frameworks for consumer electronics in smart cities, emphasizing the importance of spatiotemporal dependence modeling through dynamic graph neural networks. Their approach demonstrated improved urban service delivery through intelligent data-enabled analytics, though limitations exist in handling multimodal consumer electronics integration.

AlTerkawi and AlTarawneh [4] introduced a federated decision transformer framework to enable scalable reinforcement learning in smart city IoT systems. Their approach combines transformer-based sequence modeling with federated optimization to support decentralized decision-making across distributed edge devices. The study demonstrates improved scalability and coordination efficiency in IoT-driven environments, particularly for sequential control tasks. However, the framework primarily focuses on reinforcement learning scenarios and does not address multimodal data fusion, spatiotemporal graph modeling, or integrated security and privacy mechanisms, which are critical requirements for comprehensive smart city consumer electronics analytics.

Mehmood et al. [5] explored advancements in human action recognition through 5G/6G technologies for smart cities using fuzzy integral-based fusion. Their work addressed multimodal data integration challenges in consumer electronics environments, demonstrating the potential of advanced fusion techniques for spatiotemporal analytics.

Liu et al. [9] provided a comprehensive review of multi-source data fusion and analysis algorithms in smart city construction, identifying key challenges in integrating heterogeneous data streams from diverse urban sensors and consumer electronics devices.

2.2 Federated Learning for Smart City Applications

Federated learning has emerged as a promising paradigm for privacy-preserving smart city analytics. Dhiman and Alghamdi [1] proposed an AI-based smart city framework using multi-objective and IoT approaches, incorporating edge intelligence modules for distributed processing. Their framework demonstrated effectiveness in consumer electronics applications, though graph-based spatial modeling remains underexplored.

McMahan et al. [10] introduced the foundational FedAvg algorithm for communication-efficient learning of deep networks from decentralized data, establishing the theoretical basis for federated learning in distributed environments. This seminal work enabled subsequent developments in privacy-preserving smart city analytics.

Li et al. [11] proposed FedProx to address federated optimization challenges in heterogeneous networks, introducing a proximal term to handle statistical and systems heterogeneity commonly encountered in smart city deployments with diverse consumer electronics devices.

Wang et al. [12] tackled the objective inconsistency problem in heterogeneous federated optimization, proposing solutions for scenarios where local objectives diverge significantly across participating devices, a common challenge in smart city consumer electronics environments.

Diao et al. [13] developed HeteroFL for computation and communication-efficient federated learning with heterogeneous clients, enabling flexible model architectures that accommodate varying computational capabilities of consumer electronics devices in smart city deployments.

The integration of federated learning with spatiotemporal modeling presents unique challenges in smart city contexts. Existing approaches typically address single-domain applications without comprehensive cross-modal integration capabilities required for consumer electronics environments.

2.3 Graph Neural Networks for Urban Analytics

Graph neural networks have shown remarkable success in urban spatiotemporal modeling. Yu et al. [14] introduced spatio-temporal graph convolutional networks (STGCN) for traffic forecasting, combining graph convolutions with temporal convolutions to capture spatial dependencies and temporal dynamics in urban transportation networks.

Li et al. [15] proposed the diffusion convolutional recurrent neural network (DCRNN) for data-driven traffic forecasting, modeling traffic flow as a diffusion process on directed graphs to capture complex spatial correlations in urban road networks.

Guo et al. [16] developed attention-based spatial-temporal graph convolutional networks (ASTGCN) for traffic flow forecasting, incorporating spatial and temporal attention mechanisms to dynamically capture the most relevant features for prediction tasks.

Pan et al. [17] proposed ST-MetaNet for urban traffic prediction using deep meta learning, enabling the model to adapt to diverse urban scenarios through learned meta-knowledge about spatial and temporal patterns.

Song et al. [18] introduced spatial-temporal synchronous graph convolutional networks (STSGCN) as a new framework for spatial-temporal network data forecasting, synchronously capturing localized spatial-temporal correlations through carefully designed graph convolution modules.

Zeng et al. [19] proposed GraphSAINT, a graph sampling-based inductive learning method that enables scalable training on large graphs through efficient subgraph sampling strategies, addressing computational challenges in large-scale urban analytics applications.

Current graph-based approaches for urban analytics primarily rely on centralized processing, limiting their applicability in privacy-sensitive consumer electronics deployments where data locality must be preserved.

2.4 Security and Privacy in Smart City Systems

Security and privacy considerations in smart city systems have gained increased attention with the proliferation of consumer electronics. Yeh [7] discussed the evolution from urban modelling and GIS to intelligent and digital twin cities with AI, highlighting privacy implications of comprehensive urban data collection and processing.

Gilman et al. [8] addressed data challenges in driving the transformation of smart cities, identifying security vulnerabilities and privacy concerns associated with large-scale urban data analytics and proposing strategies for secure data management.

Aljarrah [6] developed AI-based models for power consumption prediction in smart grids using blockchain technology, demonstrating the potential of distributed ledger technologies for enhancing security and transparency in smart city energy systems.

Lifelo et al. [3] explored artificial intelligence-enabled metaverse for sustainable smart cities, discussing security challenges and privacy-preserving mechanisms required for immersive urban applications involving consumer electronics.

2.5 Smart City Transportation and Urban Planning

Recent advances in intelligent transportation systems have demonstrated significant potential for sustainable smart cities. Elassy et al. [20] investigated intelligent transportation systems for sustainable smart cities, addressing integration challenges between transportation infrastructure and urban analytics platforms.

Adewopo and Elsayed [21] proposed smart city transportation solutions using deep learning ensembles, demonstrating improved prediction accuracy through combining multiple neural network architectures for traffic flow estimation.

Mansouri et al. [22] developed deep convolutional neural network-based enhanced crowd density monitoring for intelligent urban planning, enabling real-time pedestrian flow analysis to support smart city management decisions.

The gap in existing literature reveals insufficient attention to adaptive security mechanisms specifically designed for federated learning environments processing multimodal consumer electronics data in smart cities.

3 Proposed Methodology

3.1 System Overview

The EdgeST-Fusion framework addresses the fundamental challenges of multimodal spatiotemporal data analytics in smart city consumer electronics through a comprehensive federated learning architecture. Fig. 2 presents the complete system architecture, illustrating the integration of cross-modal embedding networks, graph transformer encoders, and adaptive security mechanisms distributed across edge devices and coordination servers.

images

Figure 2: EdgeST-Fusion system architecture showing cross-modal embedding networks, graph transformer encoders, federated aggregation mechanisms, and adaptive security modules for multimodal spatiotemporal analytics in smart city consumer electronics

The framework operates through four primary modules: (1) Cross-Modal Data Preprocessing, (2) Graph Transformer Encoding, (3) Federated Learning Coordination, and (4) Adaptive Security Management. Each module is designed to handle specific aspects of multimodal spatiotemporal processing while maintaining privacy preservation and computational efficiency.

The cross-modal data preprocessing module handles heterogeneous inputs from various consumer electronics including smartphones, smart home devices, wearables, and vehicle systems. This module standardizes different data modalities into unified embedding representations suitable for downstream processing.

Temporal Alignment of Multimodal Data Streams: Multimodal data streams originating from heterogeneous consumer electronics devices are temporally aligned before federated training. Each data modality is temporally aligned by normalizing raw sensor timestamps to a shared global time reference, ensuring consistent synchronization across heterogeneous data streams. The aligned streams are then segmented into fixed-length sliding windows to ensure consistent temporal correspondence across modalities. This window-based synchronization allows asynchronous sensing rates and intermittent data availability to be handled effectively while preserving temporal dependencies required for spatiotemporal modeling.

3.2 Cross-Modal Embedding Network

The cross-modal embedding network transforms heterogeneous consumer electronics data into unified feature representations. Given multimodal input data X={X(1),X(2),…,X(M)} from M different modalities, the embedding transformation is defined as:

E(m)=fθm(X(m))(1)

where X(m)∈RN×T×Fm denotes the input tensor for modality m with N spatial nodes, T time steps, and Fm input features specific to modality m. The embedding dimension d is a hyperparameter that controls the representational capacity of the unified feature space. In this work, d=128 is selected based on preliminary experiments to balance model expressiveness and computational efficiency. This dimension ensures sufficient capacity to capture cross-modal correlations while maintaining feasibility for deployment on resource-constrained consumer electronics devices.

The unified cross-modal representation is obtained through attention-based fusion:

Eunified=∑m=1MαmE(m)(2)

where the attention weights αm are computed as:

αm=exp⁡(WaTE(m))∑k=1Mexp⁡(WaTE(k))(3)

with Wa∈Rd×1 representing learnable attention parameters.

The cross-modal embedding network ensures semantic alignment between different consumer electronics data sources through contrastive learning objectives:

ℒcontrastive=−log⁡exp⁡(sim(E(i),E(j))/τ)∑k=1Mexp⁡(sim(E(i),E(k))/τ)(4)

where sim(⋅,⋅) denotes cosine similarity, τ is the temperature parameter, and i,j represent positive pairs from the same spatial-temporal context.

3.3 Graph Transformer Architecture

The graph transformer architecture captures both spatial dependencies and temporal dynamics in multimodal spatiotemporal data. The spatial graph structure 𝒢=(𝒱,ℰ) represents consumer electronics devices as nodes 𝒱 with edges ℰ encoding spatial proximity and functional relationships.

Graph construction in EdgeST-Fusion follows a device-centric modeling strategy, where each node v∈𝒱 corresponds to an individual consumer electronics device or sensing unit. Edges e∈ℰ are established based on spatial proximity, communication reachability, or functional correlation between devices, such as shared sensing objectives or correlated data patterns. The adjacency matrix A is derived from these relationships and encodes the underlying spatial interaction structure.

The graph topology is assumed to be quasi-static during training, reflecting relatively stable device deployment and interaction patterns in smart city environments. Graph updates are triggered only when significant topology changes occur, such as device addition, removal, or sustained connectivity variation, ensuring computational efficiency while preserving modeling accuracy.

The graph transformer encoder processes embedded features through multi-head graph attention mechanisms:

H(l+1)=MultiHead(H(l),A)(5)

where H(l) represents node features at layer l, and A denotes the adjacency matrix encoding spatial relationships. The adjacency matrix A∈RN×N encodes pairwise spatial relationships among N consumer electronics devices or sensing nodes, where Aij≥0 represents the connection strength between nodes i and j.

The multi-head graph attention computes attention scores for spatial dependencies:

Attention(Q,K,V,A)=softmax(QKT+Adk)V(6)

where Q,K,V represent query, key, and value matrices, and dk is the key dimension.

Temporal self-attention captures long-range temporal dependencies across time steps:

Zt=SelfAttention(Ht,{Ht−Δ,…,Ht+Δ})(7)

In Eq. (7), Ht denotes the latent representation at time step t obtained from the graph transformer, while Δ defines the temporal window size used to capture past and future contextual information. The operator SelfAttention(⋅) computes context-aware representations by attending to temporal neighbors {Ht−Δ,…,Ht+Δ}, enabling the model to capture long-range temporal dependencies across multiple time steps. The output Zt represents the temporally refined feature embedding at time step t.

The temporal attention weights are computed as:

Atemporal(t,s)=exp⁡(htTWtemphs)∑τexp⁡(htTWtemphτ)(8)

where Wtemp represents learnable temporal attention parameters.

3.4 Federated Learning Framework

The federated learning framework enables collaborative model training across distributed consumer electronics without centralizing sensitive data. Each participating device k maintains local model parameters θk and processes local data 𝒟k.

The local gradient computed at device k during communication round t is defined as

gk(t)=∇θℒk(θk(t),𝒟k),(9)

where ℒk(⋅) denotes the local objective function evaluated on the private dataset 𝒟k.

To bound the sensitivity of the update, gradient clipping is applied as

g¯k(t)=gk(t)max(1,‖gk(t)‖2C),(10)

where C denotes the clipping threshold.

Differential privacy is enforced using the Gaussian mechanism by injecting noise into the clipped gradient:

g~k(t)=g¯k(t)+𝒩(0,σ2C2I)(11)

where σ controls the noise scale and I is the identity matrix. This procedure guarantees (ε,δ)-differential privacy for each client update prior to federated aggregation.

Accordingly, the privacy-preserving local model update is given by

θk(t+1)=θk(t)−ηg~k(t).(12)

The overall local objective function optimized at device k is defined as:

ℒk=ℒpred+λprivacyℒprivacy+λregℒreg(13)

where ℒpred denotes the prediction loss, ℒprivacy represents the privacy regularization term, ℒreg denotes the model regularization term, and λprivacy, λreg are the corresponding weighting hyperparameters.

The global model aggregation employs weighted averaging based on data quality and contribution metrics:

θ(t+1)=∑k=1Kwkθk(t+1)(14)

where weights wk are determined by:

wk=|𝒟k|⋅qk∑j=1K|𝒟j|⋅qj(15)

with |𝒟k| representing the local dataset size and qk representing the data quality score for device k, ensuring that devices with larger and higher quality datasets contribute more significantly to the global model update.

3.5 Mathematical Modeling of Spatiotemporal Dynamics

The spatiotemporal dynamics in consumer electronics data exhibit complex patterns requiring sophisticated mathematical modeling. The spatiotemporal state evolution is formulated as:

St+1=F(St,Ut,Gt)+fflt(16)

where St∈RN×ds denotes the latent system state at time step t, Ut represents exogenous inputs, Gt captures graph-based spatial dependencies, fflt denotes stochastic noise, and F(⋅) models the nonlinear state transition dynamics.

The spatial interaction function incorporates graph convolution operations:

Gt=σ(AStWspatial)(17)

where σ(⋅) is a nonlinear activation function and Wspatial∈Rds×ds represents the learnable spatial transformation weights.

The temporal evolution is modeled through recurrent dynamics:

ht=GRU(ht−1,[St;Gt])(18)

where ht∈Rdh denotes the GRU hidden state and [⋅;⋅] represents concatenation.

The prediction output combines spatial and temporal representations:

Y^t+Δ=Wout[ht;Zt]+bout(19)

where Y^t+Δ denotes the predicted output at horizon Δ, Zt represents the temporally refined features from self-attention, and Wout, bout are the output projection parameters.

3.6 Algorithmic Implementation

Algorithm 1 illustrates the main EdgeST-Fusion training procedure, incorporating cross-modal processing, graph transformer encoding, differential privacy, and federated aggregation.

images

Client participation in EdgeST-Fusion follows a round-based synchronous federated learning protocol. At each communication round, a subset of available consumer electronics devices is randomly sampled to participate in local training, which helps reduce communication overhead and mitigates straggler effects. All participating clients complete their local updates before global aggregation is performed.

Communication between clients and the coordinator is conducted through iterative model update exchanges, where only privacy-preserved local updates are transmitted. Gradient clipping is applied locally at each device prior to transmission, using the clipping norm C to bound the sensitivity of model updates and ensure stability under heterogeneous data distributions.

To enhance aggregation resilience, EdgeST-Fusion employs data-quality-aware weighted aggregation, where each client contribution is scaled based on its local dataset size and quality score. This design improves robustness against noisy, unreliable, or low-quality client updates and reduces the impact of biased gradients during global model optimization. Security and anomaly validation are performed after aggregation to detect abnormal update patterns before proceeding to the next round.

3.7 Adaptive Anomaly Detection

The adaptive anomaly detection module identifies malicious data injections and privacy breaches in real-time. The detection mechanism employs statistical process monitoring combined with machine learning-based classification.

The anomaly score for data sample xt is computed as:

Score(xt)=‖xt−μt‖Σt−12(20)

where μt and Σt represent adaptive mean and covariance estimates.

The adaptive threshold is updated based on recent anomaly patterns:

τt=γτt−1+(1−γ)Percentile95({Score(xs)}s=t−wt−1)(21)

where γ is the smoothing factor and w represents the window size.

Threshold Selection and False-Positive Control: The anomaly detection threshold τt is initialized using the empirical 95th percentile of anomaly scores observed during a clean calibration phase, ensuring conservative detection at system startup. The adaptive update mechanism in Eq. (21) allows the threshold to gradually evolve based on recent threat statistics, preventing abrupt sensitivity changes caused by transient fluctuations.

False positives are controlled through percentile-based thresholding rather than fixed absolute values, which enables robustness against distributional shifts in multimodal data streams. Empirical evaluation indicates that this strategy maintains a false positive rate below 3% across all attack scenarios, striking a balance between early threat detection and operational stability in real-time smart city deployments.

Compatibility with Differential Privacy Noise: The anomaly detection mechanism is explicitly designed to remain robust under differential privacy perturbations applied during federated training. Since Gaussian noise is injected into clipped gradients rather than raw input data, the anomaly score computation in Eq. (20) operates on feature-level statistics that are minimally affected by privacy noise.

Moreover, the adaptive covariance update smooths short-term variance inflation caused by differential privacy noise, preventing systematic bias in anomaly scores. Experimental results confirm that detection accuracy remains stable under (ε,δ)-DP settings, demonstrating that the security module does not conflict with privacy-preserving learning objectives.

Algorithm 2 details the adaptive anomaly detection and response mechanism, where thresholds are dynamically updated based on observed threat statistics to mitigate evolving attack patterns.

images

3.8 Complexity Analysis

The computational complexity of EdgeST-Fusion consists of several components. The cross-modal embedding network has complexity 𝒪(MD2) where M is the number of modalities and D is the embedding dimension. The graph transformer encoding requires 𝒪(N2d+NTd2) operations for N nodes, T time steps, and d feature dimensions.

The federated aggregation complexity is 𝒪(K⋅P) where K represents the number of participating devices and P denotes the number of model parameters. The overall time complexity per training iteration is:

𝒪total=𝒪(MD2+N2d+NTd2+KP)(22)

The space complexity is dominated by the graph transformer memory requirements:

𝒮space=𝒪(N2+NTd+P)(23)

3.9 Comparison with Existing Approaches

EdgeST-Fusion addresses several limitations of existing methodologies. Traditional centralized approaches require data aggregation, violating privacy constraints and creating communication bottlenecks. Existing federated learning frameworks typically handle single-modal data and lack comprehensive spatial modeling capabilities.

Current graph neural network approaches for urban analytics operate in centralized settings and do not incorporate federated learning principles. The proposed framework uniquely combines cross-modal processing, graph-based spatial modeling, temporal attention mechanisms, and federated learning in a unified architecture specifically designed for smart city consumer electronics applications.

4 Results and Evaluation

4.1 Experimental Setup

Comprehensive experiments are conducted to evaluate the performance of EdgeST-Fusion across multiple dimensions, including prediction accuracy, communication efficiency, security resilience, and scalability. The evaluation utilizes three large-scale datasets representing diverse smart city consumer electronics scenarios.

Table 1 presents the detailed characteristics of experimental datasets.

images

The experimental infrastructure consists of edge devices simulated using NVIDIA Jetson Xavier NX boards and coordination servers deployed on AWS EC2 p3.8xlarge instances with Tesla V100 GPUs. The federated learning simulation involves 50–200 participating devices with heterogeneous data distributions reflecting realistic smart city deployments.

Hyperparameter configuration includes learning rate η=0.001, embedding dimension d=128, graph transformer layers L=4, attention heads H=8, and local training epochs Elocal=5. The differential privacy mechanism employs (ε,δ)=(1.0,10−5) privacy guarantees.

Non-IID Data Partitioning and Client Dropout Simulation To reflect realistic smart city deployments, the experimental setup incorporates non-IID data partitions across participating devices. Each client is assigned data corresponding to specific spatial regions, sensing modalities, or usage patterns, resulting in heterogeneous data distributions across the federated network. This setting captures real-world variability where consumer electronics observe localized and context-dependent data streams.

Client dropout is explicitly simulated to evaluate robustness under unstable participation. At each communication round, a random subset of clients is unavailable due to connectivity loss, power constraints, or scheduling conflicts. Dropout rates vary between 10% and 30% depending on the scenario, and the global aggregation process proceeds using only the available client updates, mimicking realistic federated learning conditions in smart city environments.

Fig. 3 show experimental evaluation utilizes three diverse datasets representing different smart city consumer electronics scenarios with significant scale variations. GeoLife demonstrates the largest spatial coverage with N=17,621 nodes and longest temporal sequence of T=1,068,000 time steps, representing mobility tracking applications. PeMS-Bay provides transportation infrastructure data with N=325 nodes and T=52,116 time steps across M=4 modalities, totaling 18.3 GB. SmartHome+ exhibits the highest complexity with M=6 data modalities from N=5847 IoT devices over T=876,240 time steps, requiring 31.2 GB storage. The heatmap visualization reveals normalized characteristics where darker intensities indicate higher relative values, demonstrating dataset diversity essential for comprehensive framework validation across varying spatiotemporal scales and multimodal complexity levels.

images

Figure 3: Dataset characteristics comparison across spatial, temporal, and complexity dimensions for EdgeST-Fusion evaluation

Reproducibility and Implementation Details: To ensure full reproducibility of the experimental results, all experiments were conducted using fixed random seeds and explicitly defined software and hardware configurations. Unless otherwise stated, the global random seed was set to 42 for data partitioning, model initialization, and federated client sampling. Each experiment was repeated five times with different seeds to assess robustness, and mean values are reported.

The implementation was developed using Python 3.10 with PyTorch 2.1.0 and PyTorch Geometric 2.5.0 for graph-based operations. Federated learning orchestration was implemented using a custom simulation layer built on top of PyTorch Distributed. CUDA 12.1 and cuDNN 8.9 were used for GPU acceleration. Edge devices were emulated using NVIDIA Jetson Xavier NX boards (6-core Carmel ARM CPU, 384-core Volta GPU, 8 GB RAM), while the federated coordinator was deployed on AWS EC2 p3.8xlarge instances equipped with NVIDIA Tesla V100 GPUs (32 GB HBM2 memory) and 256 GB system RAM. Network latency and bandwidth constraints were simulated to reflect realistic smart city communication environments.

All hyperparameters, model configurations, and dataset preprocessing scripts were stored as structured configuration files (YAML format) to enable exact replication of the experimental pipeline.

4.2 Baseline Comparisons

EdgeST-Fusion is compared against ten state-of-the-art baselines representing different methodological approaches: (1) FedAvg [10], (2) STGCN [14], (3) GraphSAINT [19], (4) FedProx [11], (5) DCRNN [15], (6) ASTGCN [16], (7) ST-MetaNet [17], (8) FedNova [12], (9) STSGCN [18], and (10) HeteroFL [13].

The evaluation metrics encompass prediction accuracy (MAE, RMSE, MAPE), communication efficiency (bytes transmitted, compression ratio), security resilience (attack detection rate, false positive rate), and computational performance (training time, inference latency).

4.3 Prediction Accuracy Results

Table 2 presents comprehensive prediction accuracy results across all datasets and comparison baselines.

images

Statistical Significance and Variability Analysis: To assess the robustness of the reported improvements, all experiments were repeated five times using different random seeds under identical settings. Table 2 reports mean values across runs, while additional analysis was conducted to evaluate variance and statistical significance. Across all datasets, EdgeST-Fusion exhibited consistently lower variance compared to baseline methods. The standard deviation of MAE ranged from 0.06 to 0.11 for EdgeST-Fusion, whereas baseline approaches showed higher variability, with standard deviations between 0.14 and 0.29 depending on the dataset and method.

Furthermore, 95% confidence intervals are computed for MAE, RMSE, and MAPE metrics. For EdgeST-Fusion, the confidence intervals were notably tighter (average width: ±0.09 MAE) than those of the strongest baseline methods (average width: ±0.21 MAE), indicating more stable and reliable performance. To confirm statistical significance, paired two-tailed t-tests were conducted between EdgeST-Fusion and the best-performing baseline for each dataset. The results indicate that all observed improvements are statistically significant with p<0.01 across GeoLife, PeMS-Bay, and SmartHome+ datasets. These findings demonstrate that the observed performance gains are not attributable to random variation but reflect consistent methodological improvements.

EdgeST-Fusion demonstrates superior performance across all metrics and datasets. The significant improvements (21.8% average) result from the effective integration of cross-modal embedding networks, graph transformer architectures, and federated learning principles. The framework successfully captures both spatial dependencies and temporal dynamics while maintaining distributed processing capabilities.

Fig. 4 demonstrates consistent superior performance across all evaluation metrics and datasets, achieving MAE improvements of 21.8%, 21.9%, and 21.6% on GeoLife, PeMS-Bay, and SmartHome+ respectively compared to the best baseline methods. The histogram analysis reveals EdgeST-Fusion attaining MAE values of 2.41, 2.07, and 3.01 against best baseline scores of 3.09, 2.65, and 3.84 across the three datasets. Performance ranking analysis indicates ST-MetaNet as the strongest baseline with average score 7.89, while EdgeST-Fusion achieves 5.79, representing a 26.6% overall improvement. The line graph visualization demonstrates EdgeST-Fusion’s consistent outperformance across RMSE and MAPE metrics, with RMSE reductions of 21.4%, 21.0%, and 21.7%, and MAPE improvements of 22.3%, 21.3%, and 21.7%, respectively. The remarkably consistent improvement margins across all metrics (∼21%) warrant careful examination for potential experimental bias or dataset-specific optimization.

images

Figure 4: Prediction accuracy comparison across ten baseline methods and EdgeST-Fusion framework

The multi-panel analysis in Fig. 5 reveals EdgeST-Fusion’s consistent dominance across all evaluation metrics with average MAE values of 2.50 compared to the best performing category (Graph Networks) at 3.13, representing a 20.1% improvement. The violin plot distribution analysis demonstrates EdgeST-Fusion’s superior consistency, exhibiting the narrowest error distribution with combined MAE/RMSE values concentrated below 5.0, while baseline methods show wider distributions extending beyond 6.0. Method categorization reveals federated learning approaches averaging 3.24 MAE, graph-based methods at 3.13 MAE, and advanced spatiotemporal methods at 3.26 MAE, with EdgeST-Fusion achieving 2.50 MAE across all datasets. The error distribution box plots indicate EdgeST-Fusion as a statistical outlier below all baseline distributions, with GeoLife showing the highest variance (σ=0.31) and PeMS-Bay the lowest (σ=0.08) among baseline methods. The consistent performance improvements across diverse metrics and datasets validate the effectiveness of the integrated cross-modal fusion and graph transformer architecture.

images

Figure 5: Comprehensive prediction accuracy analysis across multiple evaluation dimensions and statistical perspectives

4.4 Statistical Significance Analysis

To assess whether the performance improvements achieved by EdgeST-Fusion are statistically significant, paired statistical hypothesis testing is conducted against the strongest competing baseline on each dataset. All experiments were repeated multiple times under identical settings, and the reported results represent the mean and standard deviation across runs.

A paired two-tailed t-test was employed to compare EdgeST-Fusion with baseline methods for all evaluation metrics, including prediction accuracy, error metrics, communication efficiency, and latency. The analysis shows that EdgeST-Fusion consistently outperforms competing methods with statistical significance, achieving p-values below 0.01 across all datasets. These results confirm that the observed improvements are not due to random variation but reflect the effectiveness of the proposed framework.

4.5 Model Performance Visualization

Fig. 6 demonstrates the accuracy evolution during training across different datasets and model configurations. The training convergence visualization reveals EdgeST-Fusion achieving dataset-specific performance patterns, with PeMS-Bay reaching 94.1% accuracy and exhibiting the fastest convergence due to its structured transportation data characteristics, GeoLife attaining 92.4% accuracy with steady improvement, and SmartHome+ achieving 89.2% accuracy reflecting the increased challenge posed by multimodal complexity. All datasets demonstrate stable convergence within 100 epochs, with PeMS-Bay reaching 90% accuracy earliest. The consistent convergence behavior across diverse datasets validates the robustness of the proposed graph transformer architecture and federated learning optimization strategy.

images

Figure 6: Accuracy vs. training epochs comparison showing EdgeST-Fusion achieving superior performance convergence across GeoLife, PeMS-Bay, and SmartHome+ datasets

4.6 Classification Performance Analysis

Fig. 7 presents the confusion matrix analysis for the multimodal spatiotemporal classification task on the SmartHome+ dataset. The confusion matrix analysis reveals EdgeST-Fusion achieving 91.3% overall classification accuracy across eight consumer electronics behavioral patterns, with Security Events demonstrating the highest performance (F1-Score: 0.960) due to distinct device signatures and HVAC Control showing the most challenging classification (F1-Score: 0.879) due to co-occurrence with lighting systems.

images

Figure 7: Confusion matrix for EdgeST-Fusion on SmartHome+ dataset showing high classification accuracy across different consumer electronics behavioral patterns

Expected confusion patterns emerge between semantically similar activities, including Sleep vs Away patterns (3% misclassification) caused by minimal device activity overlap and Work vs Entertainment confusion (4%–6%) reflecting similar device usage behaviors. The macro-averaged performance metrics indicate consistent discrimination capability with precision, recall, and F1-scores all exceeding 0.87 across behavioral categories. Performance variations across classes (σ=0.027) suggest reasonable dataset balance, demonstrating that EdgeST-Fusion effectively distinguishes between diverse consumer electronics behavioral patterns. These results confirm the framework’s applicability for real-world smart home activity recognition tasks where distinguishing between contextually similar device usage patterns is essential for accurate spatiotemporal analytics.

4.7 Urban Scenario Performance

Fig. 8 presents performance comparisons across diverse urban deployment scenarios including dense metropolitan areas, suburban regions, and adversarial conditions.

images

Figure 8: Performance analysis across urban, suburban, and adversarial scenarios showing EdgeST-Fusion maintaining consistent accuracy and robustness across diverse smart city environments

The scenario analysis confirms that EdgeST-Fusion maintains robust performance across diverse urban environments, with minimal performance degradation in adversarial conditions due to the adaptive security mechanisms.

4.8 Temporal Pattern Analysis

Fig. 9 demonstrates the framework’s ability to capture complex temporal patterns in consumer electronics data streams over extended time periods.

images

Figure 9: Temporal pattern analysis showing EdgeST-Fusion effectively capturing daily, weekly, and seasonal patterns in smart city consumer electronics usage with superior forecasting accuracy

The temporal analysis reveals that EdgeST-Fusion successfully captures multi-scale temporal dependencies ranging from short-term fluctuations to long-term seasonal trends, demonstrating the effectiveness of the temporal self-attention mechanisms.

4.9 Communication Efficiency Analysis

Table 3 presents detailed communication efficiency metrics demonstrating EdgeST-Fusion’s advantages in distributed smart city deployments.

images

The 35.7% reduction in communication overhead results from the efficient cross-modal embedding compression and adaptive aggregation mechanisms. The graph transformer architecture enables more effective feature representations, reducing the information required for model synchronization.

The per-round communication logs in Table 4 provide concrete evidence of the actual transmitted payload sizes during federated training. Unlike aggregated or synthetic summaries, these logs reflect raw communication statistics collected directly from the federated learning runtime. EdgeST-Fusion consistently transmits fewer bytes per round across all training stages, with an average payload of 544.8 KB compared to over 830 KB for baseline methods. The slight fluctuations across rounds are attributed to adaptive aggregation weights and dynamic client participation; however, the compression behavior remains stable throughout training. These results confirm that the reported communication efficiency gains are not artifacts of post-processing or smoothing, but are sustained throughout the entire training process.

images

4.10 Latency Performance Analysis

Table 5 presents comprehensive latency analysis across different system components and deployment scenarios.

images

The latency analysis demonstrates that EdgeST-Fusion maintains real-time performance with end-to-end processing times under 230 ms, meeting the requirements for time-critical smart city applications.

4.11 Energy Consumption Analysis

Table 6 provides detailed energy consumption analysis for different consumer electronics device categories participating in the federated learning process.

images

The energy analysis confirms that EdgeST-Fusion maintains reasonable energy consumption levels across diverse device types, with IoT sensor nodes showing the lowest energy requirements and vehicle infotainment systems requiring the highest energy due to their computational capabilities.

Complexity Trade-offs and Feasibility for Low-Power Devices The energy consumption results highlight important complexity trade-offs when deploying EdgeST-Fusion across heterogeneous consumer electronics. Devices with limited computational resources, such as IoT sensor nodes and wearable devices, incur minimal energy overhead because the proposed framework restricts local computation to lightweight embedding extraction, gradient updates, and clipped model transmissions. More computationally intensive components, including graph transformer operations and global aggregation, are primarily handled by more capable edge coordinators.

From a memory perspective, EdgeST-Fusion maintains a compact local model footprint by avoiding raw data storage and limiting on-device parameters to modality-specific encoders and shallow update layers. This design enables feasible deployment on low-power devices without exceeding typical memory constraints. Overall, the energy and computational analysis demonstrates that EdgeST-Fusion is practically deployable on resource-constrained consumer electronics while maintaining scalability across more powerful devices.

4.12 Security Resilience Evaluation

Security resilience is evaluated through comprehensive adversarial testing, including data poisoning attacks, model inversion attempts, and privacy inference attacks. Table 7 summarizes the security evaluation results.

images

The adaptive anomaly detection module achieves 29.4% improvement in security resilience compared to baseline federated learning approaches without dedicated security mechanisms.

Attack Models and Adversarial Settings The security evaluation considers four well-defined adversarial attack models commonly observed in federated learning systems for smart city environments. Data poisoning attacks are implemented using label-flipping strategies, where a fraction of malicious clients randomly alter ground-truth labels during local training to corrupt the global model. Model inversion attacks attempt to reconstruct sensitive feature attributes from shared gradients by exploiting gradient leakage. Privacy inference attacks aim to infer client participation and sensitive attributes from aggregated updates using membership inference techniques. Byzantine attacks are modeled by injecting arbitrary and inconsistent gradient updates from compromised clients to disrupt the aggregation process.

For all attack scenarios, the proportion of malicious clients is fixed at 20% of the participating devices per round, following established federated learning threat models. Attack intensity is kept consistent across datasets to ensure fair comparison. Detection performance is measured using attack detection rate, false positive rate, and response latency, as summarized in Table 7.

4.13 Privacy Preservation Analysis

Table 8 presents comprehensive privacy preservation metrics evaluating differential privacy guarantees and information leakage prevention across different consumer electronics data types.

images

The privacy analysis demonstrates that EdgeST-Fusion maintains strong privacy guarantees with minimal utility loss across all data types, achieving an average privacy score of 0.948 while preserving 96.55% of the original data utility.

4.14 Scalability and Performance Analysis

Fig. 10 demonstrates the scalability characteristics of EdgeST-Fusion across varying numbers of participating devices and data volumes. The scalability analysis reveals that EdgeST-Fusion maintains efficient performance scaling up to 150 participating devices, with training time increasing from 27 min at 10 devices to approximately 85 min at 150 devices. Beyond this range, computational overhead increases at a moderately higher rate due to communication bottlenecks and aggregation complexity inherent in large-scale federated systems. Memory consumption scales from 10.4 to 39.0 GB across the device range, remaining within practical limits for edge coordination servers. Communication overhead grows from 61 MB/round to 239 MB/round, staying below the efficiency threshold of 250 MB/round even at maximum scale. The framework achieves optimal efficiency in the 50–150 device range, which covers the majority of practical smart city deployment scenarios. For deployments exceeding 150 devices, hierarchical aggregation strategies or device clustering can be employed to maintain communication efficiency while preserving model accuracy.

images

Figure 10: Scalability analysis showing training time and memory consumption as functions of participating devices and data volume, demonstrating EdgeST-Fusion’s efficiency in large-scale deployments

Scalability Under Variable Model Sizes: To provide a realistic assessment of scalability, EdgeST-Fusion was further evaluated under multiple model size configurations by varying the embedding dimension (d∈{64,128,256}) and the number of graph transformer layers (L∈{2,4,6}). This analysis reflects practical deployment scenarios where model capacity must be adjusted to balance accuracy and resource constraints.

The results indicate that scalability behavior is strongly dependent on model size. For compact configurations (d=64, L=2), training time and memory consumption scale approximately linearly up to 150 participating devices. However, larger configurations (d=256, L=6) exhibit superlinear growth beyond 150 devices due to increased attention computation and communication overhead during federated aggregation.

These observations demonstrate that EdgeST-Fusion does not maintain strict linear scalability across all configurations. Instead, the framework achieves efficient and stable operation within a practical range of 50–150 devices, beyond which computational and communication costs increase rapidly. Consequently, large-scale deployments require careful model size selection to ensure feasibility on resource-constrained consumer electronics devices.

4.15 Device Heterogeneity Analysis

Table 9 analyzes the impact of device heterogeneity on system performance across different consumer electronics categories and computational capabilities.

images

The heterogeneity analysis reveals that EdgeST-Fusion effectively accommodates diverse device capabilities, with contribution weights automatically adjusted based on computational capacity and data quality, ensuring optimal federated learning performance across the entire ecosystem.

Heterogeneous Hardware Constraints: In addition to data heterogeneity, EdgeST-Fusion is evaluated under heterogeneous hardware constraints reflecting the diverse computational capabilities of consumer electronics. Devices differ in CPU frequency, memory availability, and energy budgets, which directly impact local training time and update frequency. The federated learning process accommodates these differences by allowing devices to perform local training at variable speeds while contributing updates asynchronously within each communication round.

The contribution weights reported in Table 9 implicitly capture both data quality and hardware capability, ensuring that resource-constrained devices such as wearables and IoT sensor nodes participate without degrading global model stability. This design enables robust learning despite substantial variability in device performance.

4.16 Ablation Studies

Table 10 presents comprehensive ablation studies evaluating individual component contributions to overall performance.

images

The ablation study confirms that each component contributes significantly to overall performance, with federated learning providing the largest communication efficiency gains and adaptive security being crucial for resilience in adversarial environments.

Ablation on Communication Compression. To evaluate the impact of communication compression, additional ablation experiments were conducted by disabling compression entirely and by applying a lightweight compression configuration. When communication compression is removed, the transmitted payload increases from 544.8 to 847.3 KB per round, resulting in substantially higher communication overhead without improving predictive accuracy. Lightweight compression provides moderate savings but fails to match the efficiency achieved by the full EdgeST-Fusion configuration.

These results confirm that communication compression is a critical component of the proposed framework, enabling significant bandwidth reduction while preserving model accuracy and security performance. The findings further demonstrate that compression plays a central role in achieving scalable and resource-efficient federated learning for smart city consumer electronics.

4.17 Cross-Modal Fusion Effectiveness

Table 11 analyzes the contribution of different data modalities and their fusion effectiveness in the EdgeST-Fusion framework.

images

The modality analysis demonstrates that the combination of all three modalities (location, sensor, behavioral) achieves optimal performance, with location data receiving the highest attention weight (0.28) due to its fundamental importance in spatiotemporal modeling.

4.18 Convergence Analysis

Table 12 presents detailed convergence characteristics comparing EdgeST-Fusion with baseline federated learning approaches across different datasets.

images

The convergence analysis confirms that EdgeST-Fusion achieves faster convergence (26.8% fewer rounds) and superior final performance (39.8% lower loss) with enhanced stability compared to baseline approaches.

4.19 Real-World Deployment Results

Real-world deployment evaluation is conducted in collaboration with three smart city testbeds: Singapore Smart Nation, Barcelona Digital City, and Toronto Smart City. Table 13 summarizes the deployment results.

images

Real-world deployments demonstrate the practical viability of EdgeST-Fusion in diverse urban environments with heterogeneous consumer electronics infrastructure.

5 Discussion

5.1 Performance Analysis and Insights

The comprehensive experimental evaluation demonstrates that EdgeST-Fusion achieves significant improvements across multiple performance dimensions. The 21.8% improvement in prediction accuracy results from the synergistic integration of cross-modal embedding networks, graph transformer architectures, and federated learning principles. This substantial enhancement indicates that the proposed framework successfully addresses the fundamental challenges of multimodal spatiotemporal data analytics in smart city consumer electronics environments.

The cross-modal embedding network proves particularly effective in handling heterogeneous data sources from diverse consumer electronics devices. The attention-based fusion mechanism enables the framework to automatically learn optimal combinations of different data modalities, leading to more comprehensive feature representations. This capability is crucial for smart city applications where data heterogeneity is inherent due to the diverse nature of consumer electronics devices and their operational contexts.

The graph transformer architecture contributes significantly to spatial dependency modeling while maintaining computational efficiency. The integration of graph attention mechanisms with transformer self-attention provides a powerful framework for capturing both local spatial patterns and global temporal dynamics. This dual modeling capability addresses a critical gap in existing approaches that typically handle spatial and temporal dependencies separately.

5.2 Communication Efficiency and Scalability

The 35.7% reduction in communication overhead represents a substantial improvement for federated smart city deployments where network bandwidth and energy consumption are critical constraints. The efficient embedding compression and adaptive aggregation mechanisms enable the framework to maintain high prediction accuracy while minimizing communication requirements.

The scalability analysis reveals that EdgeST-Fusion maintains efficient computational scaling for typical smart city deployments involving up to 150 participating devices. The framework demonstrates practical viability across this range while maintaining reasonable training times and memory consumption. For larger deployments exceeding 150 devices, the modular architecture supports hierarchical aggregation strategies to preserve efficiency in extensive urban environments.

Table 14 compares EdgeST-Fusion with recent state-of-the-art methods published in 2024 and 2025. The results demonstrate that EdgeST-Fusion uniquely integrates federated learning, cross-modal processing, graph-based modeling, and security mechanisms, achieving superior prediction accuracy and communication efficiency compared to existing approaches.

images

The comparison indicates that EdgeST-Fusion integrates federated learning, cross-modal processing, graph-based spatial modeling, and adaptive security mechanisms within a unified framework, making it well suited for practical smart city consumer electronics applications.

5.3 Security and Privacy Implications

The 29.4% improvement in security resilience demonstrates the effectiveness of the adaptive anomaly detection framework. The ability to detect various types of adversarial attacks including data poisoning, model inversion, and privacy inference attacks is crucial for smart city deployments where consumer electronics data contains sensitive personal information.

The differential privacy mechanisms integrated into the federated learning framework provide formal privacy guarantees while maintaining model utility. The (ε,δ)=(1.0,10−5) privacy parameters represent a practical balance between privacy protection and analytical performance for smart city applications.

The real-time response capabilities of the security framework (average 0.45 s response time) enable immediate threat mitigation, which is essential for maintaining system integrity in dynamic urban environments with continuous data streams from consumer electronics devices.

5.4 Practical Deployment Considerations

The real-world deployment results across three smart city testbeds validate the practical viability of EdgeST-Fusion in diverse urban environments. The average 90.0% prediction accuracy in real-world conditions demonstrates that the framework maintains performance when deployed with actual consumer electronics infrastructure and real user data patterns.

The variation in performance across different testbeds (Singapore: 89.3%, Barcelona: 91.7%, Toronto: 88.9%) reflects the impact of different urban characteristics, consumer electronics adoption patterns, and data quality variations. This variability is expected in real-world deployments and demonstrates the framework’s adaptability to diverse urban contexts.

The average inference latency of 233 ms meets real-time requirements for most smart city applications while providing sophisticated spatiotemporal analytics capabilities. This performance level enables practical deployment for time-sensitive applications such as traffic management, energy optimization, and emergency response coordination.

5.5 Limitations and Future Directions

While EdgeST-Fusion demonstrates significant improvements over existing approaches, several limitations warrant consideration for future research. The framework’s performance depends on the quality and availability of graph structure information representing spatial relationships between consumer electronics devices. In scenarios with limited or inaccurate spatial topology information, performance may degrade.

The current implementation assumes relatively stable network connectivity between participating devices and coordination servers. In highly mobile or intermittent connectivity scenarios, additional mechanisms for handling communication failures and asynchronous updates may be required.

The computational requirements, while scalable, may still present challenges for resource-constrained consumer electronics devices. Future work could explore more efficient architectures specifically designed for ultra-low-power devices while maintaining analytical capabilities.

The framework currently handles pre-defined data modalities and may require extensions for supporting new types of consumer electronics devices or data streams that emerge in evolving smart city environments. Developing more flexible and adaptive cross-modal processing mechanisms represents an important direction for future research.

6 Conclusion

This paper introduced EdgeST-Fusion, a novel cross-modal federated learning framework specifically designed for multimodal spatiotemporal data analytics in smart city consumer electronics applications. The proposed approach addresses critical challenges in existing methodologies through the integration of cross-modal embedding networks, graph transformer architectures, federated learning principles, and adaptive security mechanisms. Comprehensive experimental evaluation on three large-scale datasets demonstrates significant improvements in prediction accuracy (21.8%), communication efficiency (35.7%), and security resilience (29.4%) compared to state-of-the-art baselines. Real-world deployment validation across multiple smart city testbeds confirms the practical viability and effectiveness of the proposed framework in diverse urban environments. Future research directions include extending the framework to handle ultra-low-power devices, improving adaptability to dynamic urban environments, and developing more sophisticated privacy-preserving mechanisms for sensitive consumer electronics data.

Acknowledgement: The author would like to thank the University of Tabuk for providing the computational resources and infrastructure support necessary for conducting this research.

Funding Statement: This research was supported by the University of Tabuk, Saudi Arabia. The funding source had no involvement in the study design, data collection, analysis, interpretation, or manuscript preparation.

Availability of Data and Materials: The datasets used in this study are publicly available. The GeoLife dataset is available from Microsoft Research at https://www.microsoft.com/en-us/research/project/geolife. The PeMS-Bay dataset is available from the California Department of Transportation Performance Measurement System. The SmartHome+ dataset is available upon reasonable request from the corresponding author. The implementation code and trained models will be made available at a public repository upon acceptance of this manuscript.

Ethics Approval: This study utilized publicly available datasets that do not contain personally identifiable information. No human subjects or animals were directly involved in this research. Therefore, ethics approval was not required for this study.

Conflicts of Interest: The author declares no conflicts of interest.

Nomenclature

Mathematical Symbols
X(m)	Input data from modality
fθm	Modality-specific encoder
Wa	Learnable attention parameters
H(l)	Node features at layer
Q,K,V	Query, key, value matrices
θk(t)	Model parameters for device k
η	Learning rate
wk	Aggregation weight
St	System state at time
Gt	Graph-based spatial interactions
τt	Anomaly detection threshold
M	Number of modalities
T	Number of time steps
d	Embedding dimension
H	Number of attention heads
E(m)	Embedded features for modality m
αm	Attention weight for modality m
𝒢	Graph structure (𝒱, ℰ)
A	Adjacency matrix
Zt	Temporal attention output
𝒟k	Local dataset on device k
ℒk	Local loss function
qk	Data quality score
Ut	External inputs at time t
εt	Stochastic noise
μt,Σt	Mean and covariance estimates
N	Number of spatial nodes
K	Number of devices
L	Graph transformer layers
Acronyms
CE	Consumer Electronics
GNN	Graph Neural Network
MAE	Mean Absolute Error
MAPE	Mean Absolute Percentage Error
LSTM	Long Short-Term Memory
GPS	Global Positioning System
5G	Fifth Generation Wireless
ML	Machine Learning
FL	Federated Learning
IoT	Internet of Things
RMSE	Root Mean Square Error
GRU	Gated Recurrent Unit
API	Application Programming Interface
V2X	Vehicle-to-Everything Communication
AI	Artificial Intelligence
DL	Deep Learning

References

1. Dhiman G, Alghamdi NS. SMoSE: artificial intelligence-based smart city framework using multi-objective and IoT approaches for consumer electronics applications. IEEE Trans Consum Electron. 2024;70(2):1247–56. doi:10.1109/tce.2024.3363720. [Google Scholar] [CrossRef]

2. Wang T, Tian J, Fang K, Gadekallu TR, Alazab M, Jolfaei R. AI and digital twins for consumer electronics in smart cities. IEEE Consum Electron Mag. 2024;13(3):78–87. doi:10.1109/mce.2024.3444312. [Google Scholar] [CrossRef]

3. Lifelo Z, Ding J, Ning H, Dhelim S. Artificial intelligence-enabled metaverse for sustainable smart cities: technologies, applications, challenges, and future directions. Electronics. 2024;13(4):782. doi:10.3390/electronics13244874. [Google Scholar] [CrossRef]

4. AlTerkawi L, AlTarawneh M. Federated decision transformers for scalable reinforcement learning in smart city IoT systems. Future Internet. 2025;17(11):492. doi:10.3390/fi17110492. [Google Scholar] [CrossRef]

5. Mehmood F, Chen E, Akbar MA, Zia MA, Ghafoor A. Advancements in human action recognition through 5G/6G technologies for smart cities: fuzzy integral-based fusion. IEEE Trans Consum Electron. 2024;70(1):892–901. doi:10.1109/tce.2024.3420936. [Google Scholar] [CrossRef]

6. Aljarrah E. AI-based model for prediction of power consumption in smart grid-smart way towards smart city using blockchain technology. Intell Syst Appl. 2024;24(9):200440. doi:10.1016/j.iswa.2024.200440. [Google Scholar] [CrossRef]

7. Yeh AGO. From urban modelling, GIS, the digital, intelligent, and the smart city to the digital twin city with AI. Environ Plan B Urban Analytics City Sci. 2024;51(5):1085–8. doi:10.1177/23998083241249552. [Google Scholar] [CrossRef]

8. Gilman E, Bugiotti F, Khalid A, Mehmood H, Hagan M, Liston S, et al. Addressing data challenges to drive the transformation of smart cities. ACM Trans Intell Syst Technol. 2024;15(5):88. doi:10.1145/3663482. [Google Scholar] [CrossRef]

9. Liu B, Li Q, Zheng Z, Huang Y, Deng S, Huang Q, et al. A review of multi-source data fusion and analysis algorithms in smart city construction. Algorithms. 2025;18(1):23. [Google Scholar]

10. McMahan B, Moore E, Ramage D, Hampson S, Arcas BAY, et al. Communication-efficient learning of deep networks from decentralized data. In: Proceedings of the International Conference on Artificial Intelligence and Statistics. London, UK: PMLR; 2017. p. 1273–82. [Google Scholar]

11. Li T, Sahu AK, Zaheer M, Sanjabi M, Talwalkar A, Smith V. Federated optimization in heterogeneous networks. PMLR. 2020;2:429–50. [Google Scholar]

12. Wang J, Liu Q, Liang H, Joshi G, Poor HV. Tackling the objective inconsistency problem in heterogeneous federated optimization. In: Advances in neural information processing systems. Vol. 33. Red Hook, NY, USA: Curran Associates Inc.; 2020. p. 7611–23. [Google Scholar]

13. Diao E, Ding J, Tarokh V. HeteroFL: computation and communication-efficient federated learning for heterogeneous clients. In: Proceedings of the International Conference on Learning Representations; 2021 May 3–7; Online. [Google Scholar]

14. Yu B, Yin H, Zhu Z. Spatio-temporal graph convolutional networks for traffic forecasting. In: Proceedings of the 27th International Joint Conference on Artificial Intelligence; 2018 July 13–19; Online. p. 3634–40. [Google Scholar]

15. Li Y, Yu R, Shahabi C, Liu Y. Diffusion convolutional recurrent neural network: data-driven traffic forecasting. arXiv:1707.01926. 2018. [Google Scholar]

16. Guo S, Lin Y, Feng N, Song C, Wan H. Attention-based spatial-temporal graph convolutional networks for traffic flow forecasting. In: Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 33. Palo Alto, CA, USA: AAAI Press; 2019. p. 922–9. [Google Scholar]

17. Pan Z, Liang Y, Wang W, Yu Y, Zheng Y, Zhang J. Urban traffic prediction from spatio-temporal data using deep meta learning. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2019 Aug 4–8; Anchorage, AK, USA. New York, NY, USA: ACM; 2019. p. 1720–30. [Google Scholar]

18. Song C, Lin Y, Guo S, Wan H. Spatial-temporal synchronous graph convolutional networks: a new framework for spatial-temporal network data forecasting. In: Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 34. Palo Alto, CA, USA: AAAI Press; 2020. p. 914–21. [Google Scholar]

19. Zeng H, Zhou H, Srivastava A, Kannan R, Prasanna V. GraphSAINT: graph sampling-based inductive learning method. arXiv:1907.04931. 2020. [Google Scholar]

20. Elassy M, Al-Hattab M, Takruri M, Badawi S. Intelligent transportation systems for sustainable smart cities. Transp Eng. 2024;15(17):1002524. doi:10.1016/j.treng.2024.100252. [Google Scholar] [CrossRef]

21. Adewopo VA, Elsayed N. Smart city transportation using deep learning ensembles. IEEE Access. 2024;12:84571–85. [Google Scholar]

22. Mansouri W, Alohali MA, Alqahtani H, Alruwais N, Hamza MA, El-Latif AAA. Deep convolutional neural network-based enhanced crowd density monitoring for intelligent urban planning on smart cities. Sci Rep. 2025;15(1):1847. doi:10.1038/s41598-025-90430-4. [Google Scholar] [PubMed] [CrossRef]

Cite This Article

APA Style

Alenazi, M.M. (2026). EdgeST-Fusion: A Cross-Modal Federated Learning and Graph Transformer Framework for Multimodal Spatiotemporal Data Analytics in Smart City Consumer Electronics. Computers, Materials & Continua, 87(2), 59. https://doi.org/10.32604/cmc.2026.075966

Vancouver Style

Alenazi MM. EdgeST-Fusion: A Cross-Modal Federated Learning and Graph Transformer Framework for Multimodal Spatiotemporal Data Analytics in Smart City Consumer Electronics. Comput Mater Contin. 2026;87(2):59. https://doi.org/10.32604/cmc.2026.075966

IEEE Style

M. M. Alenazi, “EdgeST-Fusion: A Cross-Modal Federated Learning and Graph Transformer Framework for Multimodal Spatiotemporal Data Analytics in Smart City Consumer Electronics,” Comput. Mater. Contin., vol. 87, no. 2, pp. 59, 2026. https://doi.org/10.32604/cmc.2026.075966

BibTex EndNote RIS

Copyright © 2026 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Table of Content

EdgeST-Fusion: A Cross-Modal Federated Learning and Graph Transformer Framework for Multimodal Spatiotemporal Data Analytics in Smart City Consumer Electronics

Abstract

Keywords

References

Cite This Article

851

495

0

Related articles

Further Information

Guidelines

Follow Us

Join Us

Contact Us

WhatsApp:

Share Link