A Resilient BIRCH-Based Smart Framework for Real-Time IoT Data Clustering

Prabhat Das; Dibya Bora; Sajal Saha; Cheng-Chi Lee; Hirak Mazumdar

doi:10.32604/cmes.2026.079203

icon Open Access

ARTICLE

A Resilient BIRCH-Based Smart Framework for Real-Time IoT Data Clustering

Prabhat Das^1,2, Dibya Jyoti Bora¹, Sajal Saha², Cheng-Chi Lee^3,4,*, Hirak Mazumdar²

1 Department of Information Technology, The Assam Kaziranga University, Koraikhowa, NH-37, Jorhat, Assam, India
2 Department of Computer Science and Engineering, Center of Excellence in AI, Adamas University, Barasat-Barrackpore Road, Kolkata, West Bengal, India
3 Department of Library and Information Science, Artificial Intelligence Development Center, Fu Jen Catholic University, New Taipei City, Taiwan
4 Department of Computer Science and Information Engineering, Fintech and Blockchain Research Center, Asia University, Taichung City, Taiwan

* Corresponding Author: Cheng-Chi Lee. Email: email

(This article belongs to the Special Issue: Innovative Computational Models for Smart Cities)

Computer Modeling in Engineering & Sciences 2026, 147(1), 29 https://doi.org/10.32604/cmes.2026.079203

Received 16 January 2026; Accepted 23 March 2026; Issue published 27 April 2026

Abstract

Real-time data processing is essential in the evolving landscape of IoT applications, ensuring efficiency, reliability, and adaptability. However, conventional clustering algorithms often face difficulties in managing high-frequency, continuous IoT data streams due to limited adaptability and high computational overhead. To address these challenges, this study proposes a resilient adaptation of the BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) algorithm, tailored specifically for streaming IoT data. The enhanced approach dynamically recalculates clusters and determines the optimal number of clusters using the KneeLocator method. Unlike the original batch-oriented BIRCH, the modified version processes data incrementally, enabling continuous adaptation to changing data distributions. The proposed method was validated on benchmark IoT datasets and compared against K-Means, DBSCAN, standard BIRCH, and other state-of-the-art streaming-based clustering algorithms. Results consistently show that the modified BIRCH outperforms existing approaches in execution speed, memory efficiency, scalability, and clustering accuracy. In addition, the algorithm has been deployed within a web-based application featuring interactive visualization and anomaly detection, highlighting its practical relevance for smart city and industrial IoT scenarios. To promote reproducibility and future research, the complete framework and source code have been made publicly available.

Keywords

IoT applications; clustering algorithms; smart city applications; real-time data processing; industrial IoT; real-time clustering

1 Introduction

Addressing the optimization of real-time IoT data processing through clustering presents a compelling challenge, essential for maintaining the efficiency, reliability, and scalability of modern IoT systems [1,2]. In many real-world applications, obtaining labelled data is often impractical due to the volume, velocity, and heterogeneity of sensor-generated data. This scarcity has led to the adoption of unsupervised learning techniques, which enable systems to automatically identify patterns and structures without explicit supervision. Prior studies have demonstrated how such methods can facilitate continuous and adaptive processing of streaming data in dynamic environments [3]. Furthermore, by leveraging unsupervised clustering techniques, IoT frameworks can adapt to changing conditions and uncover hidden patterns in data, thereby enhancing system resilience and minimizing dependence on manually labeled datasets [4,5].

Cluster validation plays a pivotal role in optimizing IoT data processing by grouping similar data points to capture latent patterns within large-scale, heterogeneous, and often unstructured datasets [6,7]. Classical clustering algorithms such as DBSCAN, K-means, and hierarchical clustering are widely adopted in IoT systems for organizing sensor streams, network states, and behavioral patterns into meaningful groups [8–10]. Ensuring cluster quality is critical, as well-formed clusters directly influence the interpretability, reliability, and downstream decision-making capabilities of IoT analytics frameworks. Accordingly, validation techniques including silhouette analysis, elbow-based methods, and cohesion–separation measures are commonly employed to assess intra-cluster compactness and inter-cluster distinctiveness, thereby guiding the selection of an appropriate number of clusters [11].

Beyond conventional validation metrics, recent studies have emphasized application-driven and context-aware validation strategies tailored to specific IoT domains. For example, clustering of building energy consumption time-series has been enhanced using correlation-aware distance measures and prediction-oriented cluster validity indices, where clustering quality is evaluated based on its contribution to forecasting accuracy rather than geometric compactness alone [12]. Such approaches demonstrate that validated clustering labels can substantially improve the performance of downstream predictive models.

Similarly, in wireless sensor network–enabled IoT environments, adaptive clustering combined with bio-inspired optimization and learning-based validation has been shown to improve energy efficiency, network lifetime, and fault resilience. These works highlight the necessity of validation mechanisms that explicitly account for node energy constraints, dynamic topology variations, and data reliability in hostile deployment conditions [13].

Real-time and large-scale IoT applications further require clustering and validation techniques that remain computationally efficient under streaming conditions. In this direction, clustering-based system state modeling has been explored for anomaly and intrusion detection, where validated clusters represent normal operational behavior and deviations indicate potential threats. Such frameworks emphasize validation not only in terms of clustering structure but also runtime efficiency, memory overhead, and responsiveness [14].

In addition, large-scale IoT-driven decision-making systems have adopted hashing-based clustering approaches with stability-focused validation criteria, enabling scalable grouping of evolving decision patterns while maintaining high cohesion and consensus among participants [15]. At a broader systems level, recent studies identify clustering and its validation as significant contributors to energy inefficiency in IoT networks, underscoring the need for adaptive and energy aware clustering mechanisms within sustainable IoT architectures [16].

Concurrently, clustering based intrusion detection has gained attention for addressing limited labeled data and unknown attack patterns in IoT environments. Anchor graph clustering methods unify graph construction, learning, and clustering within a single framework to achieve accurate and time efficient intrusion detection under label constrained conditions [17].

Collectively, these studies indicate a clear shift from static, geometry-centric validation toward adaptive, performance-aware, and domain-specific cluster validation frameworks. This evolution aligns with the increasing complexity of IoT ecosystems, where clustering outcomes must simultaneously satisfy scalability, energy efficiency, prediction accuracy, security robustness, and real-time responsiveness. Consequently, cluster validation is no longer a post-processing step but an integral component of intelligent IoT analytics pipelines, motivating continued research into unified and real-time validation strategies.

In an IoT framework, real-time clustering not only provides insights into normal data patterns but also facilitates the identification of anomalies by segmenting data into well-defined clusters. This approach is particularly beneficial for processing diverse IoT data, such as temperature, humidity, and pressure readings, which may exhibit high temporal variability [18]. By continuously updating clusters with new incoming data, real-time clustering models can adapt to shifts in data distribution and evolving patterns, a capability essential for robust IoT data processing [19,20]. Techniques such as modified BIRCH algorithms are well-suited to this task, as they dynamically recalculate clusters, reducing computational demands while enhancing clustering adaptability [21,22].

To ensure the effectiveness of these clustering techniques, especially in real-time scenarios, proper data preprocessing is critical. IoT data typically encompasses diverse metrics collected from various sensors in heterogeneous environments, including temperature, humidity, and motion readings [23]. Recent studies on indoor environmental quality monitoring further emphasize the importance of effective data acquisition, processing, and aggregation strategies to ensure reliable system performance, reduce data transfer rates, and enable efficient storage and visualization [24]. Preprocessing this data involves addressing missing values, applying feature scaling, and selecting relevant attributes, all of which contribute to improving clustering accuracy and computational efficiency [25]. Additionally, managing such diverse and high-frequency data streams requires an understanding of domain-specific challenges, particularly in complex environments like smart cities and industrial facilities. These challenges underscore the need for scalable and adaptable clustering techniques that can maintain accuracy under fluctuating data volumes and conditions [26–28]. To address these demands, Fig. 1 presents an overview of the proposed system pipeline, which illustrates the key stages involved—from raw data ingestion and preprocessing to real-time clustering, anomaly detection, and decision-making. This visual representation aligns with the need for efficient and responsive frameworks in practical IoT deployments.

images

Figure 1: An illustration of clustering techniques in real-time IoT data processing.

Unsupervised clustering methods are extensively utilized in various fields beyond IoT, underscoring their versatility and effectiveness. For instance, in customer segmentation, transaction records can be transformed into Recency, Frequency, and Monetary (RFM) values and analyzed with clustering algorithms to form well-separated groups that reflect spending behavior and enable targeted marketing strategies [29]. Similarly, clustering and anomaly detection are crucial in fraud detection, network security, and healthcare for identifying unusual trends and preventing security breaches [30,31]. In the fields of text mining and image segmentation, clustering assists in organizing documents and segmenting images, respectively, aiding in efficient information retrieval and visual data analysis [32].

In smart city implementations, IoT sensors continuously monitor environmental factors, generating extensive data that requires optimized clustering for real-time insights and anomaly detection [33]. The changing and varied nature of environmental data requires clustering methods that can quickly adapt to new conditions, helping smart city management with useful and reliable insights [10]. Additionally, incorporating edge devices, such as ESP32 sensors, can improve responsiveness by reducing latency, though it also introduces complexity in data processing and management [34–36].

Based on the above-mentioned points, this paper aims to address the following core challenges associated with real-time IoT data processing: Data analysis complexity involves developing an efficient approach to process and analyse diverse, high-frequency data streams from multiple sensors in real time, extracting meaningful clusters for actionable insights. Clustering optimization focuses on establishing an adaptive clustering mechanism that dynamically determines the optimal number of clusters, enhancing both scalability and interpretability of clustering results in varying data conditions. Scalability and efficiency aim to design a system that can scale with increased data volumes and sensor counts without compromising clustering accuracy or efficiency, enabling continuous, high-quality data processing. Deployment challenges involve addressing operational issues associated with deploying IoT sensors in mixed environments, ensuring stable and efficient data processing in both indoor and outdoor settings. By tackling these challenges, this research aims to advance the development of optimized, real-time IoT clustering systems that can effectively process large-scale data, support adaptive cluster recalibration, and ensure scalability across diverse applications.

In this work, we present a real-time clustering framework that directly addresses these challenges through several key contributions. A modified BIRCH algorithm is introduced, capable of dynamically recalibrating clusters with reduced computational overhead, thereby improving adaptability in evolving data streams. An adaptive threshold adjustment mechanism is integrated to enhance clustering accuracy while maintaining scalability and memory efficiency for large-scale IoT deployments. The effectiveness of the framework is demonstrated through evaluation on multiple IoT datasets, highlighting improvements in clustering quality, execution time, and resource utilization. To support practical adoption, the framework is implemented within a web-based interface for real-time monitoring, anomaly detection, and decision-making.

2 Methodology

This section presents the development of a real-time, web-based system for clustering IoT data and detecting anomalies, focusing on five key objectives:

• Real-time data processing: Efficiently manage continuous, high-frequency IoT data streams with dynamic cluster updates.

• Adaptive clustering: Implement a modified BIRCH algorithm that adjusts automatically to evolving data patterns without manual intervention.

• Timely anomaly detection: Identify deviations from established clusters promptly to enable rapid and informed responses.

• Scalability and memory efficiency: Support large-scale IoT networks while optimizing memory and resource utilization.

• User-friendly web interface: Provide accessible real-time monitoring and intuitive visualization of clustering results and anomalies.

And the following improvements are included in the modified BIRCH algorithm. The CF (Cluster Feature) tree insertion process has been optimized to reduce computational overhead, while a dynamic threshold adjustment mechanism has been introduced to enhance clustering accuracy. Additionally, the algorithm is now capable of handling streaming IoT data in real time, improving both responsiveness and adaptability.

2.1 Modified BIRCH Algorithm for Real-Time Clustering

The modified BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) algorithm was developed to address the specific demands of real-time IoT applications, where traditional clustering methods often fall short. While the original BIRCH algorithm is efficient for clustering large datasets, it is typically designed for static, batch-processed data and lacks adaptability for continuously generated IoT data. The modifications introduced in this algorithm allow it to recalibrate clusters dynamically, providing real-time updates that are crucial in fast-paced environments.

The overall architecture of the proposed BIRCH-based Smart Clustering Framework is shown in Fig. 2, which illustrates the complete pipeline: from raw IoT data ingestion and pre-processing to real-time clustering and decision making. This framework integrates data acquisition, processing through a Python-based server, and dynamic clustering using a modified BIRCH algorithm, facilitating anomaly detection and real-time insights. In IoT applications, data is generated continuously and is often high-frequency and high-volume. This nature of IoT data presents unique challenges for clustering algorithms: Scalability: IoT networks, such as smart cities or industrial sensor systems, often consist of numerous data sources generating large volumes of data. An efficient clustering algorithm must scale well to handle this data influx without excessive computational cost. Real-Time Adaptability: Since IoT data is dynamic, an effective clustering algorithm must be able to adapt to evolving data distributions over time. This is particularly important for applications like anomaly detection, where patterns can change as new data arrives. Memory Efficiency: Due to the high volume and velocity of IoT data, memory usage needs to be optimized to prevent resource exhaustion and ensure continuous system performance. To address these requirements, the modified BIRCH algorithm incorporates the following key enhancements: Here’s a more concise version that maintains the flow: Real-Time Cluster Updates: Unlike traditional BIRCH, which is batch-based, the modified algorithm recalculates clusters every 10 min to incorporate new data, ensuring clusters stay relevant to current distributions. Dynamic Cluster Determination: Instead of a fixed number of clusters, the modified version uses the Knee Locator method to dynamically determine the optimal cluster count at each recalibration. This flexibility removes the need for manual adjustments. Integration with a Web-Based Application: The modified BIRCH is integrated into a web-based application that visualizes real-time clustering results and detects outliers as new data is processed, providing interactive monitoring capabilities. By incorporating these features, the modified BIRCH algorithm delivers accurate, scalable, and memory-efficient clustering and anomaly detection for real-time IoT applications, supporting timely decision-making and response.

images

Figure 2: Architecture of the modified real-time BIRCH clustering framework.

2.2 Step-by-Step Explanation of the Algorithm

The proposed Smart Clustering Framework extends the classical BIRCH algorithm to operate under real-time IoT streaming conditions through parallel data ingestion, incremental CF-tree updates, and adaptive cluster evolution. Unlike static clustering approaches, the formulation explicitly models temporal arrival, batch-wise summarization, and continuous structural adaptation.

1. Data Pre-Processing

Incoming IoT sensor observations may contain missing or incomplete values due to packet loss, transmission delays, or asynchronous sensing. Therefore, each arriving record must first be represented in a structured mathematical form before preprocessing and clustering operations can be applied.

To formalize the streaming observation, each data instance arriving at time t is modeled as an m-dimensional feature vector:

x(t)=[x1(t),x2(t),…,xm(t)]∈Rm,(1)

where m denotes the number of sensed attributes. Eq. (1) provides a unified representation of heterogeneous IoT measurements within a common Euclidean space, enabling consistent downstream processing.

Since streaming measurements may arrive with missing entries, values are repaired using a running mean estimator:

xj(t)={xj(t),if xj(t) is observed,μj(t−1),if xj(t)∈{NaN,∅},(2)

where μj(t−1) denotes the historical mean of feature j computed from previously received observations. Eq. (2) performs causal imputation, ensuring that only past information is used and no future data are required, which is essential for real-time deployment. Following imputation, numerical features are standardized and categorical attributes are encoded using a one-hot representation so that all variables contribute comparably to the distance calculations underlying the clustering process.

After preprocessing, each repaired observation is incorporated into the evolving stream representation. The collection of observations received up to time t is written as

X(t)=(x1(1)⋯xm(1)⋮⋱⋮x1(t)⋯xm(t)),(3)

as defined in Eq. (3). Here, X(t) represents a transient streaming buffer rather than a permanently stored dataset. This representation enables incremental organization of incoming data for subsequent parallel partitioning and CF summarization, while avoiding repeated access to historical samples.

At this stage, the algorithm has access only to the cleaned stream of observations represented by X(t). These data are available for analysis, but no clustering summary has yet been formed. Therefore, an initial calibration phase is required to construct the first set of clustering features that will serve as the foundation for subsequent incremental updates.

2. Cold-Start Initialization Phase

After preprocessing, the algorithm has access to a stream of cleaned observations through X(t), but no clustering summary has yet been established. Since incremental methods such as BIRCH rely on previously computed clustering features, an initial calibration step is required to construct the first representation of the data distribution.

To achieve this, an initial window of samples is collected from the stream, defined as

W(0)={x1,…,xM},(4)

where M denotes the number of observations used for initialization. Eq. (4) specifies the finite sample used to estimate the initial structure before streaming updates begin.

From this window, the first cluster statistics are computed. The centroid of each preliminary cluster is obtained as

μc(0)=1|Cc(0)|∑xi∈Cc(0)xi,(5)

as expressed in Eq. (5), which provides the initial estimate of cluster location and dispersion.

These statistics form the first clustering features and effectively initialize the CF representation. Once this baseline structure has been created, newly arriving observations can be incorporated incrementally without revisiting the initialization data. The subsequent steps therefore focus on processing newly arriving observations efficiently and updating clustering summaries in real time.

3. Parallel Chunk-wise Processing

After the cold-start phase establishes the initial clustering features (Eqs. (4) and (5)), the algorithm enters steady-state operation. At each subsequent time step t, newly arriving observations are appended to the stream buffer X(t) (Eq. (3)). To ensure that clustering updates can keep pace with high-velocity streams, the observations are partitioned across P processing threads:

X(t)=⋃j=1PXj(t),Xi(t)∩Xj(t)=∅.(6)

Eq. (6) states that each subset Xj(t) is disjoint, so different processors can work independently without duplicating computations.

Each processor summarizes its assigned subset by constructing a local clustering-feature (CF):

CFj(t)=CF(Xj(t)),(7)

which stores the sufficient statistics (number of samples, linear sum, and squared sum) required for incremental clustering.

4. Intra-Step CF Reduction (Parallel Merge)

The parallel partitions processed in Eq. (6) produce independent local clustering features CFj(t) (Eq. (7)). These partial summaries must then be consolidated to recover a unified description of all observations received during the current time interval.

Accordingly, the local CFs are merged through the additive property of clustering features:

CFbatch(t)=⨁j=1PCFj(t).(8)

Eq. (8) forms a single batch-level clustering feature that summarizes the entire set of newly arrived observations, allowing the algorithm to maintain compact statistics without revisiting the raw data.

5. Structural Compactness Constraint

The merged summary is then evaluated using the structural constraint defined in the original BIRCH formulation to regulate CF-tree growth:

R(CFbatch(t))≤Ts⇒merge,R(CFbatch(t))>Ts⇒split.(9)

Here Ts denotes the maximum allowable radius of a clustering feature (CF) entry. In this work, Ts is inherited directly from the BIRCH implementation used in the experiments, where the threshold parameter is set to its default value of 0.5. Because all numerical attributes are standardized prior to clustering, this value corresponds to a compactness constraint expressed in the normalized feature space.

Eq. (9) therefore limits how much variance a CF node may absorb before a structural split is triggered, preventing uncontrolled tree expansion and ensuring bounded memory usage during streaming operation.

6. Dynamic Threshold Estimation

After the batch-level CF has been incorporated subject to the structural constraint in Eq. (9), the algorithm evaluates the current cluster dispersion in order to distinguish normal variation from emerging structural changes.

For each observation assigned to cluster ℓ(i) at time t, the deviation from its corresponding centroid is computed as

di(t)=‖xi−μℓ(i)(t)‖2,(10)

where di(t) denotes the intra-cluster distance.

An adaptive tolerance level is then obtained from the empirical distribution of these distances:

Tt=Quantile0.95⁡({di(t)}),(11)

which defines the allowable spread of observations around their cluster centers. The 95th percentile is used as a robust boundary capturing typical cluster dispersion while excluding extreme deviations, allowing the model to remain sensitive to structural changes without reacting to transient noise. Eq. (11) therefore introduces a data-driven threshold that adapts automatically to evolving variability in the stream, in contrast to the fixed structural parameter Ts used for CF-tree regulation.

7. Outlier Detection and Micro-Cluster Formation

Using the adaptive threshold Tt defined in Eq. (11), each newly arriving observation is evaluated against its nearest cluster center to determine whether it represents normal cluster evolution or a structurally novel pattern.

The distance between a sample and cluster centroid is computed as

D(Xi,Ck)=∑j=1m(Xij−Ckj)2,(12)

where D(Xi,Ck) denotes the Euclidean distance between observation Xi and centroid Ck.

The assignment rule is then defined as

Assign(Xi)={Ck,if D(Xi,Ck)≤Tt,New micro-cluster,otherwise,(13)

which uses the learned tolerance to separate regular variation from outliers, triggering micro-cluster formation when deviations exceed the adaptive bound.

In this step, each incoming observation is evaluated only with respect to its nearest centroid obtained through the CF-tree search, without revisiting previously processed samples. The computation therefore relies on compact CF summaries rather than the full dataset. For a batch of size b and feature dimension m, the computational cost is O(b⋅m) (or O(m) per sample), where both b and m are fixed by the streaming configuration. As a result, this operation remains bounded during online processing and does not grow with the number of accumulated observations, enabling efficient incremental updates under continuous data arrival.

8. Streaming CF Evolution across Time

Once observations have been assigned to existing clusters or promoted to new micro-clusters (Eq. (13)), their sufficient statistics are incorporated into the global CF representation. The CF-tree is therefore updated incrementally as

CF(t)=CF(t−1)⊕CFbatch(t),(14)

where ⊕ denotes the additive update of clustering features.

Eq. (14) propagates previously learned summaries forward while integrating newly observed data, enabling true streaming operation without revisiting historical samples.

9. Periodic Cluster Re-Estimation

Although CF updates occur continuously, the intrinsic structure of the data may evolve over longer time horizons due to concept drift. To reassess cluster adequacy without interrupting streaming updates, clustering quality is periodically evaluated using the sum of squared errors (SSE):

SSE(k)=∑i=1n‖xi−Ck(i)‖2,(15)

which measures within-cluster compactness for candidate values of k.

The appropriate number of clusters is then determined via knee-point detection:

k∗=KneeLocator(SSE(k)),(16)

allowing the model to adapt its granularity to long-term distributional changes while leaving the real-time CF update mechanism unaffected. In a streaming environment, this re-estimation is performed at discrete time intervals, denoted by Δk, rather than continuously. The selection of Δk must balance responsiveness to evolving data distributions against the computational overhead of repeated recalibration.

To determine an appropriate interval, a sensitivity analysis was conducted by varying the recalculation period from 30 to 700 s in 30 s increments. The evaluation considered clustering-quality metrics (Silhouette coefficient and Davies–Bouldin Index) together with memory utilization. Very short intervals introduced instability and unnecessary overhead (e.g., at 30 s, Silhouette =0.3921, DBI =0.5554, memory ≈747 MB), while intermediate intervals (90–150 s) achieved strong separation but required substantially higher memory (≈1568 MB), making them unsuitable for sustained operation.

A recalculation period of 600 s provided the best trade-off between clustering stability and resource demand (Silhouette =0.5801, DBI =0.4139, memory ≈415 MB), with no significant improvement observed beyond this point. Accordingly, Δk is set to 600 s in the proposed framework.

This compact representation enables the clustering algorithm to update cluster statistics efficiently in real time. Together with periodic cluster re-estimation and distance-based outlier handling, the proposed method provides a scalable and adaptive framework for analysing high-frequency, high-volume IoT data streams.

While the proposed framework retains the fundamental principles of the BIRCH algorithm, several modifications are introduced to support real-time IoT data processing. These enhancements address the challenges associated with continuous data arrival, computational constraints, and evolving data distributions, thereby ensuring that the method remains both effective and practical for deployment in dynamic operational environments. The key enhancements introduced for real-time operation are summarized as follows:

Chunk-Wise Data Processing: In contrast to conventional BIRCH implementations that assume a memory-resident dataset, the proposed framework processes observations in time-indexed chunks. This enables incremental updates to the CF-tree while limiting memory usage, thereby ensuring scalability under continuous IoT data streams.

Adaptive Micro-Cluster Thresholding: Unlike classical BIRCH, which relies on a fixed user-defined threshold, the proposed framework estimates the micro-cluster tolerance dynamically using the data-driven statistic Tt defined in Eq. (11). This allows the clustering boundary to expand or contract according to the observed dispersion of the stream, improving robustness to noise while remaining sensitive to genuine structural changes.

Dynamic Cluster Recalculation: The number of clusters is treated as a time-varying parameter rather than a fixed constant. At predefined intervals, the elbow point of the SSE curve is re-estimated using the KneeLocator method, yielding an updated optimal cluster count k∗ as defined in Eq. (16). This periodic reassessment allows the framework to adapt to gradual distributional changes and mitigates the effects of concept drift.

Dual Communication Mechanism: To support both low-latency ingestion and controlled analytical updates, the system adopts a hybrid communication model combining WebSocket and HTTP AJAX. WebSocket facilitates continuous transmission of streaming sensor data, whereas HTTP-based requests are used for periodic retrieval of clustering results and controlled triggering of recalibration procedures.

Modular Pipeline Architecture: As illustrated in Fig. 2, the system operates as a decoupled processing pipeline. Incoming sensor data are first preprocessed and stored in a stream-oriented data repository, from which the clustering engine consumes data for incremental analysis. This modular design enables independent scaling of data ingestion, storage, and clustering components while supporting flexible system updates.

While Fig. 2 presents the structural organization of the system, the sequence of operations executed during real-time processing must also be formally defined. Accordingly, the core clustering workflow integrating preprocessing, adaptive micro-cluster thresholding, and periodic cluster re-estimation is summarized in Algorithm 1. This procedure translates the previously described pipeline into an executable sequence for real-time streaming environments, where incoming data are processed using a fixed-size rolling window while continuously updating the clustering structure. In addition, Algorithm 2 outlines the visualization routine that periodically retrieves the updated clustering results and renders them for real-time monitoring and analysis.

images

Overall, the combination of chunk-wise data ingestion, real-time communication, dynamic re-clustering, and modular deployment constitutes the proposed Modified BIRCH algorithm tailored for real-time IoT anomaly detection and adaptive clustering. For visualization during live monitoring, clustering results are displayed in a three-dimensional view in which the first two axes represent the selected features, while the third axis is used solely to separate cluster identifiers visually. This axis encodes discrete labels and does not represent a physical variable or participate in the clustering computation. Although Algorithm 1 illustrates the workflow using temperature and humidity variables for clarity, the proposed RT-BIRCH framework operates on a general m-dimensional feature vector as defined in Eq. (1). The methodology is therefore applicable to heterogeneous IoT datasets with arbitrary numbers of attributes.

2.3 Web-Based System Architecture and Design

The web-based system enables real-time clustering and anomaly detection for IoT data through efficient data collection, processing, and visualization. Users can interact via a user-friendly interface to monitor devices, adjust data size, validate clusters, and visualize anomalies in real time.

The system workflow, illustrated in the flowchart, covers user authentication, real-time data analysis, and anomaly detection, ensuring continuous monitoring and dynamic clustering updates.

The architecture consists of key components: IoT sensors deployed in a smart city environment collect real-time data on temperature, humidity, pressure, and motion. This data is ingested into a time-series collection and stored in MongoDB Atlas. The processing layer, using Oracle Cloud Virtual Machine, handles pre-processing, clustering, and anomaly detection. A web interface, built with Python Flask and the ThingSpeak API, enables real-time visualization of data, clustering results, and anomalies.

2.3.1 Algorithm Selection

The system integrates multiple clustering algorithms to evaluate their effectiveness in handling large IoT data streams as illustrated in the flowchart shown in Fig. 3. DBSCAN identifies clusters of arbitrary shapes and handles noise effectively. K-means partitions data into a predefined number of clusters, making it suitable for structured applications. BIRCH efficiently processes large datasets with incremental clustering and outlier handling. The modified BIRCH enhances real-time clustering by dynamically recalculating clusters as new data arrives, using the KneeLocator method to determine the optimal number of clusters, ensuring adaptive anomaly detection in IoT environments.

images

Figure 3: Workflow for real-time IoT data clustering and anomaly detection in a web-based application.

2.3.2 Cluster Validation

Cluster validation is performed using several methods. Silhouette analysis is used to calculate silhouette scores, which assess the quality and cohesion of the clusters. The elbow method is employed to determine the optimal number of clusters by analyzing the within-cluster sum of squares. Additionally, silhouette scores are utilized to gauge intra-and inter-cluster separation and cohesiveness.

2.3.3 Anomaly Detection

Anomalies are identified directly through clustering by measuring data points’ deviation from their assigned cluster centroids. Outliers are detected as points that fall far from the majority of points within each cluster.

2.3.4 Real-Time Data Handling

Real-time data handling is managed using MongoDB Atlas and Oracle Cloud Virtual Machine to facilitate real-time data ingestion and processing. This setup ensures that data is processed as soon as it arrives, enabling timely insights and actions. The system is designed to scale efficiently with the increasing number of sensors and data volume by leveraging the distributed computing capabilities of MongoDB Atlas and Oracle Cloud, allowing the system to handle growing data loads and maintain performance.

2.3.5 Testing Environment

The system is tested in both a simulated smart city environment and a real-world deployment. In the simulated smart city environment, the system is deployed to test its robustness and effectiveness under various conditions. For real-world validation, the system is implemented with actual IoT sensor data to assess its performance and reliability.

2.3.6 Web Interface

Fig. 4 presents the sitemap of the web interface, which includes real-time data visualization through interactive dashboards displaying sensor data, clustering results, and detected anomalies. Additionally, user alerts are implemented to notify users of anomalies in real time, enabling prompt responses and actions.

images

Figure 4: Sitemap of the web application.

2.4 Experimental Setup

The comparative analysis shown in Table 1 highlights the effectiveness of the Modified BIRCH algorithm over the original. To demonstrate its practical applicability, we designed and evaluated a web-based application for monitoring real-time data from ESP32-based edge computing devices equipped with DHT11 sensors. The data flow architecture, shown in Fig. 5, integrates data propagation to the ThingSpeak cloud and MongoDB Atlas, while data processing and visualization are handled by an Oracle Cloud virtual machine (VM) instance running a Python Flask web application.

images

Figure 5: Data flow architecture for collection, processing, and visualization.

The system consists of several components working together. ESP32-based edge computing devices, six in total, are each equipped with a DHT11 sensor for measuring temperature and humidity as shown in Fig. 6. These devices send sensor readings to a localhost server, which collects the data and forwards it to both ThingSpeak cloud and MongoDB Atlas. ThingSpeak cloud serves as an intermediate platform for data storage and basic visualization, while MongoDB Atlas stores real-time sensor data for comprehensive analysis. The data processing scripts and the web-based application, developed with Python Flask, are hosted on an Oracle Cloud virtual machine (VM). The Python Flask web application serves as the interface for displaying processed real-time data.

images

Figure 6: Distributed edge devices for environmental data collection using ESP32 and DHT11 sensors.

Following deployment on the Oracle Cloud VM, the entire system was extensively tested to ensure reliable data flow and accurate information display. These tests covered the complete data journey, from sensor readings to their visualization on the web interface.

The first stage focused on validating end-to-end data flow. This included emulating sensor readings from the ESP32 devices, processing them within the VM, and confirming their correct visualization on the web interface. To further assess robustness, controlled network failures were introduced during transmission, allowing evaluation of the system’s ability to manage disruptions and maintain data integrity.

Subsequently, attention was directed toward ensuring data consistency throughout the flow process. Mechanisms such as data validation checks, error handling routines, and redundancy measures within the VM were implemented and tested. These steps ensured the accuracy and reliability of the information displayed on the web interface.

Finally, controlled experiments were carried out to verify both accuracy and responsiveness of the displayed data. This involved comparing sensor readings shown on the web interface with expected values under ambient environmental conditions. In addition, the responsiveness of the web application was evaluated to confirm that updates were displayed in a timely manner for end users.

In addition to this real-world deployment, the proposed method was further validated using benchmark IoT datasets and a real-time IoT simulator. For comparative evaluation, the Modified BIRCH algorithm was tested against widely used clustering methods such as K-Means, DBSCAN, and standard BIRCH, along with recent state-of-the-art streaming-based clustering techniques. These comparisons form the basis of the detailed performance analysis presented in the Results and Discussion section.

3 Results and Discussions

In this section, we evaluate the performance of four clustering algorithms-K-Means, DBSCAN, traditional BIRCH, and the modified BIRCH algorithm focusing on execution time, memory utilization, and anomaly detection capabilities. This analysis is crucial for determining the suitability of each algorithm in real-time IoT applications, where processing speed, memory efficiency, and accurate anomaly detection are key to managing continuous, large-scale data streams effectively. To replicate a real-time environment, experiments were conducted over 100 iterations, with each iteration introducing a new batch of 100 data points. This experimental setup enables a comprehensive assessment of how efficiently the algorithms process sequential data batches, thereby reflecting the demands of continuous, high-frequency IoT data streams.

This setup allowed us not only to observe the algorithms’ clustering performance but also to assess their effectiveness in detecting anomalies within incoming data. These insights provide a deeper understanding of how each algorithm performs under the dual demands of real-time data clustering and anomaly detection in IoT scenarios, crucial for applications where quick identification of abnormal patterns is essential for decision-making and responsiveness.

To better understand the performance of the clustering algorithms, Table 2 compares the average execution time (in both seconds and milliseconds) and total execution time across multiple iterations, with the following key findings. K-Means exhibits a relatively low average execution time of 5.85 s, making it a fast clustering algorithm. However, K-Means has limitations in handling noise and irregular cluster shapes, which affects its suitability for IoT data that often contains such anomalies. DBSCAN requires significantly more processing time, with an average execution time of 13.23 s. While DBSCAN excels in handling noise, its relatively high execution time and memory usage make it less suitable for real-time processing in high-velocity environments like IoT. Traditional BIRCH shows an average execution time of 7.31 s, slightly slower than K-Means but faster than DBSCAN. Traditional BIRCH offers efficient clustering for structured datasets but lacks adaptability, particularly in the face of real-time data variability. Modified BIRCH achieves the lowest average execution time at 5.24 s, outperforming the other algorithms. This result highlights the modified BIRCH’s effectiveness in handling real-time IoT data streams, where rapid cluster recalibration is essential. The modifications introduced, such as optimized CF-tree management and adaptive threshold settings, have significantly reduced processing time without compromising clustering quality.

images

The line graph in Fig. 7, depicts the execution time across various iterations for each algorithm, providing insight into consistency and stability over repeated runs. K-Means and Modified BIRCH maintain relatively stable execution times across iterations, with the modified BIRCH exhibiting slightly lower peaks, reinforcing its efficiency in real-time clustering. DBSCAN shows substantial variability, with execution times spiking at several points. This inconsistency is likely due to DBSCAN’s density-based approach, which becomes more computationally intensive with data density variations. Traditional BIRCH also displays some fluctuations but remains more stable than DBSCAN. However, it still lags behind the modified BIRCH in both speed and consistency.

images

Figure 7: Real-time execution time comparison of clustering algorithms across iterations.

The results indicate that the modified BIRCH algorithm not only surpasses the traditional BIRCH and DBSCAN in terms of speed but also provides comparable (if not superior) performance to K-Means in handling execution time. Given the unique characteristics of IoT data—where data streams are continuous, potentially noisy, and high-velocity—the modified BIRCH algorithm emerges as the most suitable choice among the tested algorithms. Importantly, the modified BIRCH algorithm is designed to perform clustering in real-time by continuously taking in new streams of IoT data from multiple sensors. Its ability to adaptively recalibrate clusters as fresh data arrives, without excessive computational overhead, makes it highly valuable for real-world IoT applications.

The modified BIRCH algorithm’s rapid processing capabilities make it ideal for time-sensitive IoT applications, such as predictive maintenance through digital twin in industry 4.0, anomaly detection in smart cities, and healthcare monitoring. By minimizing execution time, this algorithm ensures that clustering and outlier detection can be performed promptly as new data streams are ingested, enabling faster response to data changes and potential anomalies in IoT environments. The capacity to handle continuous data influx from multiple sensors further underscores its applicability in scalable, real-time IoT frameworks.

In the context of real-time IoT data processing, memory efficiency is critical due to the continuous influx of data from multiple sources. The following analysis compares the memory utilization patterns of traditional clustering algorithms (K-Means, DBSCAN, and traditional BIRCH) with the modified BIRCH algorithm, highlighting the advantages of automated, adaptive clustering.

In the comparative evaluation, K-Means, DBSCAN, and conventional BIRCH were applied in a batch processing manner to successive data segments of the stream. Each segment was processed independently, with the algorithm reinitialized for every update cycle, reflecting their typical usage in non-incremental settings where model state is not preserved between executions. Consequently, memory allocation occurs repeatedly as data structures and intermediate computations are reconstructed for each processing interval. In contrast, the Modified BIRCH approach operates in a stateful streaming mode by maintaining a persistent CF-tree that is incrementally updated as new data arrive. This enables reuse of previously allocated structures and supports continuous clustering without repeated reinitialization, resulting in more stable memory utilization during prolonged operation.

The line graph of Fig. 8 demonstrates a stable but elevated memory utilization across these iterations for each traditional algorithm. This sustained high memory usage underscores the limitations of manual re-execution, as each instance contributes to the overall memory footprint. Such an approach is not only resource-intensive but also impractical in real-world IoT systems that demand scalability and continuous operation.

images

Figure 8: Memory utilization of single instances for K-means, DBSCAN, and BIRCH, and overall memory utilization of the modified BIRCH algorithm.

In contrast, the modified BIRCH algorithm is designed to automatically update clusters in real-time as new IoT data streams in, eliminating the need for manual re-execution with each new data chunk. This automation is achieved within a single instance, allowing the algorithm to dynamically adjust clusters without requiring additional memory-intensive tabs or instances. As represented in the Table 3, the modified BIRCH algorithm consistently maintains a significantly lower memory usage compared to traditional methods. By operating within a single tab and recalibrating automatically, it minimizes memory overhead and provides a more sustainable, efficient solution for real-time IoT applications. This memory efficiency makes the modified BIRCH particularly well-suited for applications with limited hardware resources, as it avoids the exponential memory growth observed in algorithms that rely on repeated manual execution.

images

The lower memory footprint of the modified BIRCH algorithm directly translates into enhanced scalability and operational efficiency in IoT environments. Its ability to handle continuous data flows without significant memory strain makes it ideal for applications requiring uninterrupted clustering and outlier detection, such as smart city monitoring, industrial automation, and remote health diagnostics. By reducing the dependency on multiple memory-intensive instances, the modified BIRCH ensures that real-time processing can be achieved even in resource-constrained environments, ultimately improving the system’s responsiveness and stability.

Table 4 compares the runtime complexity of various clustering algorithms, highlighting that the modified BIRCH algorithm maintains the same theoretical complexity as traditional BIRCH, while introducing dynamic recalibration of clusters and integration with KneeLocator for optimal k selection. This ensures that the algorithm remains computationally efficient while adapting in real-time.

In our experiments, we also evaluated outlier detection capabilities across four clustering algorithms: K-Means, DBSCAN, traditional BIRCH, and the modified BIRCH algorithm. Each of these methods was applied to detect anomalies in real-time IoT data by clustering incoming sensor readings based on temperature and humidity parameters. The four visualizations presented in Fig. 9, demonstrate the clustering performance of each algorithm, where outliers are identified by their distance from the main clusters.

images images

Figure 9: Outlier detection in IoT data using various clustering algorithms (a) K-means, (b) DBSCAN, (c) BIRCH, and (d) modified BIRCH.

The visualizations highlight the capability of each algorithm in handling outliers. K-Means and traditional BIRCH, although efficient, lack flexibility in real-time settings and may misclassify some outliers. DBSCAN achieves high accuracy in outlier detection but suffers from slower processing, which limits its scalability in real-time IoT applications. In contrast, the modified BIRCH algorithm strikes an optimal balance, providing both efficient processing and accurate outlier detection. It operates seamlessly in a single instance, updating itself with each new data chunk, thereby maintaining low memory usage and execution time.

For real-time IoT systems, the modified BIRCH algorithm’s ability to automatically identify outliers without manual intervention makes it a valuable tool in applications such as predictive maintenance, environmental monitoring, and health diagnostics. By continuously adapting to incoming data, it provides timely insights, detects anomalies promptly, and reduces memory overhead, a key requirements for scalable, real-time IoT systems.

3.1 Generality of Results across Benchmark Datasets

While real-time deployment and streaming experiments establish the practical effectiveness of the Modified BIRCH algorithm, its generalizability across diverse IoT contexts remains a critical consideration. To examine this, we evaluated our Modified BIRCH on widely recognized benchmark datasets spanning multiple domains, including network intrusion detection, environmental monitoring, and large-scale adversarial IoT traffic which are described in Table 5. The results from these evaluations provide rigorous evidence of the adaptability and robustness of the proposed method within heterogeneous IoT environments.

Using our IoT-based simulator, streaming batches from these datasets were fed into the clustering algorithms with varying batch sizes (50, 100, and 200). This experimental setup enabled us to evaluate the scalability of the Modified BIRCH algorithm and its ability to maintain clustering quality in domains characterized by high traffic variability, environmental noise, and large-scale adversarial data. Table 6 summarizes the results across representative datasets, focusing on execution time, memory consumption, and clustering quality metrics, including the Silhouette Coefficient and Davies–Bouldin Index (DBI).

images

As observed from these evaluations, the Modified BIRCH algorithm maintains consistent execution time and clustering quality across all datasets, even under varying workloads. In the RT-IoT dataset, it achieved high Silhouette score (0.81–0.83) with low DBI, indicating strong adaptability to complex network traffic. In AirIoT dataset, although higher memory utilization was observed due to dense temporal data, Modified BIRCH produced improved clustering quality, with a Silhouette score (≈0.61) compared to traditional methods. For the CIC IoT-IDAD dataset, the algorithm remained stable under large-scale adversarial traffic, further validating its robustness.

Overall, these findings confirm that the Modified BIRCH algorithm generalizes effectively across diverse IoT application domains, ranging from real-time network security to long-term environmental monitoring and adversarial intrusion detection.

3.2 Comparisons with State-of-the-Art Methods

In addition to comparisons with classical baselines, it is important to assess the performance of the Modified BIRCH algorithm relative to contemporary state-of-the-art (SOTA) streaming clustering methods. For this purpose, we considered three established micro-cluster based algorithms: CluStream, DBStream, and Clustree, which are commonly employed in streaming environments and serve as representative approaches for large-scale IoT data analysis.

To summarize clustering quality across all benchmark datasets, Fig. 10 presents the average Silhouette and Davies–Bouldin Index (DBI) scores obtained by the four streaming algorithms. While Table 6 provides detailed results for each dataset and batch size, the figure highlights the overall trends in clustering performance.

images

Figure 10: Average clustering quality comparison across benchmark datasets, showing Silhouette scores (higher is better) and Davies–Bouldin Index (DBI) scores (lower is better).

The results presented in Table 6 and summarized in Fig. 10 reveal several noteworthy patterns. With respect to execution time, Modified BIRCH demonstrates clear advantages, particularly on the RT-IoT datasets. For instance, in RT-IoT 2022 with a batch size of 50, Modified BIRCH records an average time of only 5.01 s, outperforming CluStream (6.04 s), DBStream (6.02 s), and Clustree (6.41 s). This trend is consistent across larger batch sizes and diverse datasets, confirming that the modifications introduced into BIRCH enable faster recalibration of clusters without imposing additional computational burden. Such efficiency is critical in real-time IoT contexts where delays in processing can compromise timely detection of anomalies.

Memory utilization shows a more nuanced trade-off. CluStream and DBStream maintain relatively modest memory requirements in RT-IoT scenarios, typically in the range of 300–600 MB, but exhibit substantial growth in AirIoT, where peak RAM usage surpasses 1200 MB. ClusTree further amplifies this demand, with usage rising to nearly 1500 MB. Modified BIRCH, by contrast, displays a distinctive profile: while its memory footprint remains stable on RT-IoT and CIC IoT-IDAD datasets, it incurs higher overhead on AirIoT, reaching up to 1800 MB. This increase can be attributed to the dense temporal nature of AirIoT, where frequent updates to cluster structures intensify memory requirements. Nevertheless, unlike DBStream, which becomes slower under high workloads, Modified BIRCH manages to sustain low execution time even under these conditions, suggesting a favorable balance between speed and memory consumption.

The clustering quality obtained across the evaluated methods reveals distinct behavioral characteristics under varying IoT data conditions. ClusTree attains the highest silhouette values in the RT-IoT datasets, with averages around 0.84 and Davies–Bouldin Index (DBI) values near 0.27, indicating the formation of highly compact and well-separated clusters. Modified BIRCH produces closely comparable results, yielding silhouette scores in the range of 0.81–0.83 and DBI values as low as 0.23, demonstrating that incremental CF-tree refinement can preserve cluster structure while operating under continuous data ingestion. CluStream and DBStream exhibit comparatively lower clustering quality, with silhouette values around 0.72 and DBI values near 0.63, suggesting reduced cohesion and separation in dynamically evolving streams. In the AirIoT dataset, these differences become more pronounced, where Modified BIRCH maintains a silhouette score of approximately 0.61 with DBI≈0.50, whereas CluStream and DBStream decline to silhouette values near 0.23 accompanied by DBI values exceeding 1.0, reflecting increased overlap among clusters.

MiniBatch K-Means demonstrates dataset-dependent performance. For RTIoT2022, it achieves silhouette scores of approximately 0.77 with DBI values between 0.25 and 0.32, indicating effective partitioning when the data distribution remains relatively stable. In contrast, its performance decreases in more heterogeneous environments such as AirIoT, where silhouette scores fall to 0.22–0.29 and DBI rises beyond 1.0. This variation is consistent with the algorithm’s reliance on periodic centroid updates and its assumption of approximately spherical cluster geometry, which may not fully capture the irregular and temporally evolving patterns present in streaming IoT data.

A similar trend is observed in the CIC IoT-IDAD datasets representing large-scale and adversarial traffic scenarios. MiniBatch K-Means yields moderate silhouette values (≈0.39–0.48) with DBI approaching 0.7–0.9, indicating sensitivity to noise and distributional variability. Modified BIRCH maintains stable clustering behavior within this regime, supported by its hierarchical summarization mechanism that incrementally absorbs structural changes without repeated global re-clustering. These results indicate that while centroid-based mini-batch optimization can perform well under controlled distributions, CF-tree-based incremental modeling provides more consistent clustering quality across diverse real-time IoT environments while maintaining computational efficiency.

In light of these observations, the Modified BIRCH algorithm emerges as highly competitive with established state-of-the-art methods. It combines the execution speed advantages of CluStream and MiniBatch KMeans, the robustness of DBStream, and the clustering quality of ClusTree, while retaining the adaptive recalibration capabilities required for real-time IoT analytics. This balance highlights its potential not only as an incremental refinement of the original BIRCH but also as a viable and effective alternative to modern streaming clustering algorithms in large-scale, dynamic IoT environments.

4 Conclusion

This study presents an optimized clustering solution for real time IoT data environments through a Modified BIRCH algorithm, designed to overcome the limitations of conventional clustering techniques in dynamic and heterogeneous IoT applications. By incorporating periodic structural updates and adaptive estimation of the optimal number of clusters, the proposed algorithm enables continuous and stable processing of streaming data, thereby supporting timely insights, reliable anomaly identification, and resilient system monitoring.

Experimental evaluations first compared the Modified BIRCH with K Means, DBSCAN, and classical BIRCH, demonstrating consistent improvements in execution time, memory consumption, and clustering accuracy. To establish general applicability, the evaluation was extended to large scale benchmark datasets including RT IoT, AirIoT, and CIC IoT IDAD, confirming the adaptability of the algorithm across diverse IoT scenarios such as network security analysis, environmental sensing, and system level monitoring. Further comparative studies with contemporary streaming clustering approaches including CluStream, DBStream, and ClusTree reveal that Modified BIRCH achieves a balanced trade off between computational efficiency, memory utilization, and clustering quality, while remaining stable under both benign and adversarial data conditions.

Beyond algorithmic effectiveness, the integration of Modified BIRCH within a web-based analytical application demonstrates its practical relevance in modern IoT ecosystems. The platform enables interactive visualization, continuous clustering, and real-time anomaly awareness, aligning with current trends toward explainable, operator-centric, and edge-aware IoT analytics. These characteristics position Modified BIRCH not merely as an incremental refinement of existing methods but as a viable and competitive solution for next-generation streaming data analysis frameworks.

Despite the encouraging performance achieved, certain limitations should be acknowledged. The current framework employs a distance-to-centroid criterion for anomaly detection, which is computationally efficient and well aligned with the real-time, incremental design of the Modified BIRCH algorithm; however, such an approach may be less sensitive to complex local density variations than density-based or probabilistic outlier scoring methods, particularly in highly noisy environments. In addition, the proposed methodology has been developed and validated primarily on numerical data streams, and further investigation is required to extend and evaluate the framework for unstructured or heterogeneous data sources.

Addressing these limitations forms a key direction for future research. Incorporating lightweight density-aware or probabilistic scoring mechanisms that preserve the low-latency characteristics of the framework could improve robustness under noisy streaming conditions. Similarly, extending the methodology to support unstructured modalities, such as image or multimodal IoT data, will require the integration of feature extraction and representation learning stages compatible with incremental clustering. Further efforts may also explore self-tuning mechanisms, reduced memory footprints for high-density temporal streams, and seamless deployment within edge-cloud collaborative architectures, thereby strengthening the adaptability and applicability of Modified BIRCH across diverse real-time IoT environments.

Acknowledgement: The authors want to acknowledge their respective institutions for their support.

Funding Statement: This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Author Contributions: The authors confirm contribution to the paper as follows: Conceptualization, Prabhat Das, Dibya Jyoti Bora, Sajal Saha, Cheng-Chi Lee; methodology, Prabhat Das, Dibya Jyoti Bora, Sajal Saha, Cheng-Chi Lee; formal analysis, Prabhat Das, Dibya Jyoti Bora, Sajal Saha; investigation, Dibya Jyoti Bora, Sajal Saha, Cheng-Chi Lee, Hirak Mazumdar; data curation, Dibya Jyoti Bora, Sajal Saha; resources, Dibya Jyoti Bora, Sajal Saha, Cheng-Chi Lee, Hirak Mazumdar; validation, Dibya Jyoti Bora, Sajal Saha, Cheng-Chi Lee; visualization, Dibya Jyoti Bora, Sajal Saha; writing—original draft preparation, Prabhat Das, Dibya Jyoti Bora, Sajal Saha; writing—review and editing, Dibya Jyoti Bora, Sajal Saha, Cheng-Chi Lee, Hirak Mazumdar; supervision, Dibya Jyoti Bora, Sajal Saha, Cheng-Chi Lee, Hirak Mazumdar; project administration, Dibya Jyoti Bora, Sajal Saha, Cheng-Chi Lee, Hirak Mazumdar. All authors reviewed and approved the final version of the manuscript.

Availability of Data and Materials: The data supporting the findings of this study are available at: https://github.com/prabhatdash/smart_birch/blob/main/iot_data.sensor_data.json. Additionally, the benchmark datasets used in this study are available through references [43–45].

Ethics Approval: Not applicable.

Conflicts of Interest: The authors declare no conflicts of interest.

References

1. Mutambik I. An entropy-based clustering algorithm for real-time high-dimensional IoT data streams. Sensors. 2024;24(22):7412. doi:10.3390/s24227412. [Google Scholar] [CrossRef]

2. Bin Mofidul R, Alam M, Rahman M, Jang Y. Real-time energy data acquisition, anomaly detection, and monitoring system: implementation of a secured, robust, and integrated global IIoT infrastructure with edge and cloud AI. Sensors. 2022;22(22):8980. doi:10.3390/s22228980. [Google Scholar] [CrossRef]

3. Rani S, Ahmed S, Rastogi R. Dynamic clustering approach based on wireless sensor networks genetic algorithm for IoT applications. Wirel Netw. 2020;26(4):2307–16. doi:10.1007/s11276-019-02083-7. [Google Scholar] [CrossRef]

4. Chaudhry M, Shafi I, Mahnoor M, Vargas D, Thompson E, Ashraf I. A systematic literature review on identifying patterns using unsupervised clustering algorithms: a data mining perspective. Symmetry. 2023;15(9):1679. doi:10.3390/sym15091679. [Google Scholar] [CrossRef]

5. Alfonso I, Garcés K, Castro H, Cabot J. Self-adaptive architectures in IoT systems: a systematic literature review. J Internet Serv Appl. 2021;12(1):1–28. doi:10.1186/s13174-021-00145-8. [Google Scholar] [CrossRef]

6. Lenssen L, Schubert E. Medoid Silhouette clustering with automatic cluster number selection. Inf Syst. 2024;120(3):102290. doi:10.1016/j.is.2023.102290. [Google Scholar] [CrossRef]

7. Batool F, Hennig C. Clustering with the average silhouette width. Comput Stat Data Anal. 2021;158:107190. doi:10.1016/j.csda.2021.107190. [Google Scholar] [CrossRef]

8. Gnanasekaran T, Girinath S, Venkatesh K, Valarmathi N, Bandili S, Balasubramani S. Exploring K-Means meta-heuristic techniques for prediction of anomalies in IoT-enabled industrial systems. In: Proceedings of the 2024 15th International Conference on Computing Communication and Networking Technologies (ICCCNT); 2024 Jun 24–28; Kamand, India. p. 1–7. doi:10.1109/ICCCNT61001.2024.10725295. [Google Scholar] [CrossRef]

9. Miraftabzadeh S, Colombo C, Longo M, Foiadelli F. K-means and alternative clustering methods in modern power systems. IEEE Access. 2023;11:119596–633. doi:10.1109/ACCESS.2023.3327640. [Google Scholar] [CrossRef]

10. Hosseinzadeh M, Hemmati A, Rahmani A. Clustering for smart cities in the internet of things: a review. Clust Comput. 2022;25(6):4097–127. doi:10.1007/s10586-022-03646-8. [Google Scholar] [CrossRef]

11. Punhani A, Faujdar N, Mishra K, Subramanian M. Binning-based silhouette approach to find the optimal cluster using K-means. IEEE Access. 2022;10:115025–32. doi:10.1109/ACCESS.2022.3215568. [Google Scholar] [CrossRef]

12. Yang H, Ran M, Feng H, Hou D. K-PCD: a new clustering algorithm for building energy consumption time series analysis and predicting model accuracy improvement. Appl Energy. 2025;377(5):124584. doi:10.1016/j.apenergy.2024.124584. [Google Scholar] [CrossRef]

13. Thakur S, Sarkar N, Yongchareon S. AI-driven energy-efficient routing in IoT-based wireless sensor networks: a comprehensive review. Sensors. 2025;25(24):7408. doi:10.3390/s25247408. [Google Scholar] [CrossRef]

14. Weng Z, Zhang W, Zhu T, Dou Z, Sun H, Ye Z, et al. RT-APT: a real-time APT anomaly detection method for large-scale provenance graph. J Netw Comput Appl. 2025;233(2):104036. doi:10.1016/j.jnca.2024.104036. [Google Scholar] [CrossRef]

15. Mu Z, Liu Y, Yang Y. A large-scale group decision making model with a clustering algorithm based on a locality sensitive hash function. Eng Appl Artif Intell. 2025;140(3):109697. doi:10.1016/j.engappai.2024.109697. [Google Scholar] [CrossRef]

16. Almudayni Z, Soh B, Samra H, Li A. Energy inefficiency in IoT networks: causes, impact, and a strategic framework for sustainable optimisation. Electronics. 2025;14(1):159. doi:10.3390/electronics14010159. [Google Scholar] [CrossRef]

17. Wu Y, Zhang L, Yang L, Yang F, Ma L, Lu Z, et al. Intrusion detection for Internet of Things: an anchor graph clustering approach. IEEE Trans Inf Forensics Secur. 2025;20(4):1965–80. doi:10.1109/TIFS.2025.3539100. [Google Scholar] [CrossRef]

18. Krishnamurthi R, Kumar A, Gopinathan D, Nayyar A, Qureshi B. An overview of IoT sensor data processing, fusion, and analysis techniques. Sensors. 2020;20:6076. doi:10.3390/s20216076. [Google Scholar] [CrossRef]

19. Putina A, Rossi D. Online anomaly detection leveraging stream-based clustering and real-time telemetry. IEEE Trans Netw Serv Manag. 2020;18(1):839–54. doi:10.1109/TNSM.2020.3037019. [Google Scholar] [CrossRef]

20. Ariyaluran Habeeb R, Nasaruddin F, Gani A, Amanullah M, Abaker Targio Hashem I, Ahmed E, et al. Clustering-based real-time anomaly detection—a breakthrough in big data technologies. Trans Emerg Telecommun Technol. 2019;33(8):e3647. doi:10.1002/ett.3647. [Google Scholar] [CrossRef]

21. Lang A, Schubert E. BETULA: fast clustering of large data with improved BIRCH CF-Trees. Inf Syst. 2022;108(2):101918. doi:10.1016/j.is.2021.101918. [Google Scholar] [CrossRef]

22. Tomar R, Sharma A. K-Means and BIRCH: a comparative analysis study. In: Inventive communication and computational technologies. Berlin/Heidelberg, Germany: Springer; 2023. p. 281–94. doi:10.1007/978-981-19-4960-9_23. [Google Scholar] [CrossRef]

23. Noaman M, Khan M, Abrar M, Ali S, Alvi A, Saleem M. Challenges in integration of heterogeneous internet of things. Sci Program. 2022;2022:8626882. doi:10.1155/2022/8626882. [Google Scholar] [CrossRef]

24. Barbaro A, Chiavassa P, Fissore V, Servetti A, Raviola E, Ramírez-Espinosa G, et al. Data acquisition processing, and aggregation in a low-cost iot system for indoor environmental quality monitoring. Appl Sci. 2024;14:4021. doi:10.3390/app14104021. [Google Scholar] [CrossRef]

25. Maharana K, Mondal S, Nemade B. A review: data pre-processing and data augmentation techniques. Global Transit Proc. 2022;3(1):91–9. doi:10.1016/j.gltp.2022.04.020. [Google Scholar] [CrossRef]

26. Costa D, Peixoto J, Jesus T, Portugal P, Vasques F, Rangel E, et al. A survey of emergencies management systems in smart cities. IEEE Access. 2022;10(4):61843–72. doi:10.1109/ACCESS.2022.3180033. [Google Scholar] [CrossRef]

27. Zhang H, Babar M, Tariq M, Jan M, Menon V, Li X. SafeCity: toward safe and secured data management design for IoT-enabled smart city planning. IEEE Access. 2020;8:145256–67. doi:10.1109/ACCESS.2020.3014622. [Google Scholar] [CrossRef]

28. Balakrishna S, Thirumaran M. Semantics and clustering techniques for IoT sensor data analysis: a comprehensive survey. In: Principles of Internet of Things (IoT) ecosystem: insight paradigm. Berlin/Heidelberg, Germany: Springer. p. 103–25. doi:10.1007/978-3-030-33596-0_4. [Google Scholar] [CrossRef]

29. Kumar N. Intelligent customer segmentation: unveiling consumer patterns with machine learning. J Umm Al-Qura Univ Eng Archit. 2025;16(3):774–83. doi:10.1007/s43995-025-00180-7. [Google Scholar] [CrossRef]

30. Hilal W, Gadsden S, Yawney J. Financial fraud: a review of anomaly detection techniques and recent advances. Expert Syst Appl. 2022;193(8):116429. doi:10.1016/j.eswa.2021.116429. [Google Scholar] [CrossRef]

31. Tabassum M, Mahmood S, Bukhari A, Alshemaimri B, Daud A, Khalique F. Anomaly-based threat detection in smart health using machine learning. BMC Med Inform Decis Mak. 2024;24(1):347. doi:10.1186/s12911-024-02760-4. [Google Scholar] [CrossRef]

32. Mittal H, Pandey A, Saraswat M, Kumar S, Pal R, Modwel G. A comprehensive survey of image segmentation: clustering methods, performance parameters, and benchmark datasets. Multimed Tools Appl. 2022:1–26. doi:10.1007/s11042-021-10594-9. [Google Scholar] [CrossRef]

33. Garrido-Momparler V, Peris M. Smart sensors in environmental/water quality monitoring using IoT and cloud services. Trends Environ Anal Chem. 2022;35(1):e00173. doi:10.1016/j.teac.2022.e00173. [Google Scholar] [CrossRef]

34. Oliveira F, Costa D, Assis F, Silva I. Internet of Intelligent Things: a convergence of embedded systems, edge computing and machine learning. Internet Things. 2024;26(9):101153. doi:10.1016/j.iot.2024.101153. [Google Scholar] [CrossRef]

35. Alatoun K, Matrouk K, Mohammed M, Nedoma J, Martinek R, Zmij P. A novel low-latency and energy-efficient task scheduling framework for internet of medical things in an edge fog cloud system. Sensors. 2022;22(14):5327. doi:10.3390/s22145327. [Google Scholar] [CrossRef]

36. Samal L, Bute P. Wireless network for industrial application using ESP32 as Gateway. In: Proceedings of the 2023 14th International Conference on Computing Communication and Networking Technologies (ICCCNT); 2023 Jul 6–8; Delhi, India. p. 1–5. doi:10.1109/ICCCNT56998.2023.10306864. [Google Scholar] [CrossRef]

37. Hartigan J, Wong M. Algorithm AS 136: a k-means clustering algorithm. J R Stat Soc Ser C. 1979;28(1):100–8. doi:10.2307/2346830. [Google Scholar] [CrossRef]

38. Chong B. K-means clustering algorithm: a brief review. Acad J Comput Inf Sci. 2021;4(5):37–40. doi:10.25236/AJCIS.2021.040506. [Google Scholar] [CrossRef]

39. Ester M, Kriegel H, Sander J, Xu X. A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining; 1996 Aug 2–4; Portland, OR, USA. [Google Scholar]

40. Deng D. DBSCAN clustering algorithm based on density. In: Proceedings of the 2020 7th International Forum on Electrical Engineering and Automation (IFEEA); 2020 Sep 25–27; Hefei, China. p. 949–53. doi:10.1109/IFEEA51475.2020.00199. [Google Scholar] [CrossRef]

41. Zhang T, Ramakrishnan R, Livny M. BIRCH: an efficient data clustering method for very large databases. ACM Sigmod Rec. 1996;25(2):103–14. doi:10.1145/235968.233324. [Google Scholar] [CrossRef]

42. Xu D, Tian Y. A comprehensive survey of clustering algorithms. Ann Data Sci. 2015;2(2):165–93. doi:10.1007/s40745-015-0040-1. [Google Scholar] [CrossRef]

43. Sharmila BS, Nagapadma R. RT-IoT. UCI Mach Learn Repos. 2023. doi:10.24432/C5P338. [Google Scholar] [CrossRef]

44. Dwivedi A, Reddy R, Parmar A, Chaudhari S. AirIoT: IoT-based air pollution monitoring. IEEE Dataport. 2024. doi:10.21227/b9g8-wc47. [Google Scholar] [CrossRef]

45. Rabbani M, Gui J, Nejati F, Zhou Z, Kaniyamattam A, Mirani M, et al. Device identification and anomaly detection in IoT environments. IEEE Internet Things J. 2024;12(10):13625–43. doi:10.1109/JIOT.2024.3522863. [Google Scholar] [CrossRef]

Cite This Article

APA Style

Das, P., Bora, D.J., Saha, S., Lee, C., Mazumdar, H. (2026). A Resilient BIRCH-Based Smart Framework for Real-Time IoT Data Clustering. Computer Modeling in Engineering & Sciences, 147(1), 29. https://doi.org/10.32604/cmes.2026.079203

Vancouver Style

Das P, Bora DJ, Saha S, Lee C, Mazumdar H. A Resilient BIRCH-Based Smart Framework for Real-Time IoT Data Clustering. Comput Model Eng Sci. 2026;147(1):29. https://doi.org/10.32604/cmes.2026.079203

IEEE Style

P. Das, D. J. Bora, S. Saha, C. Lee, and H. Mazumdar, “A Resilient BIRCH-Based Smart Framework for Real-Time IoT Data Clustering,” Comput. Model. Eng. Sci., vol. 147, no. 1, pp. 29, 2026. https://doi.org/10.32604/cmes.2026.079203

BibTex EndNote RIS

Copyright © 2026 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Table of Content

A Resilient BIRCH-Based Smart Framework for Real-Time IoT Data Clustering

Abstract

Keywords

References

Cite This Article

515

182

0

Related articles

Further Information

Guidelines

Follow Us

Join Us

Contact Us

WhatsApp:

Share Link