EEG-Based Neonatal Sleep Stage Classification Using Ensemble Learning

: Sleep stage classification can provide important information regarding neonatal brain development and maturation. Visual annotation, using polysomnography (PSG), is considered as a gold standard for neonatal sleep stage classification. However, visual annotation is time consuming and needs professional neurologists. For this reason, an internet of things and ensemble-based automatic sleep stage classification has been proposed in this study. 12 EEG features, from 9 bipolar channels, were used to train and test the base classifiers including convolutional neural network, support vector machine, and multilayer perceptron. Bagging and stacking ensembles are then used to combine the outputs for final classification. The proposed algorithm can reach a mean kappa of 0.73 and 0.66 for 2-stage and 3-stage (wake, active sleep, and quiet sleep) classification, respectively. The proposed network works as a semi-real time application because a smoothing filter is used to hold the sleep stage for 3 min. The high-performance parameters and its ability to work in semi real-time makes it a promising candidate for use in hospitalized newborn infants.


Introduction
Sleep is an important human phenome. In neonates, maturation and development of cortical pathways, structural development, and optimal physical growth occurs during sleep. Therefore, it is important to monitor neonatal sleep in a neonatal intensive care unit (NICU). Visual assessment of electroencephalogram (EEG) is considered as a gold standard for neonatal sleep. Neonatal EEG has been classified into four stages: Awake, Intermediate Sleep (IS), Active sleep (AS), Quiet Sleep (QS). The time spent in these sleep states is directly associated with brain maturation [1][2][3]. The distribution changes from 80% AS and 18% QS at an early Gestational Age to 60% AS and 30% QS at full-term. Since visual labelling is time-consuming and needs a professional neurologist, automatic sleep stage algorithms using machine and deep neural networks is of great interest for clinicians.
Recently, internet of things and machine learning algorithms have been a great deal of attention for biomedical signal applications, such as, EEG seizure detection [4], medical image analysis [5] etc. In the past two decades, growing body of researchers has investigated automatic sleep staging algorithms using artificial neural networks. However, the existing algorithms amalgamate AS II and Awake stage into low voltage irregular (LVI) stage. This results in the misclassification of almost 40% of the sleep EEG.
In this paper, an ensemble-based algorithm using CNN, MLP and SVM has been proposed to classify neonatal EEG into three stages i.e., AS, QS and Awake. The proposed method used 8-time and 4-frequency domain features from multichannel EEG for training and testing the networks. Finally, bagging and stacking ensemble methods are used to combine the outputs of CNN, MLP and SVM. This study can reach an accuracy of 81.99% for three-stage classification i.e., AS, QS and Awake. The main contributions of this paper are as following: • Traditional neonatal sleep stage classification algorithms amalgamate AS II and Awake into LVI. This study classified sleep EEG into three separate sleep stages i.e., Awake, AS and QS. • NicoletOne IoT device has been deployed to extract EEG recordings from 19 healthy neonates. The proposed study used this dataset for further preprocessing and classification.
No neonatal dataset is available online thus, it can be considered as a great contribution for healthcare. • This is the first time an ensemble-based machine learning algorithm has been used for neonatal sleep stage classification. Traditional algorithm classified sleep stages using single machine/ deep learning algorithm. • From literature, it is quite evident that a neonate spends at least 3-min in one sleep stage.
Therefore, a delay filter is used to halt the sleep stage for 3-min. This helps to increase the accuracy by 4%-5%.
Rest of the paper is arranged as: Section 2 presents the background; Section 3 presents the preliminaries; Section 4 proposes the methodology whereas results and discussion are presented in Sections 5 and 6, respectively. Finally, Section 7 concludes the paper.

Background
Over the past two decades, machine learning has been exponentially employed in the field of healthcare, image encryption and neonatal sleep. Automatic sleep stage classification using machine/deep learning has become a hot topic for researchers now a days. Some maturational changes can only be seen during QS. For this purpose, Turnbull et al. [6] detected a distinguished EEG-pattern, Trace Alternant (TA), in 2001. The proposed scheme can classify TA efficiently still, it is challenging to detect entire QS with this method. Some brain developments can only be administered in QS, reflecting changes in brain function [7][8][9]. For this purpose, Dereymaeker et al. [10] proposed an automatic QS detection algorithm using cluster based adaptive sleep staging (CLASS). The main benefit of CLASS is that it can efficiently detect QS in preterm neonates. Quiet sleep detection using radial basis function support vector machine (RBF-SVM) was propounded by Koolen et al. [11]. A total of 57 features were used for training and testing the network. RBF-SVM based algorithm can reach an accuracy of up to 85% for QS detection.
In 2018, researchers further classified neonatal EEG into four sleep states: anterior dysrhythmia or AS I, low voltage irregular (LVI) or AS II, high voltage slow (HVS), and Trace Alternant (TA)/Trace Discontinue (TD). To classify these sleep states, Pillay et al. [12] presented a generative modelling approach using Hidden Markov models (HMMs) and Gaussian mixture models (GMMs). The proposed models were trained using a large 112 feature set. This scheme achieved a cohen's kappa of 0.62 for 4-stage classification. In 2020, a convolutional neural network (CNN) based algorithm outperformed all the existing algorithms with a kappa of 0.64 [13]. The convolutional neural network used raw-EEG data from 113 EEG recordings. Some brain development and maturation can only be seen during AS, whereas the existing set of framework amalgamates AS II and Awake into LVI. The current methods of separate sleep stage classification are limited. In 2017, Fraiwan et al. [14] proposed an algorithm based on deep autoencoders. The deep autoencoder can reach an accuracy of 80.2% for AS still, it is limited to only 17% for Awake. To overcome this problem, an automatic algorithm was propounded using multilayer perceptron (MLP) neural network by Abbasi et al. in 2020 [15]. MLP achieved an accuracy of 82.53% for sleep-wake classification. However, the MLP-based network only works well for 2-stage classification.

EEG Visual Sleep Labelling
The NicoletOne IoT system was used by two professional neurologists from Fudan children hospital Shanghai, China. The primary rater (CL) classified sleep signals into three stages: Awake, AS, QS and artifacts. The secondary rater (LW) verified the sleep stages annoted by CL. Physiological signals i.e., EEG, ECG and EMG were used for visual sleep scoring. Videos were also considered where needed.

Pre-Processing
The EEG recordings are processed on their original sampling frequency i.e., 500 Hz. While recording, EEG signals got contaminated with noise and artifacts. These artifacts should be removed before training and testing the network. The pre-processing is divided into three steps: • Finite Impulse Response (FIR) filter was used, with cutoff frequencies 0.3-35 Hz, to remove baseline noise, powerline noise and motion artifacts from neonatal EEG. • After removing these noises and motion artifacts, the EEG recordings were segmented into 4560 30-s epochs and their respective label. • Finally, epochs with label "artifacts" were removed manually.

Feature Extraction
8-time and 4-frequency domain features were extracted from 9-EEG channel to form an input vector of size 108. Time-domain features were amplitude mean, amplitude median, skewness, kurtosis, standard deviation, variance, minima and maxima whereas, frequency-domain features were the mean amplitude value of delta (0.5-3 Hz), theta (3-8 Hz), alpha (8-12 Hz) and beta (12-30 Hz) bands. Tab. 1 illustrates the features used in the proposed scheme.

Ensemble Learning
In Ensemble learning, multiple learner modules are applied on a dataset to extract predictions [28,29]. These presages are then accumulated into a singular compound forecast. In this procedure, two steps are involved. Primarily, a base learner series is acquired from the training data, which are then ensembled to materialize the combined prediction model. Hence, various forecasts relying on the base learners are then amalgamated into a compound model, which behaves with an enhanced accuracy as compared to its constituent base learners.
The success of the ensemble learning is based on three important determinants. The primary consideration is its statistical nature, in which the models determine the most suitable hypothesis after scrutinizing the hypothesis space H. In this pursuit, a prominent bottleneck is the presence of multiple suitable hypothesis in H as well as the limitation of the dataset, which makes it difficult to locate the most optimum one. The employment of ensemble methods potentially avoids this issue by utilizing various models to produce the pertinent unseen hypothesis. The secondary consideration is its computational ability, in contrast to the conventional local search by multiple existing models having the startling drawback of getting stuck in the local optima. The ensemble on the other hand enhances the efficiency and approximation by initiating the local search from multiple points. The last but not the least is its representational capacity. In most of the scenarios, the unknown may not be inhabited within the H. Hence the fusion of several hypothesis obtained from H can widen the space of representable functions, which can potentially hold the unknown true entity.
The proposed paper used bagging and stacking ensemble for neonatal sleep stage classification. In bagging, several models are built, the results are then combined using a combiner i.e., majority voting, uniform voting or weighted voting. In stacking, multiple models are built at level 1 and then a combiner algorithm is trained, at level 2, using predictions generated by the base model. This combiner can be any deep or machine learning algorithm. Fig. 2 shows the proposed block diagram for neonatal sleep stage classification. In this section, we explained the proposed ensemble system for neonatal sleep stage classification. The proposed ensemble system is divided into two modules: base-level learning module and metadata combination module. In the following subsection, we explained these modules briefly.

Base-Level Learning Module
The main purpose of this module is to construct base learners using training set to produce metadata for final classification. To achieve high neonatal sleep classification accuracy and high interpretability, we used CNN, MLP and SVM as our options for constructing base learners. It is widely recognized that diversity which measures the difference of the base learners is one of the pivotal ingredients for a good ensemble. With the intention of procuring high discriminative metadata for classification, the learned module should be as diverse as possible. Metadata generation is the second task of the base-level learning module. For this purpose, we used 4-fold cross validation process in which the original dataset is divided into 4 disjoint sets with same size.
The network hyper parameters of the proposed base learners are illustrated in Tab. 2. It is important to note that these parameters are set manually by hit and trial rule. Parameters with which we achieved highest accuracy are reported in this study.

Metadata Combination Module
Model combination is of great importance in ensemble learning. Proper selection can enhance the classification ability of the selected models. In this paper, we propose to utilize bagging and stacked generalization scheme for constructing ensembles. For stacking, the output of base learners can serve as input to a second-level meta-learners to learn the mapping between outputs of the base learners and final class. In this study, majority voting is determined to play the role for bagging ensemble whereas SVM is used for stacking generalization. Fig. 3 shows the learning process of the proposed scheme with base learners of convolutional neural networks (Fig. 4), multilayer perceptron (Fig. 5) and support vector machine and the combination method for stacking and bagging ensemble. CNN, SVM and MLP are used for metadata generation at base level. Then, majority voting and SVM are used at top level for pre-final classification. These results are then passed from a smoothing filter for final sleep stage classification. All the parameters were set after running the preliminary experiments on the proposed neonatal data. Moreover, to overcome the possible representation limitation of SVM, we have used NNs, which can handle non-linear learning very well. RMSprop [30] algorithm was used for training the neural networks.

Evaluation Parameters
Performance metrics used to assess and compare the proposed network were Accuracy, Precision, Recall, F1 score and Kappa. Mathematically, these metrics are given as:  TP → true positives, TN → true negatives, FP → false positive whereas FN → false negatives. P Agree is the proportion of results in which network and annotation agrees whereas, P Chance is the proportion which are expected due to chance. To validate the proposed scheme, 4-fold crossvalidation was used. The neonatal dataset is divided into 4 folds. One-fold is used for testing whereas the other three will be for testing.

Experimental Results and Comparison
All the networks were processed on Intel Core i5-8400, RAM 16 GB. TensorFlow and Keras were used to implement the proposed ensemble network. Features were extracted on MATLAB 2019b. The results of the proposed methods are divided into two parts. 1) Two-stage classification 2) Three-stage classification. The following subsections present and discuss the classification results.

Two Stage Classification
The proposed network achieved an accuracy of 94.27% ± 3% for QS detection. The test performance of this study and the existing algorithms is compared in Tab. 3. Kappa, accuracy, sensitivity, and specificity are calculated for evaluation. The proposed ensemble-based algorithm outclasses the existing algorithms for neonatal QS detection. It is important to note that not all the algorithms work on the full range of neonates i.e., 37 ± 5. The proposed scheme can work for the whole range of neonates. According to our assiduous research, the existing algorithms for neonatal sleep-wake classification is limited. Existing algorithms amalgamates awake and AS II into LVI stage. The proposed algorithm can also be used for neonatal sleep-wake classification. Tab. 4 shows the performance of the proposed algorithm and the comparison. Fig. 6 shows the confusion matrix of the proposed ensemble schemes. The proposed network achieved an accuracy of 81.99% and 78.81% for bagging and stacking ensemble, respectively. The standard error of the proposed scheme is 0.76. Fig. 7 shows the overall test performance of the proposed SVM, MLP, CNN and ensemble algorithms. From the results, it is evident that the ensemble algorithms can increase the accuracy by 7%-8%. Here, it is important to note that each network is tuned to their best and the highest accuracy is reported in the proposed manuscript.  Not all the NICU is equipped with multichannel EEG. Therefore, the test performance of the proposed scheme with different number of channels is shown in Tab. 5. From the table, it is evident that by reducing the number of EEG channels the accuracy decreases. The most relevant EEG channels for sleep staging are F4, C4, C3 and F3.   The proposed ensemble network used a smoothing filter to enhance the accuracy of the network. In literature, it is believed that the neonate spends at least 3 min in one sleep stage either it be Awake, AS or QS. Thus, after ensembling the outputs of SVM, CNN and MLP, a smoothing filter is used which holds the sleep stage for 6 epochs (6 * 30 s = 3 min). Tab. 6 shows the performance of the proposed network with and without post-processing step.

Discussion
To the best of our knowledge, this is the first ensemble-based study for neonatal sleep staging. We show that the proposed ensemble can increase the accuracy of neonatal sleep staging to 81.99%. The proposed study outperforms the existing networks for all sleep stage classification i.e., Awake, QS detection and three-stage classification. This network can be called as a "semi" real-time application. We call it "semi" because it uses a smoothing filter which holds a sleep stage for 3-min. One major advantage of this scheme is that this algorithm works for the whole range of neonates i.e., 37 ± 5. The high-performance metrics of the proposed study along with its real-time application makes it a promising candidate for a real-time NICU. This study uses multichannel bipolar EEG for classification. 9 bipolar EEG channels were used for neonatal sleep stage classification. Not every NICU is equipped with multichannel EEG therefore we investigated the proposed scheme with less number of channels. Tab. 5 shows the performance comparison pf the proposed scheme with different number of EEG channels. It is important to note that the accuracy decreases with reducing channels.
The proposed study uses 12 most important features from 30-s epochs. These features are divided into two categories: time domain and frequency domain. Time-domain features are amplitude mean, amplitude median, skewness, kurtosis, standard deviation, variance, minima and maxima whereas, frequency-domain features are delta, theta, alpha and beta bands. Out of these 12 extracted features, the most prominent features are frequency domain features. These features can improve the overall frequency by 10%-15%.
To access the performance of the proposed study, accurate visual annotation is required. For this reason, two neurologists were used for annotation. The primary rater (CL) classified sleep signals into three stages: Awake, AS, QS and artifacts. The secondary rater (LW) verified the sleep stages annoted by CL. Physiological signals i.e., EEG, ECG and EMG were used for visual sleep scoring. Videos were also considered where needed. The performance of the ensemble algorithm is described in section 4. The existing algorithms for three stage classification i.e., AS, QS and Awake are limited. For this reason, the proposed study applied different machine and neural networks on neonatal dataset and reported the performance matrices.
Ensembling has not been employed for neonatal sleep stage classification. This is the first time an ensemble-based algorithm has been proposed. In this study, two ensemble algorithms are used for neonatal sleep stage classification i.e., bagging and stacking. In bagging, CNN, MLP and SVM are used as base classifiers. The final classification is then obtained by combining their outputs using majority voting. In stacking, CNN, MLP and SVM are used for base classification. The outputs of these classifiers are then used to train level 2 classifier i.e., SVM. The proposed scheme can reach a mean accuracy of 81.99 and 78.81 for bagging and stacking ensemble algorithms, respectively.
In this study, there are some limitations that should be taken into consideration: first, the proposed study uses 19 neonates for sleep staging. Larger dataset can increase the implicitness of the study. Second, the existing algorithms for three stage classification are limited therefore we applied different machine and neural networks on our dataset and compared the results in this study. Third, the proposed study removes artifacts manually using visual annotation. These artifacts can contaminate the EEG recording, in real time applications, and may decrease the accuracy of the proposed network. An automatic artifact removal technique should be used to use this algorithm directly in NICU.
As a future work, we aim to classify one more stage i.e., intermediate sleep (IS). In addition, both AS and Awake contains LVI and Mixed EEG signals. Thus, we aim to use EOG, EMG and ECG along with EEG to increase the performance of neonatal sleep stage classifier. In addition, larger neonatal sleep data will be collected to have more accurate and implicit results.

Conclusion
In this study, an IoT and ensemble-based machine learning algorithm is proposed for neonatal sleep stage classification. NicoletOne IoT system is used for neonatal data extraction and annotation. This is the first time an ensemble-based sleep staging has been proposed. 12 EEG features were extracted from 9 bipolar EEG channels to train and test CNN, SVM and MLP as base classifiers. Then, two ensemble algorithms i.e., bagging and stacking were deployed for final classification. The performance results shows that the proposed study outclassed existing algorithms for 2-class and 3-class sleep classification. To conclude, we can say that the highperformance results and its ability to work in semi real-time makes it a promising candidate for use in NICU.