Generation of massive data is increasing in big data industries due to the evolution of modern technologies. The big data industries include data source from sensors, Internet of Things, digital and social media. In particular, these big data systems consist of data extraction, preprocessing, integration, analysis, and visualization mechanism. The data encountered from the sources are redundant, incomplete and conflict. Moreover, in real time applications, it is a tedious process for the interpretation of all the data from different sources. In this paper, the gathered data are preprocessed to handle the issues such as redundant, incomplete and conflict. For that, it is proposed to have a generalized dimensionality reduction technique called Shrinkage Linear Discriminate Analysis (SLDA). As a result, the Shrinkage Linear Discriminate Analysis (LDA) will improve the performance of the classifier with generalization. Even though, dimensionality reduction systems improve the performance of the classifier, the irrelevant features get degraded by the performance of the system further. Hence, the relevant and the most important features are selected using Pearson correlation-based feature selection technique which selects the subset of correlated features for improving the performance of the classification system. The selected features are classified using the proposed Quadratic-Gaussian Discriminant Analysis (QGDA) classifier. The proposed evolution techniques are tested with the localization and the cover data sets from machine learning University of California Irvine (UCI) repository. In addition to that, the proposed techniques on datasets are evaluated with the evaluation metrics and compared to the other similar methods which prove the efficiency of the proposed classification system. It has achieved better performance. The acquired accuracy is over 91% for all the experiment on these datasets. Based on the results evaluated in terms of training percentage and mapper, it is meaningful to conclude that the proposed method could be used for big data classification.
Huge dataset is difficult to handle using classical database structure such as big data [
The big data is characterized based on volume, veracity and variety associated to it. In order to solve the issue, machine learning and mining algorithms are used. The existing approaches are not dealt with the data size [
Discriminant analyses composed of many techniques are used to solve the classification problems. These are recognized as the model-based machine learning methods [
These limitations will motivate the present work so as to investigate the behavior of discriminant techniques in the classification. The major contributions of the work are as follows: Input data are preprocessed with the dimensionality reduction method called LDA with the improvement of shrinkage. This shrinkage LDA will improve the data into low dimensional space for better execution of classification approaches. Reduced data are processed using Pearson Correlation based feature selection for the selection of relevant and most important features Reduced dimensional data and selected features are classified using proposed Quadratic Gaussian Discriminant classifier. Classification algorithm performance is evaluated by evaluation metrics and compared with the existing algorithms.
The organization of paper consists of 5 sections: Section 2 describes the review process of traditional techniques, Section 3 introduces the approaches such as evolutionary techniques, feature selection and deep classification, Section 4 discusses the results of the experiment and Section 5 states the conclusion of the research work.
Following section illustrates the literatures related to big data classification and implemented methods.
Mujeeb et al. [
Sleeman et al. [
Bejaoui et al. [
Ghojogh et al. [
Nanga et al. [
Some of the previous studies on algorithms used by various researchers are shown in
Authors | Methods applied | Description | Key findings |
---|---|---|---|
Fu [ |
LDA and PCA | In the paper, two dimensionality reduction techniques such as LDA and PCA were analyzed with ML algorithms like decision tree, support vector algorithm, naïve bayes and random forest classifier using cardiotocography dataset. | Experimental results proved that PCA had performed better for dimensionality reduction on cardiography datasets |
Vogelstein et al. [ |
Principal component analysis (PCA), singular value decomposition (SVD), linear discriminant analysis (LDA), locality preserving projections (LPP), latent semantic analysis (LSA), independent component and project pursuit analysis. | Characteristics, strength, weakness and applications about the supervised, semi supervised and unsupervised learning methods for dimensionality reduction were reviewed. | Data types that had been applied on the different DR techniques were also explored. |
Ledoit et al. [ |
Linear dimension reduction methods such as PCA and LDA. dimensionality reduction using nonlinear methods such as local tangent space alignment (LTSA). | Analyzed the impact of high dimensional data on discriminate analysis and discussed the necessity of dimensionality reduction on high dimensional data | Demonstrated the developing of e dimensionality reduction method. |
Mitja et al. [ |
EMD, PCA for dimensionality reduction. LDA for feature selection | Techniques were applied on deep neural network for medical diagnosis. The authors also discussed about the importance of dimensionality reduction in deep learning. | It was analyzed that feature selection and feature extraction methods with dimensionality reduction will decrease the computation time. |
Duan et al. [ |
Supervised dimensionality reduction method called linear optimal low rank projection | Introduced a novel approach that incorporated PCA with class conditional moment and estimated the low dimensional projection. | Evaluated with brain imaging datasets and concluded that linear optimal low rank projection and its generalization will maintain the computational efficiency. |
Fawzi et al. [ |
LDA and QDA | Investigated the accuracy of LDA and QDA. Two aspects were considered such as single subject cross validation and cross subjects’ generalization. | Mean accuracy of LDA and QDA was analyzed. The mean of single subject cross validation was 59% and cross subject generalization was 51%. Both classifiers could not reject the null hypothesis. |
Further, quality optimization in article resource managing using optimization algorithm in [
For processing the high dimensionality of big data, there are numerous data mining and machine learning techniques that are available in the market. Due to the complex nature of big data, the processing such as big data is still a challenge. Preprocessing plays a vital role in processing these huge datasets in order to reduce the dataset for further effective processing with high accuracy. In this regard, irrelevant, missing raw data are handled in preprocessing stage and the collected data are transformed to low dimensional space through dimensionality reduction techniques. Initially, the data set is separated into test and training data based on k-fold validation. The overview of proposed technique is given in
Big dataset consumes larger storage. The storage space of these big datasets is compressed and reduced using dimensionality reduction techniques for better execution. This low dimensional space process will fasten the computation, improving accuracy of classification, and information loss is reduced [
For f(x), Gaussian distribution is used. The Gaussian distribution with the discriminant function is declared as
Pearson Correlation is the relationship between the data in the range of [−1, 1] where positive 1 indicates the positive correlation, 0 indicates no correlation and negative 1 indicates negative correlation of the data. In comparison to the other feature selection machine learning models that remove the features at each step, PCRFE removes the irrelevant data at once. Due to this factor of hybrid approach, it is faster than filter, wrapper and embedded FS methods. The Correlation Coefficient of the features are calculated using the
Quadratic Discriminant Analysis is a variant of LDA that a covariance matrix for the individual data is estimated for each observation. QDA has been used for the individual classes with various co-variance. Unlike LDA, QDA cannot be used for dimensionality reduction. In this proposed work, the Quadratic Discriminant Analysis is combined with gaussian property to use as a classifier. In QDA, the class measurements are normally distributed and no assumption of the covariance matrix to be identical. The log posterior of the
Where, μ_C-mean vector, Σ_C-covariance matrix. The class posterior is calculated using bayes theorem as
The workflow of the proposed discriminant-based classification using proposed QGDA is shown in
The following section discusses the experimental results of the proposed discriminant algorithms using python (scikit).
In order to check the efficiency and strength of the proposed algorithm, it was examined with the localization datasets from machine learning UCI repository. For that, the activities such as attiring tags, ankle left and right, belt and chest of five people were recorded and collected. It contained 164860 instances and eight attributes. The tags of localization data were formed for each instance and they were recognized using attributes. For the process of evaluation, cover dataset type was also taken from UCI machine learning repository. The dataset contained 581012 instances with 54 attributes.
Five metrics such as sensitivity, accuracy, specificity, time and memory were evaluated to prove the performance of proposed algorithm. Veracity degree was measured by accuracy proportion of true results. The proportion of correctly classified true positives and true negatives were referred as sensitivity and specificity respectively.
where, TP = True Positive, TN = True Negative, FP = False Positive, FN = False Negative. The performance of the proposed algorithm was evaluated in comparison to the existing big data classification algorithms such as Naïve Bayes (NB), Correlated Naïve Bayes (CNB), Fuzzy Naïve Bayes (FNB). After that, the evaluation was analyzed with the metrics in terms of training percentage and mappers size And thus, the mapper size was mentioned as the number of desktops used for the process of execution.
The analysis of the proposed approach on localization dataset was performed with the training percentages such as 80%, 85% and 90% accordingly. The evaluation consideration of Mapper sizes were 3, 4 and 5.
Classifier | Training data (%) | Mappers (M) | Acc (%) | Sens (%) | Spec (%) | Memory (MB) | Execution time (s) |
---|---|---|---|---|---|---|---|
NB | 80 | 3 | 76.2 | 79.2 | 80.5 | 39.2 | 29.5 |
4 | 76.1 | 79.1 | 80.4 | 39 | 29.3 | ||
5 | 76.3 | 79.3 | 80.6 | 38.9 | 29.2 | ||
85 | 3 | 76.5 | 79.5 | 80.8 | 38.2 | 28.5 | |
4 | 76.4 | 79.4 | 80.7 | 38.3 | 28.6 | ||
5 | 76.8 | 79.8 | 81.1 | 38.1 | 28.4 | ||
90 | 3 | 77.1 | 80.1 | 81.4 | 36.4 | 26.7 | |
4 | 77.4 | 80.4 | 81.7 | 36 | 26.3 | ||
CNB | 80 | 3 | 79.3 | 82.3 | 83.6 | 36.7 | 27 |
4 | 79.4 | 82.4 | 83.7 | 39.7 | 30 | ||
5 | 79.6 | 82.6 | 83.9 | 36 | 26.3 | ||
85 | 3 | 79.3 | 82.3 | 83.6 | 35.8 | 26.1 | |
4 | 79.5 | 82.5 | 83.8 | 35.3 | 25.6 | ||
5 | 79.7 | 82.7 | 84 | 34.7 | 25 | ||
90 | 3 | 79.6 | 82.6 | 83.9 | 32.1 | 22.4 | |
4 | 79.7 | 82.7 | 84 | 30.9 | 21.2 | ||
FNB | 80 | 3 | 81.2 | 84.2 | 85.5 | 33.3 | 23.6 |
4 | 81.4 | 84.4 | 85.7 | 32.7 | 23 | ||
5 | 81.7 | 84.7 | 86 | 31.7 | 22 | ||
85 | 3 | 81.8 | 84.8 | 86.1 | 32.2 | 22.5 | |
4 | 81.3 | 84.3 | 85.6 | 30.5 | 20.8 | ||
5 | 81.9 | 84.9 | 86.2 | 27.3 | 17.6 | ||
90 | 3 | 82.3 | 85.3 | 86.6 | 27.8 | 18.1 | |
4 | 82.4 | 85.4 | 86.7 | 26.6 | 16.9 | ||
QGDA | 80 | 3 | 86.3 | 89.3 | 90.6 | 31.4 | 21.7 |
4 | 86.5 | 89.5 | 90.8 | 28.3 | 18.6 | ||
5 | 86.4 | 89.4 | 90.7 | 27.1 | 17.4 | ||
85 | 3 | 87.4 | 90.4 | 91.7 | 27.4 | 17.7 | |
4 | 87.5 | 90.5 | 91.8 | 27.3 | 17.6 | ||
5 | 87.3 | 90.3 | 91.6 | 26.4 | 16.7 | ||
90 | 3 | 88.4 | 91.4 | 92.7 | 26.3 | 16.6 | |
4 | 88.6 | 91.6 | 92.9 | 26.2 | 16.5 | ||
Based on the training percentage analysis from
The analysis of the proposed approach on localization dataset was performed with the training percentages such as 80%, 85% and 90%. The evaluation consideration of Mapper sizes were 3, 4 and 5.
Classifier | Training data (%) | Mappers (M) | Acc (%) | Sens (%) | Spec (%) | Memory (MB) | Execution time (s) |
---|---|---|---|---|---|---|---|
NB | 80 | 3 | 68 | 71 | 72.3 | 39.1 | 29.4 |
4 | 68.2 | 71.2 | 72.5 | 38 | 28.3 | ||
5 | 68.4 | 71.4 | 72.7 | 37.3 | 27.6 | ||
85 | 3 | 69.2 | 72.2 | 73.5 | 36.2 | 26.5 | |
4 | 69.4 | 72.4 | 73.7 | 35.4 | 25.7 | ||
5 | 70.2 | 73.2 | 74.5 | 35.2 | 25.5 | ||
90 | 3 | 71.3 | 74.3 | 75.6 | 32.4 | 22.7 | |
4 | 71.6 | 74.6 | 75.9 | 31.5 | 21.8 | ||
CNB | 80 | 3 | 72.5 | 75.5 | 76.8 | 36.7 | 27 |
4 | 73.7 | 76.7 | 78 | 34.2 | 24.5 | ||
5 | 73.9 | 76.9 | 78.2 | 33.3 | 23.6 | ||
85 | 3 | 74.3 | 77.3 | 78.6 | 34.2 | 24.5 | |
4 | 74.6 | 77.6 | 78.9 | 33.9 | 24.2 | ||
5 | 74.8 | 77.8 | 79.1 | 32.2 | 22.5 | ||
90 | 3 | 74.5 | 77.5 | 78.8 | 34 | 24.3 | |
4 | 75.1 | 78.1 | 79.4 | 33.8 | 24.1 | ||
FNB | 80 | 3 | 75.4 | 78.4 | 79.7 | 33.3 | 23.6 |
4 | 75.6 | 78.6 | 79.9 | 32.7 | 23 | ||
5 | 75.8 | 78.8 | 80.1 | 31 | 21.3 | ||
85 | 3 | 76.1 | 79.1 | 80.4 | 32.2 | 22.5 | |
4 | 76.2 | 79.2 | 80.5 | 30.5 | 20.8 | ||
5 | 76.4 | 79.4 | 80.7 | 27.4 | 17.7 | ||
90 | 3 | 76.6 | 79.6 | 80.9 | 27.7 | 18 | |
4 | 76.8 | 79.8 | 81.1 | 27.2 | 17.5 | ||
QGDA | 80 | 3 | 81.4 | 84.4 | 85.7 | 27.3 | 17.6 |
4 | 82.3 | 85.3 | 86.6 | 27.1 | 17.4 | ||
5 | 83.8 | 86.8 | 88.1 | 26.1 | 16.4 | ||
85 | 3 | 83.4 | 86.4 | 87.7 | 25.3 | 15.6 | |
4 | 83.8 | 86.8 | 88.1 | 24.1 | 14.4 | ||
5 | 84 | 87 | 88.3 | 23.4 | 13.7 | ||
90 | 3 | 86.7 | 89.7 | 91 | 24.2 | 14.5 | |
4 | 86.9 | 89.9 | 91.2 | 21.2 | 11.5 | ||
Based on the analysis of
The comparative analysis from
The proposed paper which pays attention on big data classification based on discriminant technologies was implemented using python sklearn. The initial dataset was transformed to low dimensional space for improving the accuracy of the classification using Shrinkage LDA. Then the proposed approach had used the correlation-based feature selection algorithm called Pearson Correlation in order to select the relevant features for further processing. The proposed classification algorithm based on Gaussian bayes theorem called Quadratic Gaussian Discriminant method had obtained high accuracy on classifying the localization and cover datasets. The proposed algorithm had been evaluated in terms of various training percentages and Mapper's sizes. Similarly, the methods such as NB, CNB and FNB were evaluated and compared to the proposed algorithm. The simulation outcomes had shown the performance of proposed algorithm was high in terms of accuracy, sensitivity and specificity, memory, and execution time as 88.8%, 91.8%, 93.1%, 26.1MB, and 15.4 s respectively for localization dataset. Moreover, it was found that the metrics were 89.5% for accuracy, 92.5% for sensitivity, 93.8% for specificity, 20.3MB for memory and 10.9 s of execution time for cover dataset. Hence, the proposed QGDA is proven to be the best algorithm for big data classification.
We would like to give special thanks to Taif University Research supporting Project Number (TURSP-2020/211), Taif University, Taif, Saudi Arabia.