An Imbalanced Dataset and Class Overlapping Classification Model for Big Data

Mini Prince; P. M.

doi:10.32604/csse.2023.024277

Open Access icon Open Access

ARTICLE

An Imbalanced Dataset and Class Overlapping Classification Model for Big Data

Mini Prince^1,*, P. M. Joe Prathap²

1 Department of Information Technology, St. Peter’s College of Engineering and Technology, Chennai, 600054, Tamilnadu, India
2 Department of Information Technology, R.M.D Engineering College, Chennai, 601206, Tamilnadu, India

* Corresponding Author: Mini Prince. Email: email

Computer Systems Science and Engineering 2023, 44(2), 1009-1024. https://doi.org/10.32604/csse.2023.024277

Received 12 October 2021; Accepted 17 January 2022; Issue published 15 June 2022

Abstract

Most modern technologies, such as social media, smart cities, and the internet of things (IoT), rely on big data. When big data is used in the real-world applications, two data challenges such as class overlap and class imbalance arises. When dealing with large datasets, most traditional classifiers are stuck in the local optimum problem. As a result, it’s necessary to look into new methods for dealing with large data collections. Several solutions have been proposed for overcoming this issue. The rapid growth of the available data threatens to limit the usefulness of many traditional methods. Methods such as oversampling and undersampling have shown great promises in addressing the issues of class imbalance. Among all of these techniques, Synthetic Minority Oversampling TechniquE (SMOTE) has produced the best results by generating synthetic samples for the minority class in creating a balanced dataset. The issue is that their practical applicability is restricted to problems involving tens of thousands or lower instances of each. In this paper, we have proposed a parallel mode method using SMOTE and MapReduce strategy, this distributes the operation of the algorithm among a group of computational nodes for addressing the aforementioned problem. Our proposed solution has been divided into three stages. The first stage involves the process of splitting the data into different blocks using a mapping function, followed by a pre-processing step for each mapping block that employs a hybrid SMOTE algorithm for solving the class imbalanced problem. On each map block, a decision tree model would be constructed. Finally, the decision tree blocks would be combined for creating a classification model. We have used numerous datasets with up to 4 million instances in our experiments for testing the proposed scheme’s capabilities. As a result, the Hybrid SMOTE appears to have good scalability within the framework proposed, and it also cuts down the processing time.

Keywords

Imbalanced dataset; class overlapping; SMOTE; MapReduce; parallel programming; oversampling

Cite This Article

APA Style

Prince, M., Joe Prathap, P.M. (2023). An Imbalanced Dataset and Class Overlapping Classification Model for Big Data. Computer Systems Science and Engineering, 44(2), 1009–1024. https://doi.org/10.32604/csse.2023.024277

Vancouver Style

Prince M, Joe Prathap PM. An Imbalanced Dataset and Class Overlapping Classification Model for Big Data. Comput Syst Sci Eng. 2023;44(2):1009–1024. https://doi.org/10.32604/csse.2023.024277

IEEE Style

M. Prince and P. M. Joe Prathap, “An Imbalanced Dataset and Class Overlapping Classification Model for Big Data,” Comput. Syst. Sci. Eng., vol. 44, no. 2, pp. 1009–1024, 2023. https://doi.org/10.32604/csse.2023.024277

BibTex EndNote RIS

Copyright © 2023 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Table of Content

An Imbalanced Dataset and Class Overlapping Classification Model for Big Data

Abstract

Keywords

Cite This Article

4101

1826

0

Related articles

Further Information

Guidelines

Follow Us

Join Us

Contact Us

WhatsApp:

Share Link