Open Access iconOpen Access

ARTICLE

A Comparative Study of Data Representation Techniques for Deep Learning-Based Classification of Promoter and Histone-Associated DNA Regions

Sarab Almuhaideb1,*, Najwa Altwaijry1, Isra Al-Turaiki1, Ahmad Raza Khan2, Hamza Ali Rizvi3

1 Computer Science Department, College of Computer and Information Sciences, King Saud University, P.O. Box 51178, Riyadh, 11543, Saudi Arabia
2 Chemical Engineering, Indian Institute of Technology Kharagpur, Kharagpur, 721302, India
3 Computer Science and Engineering Department, Punjab Engineering College, Sector 12, Chandigarh, 160012, India

* Corresponding Author: Sarab Almuhaideb. Email: email

(This article belongs to the Special Issue: Emerging Machine Learning Methods and Applications)

Computers, Materials & Continua 2025, 85(2), 3095-3128. https://doi.org/10.32604/cmc.2025.067390

Abstract

Many bioinformatics applications require determining the class of a newly sequenced Deoxyribonucleic acid (DNA) sequence, making DNA sequence classification an integral step in performing bioinformatics analysis, where large biomedical datasets are transformed into valuable knowledge. Existing methods rely on a feature extraction step and suffer from high computational time requirements. In contrast, newer approaches leveraging deep learning have shown significant promise in enhancing accuracy and efficiency. In this paper, we investigate the performance of various deep learning architectures: Convolutional Neural Network (CNN), CNN-Long Short-Term Memory (CNN-LSTM), CNN-Bidirectional Long Short-Term Memory (CNN-BiLSTM), Residual Network (ResNet), and InceptionV3 for DNA sequence classification. Various numerical and visual data representation techniques are utilized to represent the input datasets, including: label encoding, -mer sentence encoding, -mer one-hot vector, Frequency Chaos Game Representation (FCGR) and 5-Color Map (ColorSquare). Three datasets are used for the training of the models including H3, H4 and DNA Sequence Dataset (Yeast, Human, Arabidopsis Thaliana). Experiments are performed to determine which combination of DNA representation and deep learning architecture yields improved performance for the classification task. Our results indicate that using a hybrid CNN-LSTM neural network trained on DNA sequences represented as one-hot encoded -mer sequences yields the best performance, achieving an accuracy of 92.1%.

Keywords

DNA sequence classification; deep learning; data visualization

Cite This Article

APA Style
Almuhaideb, S., Altwaijry, N., Al-Turaiki, I., Khan, A.R., Rizvi, H.A. (2025). A Comparative Study of Data Representation Techniques for Deep Learning-Based Classification of Promoter and Histone-Associated DNA Regions. Computers, Materials & Continua, 85(2), 3095–3128. https://doi.org/10.32604/cmc.2025.067390
Vancouver Style
Almuhaideb S, Altwaijry N, Al-Turaiki I, Khan AR, Rizvi HA. A Comparative Study of Data Representation Techniques for Deep Learning-Based Classification of Promoter and Histone-Associated DNA Regions. Comput Mater Contin. 2025;85(2):3095–3128. https://doi.org/10.32604/cmc.2025.067390
IEEE Style
S. Almuhaideb, N. Altwaijry, I. Al-Turaiki, A. R. Khan, and H. A. Rizvi, “A Comparative Study of Data Representation Techniques for Deep Learning-Based Classification of Promoter and Histone-Associated DNA Regions,” Comput. Mater. Contin., vol. 85, no. 2, pp. 3095–3128, 2025. https://doi.org/10.32604/cmc.2025.067390



cc Copyright © 2025 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
  • 1495

    View

  • 725

    Download

  • 0

    Like

Share Link