DNA Sequence classification: An advanced Machine Learning framework for accurate splice junction detection

No Thumbnail Available
Authors
Sharif, Kazi Shaharair
Uddin, Ifrat Ikhtear
Abubakkar, Md
Khan, Md Munsur
Ahmad, Imran
Uddin, Mohammed Majbah
Advisors
Issue Date
2025-10-17
Type
Conference paper
Keywords
Bioinformatics , DNA sequence classification , Machine Learning , Splice site prediction
Research Projects
Organizational Units
Journal Issue
Citation
K. S. Sharif, I. I. Uddin, M. Abubakkar, M. M. Khan, I. Ahmad and M. M. Uddin, "DNA Sequence Classification: An Advanced Machine Learning Framework For Accurate Splice Junction Detection," 2025 International Conference on Metaverse and Current Trends in Computing (ICMCTC), Subang Jaya, Malaysia, 2025, pp. 1-6, doi: 10.1109/ICMCTC62214.2025.11196541.
Abstract

In the context of genomic data analysis, DNA splice junction classification is a critical task for understanding gene expression, as these junctions are sites where introns are removed and exons are joined. Accurate identification of splice junctions is essential for deciphering gene functionality. Traditional methods, such as sequence alignment, are often slow and computationally intensive, especially when processing large-scale DNA datasets. To address this, we developed and evaluated multiple machine learning (ML) and deep learning (DL) models for the accurate classification of splice junctions. Our goal was to enhance classification accuracy, reduce computational costs, and provide a comparative analysis of different modeling approaches to advance research in genomic data analysis. We employed a methodological framework that included traditional ML algorithms, such as Random Forest, Gradient Boosting, Decision Tree, Support Vector Machine (SVM), and XGBoost, as well as contemporary DL architectures like Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs). The data preprocessing pipeline incorporated one-hot encoding for optimal feature representation. Empirical results demonstrated the superior performance of ensemble learning methods, with Gradient Boosting and XGBoost achieving exceptional classification accuracies of 97.34% and 97.02%, respectively. Among DL models, CNNs outperformed RNNs, achieving 94.51% accuracy compared to 93.89% for RNNs. The results underscore the exceptional performance of tree-based ensemble methods for splice junction classification, highlighting their superior discriminative power and effectiveness in genomic sequence analysis. © 2025 IEEE.

Table of Contents
Description
Click on the DOI link to access this article at the publishers website (may not be free).
Publisher
Institute of Electrical and Electronics Engineers Inc.
Journal
Book Title
Series
2025 International Conference on Metaverse and Current Trends in Computing, ICMCTC 2025
2025-04-10 through 2025-04-11
Hybrid, Subang Jaya
214064
PubMed ID
ISSN
EISSN