Dimensionality reduction by machine learning for cost-effective data analysis

Loading...
Thumbnail Image
Authors
Asaduzzaman, Abu
Uddin, Md Raihan
Sibai, Fadi N.
Advisors
Issue Date
2024-04-17
Type
Article
Keywords
Computing , Data analysis , Dimensionality reduction , Machine learning , Water quality prediction , Data feature pruning , Input dataset
Research Projects
Organizational Units
Journal Issue
Citation
Abu Asaduzzaman, Md R Uddin, Fadi N Sibai. Dimensionality Reduction by Machine Learning for Cost-Effective Data Analysis. TechRxiv. April 17, 2024. DOI: 10.36227/techrxiv.171332281.12206851/v1
Abstract

Processing large amount of data with many input features is always time consuming and expensive. In machine learning (ML), the number of input features play a crucial role in determining the performance of the ML models. Studies show that ML has potential for dimensionality reduction. This work proposes a methodology to reduce the number of input features using ML to facilitate cost-effective data analysis. Two different data sets for water quality prediction from Kaggle are used to run the ML models. First, we use Recursive Feature Elimination with Cross-Validation (RFECV), Permutation Importance (PI), and Random Forest (RF) models to find the impact of input features on predicting water quality. Second, we conduct experiments applying seven ML models: RF, Decision Tree (DT), Logistic Regression (LR), K-Nearest Neighbors (KNN), Gaussian Naïve Bayes (GNB), Support Vector Machine (SVM), and Deep Neural Network (DNN) to explore water quality using the original and reduced datasets. Third, we evaluate the impact of the optimized data features on computations and cost to test water quality. Experimental results show that reducing the number of features from nine to five for Dataset 1 helps reduce computations by up to 59% and cost up to 65%. Similarly, reducing the number of features from 20 to 16 for Dataset 2 helps reduce computations by up to 20% and cost up to 14%. This study may help mitigate the curse of dimensionality, via improving the performance of ML models by enhancing data generalization.

Table of Contents
Description
e-Prints posted on TechRxiv are preliminary reports that are not peer reviewed. They should not be regarded as conclusive, guide clinical practice/health-related behavior, or be reported in the media as established information.
Publisher
TechRxiv
Journal
Book Title
Series
PubMed ID
ISSN
EISSN