Improving imbalanced machine learning with neighborhood-informed synthetic sample placement

No Thumbnail Available
Authors
Nasir, Murtaza
Dag, Ali
Simsek, Serhat
Ivanov, Anton
Oztekin, Asil
Issue Date
2022-10-02
Type
Article
Language
en_US
Keywords
Imbalanced data , Oversampling , Undersampling , Machine learning , Predictive analytics , Classification prediction performance , Algorithm training
Research Projects
Organizational Units
Journal Issue
Alternative Title
Abstract

Machine learning is widely used in information systems design. Yet, training algorithms on imbalanced datasets may severely affect performance on unseen data. For example, in some cases in healthcare, fintech, or cybersecurity contexts, certain subclasses are difficult to learn because they are underrepresented in training data. Our study offers a flexible and efficient solution based on a new synthetic average neighborhood sampling algorithm (SANSA), which, in contrast to other solutions, introduces a novel ?placement? parameter that can be tuned to adapt to each dataset?s unique manifestation of the imbalance. This package can be downloaded for R1. We tested SANSA against seven existing sampling methods used in conjunction with the four most frequently used machine learning models trained on 14 benchmark datasets. Our results provide suggestive evidence that SANSA offers a feasible solution to the imbalance problem for most datasets. Our findings provide practical recommendations for how SANSA can be effectively implemented while reducing the complexity level of an imbalanced learning pipeline.

Description
Click on the DOI to access this article (may not be free).
Citation
Murtaza Nasir, Ali Dag, Serhat Simsek, Anton Ivanov & Asil Oztekin (2022) Improving Imbalanced Machine Learning with Neighborhood-Informed Synthetic Sample Placement, Journal of Management Information Systems, 39:4, 1116-1145, DOI: 10.1080/07421222.2022.2127453
Publisher
Routledge
License
Journal
Volume
Issue
PubMed ID
DOI
ISSN
0742-1222
EISSN