A Comparative Study of PCA-Based Dimensionality Reduction and Best Subset Selection in Disease Classification

Andreas Rony Wijaya; Atika Ratna Dewi; Muhammad Bayu Nirwana; Respatiwulan Respatiwulan; Sri Sulistijowati Handajani

doi:10.31764/jtam.v10i3.38265

A Comparative Study of PCA-Based Dimensionality Reduction and Best Subset Selection in Disease Classification

Andreas Rony Wijaya, Atika Ratna Dewi, Muhammad Bayu Nirwana, Respatiwulan Respatiwulan, Sri Sulistijowati Handajani

Abstract

Real-world datasets often contain many variables, some of which may be irrelevant or redundant. To build an effective classification model, it is important to simplify the data by keeping only the most influential features. One common approach that can be used for selecting the most influential variables is feature selection. However, when dealing with many variables, removing some may result in the loss of information. Hence, it is also necessary to consider methods that can simplify the model while retaining most of the information from the original variables. Dimensionality reduction is one such approach that effectively addresses this issue. This study employs a comparative quantitative research approach to evaluate the effectiveness of principal component analysis (PCA) as a dimensionality reduction method and best subset selection as a feature selection method in improving classification performance. The study utilizes a heart disease dataset from the UCI Machine Learning Repository consisting of 303 observations and 13 predictor variables as a case study. Both approaches are applied to reduce the number of predictor variables and make the model more interpretable. After applying both methods, three classification models — logistic regression, naïve Bayes, and linear discriminant analysis — are trained and evaluated using accuracy, recall, precision, and F1-score, and the results are further illustrated through ROC curves. Feature selection using best-subset selection yields seven variable combinations with the most significant predictors, whereas PCA requires eight principal components to explain 80% of the total variation. The best classification performance was obtained using the feature-selected dataset, achieving an accuracy of 87% and an AUC of 0.93, outperforming both the original dataset model and the PCA-reduced dataset model. These results show that feature selection using best subset selection provides a better balance between simplicity and classification performance. Furthermore, the models obtained after feature reduction, both from best subset selection and PCA, still maintain good predictive ability as indicated by their relatively high AUC values.

Keywords

Classification; Dimesionality Reduction; Feature Selection; PCA; Subset Regression.

Full Text:

DOWNLOAD [PDF]

References

Abdollahi, J., & Nouri-Moghaddam, B. (2021). Feature selection for medical diagnosis: Evaluation for using a hybrid Stacked-Genetic approach in the diagnosis of heart disease. ArXiv. https://arxiv.org/abs/2103.08175

Andika, R. A., & Dewi, C. (2025). Importance of Feature Selection for Multiple Disease Classification. Jurnal Buana Informatika, 16(1), 34–45.

Austin, P. C., & van Buuren, S. (2023). Logistic regression vs. predictive mean matching for imputing binary covariates. Statistical Methods in Medical Research, 32(11), 2172–2183. https://doi.org/10.1177/09622802231198795

Chen, H., Hu, S., Hua, R., & Zhao, X. (2021). Improved naive Bayes classification algorithm for traffic risk management. Eurasip Journal on Advances in Signal Processing, 30(2021), 1–12. https://doi.org/10.1186/s13634-021-00742-6

Devaraj, S., & Paulraj, S. (2015). An Efficient Feature Subset Selection Algorithm for Classification of Multidimensional Dataset. Scientific World Journal, 2015. https://doi.org/10.1155/2015/821798

Dey, D., Haque, M. S., Islam, M. M., Aishi, U. I., Shammy, S. S., Mayen, M. S. A., Noor, S. T. A., & Uddin, M. J. (2025). The proper application of logistic regression model in complex survey data: a systematic review. BMC Medical Research Methodology, 25(15). https://doi.org/10.1186/s12874-024-02454-5

Esen, G., Altaibek, A., Amankulov, J., Matkerim, B., & Nurtas, M. (2024). Enhancing Breast Cancer Detection with Dimensionality Reduction Techniques: A Study Using PCA and LDA on Wisconsin Breast Cancer Data. Procedia Computer Science, 251, 414–421. https://doi.org/10.1016/j.procs.2024.11.128

Graf, R., Zeldovich, M., & Friedrich, S. (2024). Comparing linear discriminant analysis and supervised learning algorithms for binary classification—A method comparison study. Biometrical Journal, 66(1). https://doi.org/10.1002/bimj.202200098

Guyon, I., & Elisseeff, A. (2003). An Introduction to Variable and Feature Selection André Elisseeff. Journal of Machine Learning Research, 3, 1157–1182.

Han, J., Pei, J., & Tong, H. (2023). Data Mining: Concepts and Techniques.

Hanke, M., Dijkstra, L., Foraita, R., & Didelez, V. (2024). Variable selection in linear regression models: Choosing the best subset is not always the best choice. Biometrical Journal, 66(1). https://doi.org/10.1002/bimj.202200209

Heaton, J. (2018). Ian Goodfellow, Yoshua Bengio, and Aaron Courville: Deep learning. Genetic Programming and Evolvable Machines, 19(1–2), 305–307. https://doi.org/10.1007/s10710-017-9314-z

Huang, D., Quan, Y., He, M., & Zhou, B. (2009). Comparison of linear discriminant analysis methods for the classification of cancer based on gene expression data. Journal of Experimental and Clinical Cancer Research, 28(1). https://doi.org/10.1186/1756-9966-28-149

Johnson, R. A. ., & Wichern, D. W. . (2014). Applied multivariate statistical analysis. Pearson Educated Limited.

Joosse, H. J., Chumsaeng-Reijers, C., Huisman, A., Hoefer, I. E., van Solinge, W. W., Haitjema, S., & van Es, B. (2025). Haematology dimension reduction, a large scale application to regular care haematology data. BMC Medical Informatics and Decision Making, 25(1). https://doi.org/10.1186/s12911-025-02899-8

Kehinde Josephine Olowe, Ngozi Linda Edoh, Stephane Jean Christophe Zouo, & Jeremiah Olamijuwon. (2024). Comprehensive review of logistic regression techniques in predicting health outcomes and trends. World Journal of Advanced Pharmaceutical and Life Sciences, 7(2), 016–026. https://doi.org/10.53346/wjapls.2024.7.2.0039

Kuzudisli, C., Bakir-Gungor, B., Bulut, N., Qaqish, B., & Yousef, M. (2023). Review of feature selection approaches based on grouping of features. In PeerJ (Vol. 11). PeerJ Inc. https://doi.org/10.7717/peerj.15666

Labory, J., Njomgue-Fotso, E., & Bottini, S. (2024). Benchmarking feature selection and feature extraction methods to improve the performances of machine-learning algorithms for patient classification using metabolomics biomedical data. Computational and Structural Biotechnology Journal, 23, 1274–1287. https://doi.org/10.1016/j.csbj.2024.03.016

Li, B., Gui, X., & Zhou, Q. (2022). Construction of Development Momentum Index of Financial Technology by Principal Component Analysis in the Era of Digital Economy. Computational Intelligence and Neuroscience, 2022. https://doi.org/10.1155/2022/2244960

Li, J., Cheng, K., Wang, S., Morstatter, F., Trevino, R. P., Tang, J., & Liu, H. (2018). Feature selection: A data perspective. In ACM Computing Surveys (Vol. 50, Number 6). Association for Computing Machinery. https://doi.org/10.1145/3136625

Mohtasham, F., Pourhoseingholi, M. A., Hashemi Nazari, S. S., Kavousi, K., & Zali, M. R. (2024). Comparative analysis of feature selection techniques for COVID-19 dataset. Scientific Reports, 14(1). https://doi.org/10.1038/s41598-024-69209-6

Opitz, J. (2024). A Closer Look at Classification Evaluation Metrics and a Critical Reflection of Common Evaluation Practice. Transactions of the Association for Computational Linguistics, 12, 820–836. https://doi.org/https://doi.org/10.1162/tacl_a_00675

Parman, N. H., Hassan, R., & Zakaria, N. H. (2024). Breast Cancer Prediction Using Support Vector Machine Ensemble with PCA Feature Selection Method. International Journal of Innovative Computing, 14(1), 15–19. https://doi.org/10.11113/ijic.v14n1.461

Sankarganesh, P. V, & Priya, D. R. (2024). Improved Feature Selection and Classification for Diabetes Mellitus Using Random Forest-Based U-Net Classifier. International Journal of Intelligent Systems and Applications in Engineering IJISAE, 12(4), 1772–1780. www.ijisae.org

Shen, Z. (2023). Comparison and Evaluation of Classical Dimensionality Reduction Methods. Highlights in Science, Engineering and Technology ICMEA, 70(2023), 411–418. https://doi.org/https://doi.org/10.54097/hset.v70i.13890

Sujon, K. M., Hassan, R., Choi, K., & Samad, M. A. (2025). Accuracy, precision, recall, f1-score, or MCC? empirical evidence from advanced statistics, ML, and XAI for evaluating business predictive models. Journal of Big Data, 12(1). https://doi.org/10.1186/s40537-025-01313-4

Wu, R. M. X., Zhang, Z., Yan, W., Fan, J., Gou, J., Liu, B., Gide, E., Soar, J., Shen, B., Fazal-E-Hasan, S., Liu, Z., Zhang, P., Wang, P., Cui, X., Peng, Z., & Wang, Y. (2022). A comparative analysis of the principal component analysis and entropy weight methods to establish the indexing measurement. PLoS ONE, 17(1 January), 1–26. https://doi.org/10.1371/journal.pone.0262261

Zheng, J., & Rakovski, C. (2021). On the application of principal component analysis to classification problems. Data Science Journal, 20(1). https://doi.org/10.5334/dsj-2021-026

DOI: https://doi.org/10.31764/jtam.v10i3.38265