MICE Implementation to Handle Missing Values in Rain Potential Prediction Using Support Vector Machine Algorithm

Aina Latifa Riyana Putri; Bayu Surarso; Titi Udjiani SRRM

doi:10.31764/jtam.v7i4.16699

MICE Implementation to Handle Missing Values in Rain Potential Prediction Using Support Vector Machine Algorithm

Aina Latifa Riyana Putri, Bayu Surarso, Titi Udjiani SRRM

Abstract

Support Vector Machine (SVM) is a machine learning algorithm used for classification. SVM has several advantages such as the ability to handle high-dimensional data, effective in handling nonlinear data through kernel functions, and resistance to overfitting through soft margins. However, SVM has weaknesses, especially when handling missing values in data. The use of SVM must consider the missing values strategy chosen. Missing values in data mining is a serious problem for researchers because it causes many problems such as loss of efficiency, complications in data handling and analysis, and the occurrence of bias due to differences between missing data and complete data. To overcome the above problems, this research focuses on understanding the characteristics of missing values and handling them using the Multiple Imputation by Chained Equations (MICE) technique. In this study, we utilized secondary data experiments that contain missing values from the Meteorological, Climatological, and Geophysical Agency (called BMKG) related to predictions of potential rain, especially in DKI Jakarta. Identification of types or patterns of missing values, exploration of the relationship between missing values and other variables, incorporation of the MICE method to handle missing values, and the Support Vector Machine Algorithm for classification will be carried out to produce a more reliable and accurate prediction model for rain potential. It shows that the imputation method with the MICE gives better results than other techniques (such as Complete Case Analysis, Imputation Method Mean, Median, Mode, and K-Nearest neighbor), namely an accuracy of 89% testing data when applying the Support Vector Machine algorithm for classification.

Keywords

Adversity quotient; Reflective thinking; Problem-solving; Pythagorean.

Full Text:

DOWNLOAD [PDF]

References

Ahn, H., Sun, K., & Kim, K. P. (2021). Comparison of Missing Data Imputation Methods in Time Series Forecasting. Computers, Materials & Continua, 70(1), 767–779. https://doi.org/10.32604/CMC.2022.019369

Alamoodi, A. H., Zaidan, B. B., Zaidan, A. A., Albahri, O. S., Mohammed, K. I., Malik, R. Q., Almahdi, E. M., Chyad, M. A., Tareq, Z., Albahri, A. S., Hameed, H., & Alaa, M. (2021). Sentiment analysis and its applications in fighting COVID-19 and infectious diseases: A systematic review. Expert Systems with Applications, 167. https://doi.org/10.1016/J.ESWA.2020.114155

Bartlett, J. W., Carpenter, J. R., Tilling, K., & Vansteelandt, S. (2014). Improving upon the efficiency of complete case analysis when covariates are MNAR. Biostatistics, 15(4), 719–730. https://doi.org/10.1093/BIOSTATISTICS/KXU023

Bondarenko, I., & Raghunathan, T. (2016). Graphical and numerical diagnostic tools to assess suitability of multiple imputations and imputation models. Statistics in Medicine, 35(17), 3007–3020. https://doi.org/10.1002/SIM.6926

Chen, J., Zhang, X., & Gao, Y. (2016). Fault detection for turbine engine disk based on an adaptive kernel principal component analysis algorithm. Http://Dx.Doi.Org/10.1177/0959651816643670, 230(7), 651–660. https://doi.org/10.1177/0959651816643670

Finch, H. (2021). Cite this article: Holmes FW. A Comparison of the Heckman Selection Model, Ibrahim, and Lipsitz Methods for Deal-ing with Nonignorable Missing Data. J Psychiatry Behav Sci, 4(1), 1045. http://meddocsonline.org/

Gaye, B., Zhang, D., & Wulamu, A. (2021). Improvement of Support Vector Machine Algorithm in Big Data Background. Mathematical Problems in Engineering, 2021. https://doi.org/10.1155/2021/5594899

Hunt, L. A. (2017). Missing Data Imputation and Its Effect on the Accuracy of Classification. International Federation of Classification Societies, 0, 3–14. https://doi.org/10.1007/978-3-319-55723-6_1

Jadhav, A., Pramod, D., & Ramanathan, K. (2019). Comparison of Performance of Data Imputation Methods for Numeric Dataset. Applied Artificial Intelligence, 33(10), 913–933. https://doi.org/10.1080/08839514.2019.1637138

Li, C., Li, & Cheng. (2013). Little’s test of missing completely at random. Stata Journal, 13(4), 795–809. https://EconPapers.repec.org/RePEc:tsj:stataj:v:13:y:2013:i:4:p:795-809

Little, R. J. A., & Rubin, D. B. (2019). Statistical analysis with missing data. Statistical Analysis with Missing Data, 1–449. https://doi.org/10.1002/9781119482260

Luengo, J., García, S., & Herrera, F. (2012). On the choice of the best imputation methods for missing values considering three groups of classification methods. Knowledge and Information Systems, 32(1), 77–108. https://doi.org/10.1007/S10115-011-0424-2/METRICS

Luo, X. (2021). Efficient English text classification using selected Machine Learning Techniques. Alexandria Engineering Journal, 60(3), 3401–3409. https://doi.org/10.1016/J.AEJ.2021.02.009

Mera-Gaona, M., Neumann, U., Vargas-Canas, R., & López, D. M. (2021). Evaluating the impact of multivariate imputation by MICE in feature selection. PLOS ONE, 16(7), e0254720. https://doi.org/10.1371/JOURNAL.PONE.0254720

Navin J R, M., & R, P. (2016). Performance Analysis of Text Classification Algorithms using Confusion Matrix. International Journal of Engineering and Technical Research (IJETR), 6(4), 75-8. www.erpublication.org

Nguyen, C. D., Carlin, J. B., & Lee, K. J. (2017). Model checking in multiple imputation: An overview and case study. Emerging Themes in Epidemiology, 14(1), 1–12. https://doi.org/10.1186/S12982-017-0062-6/TABLES/5

Rouzinov, S., & Berchtold, A. (2022). Regression-Based Approach to Test Missing Data Mechanisms. Data 2022, Vol. 7, Page 16, 7(2), 16. https://doi.org/10.3390/DATA7020016

Santos, A. E. M., Lana, M. S., & Pereira, T. M. (2021). Evaluation of machine learning methods for rock mass classification. Neural Computing and Applications, 34(6), 4633–4642. https://doi.org/10.1007/S00521-021-06618-Y

Stewart, T. G., Zeng, D., & Wu, M. C. (2018). Constructing support vector machines with missing data. Wiley Interdisciplinary Reviews: Computational Statistics, 10(4), e1430. https://doi.org/10.1002/WICS.1430

Triguero, I., García-Gil, D., Maillo, J., Luengo, J., García, S., & Herrera, F. (2019). Transforming big data into smart data: An insight on the use of the k-nearest neighbors algorithm to obtain quality data. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 9(2), e1289. https://doi.org/10.1002/WIDM.1289

Vijayarajeswari, R., Parthasarathy, P., Vivekanandan, S., & Basha, A. A. (2019). Classification of mammogram for early detection of breast cancer using SVM classifier and Hough transform. Measurement, 146, 800–805. https://doi.org/10.1016/J.MEASUREMENT.2019.05.083

Wissler, A., Blevins, K. E., & Buikstra, J. E. (2022). Missing data in bioarchaeology II: A test of ordinal and continuous data imputation. American Journal of Biological Anthropology, 179(3), 349–364. https://doi.org/10.1002/AJPA.24614

Xu, C., Tannant, D. D., Zheng, W., & Liu, K. (2020). Discrete element method and support vector machine applied to the analysis of steel mesh pinned by rockbolts. IJRMM, 125, 104163. https://doi.org/10.1016/J.IJRMMS.2019.104163

Zhai, R., & Gutman, R. (2022). A Bayesian Singular Value Decomposition Procedure for Missing Data Imputation. https://doi.org/10.6084/M9.FIGSHARE.20405770.V1

Zhang, H., Zhang, L., & Jiang, Y. (2019). Overfitting and Underfitting Analysis for Deep Learning Based End-to-end Communication Systems. 2019 11th International Conference on Wireless Communications and Signal Processing (WCSP). https://doi.org/10.1109/WCSP.2019.8927876

Zhang, Z. (2016). Multiple imputation with multivariate imputation by chained equation (MICE) package. Annals of Translational Medicine, 4(2). https://doi.org/10.3978/J.ISSN.2305-5839.2015.12.63

DOI: https://doi.org/10.31764/jtam.v7i4.16699