Principal Component Regression Modelling with Variational Bayesian Approach to Overcome Multicollinearity at Various Levels of Missing Data Proportion

Nabila Azarin Balqis, Suci Astutik, Solimun Solimun


This study aims to model Principal Component Regression (PCR) using Variational Bayesian Principal Component Analysis (VBPCA) with Ordinary Least Square (OLS) as a method of estimating regression parameters to overcome multicollinearity at various levels of the proportion of missing data. The data used in this study are secondary data and simulation data contaminated with collinearity in the predictor variables with various missing data proportions of 1%, 5%, and 10%. The secondary data used is the Human Depth Index in Java in 2021, complete data without missing values. The results indicate that the multicollinearity in secondary and original data can be optimally overcome as indicated by the smaller standard error value of the regression parameter for the PCR using VBPCA method which is smaller and has a relative efficiency value of less than 1. VBPCA can handle the proportion of missing data to less than 10%. The proportion of missing data causes information from the original variable to decrease, as evidenced by immense MAPE value and the parameter estimation bias that gets bigger. Then the cross validation (Q^2 ) value and the coefficient of determination (adjusted R^2 ) are get smaller as the proportion of missing data increases.



Missing Data; Multicollinearity; Principal Component Analysis; Principal Component Regression; Variational Bayesian PCA.

Full Text:



Agarwal, A., Shah, D., Shen, D., & Song, D. (2021). On Robustness of Principal Component Regression. Journal of the American Statistical Association, 116(536), 1731–1745.

Ahmad, A. U., Balakrishnan, U. V., & Jha, S. (2021). A Study of Multicollinearity Detection and Rectification under Missing Values. Turkish Journal of Computer and Mathematics Education, 12(1), 399-418.

Alabi, O. O., Ayinde, K., Babalola, O. E., Bello, H. A., & Okon, E. C. (2020). Effects of Multicollinearity on Type I Error of Some Methods of Detecting Heteroscedasticity in Linear Regression Model. Open Journal of Statistics, 10(04), 664–677.

Alruhaymi, A. Z., & Kim, C. J. (2021). Study on the Missing Data Mechanisms and Imputation Methods. Open Journal of Statistics, 11(04), 477–492.

Arumsari, M., Tri, A., & Dani, R. (2021). Peramalan Data Runtun Waktu Menggunakan Model Hybrid Time Series Regression-Autoregressive Integrated Moving Average. In Jurnal Siger Matematika (Vol. 02, Issue 01).

Ayinde, K., Lukman, A. F., Alabi, O. O., & Bello, H. A. (2020). A New Approach of Principal Component Regression Estimator with Applications to Collinear Data. International Journal of Engineering Research and Technology, 13(7), 1616–1622.

Bennet, D. A. (2001). How Can I Deal with Missing Data in My Study? Aust N Z J Public Health, 25(5), 464-469. DOI: 10.1111/j.1467-842X.2001.tb00294.x

Bishop, C. M. (1999). Variational Principal Components. Ninth International Conference on Artificial Neural Networks, ICANN, IEE, Vol. 1, 509-514.

Schipper, N. C., & Deun, K. V. (2021). Model Selection Techniques for Sparse Weight-Based Principal Component Analysis. Journal of Chemometrics, 35(2).

Diah, S., Larasati, A., Nisa, K., Setiawan, D. E., Soemantri Brojonegoro, J., & Lampung, B. (2020). Analisis Regresi Komponen Utama Robust dengan Metode Minimum Covariance Determinant-Least Trimmed Square (MCD-LTS). Jurnal Siger Matematika, 1(1), 1-9.

Estrada, Ma. del R. C., Camarillo, M. E. G., Parraguirre, M. E. S., Castillo, M. E. G., Juárez, E. M., & Gómez, M. J. C. (2020). Evaluation of Several Error Measures Applied to the Sales Forecast System of Chemicals Supply Enterprises. International Journal of Business Administration, 11(4), 39.

Groenwold, R. H. H., & Dekkers, O. M. (2020). Missing Data: The Impact of What is Not There. European Journal of Endocrinology, 183(4), E7–E9.

Jollife, I. T., & Cadima, J. (2016). Principal Component Analysis: A Review and Recent Developments. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences (Vol. 374, Issue 2065).

Kang, H. (2013). The Prevention and Handling of the Missing Data. Korean Journal of Anesthesiology (Vol. 64, Issue 5, pp. 402–406).

Karch, J. (2020). Improving on Adjusted R-squared. Collabra: Psychology, 6(1).

Kim, H., & Jung, H. Y. (2020). Ridge Fuzzy Regression Modelling for Solving Multicollinearity. Mathematics, 8(9).

Kim, S., & Kim, H. (2016). A New Metric of Absolute Percentage Error for Intermittent Demand Forecasts. International Journal of Forecasting, 32(3), 669–679.

Li, W., Jiang, W., Li, Z., Chen, H., Chen, Q., Wang, J., & Zhu, G. (2020). Extracting Common Mode Errors of Regional GNSS Position Time Series in the Presence of Missing Data by Variational Bayesian Principal Component Analysis. Sensors (Switzerland), 20(8).

Liantoni, F., & Agusti, A. (2020). Forecasting Bitcoin Using Double Exponential Smoothing Method Based on Mean Absolute Percentage Error. International Journal on Informatics Visualization, 4(2).

Little, R. J. A. & Rubin, D. B. (1987). Statistical Analysis with Missing Data. Hoboken: John Wiley and Sons.

Little, R. J. A. (1988). A Test of Missing Completely at Random for Multivariate Data with Missing Values. Journal of the American Statistical Association, 83(404), 1198–1202.

Mahmoudi, M. R., Heydari, M. H., Qasem, S. N., Mosavi, A., & Band, S. S. (2021). Principal Component Analysis to Study the Relations Between the Spread Rates of COVID-19 in High Risks Countries. Alexandria Engineering Journal, 60(1), 457–464.

Marcelino, C. G., Leite, G. M. C., Celes, P., & Pedreira, C. E. (2022). Missing Data Analysis in Regression. Applied Artificial Intelligence.

McDonald, G. C., & Galarneau, D. I. (1975). A Monte Carlo Evaluation of Some Ridge-type Estimators. Journal of the American Statistical Association, 70(350), 407–416.

Astivia, O. L. O. & Zumbo, B. D. (2019). Heteroskedasticity in Multiple Regression Analysis: What it is, How to Detect it and How to Solve it with Applications in R and SPSS. Practical Assessment, Research, and Evaluation, 24.

Pham, H. (2019). A New Criterion for Model Selection. Mathematics, 7(12), 1215.

Rutledge, D. N., Roger, J.-M., & Lesnoff, M. (2021). Different Methods for Determining the Dimensionality of Multivariate Models. Frontiers in Analytical Science, 1.

Tsiampalis, T., & Panagiotakos, D. B. (2020). Missing-data Analysis: Socio- demographic, Clinical and Lifestyle Determinants of Low Response Rate on Self-reported Psychological and Nutrition Related Multi-item Instruments in the Context of the ATTICA Epidemiological Study. BMC Medical Research Methodology, 20(1).

Wulandari, S., Salam, N., & Anggraini, D. (2010). Perbandingan Metode Robust MCD-LMS, MCD-LTS, MVE-LMS, dan MVE-LTS dalam Analisis Regresi Komponen Utama. Jurnal Matematika Murni dan Terapan, 4(1), 57-64.

Yordani, R. (2015). Penerapan Model Inferemsi Bayesian dengan Variational Bayesian Principal Component Analysis (VBPCA) dalam Mengatasi Missing Data Analisis Komponen Utama. Jurnal Aplikasi Statistika & Komputasi Statistik, 7(2), 51-69.

Ziegel, E. R. (1991). Linear Statistical Models: An Applied Approach. Technometrics, 33(2), 248–248.



  • There are currently no refbacks.

Copyright (c) 2022 Nabila Azarin Balqis, Suci Astutik, Solimun

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.


JTAM already indexing:




Creative Commons License

JTAM (Jurnal Teori dan Aplikasi Matematika) 
is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License




JTAM (Jurnal Teori dan Aplikasi Matematika) Editorial Office: