Performance of LAD-LASSO and WLAD-LASSO on High Dimensional Regression in Handling Data Containing Outliers

Septa Dwi Cahya, Bagus Sartono, Indahwati Indahwati, Evita Purnaningrum

Abstract


In several research areas, it is common to have a dataset with more explanatory variables than the number of observations, called high-dimensional data. This condition can lead to multicollinearity problem. The least absolute shrinkage and selection operator (LASSO) solves the problem by shrinking the estimated coefficient to zero so that it can simultaneously carry on the variable selection and the parameter estimation.  But LASSO performs poorly when the data contains some outliers in the response or explanatory variables. Robust methods have addressed this problem based on the least-absolute-deviation approach, such as LAD-LASSO and WLAD-LASSO. This current research aims to evaluate the performance of the LAD-LASSO and WLAD-LASSO methods on high-dimensional and low-dimensional data containing outliers. To evaluate the performance of these methods, the simulation study was conducted. The simulation study used three scenarios (without outliers, outliers on the response variable (5%, 10%, 15%), outliers both on the response and explanatory variables (5%, 10%, 15%)). We also used the Minimum Regularized Covariance Determinant (MRCD) estimator in calculating the weights on the WLAD-LASSO. The best method from this simulation then will be applied to sembung leaf extract data to identify antioxidant marker compounds in sembung leaf extract. The simulation results show that LAD-LASSO tends to be very tight in selecting, while LASSO tends to be too loose.  Meanwhile, WLAD-LASSO is in the middle of those two techniques and performs the best in identifying the important variables correctly. Even the existence of weights cause WLAD-LASSO more robust against the presence of outliers in the response and explanatory variables compared to LAD-LASSO. Furthermore, performance of these methods on high-dimensional data decrease compared to low-dimensional data. The performance of these methods also tends to decrease when the rate of outlier increases. The WLAD-LASSO was then implemented in actual data to find the compound of antioxidant markers in the sembung leaf extract. The compounds/formulas obtained are Umbelliferone, 12-Hydroxyjasmonic Acid, C22H14N8O2, and Acetyleugenol (with a prediction error is 0.133050). These compounds/formulas can be developed as natural antioxidants and have the potential to be developed as medicinal ingredients.

Keywords


High Dimensional Data; LAD-LASSO; Multicollinearity; Outliers; WLAD-LASSO.

Full Text:

DOWNLOAD [PDF]

References


Ajeel, S. M., & Hashem, H. A. (2020). Comparison Some Robust Regularization Methods in Linear Regression via Simulation Study. Academic Journal of Nawroz University, 9(2), 244–252. https://doi.org/10.25007/ajnu.v9n2a818

Alaluusua, K. (2018). Outlier detection using robust PCA methods. Bachelor’s Thesis, Aalto University. https://doi.org/10.13140/RG.2.2.17736.88321

Arslan, O. (2012). Weighted LAD-LASSO method for robust parameter estimation and variable selection in regression. Computational Statistics and Data Analysis, 56(6), 1952–1965. https://doi.org/10.1016/j.csda.2011.11.022

Bangdiwala, S. I. (2018). Regression: multiple linear. International Journal of Injury Control and Safety Promotion, 25(2), 232–236. https://doi.org/10.1080/17457300.2018.1452336

Boudt, K., Rousseeuw, P. J., Vanduffel, S., & Verdonck, T. (2019). The minimum regularized covariance determinant estimator. Statistics and Computing, 30(1), 113–128. https://doi.org/10.1007/s11222-019-09869-x

Bulut, H. (2020). Mahalanobis distance based on minimum regularized covariance determinant estimators for high dimensional data. Communications in Statistics - Theory and Methods, 49(24), 5897–5907. https://doi.org/10.1080/03610926.2020.1719420

Bulut, H., Öner, Y., & Sözen, Ç. (2016). A Proposal for Robpca Algorithm International Journal of Sciences : A Proposal for Robpca Algorithm. International Journal of Sciences: Basic and Applied Research (IJSBAR), 29(2), 119–129. https://www.gssrr.org/index.php/Journal OfBasicAndApplied/article/view/6131

Camponovo, L. (2022). Extended Oracle Properties of Adaptive Lasso Estimators. Open Journal of Statistics, 12(2), 210–215. https://doi.org/10.4236/ojs.2022.122015

Daoud, J. I. (2017). Multicollinearity and Regression Analysis. Journal of Physics: Conference Series, 949(1), 012009. https://doi.org/10.1088/1742-6596/949/1/012009

Dielman, T. E. (2005). Least absolute value regression : recent contributions. Journal of Statistical Computation and Simulation, 75(4), 263–286. https://doi.org/10.1080/0094965042000223680

Giloni, A., Simonoff, J. S., & Sengupta, B. (2006). Robust weighted LAD regression. Computational Statistics and Data Analysis, 50(11), 3124–3140. https://doi.org/10.1016/j.csda.2005.06.005

Hubert, M., Rousseeuw, P. J., & Vanden Branden, K. (2005). ROBPCA: A new approach to robust principal component analysis. Technometrics, 47(1), 64–79. https://doi.org/10.1198/004017004000000563

Hubert, M., Rousseeuw, P., & Verdonck, T. (2009). Robust PCA for skewed data and its outlier map. Computational Statistics and Data Analysis, 53(6), 2264–2274. https://doi.org/10.1016/j.csda.2008.05.027

Keith, T. Z. (2015). Multiple Regression and Beyond: An Introduction to Multiple Regression and Structural Equation Modeling 2nd Edition. New York:Taylor & Francis.

Leem, H. H., Kim, E. O., Seo, M. J., & Choi, S. W. (2011). Antioxidant and Anti-Inflammatory Activities of Eugenol and Its Derivatives from Clove (Eugenia caryophyllata Thunb.). Journal Korean Social Food Science Nutrition, 40(10), 1361–1370. https://doi.org/10.3746/jkfn.2011.40.10.1361

Lima, E., Davies, P., Kaler, J., Lovatt, F., & Green, M. (2020). Variable selection for inferential models with relatively high-dimensional data: Between method heterogeneity and covariate stability as adjuncts to robust selection. Scientific Reports, 10(1), 1–11. https://doi.org/10.1038/s41598-020-64829-0

Machkour, J., Muma, M., Alt, B., & Zoubir, A. M. (2020). A robust adaptive Lasso estimator for the independent contamination model. Signal Processing, 174(2), 1649–1653. https://doi.org/10.1016/j.sigpro.2020.107608

Mazimba, O. (2017). Umbelliferone: Sources, chemistry and bioactivities review. Bulletin of Faculty of Pharmacy, Cairo University, 55(2), 223–232. https://doi.org/10.1016/j.bfopcu.2017.05.001

Rahardiantoro, S., & Kurnia, A. (2015). LAD-LASSO : Simulation Study of Robust Regression in High Dimensional Data. Indonesian Journal of Statistics, 18(2), 105–107. https://journal.ipb.ac.id/index.php/statistika/article/view/16775

Sirimongkolkasem, T., & Drikvandi, R. (2019). On Regularisation Methods for Analysis of High Dimensional Data. Annals of Data Science, 6(4), 737–763. https://doi.org/10.1007/s40745-019-00209-4

Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society B, 58(1), 267–288. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x

Tibshirani, R. J. (2013). The lasso problem and uniqueness. Electronic Journal of Statistics, 7(1), 1456–1490. https://doi.org/10.1214/13-EJS815

Varin, S. (2021). Comparing the Predictive Performance of OLS and 7 Robust Linear Regression Estimators on a Real and Simulated Datasets. International Journal of Engineering Applied Sciences and Technology, 5(11), 9–23. https://doi.org/10.33564/ijeast.2021.v05i11.002

Wahid, A., Khan, D. M., & Hussain, I. (2017). Robust Adaptive Lasso method for parameter’s estimation and variable selection in high-dimensional sparse models. PLoS ONE, 12(8), 1–17. https://doi.org/10.1371/journal.pone.0183518

Wang, H., Li, G., & Jiang, G. (2007). Robust Regression Shrinkage and Consistent Variable Selection Through the LAD-Lasso. Journal of Business and Economic Statistics, 25(3), 347–355. https://doi.org/10.1198/073500106000000251

Wang, L. (2013). The L1 penalized LAD estimator for high dimensional linear regression. Journal of Multivariate Analysis, 120(2013), 135–151. https://doi.org/10.1016/j.jmva.2013.04.001

Wasserman, L., & Roeder, K. (2009). High-Dimensional Variable Selection. Annals of Statistics, 37(5A), 2178–2201. https://doi.org/10.1214/08-AOS646

Yang, H., & Li, N. (2018). WLAD-LASSO method for robust estimation and variable selection in Partially Linear Models. Communications in Statistics - Theory and Methods, 47(20), 4958–4976. https://doi.org/10.1080/03610926.2017.1383427

Zhao, N., Xu, Q., Tang, M. L., Jiang, B., Chen, Z., & Wang, H. (2020). High-Dimensional Variable Screening under Multicollinearity. Stat, 9(1), 1–14. https://doi.org/10.1002/sta4.272

Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101(476), 1418–1429. https://doi.org/10.1198/016214506000000735




DOI: https://doi.org/10.31764/jtam.v6i4.8968

Refbacks

  • There are currently no refbacks.


Copyright (c) 2022 Septa Dwi Cahya, Bagus Sartono, Indahwati, Evita Purnaningrum

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

_______________________________________________

JTAM already indexing:

                     


_______________________________________________

 

Creative Commons License

JTAM (Jurnal Teori dan Aplikasi Matematika) 
is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License

______________________________________________

_______________________________________________

_______________________________________________ 

JTAM (Jurnal Teori dan Aplikasi Matematika) Editorial Office: