Comparing Five Machine Learning-Based Regression Models for Predicting the Study Period of Mathematics Students at IPB University

ABSTRACT


A. INTRODUCTION
Machine learning is a branch of artificial intelligence that develops a computer algorithm to adapt and evolve based on empirical data (Jalal & Ezzedine, 2019). There are three types of machine learning based on human supervision in the learning process, i.e., supervised learning, unsupervised learning, and reinforcement learning (Dey, 2016). Supervised learning is a type of machine learning in which computer algorithms are trained on input data labelled for a specific output. Examples of supervised learning are regression and classification problems (Kotsiantis, 2007). Meanwhile, unsupervised learning does not require labels in the learning process. The unsupervised learning algorithm will find natural patterns from the data without human supervision. An example of this unsupervised learning is the clustering problem (Alzubi et al., 2018). On the other hand, reinforcement learning is a type of machine learning based on rewards and/or punishments for desired and/or unwanted behavior (Sutton, 1992).
Machine learning is one of the most popular and frequently used techniques today. According to the data on scopus.com, articles about machine learning first appeared in the 1950s (Campaigne, 1959;Martens, 1959) and have overgrown from year to year. Research related to machine learning in 2021 reached 80,722 articles with the top four fields: computer science, engineering, medical science, and mathematics. Some of the applications of machine learning include forecasting (Aggarwal & Toshniwal, 2021), prediction (Liu et al., 2021), anomaly detection (Zhou, 2021), and pattern recognition (Li, 2021).
One of the uses of machine learning is to predict the length of study for undergraduate students. The accreditation of a study program is strongly influenced by the study period of its graduates, referring to the Regulation of the National Accreditation Board for Higher Education (called BAN-PT) Number 3 of 2019 concerning higher education accreditation instruments. Several studies that apply machine learning to predict a student's study period, such as the C4.5 and k-nearest neighbor (kNN) (Purwanto et al., 2019), the decision trees and artificial neural networks (Rohmawan, 2018), fuzzy k-NN (Anugerah et al., 2017), perceptron (Masykuri et al., 2021), and many others. However, from some of these studies, it was found that there are still limited who apply machine learning-based regression methods, such as ridge regression (Marquardt, 1970), Huber regression (Huber, 1964), and quantile regression (Lejeune & Sarda, 1988).
Several factors affect the study period, such as the grade point average (GPA), the suitability of the final project topic with the area of interest, other main activities, and others. However, GPA is initial information for supervisors to characterize their supervised students. A model that can predict a student's study period based on GPA is needed for characterization so that supervisors can apply the right strategy for their students. One model that can be used is a machine learning-based regression model. Therefore, this study aims to implement and select a machine learning-based regression model to predict a student's study period based on GPA. The models used are least-square regression, ridge regression, Huber regression, quantile regression, and quantile regression with 2 -regularization provided by Machine Learning in Julia (MLJ). The regression model was selected based on several statistics, such as maximum error, root mean squared error (RMSE), and mean absolute proportional error (MAPE). The computational process was carried out using Julia version 1.6.5. which provides an environment for fast and easy implementation of various machine learning methods. The resulting model can be used by supervisors to predict the study period of their supervised students so that supervisors can characterize their students and can design appropriate strategies.

B. METHODS
The data used in this study is GPA data from semesters 1 to 6 and the study period of students in the mathematics undergraduate program at IPB University, who entered in the years of 2013-2016. The length of the study for students in the mathematics undergraduate study programs is normally 8 semesters. The data is divided into training data for students who entered in 2013-2015 and testing data for students who entered in 2016. The total data obtained was 203 data, with 178 data (87.68%) as training data and the rest as testing data.
This research begins with the collecting and processing of data as mentioned above. Based on the training data, the regression model coefficient values were calculated using several approaches, such as least-square regression, regression with 2 -regularization (ridge regression), regression with Huber loss (Huber regression), quantile regression, and quantile regression with 2 -regularization. Furthermore, the model is evaluated using data testing based on maximum error, RMSE, and MAPE. The fittest model is selected based on these criteria. At the end of the research, some errors generated by the predicted value of the study period are analyzed. The research flow chart is shown in Figure 1 below to effectively understand this study's steps.

Multiple Linear Regression
This study uses six predictors, i.e., GPA semesters 1 to 6, with one response variable, i.e., student study period, so the regression model that will be used is where is the study period, is the student's GPA in the -th semester, is the coefficient of , 0 is the intercept coefficient, and ( ) is the predicted value of . This study uses five machine learning approaches to estimate the regression coefficient, as follows. a. Least-Square Regression Least-square regression is the most popular and commonly used. The "best" coefficient value in the least-square regression model is obtained by minimizing the average value of the squared loss or mean squared error (MSE) (Heath, 2002;Johnson & Faunt, 1992), given by where is the number of training data, ̂ is the approximate value of the coefficient based on least-square regression, ( ) and ( ( ) ) are the actual and predicted values of the -th student study period. b. Ridge Regression Ridge regression is a method for estimating the coefficients of a regression model with a scenario where each predictor is highly correlated (Jones, 1972). The "best" coefficient value for the ridge regression model is obtained by minimizing the mean squared error (MSE) added with 2 -regularization (Wang, 2019), given by where ̂ is the approximate value of the coefficient based on ridge regression, and is the hyper-parameter or tuning parameter of the model. Hyper-parameters can be tuned during the training process using the cross-validation method .

c. Huber Regression
Huber regression is one robust regression, a type of regression model that is insensitive to data outliers (Wager et al., 2005). The "best" coefficient value for the Huber regression model is obtained by minimizing the average value of the Huber loss (Huber, 1964), given by where ̂ℎ is the approximate value of the coefficient based on Huber regression. The function ℓ( ( ), ) is called a Huber loss or smooth absolute loss function, while is a hyper-parameter of this model.

d. Quantile Regression
Quantile regression is an extension of the least-square regression used when leastsquare conditions or assumptions are not met (Koenker & Bassett, 1978). In contrast to least-square, which estimates the conditional mean, quantile regression estimates the conditional median or other quantile values of the response variable across values of the predictor variables (Davino et al., 2022). The "best" coefficient value for the quantile regression model is obtained by minimizing the average pinball loss ( ) a.k.a. linear loss, given by where ̂ is the approximate value of the coefficient based on quantile regression (Cahyani et al., 2016). The value of δ is the hyperparameter of this model, referred to as the quantile. If is equal to 0.5, then quantile regression estimates the conditional median of across values of . e. Quantile Regression with 2 -Regularization The last regression model used is the quantile regression model with 2 -regularization. The "best" coefficient value for this regression model is obtained by minimizing the linear loss added with 2 -regularization, given by 1 ∑ ( ( ( ) ) − ( ) ) where ̂ is the approximate value of the coefficient based on quantile regression with 2.
-Fold Cross-Validation Four of the five models have hyper-parameters that must be tuned using cross-validation. Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample (Wainer & Cawley, 2017). One of the most popular procedures is -fold cross-validation (Lyu et al., 2022). This procedure has a single parameter called , which refers to the number of groups to be divided. This study chose the value of , which is 10, so this procedure can be referred to as 10-fold cross-validation. An illustration of 10-fold crossvalidation is shown in Figure 2 below Suppose there is a model with hyper-parameter to be tested as many as , namely 1 , 2 , … , . The 10-fold cross-validation procedure divides the data into 10 equal parts of data (sub-data). For any hyper-parameter 1 the regression model's coefficient is estimated using training data with different sub-data. First, the regression model's coefficient is estimated using the 2nd to 10th sub-data, then the model is evaluated in the 1st sub-data. Second, the evaluation data used is the second sub-data, with the regression model's coefficient estimated using other sub-data. The process was repeated 10 times so that each sub-data was used as evaluation data. The evaluation result of 1 is the average of the evaluations of 10 meta-models. The process is carried out for each other hyper-parameter, 2 , 3 … , and the hyperparameter is selected based on the best evaluation value. This study's evaluation at the cross-validation stage was based on MAPE.

Evaluation of the Model
After the hyper-parameter values are tuned and the regression model coefficients are estimated, the following process evaluates the models used in the data testing. Data testing is not used at all during the learning process. Some of the measures used for evaluation include the maximum error ( max ), RMSE, and MAPE, given by = max where is the number of testing data, ( ) and ̂( ( ) ) are the actual and predicted values of the study period based on the regression model with a coefficient of ̂.

Julia Programming Language
All computational process in this study was carried out using Julia version 1.6.5. Julia is a new programming language with its primary target on technical computing (Bezanson et al., 2017). Julia claimed to have, speed like C, dynamic like Ruby, feel like Lisp, familiar with mathematical notation like MATLAB, easy to use like Python and R (Joshi & Lakhanpal, 2017). With a simple language, fast, and open source, Julia has quickly become a competitive language in data sciences and scientific computing (Ardhana et al., 2022;Gao et al., 2020). Julia provides many packages that can be used to help the computing process of its users. On the other hand, Julia users can also participate in providing packages and sharing them with other users. In 2022, Julia community has registered over 7,400 Julia packages for community use.
This study uses the MLJ (Machine Learning in Julia) package version 0.16.1 (Blaom et al., 2020). MLJ is a toolbox that provides interfaces and meta-algorithms for selecting, tuning, evaluating, compiling, and comparing more than 180 machine learning models written in Julia and several other languages. MLJ integrates with various other machine learning packages, such as scikit-learn, Flux, GLM, and many more, making model selection easier.

C. RESULT AND DISCUSSION
This section explains the results of hyper-parameter tuning and coefficient estimation of five machine learning-based regression models. After that, the results of the evaluation and comparison of each regression model are explained using the maximum error, RMSE, and MAPE.

Data Summary
Before we discuss the training and testing of the model, this section will show the characteristics of the data used. A brief summary of the data used can be seen in Figure 3 below. As previously mentioned, the data used as predictors are the GPA from semesters 1-6 and the study period as response variables. For students who entered in 2015 and 2016, the data used is only those who have graduated in 2020. The 2021 graduate data is not used because student graduation at that time was influenced by the Covid-19 outbreak.

Estimation of the regression coefficient
Using training data, the regression coefficient in Eq. 1 is estimated using five machine learning approaches. Following are the results of each of these approaches.
a. Least-Square Regression Model The coefficients of the least-squares regression model can be found easily using a matrix formulation. Suppose there is a matrix X containing a set of predictors and β is a vector containing the coefficients of each corresponding predictor, i.e., = [1 1 2 3 4 5 6 ], = [ 0 1 2 3 4 5 6 ] (12) where is a column vector containing student GPA in the -th semester on the training data. If is a column vector containing the response variable, i.e., the study period on the training data, then the coefficient value of (estimator for ) that satisfies Eq. 2 is given by ̂= ( ) −1 ( ) (13) Based on the training data, the least-square regression model is obtained and is given by Because of the convenience provided by least squares in estimating the coefficients of the regression model, there are many packages in MLJ that provide this regression model, such as ScikitLearn, GLM, MLJLinearModels, and MultivariateStats. b. Ridge Regression Model Similar to least squares, ridge regression can also be solved using a matrix formulation. However, ridge regression requires the hyper-parameter value to be tuned. For any hyper-parameter value , the coefficient value (estimator for ) that satisfies Equation (3) is given by ̂= ( + ) −1 ( ) (15) where is a identity matrix, and > 0 is small. Using 10-fold cross-validation, the hyper-parameter value is selected in the interval [0, 0.2] with 51 points being tested, i.e., 0.004 for = 0,1, … ,50. The cross-validation results for the ridge regression are shown in Figure 3.A. Based on the MAPE value, = 0.048 gives the best accuracy. Thus, this value is used as a hyper-parameter for the ridge regression. Based on the training data and the hyper-parameter value = 0.048, the ridge regression model is given by ̂( ) = 63.85 + 2.32 1 − 2.97 2 + 4.69 3 + 17.14 4 − 15.77 5 − 9.45 6 (16) Some packages in MLJ that provide ridge regression models are ScikitLearn, MLJLinearModels, and MultivariateStats. c. Huber Regression Model Huber regression, also called robust regression with Huber loss, is a regression type that is not sensitive to data outliers. Although not sensitive to outliers, Huber regression does not ignore the effect of data outliers. Huber regression only assigns a lower weight to the outlier. Based on Equation (5), Huber regression will optimize the square loss when the absolute value of the residual between the actual and predicted values is less than a bound , which is called the hyperparameter. Meanwhile, the absolute loss will be optimized if the residual value is greater than the hyper-parameter value. In contrast to least squares and ridge regression, the estimation of the Huber regression's coefficients cannot use a matrix formulation. One method used for Huber regression or other robust regression is M-estimation (Huber, 1964). The letter of M in M-estimation stands for "maximum likelihood type". Using MLJ packages, Huber regression coefficients can be estimated easily and quickly. Using 10-fold cross-validation, the hyper-parameter value is selected in the interval [0, 1] with 51 points being tested, i.e., 0.02 for = 0,1, … ,50 . The cross-validation results for the Huber regression are shown in Figure 3.B. Based on the MAPE value, = 0.84 gives the best accuracy. Thus, this value is used as a hyper-parameter for the Huber regression. Based on the training data and the hyper-parameter value = 0.84 , the Huber regression model is given by ̂ℎ ( ) = 66.55 + 1.24 1 − 4.36 2 + 8.09 3 − 1.75 4 − 2.83 5 − 5.64 6 (17) Some packages in MLJ that provide Huber regression are ScikitLearn and MLJLinearModels.

d. Quantile Regression Model
Quantile regression is usually used when the conditions or assumptions in the leastsquare regression are not met. Similar to Huber regression, quantile regression has no sensitivity to data outliers. Quantile regression will choose the conditional median or other quantile value. This quantile value is called the hyper-parameter in the quantile regression model and is denoted by .
Using 10-fold cross-validation, the hyper-parameter value is selected in the interval [0, 1] with 51 points being tested, i.e., 0.02 for = 0,1, … ,50 . The cross-validation results for the quantile regression are shown in Figure 3.C. Based on the MAPE value, = 0.6 gives the best accuracy. Thus, this value is used as a hyper-parameter for the quantile regression. Based on the training data and the hyper-parameter value = 0.6 , the quantile regression model is given by ̂( ) = 63.19 + 0.83 1 − 2.85 2 + 7.18 3 − 3.31 4 − 0.44 5 − 5.85 6 (18) The package in MLJ that provides quantile regression is MLJLinearModels. e. Quantile Regression with 2 -Regularization Model The last regression model used is quantile regression with 2 -regularization. The basis of this regression model is the same as that of quantile regression. The difference is, the loss function in quantile regression is added with 2 -norm regularization. Thus, this model has two hyper-parameter values, i.e., as the quantile value and as the regularization weight. The value of used is derived from the quantile regression model, i.e., = 0.6, so that only will be tuned using cross-validation.
Using 10-fold cross-validation, the hyper-parameter value is selected in the interval [0.5, 1] with 51 points being tested, i.e., 0.5 + 0.01 for = 0,1, … ,50 . The crossvalidation results for this regression model are shown in Figure 3.D. Based on the MAPE value, = 0.9 gives the best accuracy. Thus, this value is used as a hyper-parameter for the quantile regression with 2 -regularization. Based on the training data and the hyper-parameter values δ=0.6, and λ=0.9, the quantile regression model with 2 -regularization is obtained and is given by The results of the tuning process for the hyper-parameter values of ridge regression, Huber regression, quantile regression, and quantile regression with 2 -regularization models as shown in Figure 3.

Evaluation of the Regression Model
After the five models are obtained, the next step is to evaluate these models using data testing. There are three criteria used in this evaluation step, i.e., the maximum error value ( max ), the root mean squared error (RMSE), and the mean absolute proportional error (MAPE). The evaluation results of the five models. The evaluation value in bold indicates the best evaluation value among other models, as shown in Table 1. Based on the evaluation results, the least-square regression model produces the worst accuracy among the others. From Table 1, it can be seen that the maximum error values, RMSE, and MAPE of the least-square regression model are 9.2712, 3.9055, and 7.17%, respectively. The error value is higher than the other four models. The maximum prediction error in this model is more than 9 months, meaning that the student's study period can be 9 months faster or longer than the predicted value. Although the ridge regression model can refine the error of the least-square regression model, the refinement is not very significant. Based on Table 1, the values of , RMSE, and MAPE of the ridge regression model are not much different from the least-square regression model. Meanwhile, Huber regression was able to significantly improve the error of the leastsquare regression model. In this model, the maximum error obtained is 5.4 months, much better than the least-square regression model. Likewise, the RMSE and MAPE values have improved significantly. However, the regression model that gives the best evaluation results is given by the quantile regression model. Based on RMSE and MAPE, the quantile regression model without regularization gives the best evaluation results, while the quantile regression model with 2 -regularization is better at the maximum error criterion. Thus, the fittest model to predict the study period based on GPA is the quantile regression model, as shown in Table 2.  Table 2 shows the prediction results of the study period using quantile regression with 2 -regularization on data testing with a maximum error of more than three months. The overestimated and underestimated prediction results are both three. The maximum error occurs when the prediction is less than the actual value (underestimated). In this data, students have a consistently high GPA of around 3.94, so the prediction of the study period is very fast, i.e., 46.10 months. However, these students take up to 51 months of study in reality. Meanwhile, predictions are overestimated in the second data. Because the GPA is relatively sufficient (less than 3), the regression model predicts that the student can graduate within 51.48 months. However, in reality, these students can graduate on time, which is within 47 months. This study provides an alternative prediction model for student graduation based on GPA. While many models provide a predictive model in the form of a classification of whether students graduate on time or not (Purwanto et al., 2019;Risnawati, 2018;Thaniket et al., 2020), the model in this study provides an approach in the form of a regression model that estimates the number of months required for students to complete their undergraduate studies. However, based on the results, GPA is not the only factor that affects the study period. Other factors, such as the suitability of the field of interest, the ease of finding references, the regularity of the guidance process, and the presence or absence of other main activities, can affect the student's study period (Masykuri et al., 2021). This fact provides an opportunity for further research to construct a regression-based predictive model for the study period of undergraduate students with better accuracy.
The model in this study can be used to describe the characteristics of new guidance students. Thus, the supervisor can determine the right strategy for the supervision process. By knowing the estimated study period, the supervisor can better determine the appropriate topic for the student. For students who are estimated to have a fast study period, the topic of the final project for these students can be wider with several challenges and the supervision process can also run normally without special treatment. Meanwhile, for students who are estimated to have a long period of study, the topic of the final project must be adjusted to the ability of the student without compromising on quality. In addition, the mentoring process can also be carried out more rigorously. Thus, it is hoped that the student's study period can be completed quickly and have a good final project quality.

D. CONCLUSION AND SUGGESTIONS
This study models the length of the student's study period based on GPA using a machine learning-based regression model. The least-square regression model gives the worst evaluation results of the several regression models, although the calculation method is easy. Meanwhile, the quantile regression model gave the best results. Based on RMSE and MAPE, the quantile regression model without regularization gave the best evaluation result. Moreover, the quantile regression model with 2 -regularization had a better evaluation result on the maximum error criterion.
This study provides an alternative prediction model for student graduation based on GPA in the form of a regression model that estimates the number of months it takes students to complete their undergraduate studies, while other studies provide models that predict whether students graduate on time or not. However, this study can be developed by adding other supporting predictor variables, which can be obtained at the beginning of the supervision process, such as the suitability of field interests between students and supervisors, and many activities other than completing the final project. Comparison with more modern machine learning methods can also be applied to get better results.