Parametric Survival Model on IPB University’s Graduation Data

ABSTRACT


A. INTRODUCTION
Law of the Republic of Indonesia, Number 12 of 2012 concerning Higher Education article, states that a tertiary institution is an academic unit that organizes higher education.The goal of higher education is to produce graduates who are proficient in the fields of science and technology to serve the national interest and enhance the country's competitiveness (Pemerintah Indonesia, 2012).Graduation is one of the assessment criteria in the higher education accreditation process (BAN-PT, 2019).In the undergraduate program, students are expected to graduate within four years (Mendikbud, 2020).However, some students do not graduate on time, so it is necessary to know the distribution of student graduation and the factors that influence it.Therefore, there needs to be a method to explain the time of student graduation.The method can use to explain the time of student graduation is survival analysis (Acton, 2015).
Survival analysis was first developed by the English astronomer Edmund Halley (1656-1742) by describing the longevity of Breslau residents in the 17th century (Kovacheva, 2017).This analysis is essential in demographic, actuarial, and other sciences (Allison, 2014).For example, demographers use this analysis to analyze mortality rates Zoeller et al. (2021), actuary use this analysis to calculate insurance coverage premiums Staudt & Wagner (2022), and medical professionals use this analysis to estimate the length of life of a patient (Lu et al., 2021).Theories that have discussed resilience analysis include the Kaplan-Meier and Nelson-Altschuler estimation models.Kaplan-Meier and Nelson-Altschuler estimation models are included in the non-parametric study (Moore, 2016).The non-parametric research is relatively easy to understand and apply, but it is less efficient than the parametric analysis when the survival time follows a specific distribution (Lee & Wang, 2013).
Some examples of research that discusses the survival analysis in education is survival models for predicting student dropout at university across time Masci et al. (2022) and survival analysis and its application in education (Syarifuddin, 2012).The model used in this study is still classified as a non-parametric model.Therefore, researchers used a parametric model approach to explain students' study length at the IPB University.The parametric models assume survival times follow a known distribution.Methods of parametric model are such as Tobit, Buckley-James, Penalized regression and Accelerated Failure Time (Atlam et al., 2021).
Parametric regression models focus on the relationship between hazard rate or survival random variables and explanatory variables.There are two common methods for parametric regression model of survival data: proportional hazards (PH) model and accelerated failure time (AFT) model (Wang et al., 2018).The purpose of this research is to determine the best parametric model between PH and AFT model for graduation data IPB University students, as well as to analyze the factors that determine the graduation of IPB University students.

B. METHODS
The following Research Flowchart, as shown in Figure 1.

Descriptive Analysis
This stage is carried out descriptive statistics of data to obtain characteristics from the data used.Descriptive statistics consist of numerical variables and categorical variables.

Distribution Determination
This stage is to determine the distribution that matches and corresponds to the data using criteria smallest sum squared error (SSE) and Bayesian Information Criterion (BIC).SSE and BIC are among the standard criteria for measuring the model and selecting a good model from a set of potential models.The lower value of SSE and BIC indicates the better the model goodness-of-fit (Pham, 2019).The distribution was limited to five distribution which were Burr, Generalized Extreme Value (GEV), Burr XII, log-logistic, and log-normal distribution.The Burr distribution function is defined as follows: where  > 0 and  > 0 are distribution parameters (Almuhayfith et al., 2022).
The Generalized Extreme Value (GEV) cumulative distribution function is defined as follows: and  is a shape parameter with −∞ <  < ∞ (Provost et al., 2018).
The log-logistic survival function is defined as follows: where  > 0 and  > 0 are distribution parameters (Zheng et al., 2021).The log-normal probability density function is defined as follows: where  and  are distribution parameters (Awodutire et al., 2022).

Survival Regression Model
This stage is to perform a regression analysis of the distribution that matches and corresponds to the data that used.Parametric regression model of survival data is proportional hazards (PH) model and accelerated failure time (AFT) model.Proportional hazards (PH) models can be formulated with or without the assumption of a survival probability distribution (Khan & Khosa, 2015).The hazard function is defined as follows: ℎ(, ) = ℎ 0 () exp() (1) where ℎ 0 () is baseline hazard function and  =  0 +  1  1 + ⋯ +     .
The accelerated failure time (AFT) model describes the relationship between the response variable and the survival time.In this model, the survival time logarithms are considered a response variable and include an error term assumed to follow a specific probability distribution (Saikia & Barman, 2017).AFT model can be expressed as: (|) =  0 (), (2) where  =  0 +  1  1 + ⋯ +     and  0 () is the baseline survival function.

Selection of the best model
This stage is to select and determine the best model between AFT model and PH models by looking at the smallest AIC values.The lower value of AIC indicates the better the model goodness-of-fit (Pham, 2019).

Interpretation of result
This stage is Interpret the selected model then analyze the factors which affects graduation.Then compare the results with empirical models.The empirical model used is the Kaplan-Meier model.The Kaplan-Meier (KM) model is a non-parametric analysis to estimate the survival function from a set of survival data.The Kaplan-Meier method estimates conditional survival probabilities computed at specific times determined by events (D' Arrigo et al., 2021).The following equation gives the estimation of the survival function of the model: where   and   respectively represent the number of subjects at risk at time   and the number of individuals who died at time   .

Conclusion
This stage is to get conclusions and suggestions for the next research.

C. RESULT AND DISCUSSION 1. Data
The data used is the graduation time data for undergraduate students at IPB University from 2011 to 2016.The data was obtained from the Directorate of Education Administration and New Student Admissions (Dit AP PMB) IPB University.The information comprised the length of study at IPB University, gender, origin, faculty, grade point average (GPA), school status, and school origin.Analysis using r and python software.

Descriptive Analysis
This section presents the results of descriptive statistics from student graduation survival data.There are 6890 student graduation data used in this study.The descriptive analysis of the categorical data used is presented in Table 1.Table 1 explains that the most gender is female and the most faculties are G faculties.The most origin are from Java, the most school status comes from state schools, and the most school origin are from SMA.The characteristics of the sample used represent the student population at the IPB University, as shown in Table 2. Table 2 shows the average graduation time is 4,361 years and the average student GPA is 3,244 units.It represents the student population at the IPB University.

Distribution Determination
The next step is determining the best distribution of student graduation survival data.Graduation time can be made a pass frequency histogram, as shown in Figure 2. Figure 2 explains that the highest number of graduates occurs in 4 years because most undergraduate students graduate then.In addition, some are less than four years later, and some are more than four years.Then Figure 1 shows that the distribution is in the form of right skewness so that several distributions have these characteristics, such as the Burr, Burr XII, GEV, loglogistik, and lognormal distributions.Furthermore, the estimation of survival data using this distribution is presented in Figure 3.  3. Table 3 explains that the Burr XII distribution produces the smallest AIC and BIC values.Smallest SSE and BIC criteria are used to obtain the most appropriate distribution, so the selected distribution is the Burr XII distribution.Then it will be analyzed using the Burr XII regression model.

Survival Burr XII Regression Model
Let X be a random variable that has a Burr XII (, , ) distribution with the following survival function: where  > 0,  > 0 shape parameter and  > 0 is scale parameter (Thupeng, 2016).a. Accelerated Failure Time Burr XII Model Suppose given survival time T and an explanatory variable vector X with parameters according to the Accelerated Failure Time (AFT) model can be written as:  =  0 +  ′  +  (4) then  is expressed as () so that equation (4) becomes: () =  0 +  ′  +  (5) based on equation ( 5 ℎ | (|) = ℎ 0 () exp( 0 +  1  1 + ⋯ +     ) = ℎ 0 () exp( 0 +  ′ ) The difference between models is that the baseline hazard function ℎ 0 () of the PH model is assumed to follow a specific distribution (Muse et al., 2022).In this case, the distribution used is the Burr XII distribution so that the baseline hazard function follows the Burr XII distribution.Therefore, the model can be written as follows: where  > 0,  > 0,  > 0, dan  > 0. From equation ( 7), the survival function of the Burr XII distribution can be written as follows:

Selection of the best model
Comparison of AFT model and PH distribution of Burr XII is presented in Table 4.

Interpretation of result
Estimation with the PH Burr XII model to analyze the factors influencing student graduation.Based on Table 4, the partial likelihood test value is 1547,911 with a p-value less than 0,05, so the model is feasible to use.Furthermore, to find out the influencing factors are presented in Table 5. *Influence on a significant level of 0,05 and 2 table with a degree of freedom 1 is 3,841 Table 5 explains that the explanatory variables that have a significant effect on the length of graduation of IPB University students are gender ( 1 ), faculty ( 2 ), GPA ( 3 ), origin ( 4 ), and school status ( 5 ).Based on equation ( 8), the survival function of the PH Burr XII model when graduating students from the IPB University can be written as follows: ) −0,16119 (0,35901 exp(0,09706 1 +⋯−0,04613 5 )) .
For interpretation, because the coefficient of the explanatory variable is inversely proportional to the survival function, if the coefficient is positive, students will pass more quickly.For example, gender ( 1 ) has a coefficient of 0,09706, meaning that every female student will graduate faster than male students if other explanatory variables are considered the same, as shown in Figure 4.

D. CONCLUSION AND SUGGESTIONS
The AFT model and the PH model can be used to determine the factors that influence the graduation time of IPB University students.This study uses the parametric AFT and PH models with the distribution that best fits students' graduation time, the Burr XII distribution.The best model based on AIC criteria is the Burr XII distribution PH model, which produces an AIC value of 5809,303.Analysis of the factors that influence the time of graduation shows that the factors that have a significant effect are gender, faculty, GPA, origin, and school status.Future studies are suggested to use more explanatory variables such as scholarships, pathways of entry, and parents' income.

Figure
Figure 1.Research Flowchart

Figure 2 .
Figure 2. Distribution of data on student graduation time

Figure 3 .
Figure 3. Graph the results of the estimation of several distributions ), then obtained  = exp( 0 +  ′  + ) = exp( 0 ) exp( ′ ) + exp() the AFT model of the Burr XII distribution has the survival function as follows: Hazard Burr XII model The proportional hazard (PH) model is a parametric model of the Cox proportional hazards model.Therefore, the model has the same form as the Cox proportional hazards model.The hazard function at time T, which is affected by the explanatory variable , is given as follows:

Figure 4 .
Figure 4. Plot of the estimated survival function of the PH model and the KM model

Figure 4
Figure 4 describes the pattern of the survival function of the KM model and the PH Burr XII model for almost the same.The Kaplan-Meier model seems to be more in line with student graduation data, but this model can only display the survival function per explanatory variable.In contrast to the PH model, which can combine all explanatory variables.

Table 1 .
Descriptive analysis of the categorical variables

Table 2 .
Descriptive analysis of the numerical variables

Table 3 .
AIC Values for Each Model

Table 4 .
Comparison of the AFT and PH models of the Burr XII distribution

Table 4
explains that the PH Burr XII model and AFT Burr XII model has p-value less than 0,05, so the model is feasible to use.The PH Burr XII model has AIC value 5809,303 and AFT Burr XII model has AIC value 6177,027.So the best model is the PH Burr XII model because it has the smallest AIC value than the AFT model.

Table 5 .
Parameter estimators and -values of the PH Burr XII model