Support Vector Regression for Modeling Effect of Education Rate on Life Expectancy Rate in Indonesia

Abstract


INTRODUCTION
According to the definition of Indonesian Central Bureau of Statistics (BPS), Life Expectancy Rate is the average year of life that a person who has reached a certain age in the prevailing mortality situation in their community lives. Life Expectancy Rate is a tool for evaluating government performance in improving the welfare of the people, and improving health status in particular. If life expectancy rates in a region is not good enough, the government should improve it with health development programs, and other social programs including environmental health, nutritional adequacy and poverty eradication programs.
Other paragraphs are indented this study uses data sourced from the National Socio-Economic Survey (SUSENAS) organized by Indonesian Central Bureau of Statistics (BPS). Previous studies using SUSENAS data include Damayanti and Ratnasari modeling variables affecting poverty in East Java using Geographically Weighted Regression (GWR) method, and then Anuraga and Otokusing Structural Equation Modeling-Partial Least Square to mathematically modeling the poverty in East Java [1]. Then Nur and Widjanarko used Meta-Analytic Structural Equation Modeling (MASEM) to modeling poverty in every districts in Java.
Other study in poverty data is Ghazali  expectancy variables so that research needs to be done to find out how the relationship between these two variables.
The Support Vector Regression (SVR) method was used in this study because in several previous studies using SUSENAS data, modeling using the SVR method provided a better level of accuracy by Ghazali and Fitriati in 2016 then continued by Fitriati & Ghazali in 2018. SVR is the development of the Support Vector Machine (SVM) was first introduced by Vapnik in 1992 as a series of harmonious concepts in the field of pattern recognition. SVR is a regression-based modeling developed from SVM which is a classification-based method. The SVR model has advantages over the Ordinary Least Square (OLS) regression model in terms of implicit nonlinear model utilization through the application of kernel functions that map the vector of x feature data points to higher dimensional spaces allowing the use of models as in linearly separable cases.

METHOD
In this study, the method we use is support vector regression while to measure the best model we use the coefficient of determination as criterion.
Support Vector Regression (SVR) probably has greatest use when the dimensionality of the input space and the order of the approximation creates a dimensionality of a feature space representation much larger than that of the number of examples The classification problem can be restricted to consideration of the two-class problem without loss of generality. SVMs were developed to solve the classification problem, but recently they have been extended to the domain of regression problems. Support Vector Regression (SVR) is an advanced application of the Support Vector Machine (SVM) in regression cases. SVM which was originally a classification method where the response variable was an ordinal variable while the SVR using the independent variable was a numerical variable in the form of real and continuous numbers.
Suppose there is a set of data (1) with linear function (2), the optimal function of regression from the above equation is (3), where is a predetermined value, and is a slack variable indicating the upper and lower bounds of the output in the system.
The factor ‖ ‖ is called the regularization factor. Minimizing ‖ ‖ will make the function as thin as possible, so that it can control the function capacity. Using the idea of the insensitive loss function introduced by Vapnik (1995) we minimize the norm from w to get a good generalization for the regression function .
The ε-insensitive loss function equation, to solve the optimization equation (3) is so the solution is as follows ∑ ∑ ( with restrictions Complete equation (5) with boundary (6) using Lagrange multiplier then the optimal condition of the regression function is written as follows: ̅ ∑ ̅ 〈 ̅ 〈 〉〉 If ε = 0 then we get the optimization of loss function in the form of a simpler equation as follows ∑ ∑ ( ) ∑ with restrictions ∑ where is the kernel function of The optimal regression function of the equation (2) written as follows 〈 ̅ 〉 ∑ (10) In this study, optimization is assisted with kernel functions including polynomial kernels and Gaussian Radial Basis Function (RBF) and Exponential Radial Basis Function (eRBF).
The polynomial kernel function equation is The RBF kernel function equation is While the equation of the Exponential RBF kernel function is ( ) where d is the kernel degree[8] [12].

Model Goodness measurement
To choose the best model, we used the model validation procedure using coefficient of determination (R 2 ) which is the percentage of influence of the predictor variable to the response variable. The equations are written as follows: denotes the i-th observational object and ̂ is the prediction of the i-thdata. Then the best model is the model that has the maximum R 2 value.

RESULT
The data of this study are secondary data taken from the results of the data collection of the National Socio-Economic Survey (SUSENAS) for 2012 by the Central Statistics Agency (BPS). Data collected include concerns all indicators are included in the health indicators, human resources and economics. With observation data consisting of 497 districts and cities in Indonesia, the response variable is Life Expectancy Rate (Y), while the predictor variable is length of education by years (X) in each districts and cities in Indonesia in 2012.
The software used is the Matlab toolbox created by Steve R. Gunn. The data description is displayed by Table 1  The correlation between the independent variable and the response variable is equal to 0.4369 which means that there is a relationship that is directly proportional (positive sign) between the average variable of school length and life expectancy. This means that if the public education factor is getting better that characterized by the high average length of schooling, then the life expectancy in Indonesia will increase.
The SVR kernels used in this study are Linear, Polynomial, Gaussian Radial Basis Function (RBF) and Exponential Gaussian Radial Basis Function (eRBF) kernels. Each kernel, Polynomial and RBF are used 3 different degrees.
After obtaining the prediction from the response variable the accuracy will be compared to the response variable (y_i). If the value of is closer to the value of y_i, the greater the accuracy level indicated by the small MSE value and the large R2 value.
Summary of output from this experiment is shown by Table 2 as follows:  Regression Method (OLS) only produces R 2 coefficient of determination of 19.09% which means that the resulting model is not too good because the predictor variable is unable to influence the response variable of 19.09% while the rest is influenced by other factors. The regression plot in Figure 1 forms a random distribution pattern but the regression line shows an upward trend which means that there is a straight comparison between the average school length and life expectancy.
Using the SVR method of polynomial kernel is not able to produce a more accurate model, but if using RBF kernel obtained coefficient of determination R 2 better that is equal to 23.77%. Higher coefficient of determination is obtained by using eRBF kernel that is equal to 68.90%.

Figure 2:
Plot of X vs Y using the 3rd degree RBF kernel SVR Figure 3: Plot of X vs Y using the 3rd degree eRBF kernel SVR The best model to estimate the effect of educational factors as measured by the average length of school to the life expectancy rate in Indonesia with the Support Vector Regression (SVR) method is obtained by using the Exponential Radial Basis Function (eRBF) kernel, which is indicated by a better R 2 value than using the OLS regression method and using other SVR kernels.

CONCLUTION
Suggestions for further research are comparing data with several different SVM kernels including the Splines kernels, B-Splines, Fouries Series, etc. There also needs to be further research on the relationship patterns of the two variables based on quartile data to see what the range of years of education that can be said to be significant to life expectancy.