Generalized Linear Models in Determining Factors Affecting the Number of Community Visits to Health Service with Bayesian Inference Approach

ABSTRACT

used in this study are the number of confirmed Covid-19 cases per day and the number of deaths caused by Covid-19 per day. The purpose of this study is to observe how much impact the confirmation of Covid-19 cases has on the number of deaths that occur every day (Saidi et al., 2021).
There are two methods used to estimate parameters, namely the classical method, which is a classical statistical approach where parameter estimation is based on information from sample data and ignores initial information (Prior) and the Bayesian method (Hosack et al., 2017). In the Bayesian method, population parameters are viewed as variables that have an initial distribution (prior). Distribution prior is a subjective distribution based on one's beliefs about the parameters. In general, the main advantage of using the Bayes method is that it simplifies the classical method that requires the use of complex integrals in calculating the marginal model (Shimizu, 2023). The Bayesian method combines the information contained in the sample with the information contained in the prior distribution, and then produces a posterior distribution. If the posterior distribution cannot be calculated analytically, then a Markov Chain Monte Carlo (MCMC) approach is used. This algorithm uses an acceptance and rejection mechanism to generate a line of random samples. In the MCMC simulation, iterations are carried out until the random samples reach convergence. Samples can be seen from regular patterns on the Dynamic Trace Chart (Shi et al., 2023). In this study, the response variable as the number of visits to healthcare facilities has a Poisson distribution with a small sample size. The proposed model is Bayesian GLM to perform parameter estimation with an algorithm Markov Chain Monte Carlo (MCMC) by using Winbugs Software.

B. METHODS
In many statistical models, the generally dependent variable is assumed to be normally distributed, but in reality, it is often found that the dependent variable is not normally distributed (Houpt & Bittner, 2018). One solution to overcome this problem is by developing a linear model such as the GLM (Saputri & Devianto, 2020). The GLM is a generalized form of linear model that is often used Saidi et al. (2021), this model is a more flexible form of ordinary linear regression that describes the linear relationship between several independent variables and one dependent variable through a linking function (Rodriguez et al., 2022). The GLM is a statistical model extended from the linear regression model with the assumption that the predictor effects are linear, but does not require the assumption of a particular distribution of the response variable. This model is usually used when the response variable belongs to the exponential family (Bacha & Tadesse, 2019). The flow chart of the GLM process with Bayesian inference is described as in Figure 1.  (contingency table) and Gamma (variance component) (Saputri & Devianto, 2020). Given the formula form of the GLM, i.e. : with link function ( ) where = 1,2, . . . , are subjects for individual observations, is the number of independent variables in the observation, is the i-th observation of the j-th independent variable, and the parameter is the unknown coefficient of the observation. The GLM combines various methods in statistical sciences such as regression, ANOVA, and models for categorical data are special cases of one good model. There are 3 components in the GLM, namely random components, systematic components, and connecting functions (Diwidian et al., 2020). The random component of the GLM identifies the component consisting of the dependent variable expressed on mutually independent observations ( 1 , 2 , . . . , ) and selects the distribution of the study data. In some cases each observation is a count data having a distribution from the exponential family which has the following probability density function with is canonical parameter, is dispersion parameter, where the value of parameter with = 1,2, . . . , , depends on each value of the independent variable. The systematic component of the GLM specifies the dependent variable linearly as the predictor on the righthand side of the model equation. In this case connecting vectors ( 1 , 2 , . . . , ) for the independent variables, suppose is an independent variable value, with = 1,2, . . . , for = 1,2, . . . , then it is obtained This linear combination of independent variables is associated with linear independence or mutual independence which are referred to as linear predictors. The third component is the linking function that connects the random and systematic components. The linking function defines a function (. ) that relates µ to a linear predictor denoted (µ ) = so that it has the form as follows The link function (µ ) = µ , is said to be the identity link, if = µ . This implies a linear model for the mean response, which is expressed as follows If the link function ( ) = ( ) models the log probability, it is called the logit function. The link functions for the GLM model are stated as in Table 1. In this study, to estimate the parameters of the GLM, we used the Bayesian due to relatively a small sample size for community visits to healthcare facilities at West Sumatra. The Bayesian approach to parameter estimation has the added value that unlike traditional statistical methods, it integrates different types of information into the estimation process, such as expert judgment and statistical data (Jayaraman & Ramu, 2023). This method combines the likelihood function with the prior distribution, and the Bayesian inference method provides posterior probability estimates (Ota et al., 2023). The simulation Method with the MCMC algorithm is performed with the Winbugs application as a powerful tool to analyze the framework of the Bayesian theory for complex probabilistic Markov models (Youn et al., 2022). The data used in this research originating from one hundred respondents in Padang City, West Sumatra Province, who filled out questionnaires in 2022. In this article, the response variable is the number of visits by respondents to healthcare facilities, The variables used as predictors in this article can be found in Table 2. Types of health insurance consist of government insurance, other insurance, not having health insurance.

2
The distance from home to health services consists of less than 2 km, 2 km-5 km, and more than 5 km.

3
Consumption patterns consist of always eating healthy food, occasionally eating fast food, and there are no rules for always eating healthy.

4
The medical history consists of never being seriously ill, ever seriously ill, and often seriously ill.

5
The status of the house with clean and healthy living.
For some predictor variables in Table 2 that are categorical variables, it is necessary to convert the variables into dummy variables. This conversion can be seen in Table 3. The following are some of the steps taken in data processing to determine the predictor variables that are significant to the response variable of the number of community visits to health services in West Sumatra.
1. Perform descriptive statistical analysis of the response variable regarding the number of community visits to health facilities. 2. Determine the factors that influence the number of visits made by the community to healthcare facilities using the GLM with parameter estimation using the Bayesian inference approach, and the stages are as follows.
a. Identify the prior distribution and likelihood function of the data used.; b. Determine the posterior of the GLM; c. Estimating parameters with the MCMC and the Gibbs sampler method as many as T iterations; d. Test the convergence of each model parameter. If the model parameters have not converged, then the MCMC process is repeated by adding iterations until all the parameters of the GLM have converged; e. Test the significance of the model parameters; f. Significance test of predictor variables which aims to check whether the predictor variable has a significant influence on the response variable. g. Test the convergence of each model parameter; h. After knowing which parameters have a significant effect on the response variable, then conduct convergence testing again as in section d. i. The selection of the best model. j. After the convergence and significance conditions are met, parameter estimates will be obtained to form the best model.
3. Make conclusions on the data analysis that has been done based on Bayesian inference approach of the GLM.

C. RESULT AND DISCUSSION
In this study, to analyze the factors that influence the number of respondents visits to health service facilities, a regression analysis method of the GLM is used. This method is used to find out what factors affect the number of respondents visits to health service. The logit function is used as a link function to connect to the linear predictor ( ) in the process of analyzing the data in this case as follows: One of the methods used to estimate the GLM parameters is the Bayesian method. The Bayesian method is a method used to estimate parameters by combining the information contained in the sample with other information that has been previously available. In the Bayesian method, when a population follows a certain distribution with a parameter in it (suppose in this case ), then the probability of the parameter follows a certain probability distribution called the prior distribution. Bayesian methods are based on the Bayesian theorem. The Bayesian method combines the likelihood function and the prior distribution of the parameters to obtain the posterior distribution which then becomes the basis for estimating the parameters. Markov Chain Monte Carlo (MCMC) is a simulation technique commonly used in the Bayes method. MCMC is a simulation method to obtain sample data of a random variable with sampling technique based on Markov chain. One of the famous MCMC method techniques is Gibbs Sampler. Gibbs Sampler uses conditional distribution to generate sample data of random variables. The MCMC method is effective enough to determine the estimated value of the parameters of the posterior distribution which is very complex and quite difficult if solved by other methods. Monte Carlo simulation is an approach to estimate the distribution function of a random variable. The simulation method is Gibbs Sampler, which is a method that uses a fully conditional distribution that is associated with a stationary distribution The Gibbs Sampler process takes steps such as selecting the initial value then simulating the sampling of random variables based on the full conditional distribution by iterating times (Bacha & Tadesse, 2019). Test whether the response variable, namely the number of visits made by respondents to health service, follows a Poisson distribution. This test is conducted using the Kolmogorov-Smirnov test, with the following hypothesis. In this testing process, SPSS software was used to obtain the statistic test of the Kolmogorov-Smirnov as follows: where Fo(X) is observed cumulative frequency distribution of a random sample of n observations and Fr(X) is the theoretical frequency distribution, in this case as the Poisson distribution. The results of statistic test as in Table 4. (2-tailed) > , then it can be concluded that 0 does not reject. Thus, it can be concluded that the data regarding the number of visits made by respondents to health facilities has a distribution that follows the Poisson Distribution.

Multicollinearity Test
The way to determine the presence of multicollinearity in the predictor variables is by looking at the VIF and Tolerance values of each predictor variable, such as in the output in Table  5. From Table 5 above, it can be seen that all VIF values for these variables are 1,1 = 1.120, 1,2 = 1.076 , 2,1 = 1.046 , 2,2 = 1.103 , 3,1 = 1.232 , 3,2 = 1.397, 4,1 = 1.378, 4,2 = 1.884, 5 = 2.124 , it can be concluded that there is no multicollinearity in all predictor variables, because all VIF values are less than the specified value, as seen in Table 5 above. Other than that, all grades Tolerance value of each predictor variable is smaller than 10 and all the VIF values of each predictor variable are greater than 0.10, so it can be concluded strengthens the assumption that it can be seen that all predictor variables are mutually independent, multicollinearity does not occur, so that it can be continued with the stages of model analysis to obtain the proper modeling.

Data Analysis
In this study, regression analysis with the GLM method was used to determine the factors that influence the number of respondents visits to health service. Based on Equation 6, a logit function is obtained which is used as a link function to connect to linear predictors in the process of analyzing the data of this case as follows. = + 1,1 1,1 + 1,2 1,2 + 2,1 2,1 + 2,2 2,2 + 3,1 3,1 + 3,2 3,2 + 4,1 4,1 + 4,2 4,2 + 5 5 (8) From the obtained equation form, the next step is to estimate the model parameters by using the Bayesian method. The Bayesian method on Generalized Linear Models is done using Winbugs Software. The simulation process begins with building a model and then iterating with the Gibbs Sampler through the Winbugs Application. The parameter estimation of the number of respondent visits to healthcare facilities is displayed on the density in Figure 2. In Figure 2 above it can be seen that Plot Density for constants (alpha), parameters two dummy variables for the type of health insurance (b.AK1, b.AK2), parameters two dummy variables for the distance of the respondent's house to the health service (b.JRK1, b.JRK2), parameters the two dummy variables for consumer pattern (b.PK1, b.PK2), parameters the two dummy variables for health history (b.RK1, b.RK2), and parameters respondent's home status has quite good (b.RM) results because it has a pattern that tends to be smooth and has the shape of a bell curve the posterior distribution formed for parameters in the GLM model is normally distributed. Furthermore, the convergence test was carried out using Trace Plot. Trace Plot is one way to test the convergence of the resulting parameter estimates for the number of iterations performed, as shown in Figure 3. Based on Figure 3, it can be seen that the distribution of the data can be observed for constants (alpha), parameters two dummy variables for the type of health insurance (b.AK1, b.AK2), parameters two dummy variables for the distance of the respondents house to the health service (b.JRK1, b.JRK2), parameters the two dummy variables for consumer pattern (b.PK1, b.PK2), parameters the two dummy variables for health history (b.RK1, b.RK2), and parameters for respondent's home status has quite good (b.RM) has been constant and stable which lies between the estimation intervals, so that it can be concluded that the convergence test is fulfilled. The next step is to test the significance of the predictor variables, with the aim of checking whether the predictor variables have a significant influence on the response variable. The significance of parameters in Bayesian can be evaluated by looking at confidence intervals. Based on the analysis, the results of the parameter significance test are shown in Table 6. Based on Table 6 above, stating that the parameters 3,1 , 3,2 , 4,1 , 4,2 do not contain zero in the confidence interval, meaning that the variables 3,1 , 3,2 , 4,1 , 4,2 has a significant influence on the response variable Y. After knowing the parameters that have a significant influence on the response variable, the next step is to find a new regression model by eliminating the variables that are not significant to the response variable. This process involves 111,000 iterations. Furthermore, the significance of the parameters that were significant in the previous stage was tested again, with the results obtained in Table 7. In Table 7 above, it can be seen that influence testing can be seen from the interval value 5,0% up to 95,0%. In this range, it can be concluded that the predictor variable has a significant influence on the response variable because there is no zero value in these parameters. Furthermore, the convergence test of the parameters that have been significant is carried out by paying attention to the history trace plot so that the following output is obtained, as shown in Figure 4. Based on the history trace plot above, it is known that the variables have converged, then the variable parameter estimation is carried out with the help of Winbugs software, the parameter estimation results are obtained in Table 8. Based on the parameter estimation results above, the Generalized Linear Models equation for the opportunity for community visits to healthcare facilities is obtained as follows: = exp(0,5898 3,1 + 0,5338 3,2 + 0,9051 4,1 + 1,488 4,2 ) with . 1( 3,1 ) and . 2( 3,2 ) is a parameter for the dummy variables of the consumption pattern as well and . 1( 4,1 ) and . 2( 4,2 ) is a parameter for the dummy variable of the respondent's health history. It can be interpreted that the predictor variables that have a significant influence on the number of community visits to health services are respondents consumption patterns and medical history.
In previous studies, Generalized Linear Models have been widely used to analyze various problems including the analysis of the number of tuberculosis patients and the identification of factors that cause malnutrition in the nutritional status of children in Ethiopia. Therefore, the author conducted a study to analyze the factors that have a significant influence on the number of community visits to health services. Based on the results of the analysis, the respondents consumption patterns and medical history variables have a very dominant influence on community visits to health centers in Padang City. Respondents' consumption patterns have an influence because poor consumption patterns can increase the risk of developing various dangerous diseases. The more people who consume with bad patterns, the more people will suffer from diseases, so the number of visits to health services will be higher than usual. Respondents who have a history of chronic diseases will use health services more to maintain their health. In addition, for respondents who have a health history in the form of illness, it can have an influence on the number of visits to health services because respondents will more often visit existing health services so that respondents can better control their health.
These two variables have very dominant affect the community visits to health service in Padang City. Respondents consumption patterns have an influence because consumption patterns that are not good can increase the risk of various dangerous diseases. The more people consume by wrong patterns, the more people will suffer from diseases, and then the number of visits to health services will be higher than as usual. Respondents who have a history of chronic disease will use more health service places to maintain a healthy body. In addition, for respondents who have a medical history in the form of illness, it can have an influence on the number of visits to health service providers because respondents will visit existing health service locations more frequently so that respondents can exercise better health control.

D. CONCLUSION AND SUGGESTIONS
Based on the analysis using the GLM with Bayesian parameter estimation, it can be concluded which of only two variables tested has a significant influence on the number of respondents visiting health service facilities in Padang city among fives hypothesized variables, namely the consumption patterns of respondents and the health history of respondents. These two variables have very dominant affect the community visits to health services in Padang City, and this indicates the community has to pay attention to their consumption patterns behavior, since the worse behavior of consumption patterns then it will increase the risk of outbreaks of dangerous diseases. Health history also has an impact on the frequency of visits to healthcare centers, as individuals with more medical history tend to have their health conditions checked more often, so they have to take better health control during that time.