Percentile Bootstrap Interval on Univariate Local Polynomial Regression Prediction

ABSTRACT

This study offers a new technique for constructing percentile bootstrap intervals to predict the regression of univariate local polynomials. Bootstrap regression uses resampling derived from paired and residual bootstrap methods. The main objective of this study is to perform a comparative analysis between the two resampling methods by considering the nominal coverage probability. Resampling uses a nonparametric bootstrap technique with the return method, where each sample point has an equal chance of being selected. The principle of nonparametric bootstrapping uses the original sample data as a source of diversity in contrast to parametric bootstrapping, where the variety comes from generating a particular distribution. The simulation results show that the paired and residual bootstrap interval coverage probabilities are close to nominal coverage. The results showed no significant difference between paired bootstrap interval and percentile residual. Increasing the bootstrap sample size sufficiently large gives the scatterplot smoothness of the confidence interval. Applying the smoothing parameter by choice gives a second-order polynomial regression with a smoother distribution than the first-order polynomial regression. The scatterplot shows that the seconddegree polynomial regression can capture the data curvature feature compared to the first-degree polynomial. The bands made from second-degree polynomials give a narrower width than first-degree polynomials. In contrast, applying optimal smoothing parameters to the model provides different conclusions by using smoothing parameters based on choice. In addition to the differences based on the scatterplot, the bootstrap estimates of the coverage probability are also other. Selecting smoothing parameters based on a particular value provides probability coverage with the paired bootstrap method for the first-degree local polynomial regression is 0.93, while the second-degree local polynomial is 0.96. The probability of coverage based on the residual bootstrap method for the first-degree local polynomial regression is 0.95, while the second-degree local polynomial is 0.96. The probability coverage based on the optimal parameters of the paired bootstrap method for the first-degree local polynomial regression is 0.945, while the second-degree local polynomial is 0.93. The residual bootstrap method gives the first-degree local polynomial regression of 0.95, while the second-degree local polynomial is 0.93. In general, both bootstrap methods work well for estimating prediction confidence intervals. Efron & Tibshirani (1994) were pioneers in introducing the bootstrap method as a resampling technique that is very useful for estimating statistics without fulfilling certain assumptions. In the bootstrap method, there is bootstrapping terminology which is a resampling procedure from the original data to produce many simulated samples (bootstrap samples). Simamora et al. (2015) suggested using a computer with a high ability level to perform bootstrapping in the simulation. Expensive simulations are options if the statistics are not in closed form or more complex statistics that do not require certain assumptions. Bootstrapping is a resampling technique that is useful for analysing difficult statistics without strict rules or the parametric assumptions of the applied model are not met (Solci et al., 2022). The working principle of bootstrapping relies on resampling the empirical distribution, which can be done by weakening parametric assumptions. One of the complex and sensitive statistics is constructing a confidence interval for nonparametric regression prediction. In practice, constructing standard confidence intervals based on asymptotic distribution theory can be wildly inaccurate (Diciccio & Efron, 1996). The curve features of the Local Polynomial Regression and the lower and upper limits of the interval are far from the truth.

A. INTRODUCTION
The failure of the normality approach to provide a valid confidence interval has prompted some alternative methods. Eubank & Speckman (1993) proposed a bias-corrected confidence band in a nonparametric kernel regression model. The consideration is only on the bands generated from the kernel estimator of the regression curve and the selection of rounds based on the behaviour of the data. The consideration is only on the bands generated from the kernel estimator of the regression curve and the selection of rounds based on the behaviour of the data. The results of the Monte Carlo simulation show that the mean response confidence interval has asymptotically correct coverage and behaves well in small sample studies. They concluded that Bonferroni-type bands have conservative asymptotic coverage behaviour for large samples without bias correction. Then, Xia (1998) answered the open-ended question on page 1298 of (Eubank & Speckman, 1993). They used local polynomial regression model matching to construct confidence intervals for the mean of response using cross-validation and plug-in methods to select bandwidth. Härdle & Bowman (1988) discuss the performance of bootstrapping and direct methods on a nonparametric regression model. They use the principle of good local adaptive choice of local smoothing parameters. This principle is applied to bootstrap sampling to estimate mean squared errors and percentile intervals from nonparametric estimates at test points. These two applications compare bootstrap performance with a simple "plug-in" method based on direct estimation (asymptotic expansion). In general, the performance of these two methods is generally very similar. However, bootstrap has the slight advantage of not being as sensitive to second derivatives. Moreover, in the confidence interval construct, the bootstrap can reflect features such as skewness but slightly less than the target confidence interval due to inaccuracies in centring. Ringle et al. (2012) warned that the correct setting could provide a reasonable bootstrap confidence interval estimate. Poor choice of options can lead to a significantly biased estimate of the standard error and cause the bootstrap estimate to become unstable. Özdemir (2013) showed a better pencil bootstrap interval on the probability of error in Type I and more efficient computation time. Aguirre-Urreta & Rönkkö (2017) revealed that the confidence interval of the pencil bootstrap is the most straightforward approach, but it is necessary to consider the exact statistical distribution. This approach will work best if the statistical distribution is symmetrical and centred on the original estimate. Then, (Jung et al., 2019) show that the coverage probability of the bootstrap percentile confidence interval is closer to the nominal coverage. They study general structured component analysis without the need for distributional assumptions. Gultom et al. (2022) applied the Gompertz Growth Model with Levenberg-Marquardt iteration on the soybean growth process. They conclude that the bootstrap resampling process in the growth model does not change the characteristics of the data (information from the data), and aims to fulfill the assumption of residual normality.
This approach will work best if the statistical distribution is symmetrical and centred on the original estimate. Then, (Jung et al., 2019) show that the coverage probability of the bootstrap percentile confidence interval is closer to the nominal coverage. The application of the paired and residual nonparametric bootstrap method aims to construct predictive confidence intervals for local polynomial regression. Then perform a comparative study between the two nonparametric bootstrap methods by considering the bootstrap estimate for the standard error of the pivot quantity and the proximity of the empirical probability coverage of the bootstrap to the nominal range. The basic idea uses the results of (Mansyur & Simamora, 2022).
The organisation of this paper is as follows. The first section presents an introduction covering the background and proposals of this research. The second part describes the research method and summarises the concepts and theories of the local polynomial regression model and bootstrap. This section also provides a new algorithm (novelty) related to the bootstrap estimation of confidence intervals for local polynomial regression predictions with nested bootstrap using paired and residual bootstrap methods. The third section deals with the results and discussion of the simulation of the new algorithm. The last section includes research conclusions and suggestions for further development.

B. METHODS
This research method is a combination of literature review and simulation. The literature review aims to provide a framework for deriving a new algorithm. At the same time, the sample data follows the data generated from the literature of (Eubank & Speckman, 1993). The simulation sample follows the resampling of the generation sample data using paired and residual bootstraps. Figure 1 presents the flow of thinking after conducting a literature review. This flow only displays the main stages in the simulation and analyses the output of the simulation. The following section will explain some parts of these stages. Section 1 summarises the literature review on the concepts and theories of local polynomial regression, paired bootstrap and residuals. Section 2 presents the proposed new algorithm of bootstrap percentile confidence intervals for predicting local polynomial regression in detail, as shown in Figure 1.

Summary of Literature Review
This section summarises the concepts and theories of local polynomial regression. To make writing easier, LPR-1 is an acronym for first-degree Local Polynomial Regression (LPR), and LPR-2 is an acronym for second-degree Local Polynomial Regression. Then in this section also summarises the paired and residual bootstrap method in general. The combination of the LPR concept and theory and the bootstrap method proposes two new algorithms for bootstrap percentile interval estimation based on the empirical distribution of pivot quantities. We mention the terminology of the two algorithms with the paired bootstrap percentile interval and residual with the acronyms CI-Paired and CI-Residual, respectively. a. Univariate Local Polynomial Regression Nonparametric regression is an extension of the parametric regression model. The average response does not have a specific trend but is constructed according to a databased factual information set. Local polynomial regression (LPR) is a nonparametric regression model that estimates the relationship between the independent and dependent variables without assuming any functional form. Cleveland (1979) presented a univariate LPR model in the following form, where ( ) is an unknown smoothing function and is a random variable independently and identically distributed. The function is the expectation of the response that needs to be estimated. Meanwhile, the random variable has the expectation ( ) = 0 and the constant variance, ( ) = 2 . According to (de Brabanter et al., 2013), as long as the ( + 1) ℎ derivative of at the point of interest 0 exists, the function ( ) can be approximated locally with a polynomial degree , Polynomial matching locally or, in other words, looks for the estimated parameter on the right-hand side of equation (2) using the weighted least squares method with minimizing the problem, where = ⌊γ ⌋ represents the number of points ∈ ℕ( 0 ). The γ parameter is the curve smoothing parameter, and n is the sample size. The W function is the selectable weight function and Δ( 0 ) = maximum ∈ℕ ( 0 ) | − 0 |. Cleveland (1979) characterizes the W function as follows: Researchers usually choose the γ value of between zero and one. It is necessary to consider the magnitude of the γ value, where the γ value close to zero will predict overfitting or a wavy curve surface. The γ value close to one will provide a smooth surface curve prediction or underfitting but omit the original data features. Mansyur & Simamora (2022) offer a search algorithm for optimal smoothing parameters using cross-validation. The choice of the polynomial degree also determines the curve's smoothness. Usually, researchers use the low-degree polynomial, where the first degree is a linear polynomial and the second degree is a quadratic polynomial. Fan & Gijbels (1960) provide a solution to equation (3)  and Wγ is a diagonal matrix of size k×k whose diagonal element contains the sequence (| 0 − 1 |/Δ( 0 ) , (| 0 − 2 |/Δ( 0 ), ⋯ , (| 0 − |/Δ( 0 ). Following the similarity as in the case of the linear regression model, the prediction of LPR at a point x0 using the weighted least squares method yields, Readers interested in studying more about the weighted least squares method can read the literature of (Draper & Smith, 1998). b. Paired and Residual Bootstrapping Efron & Tibshirani (1994) present two resampling methods, paired and residuals bootstrap processes, in a linear regression model. They give an open problem on page 113, which is the best between paired and residual Bootstrapping? The answer is left to us to what extent we trust the linear regression model. We conclude that there are four exciting provisions from the results of the analysis of (Efron & Tibshirani, 1994) regarding the results of the percentile regression simulation of cholestyramine data, namely: 1) Paired bootstrapping is slightly more sensitive than residual bootstrapping; 2) The error and mean of the response for paired bootstrapping do not depend on the covariates of the original data. The reason is that the covariates are random, unlike the residuals, in which the covariate structure is unchanged; 3) The residual bootstrap has the same suitability as the original data; 4) The bootstrap method does not provide a unique conclusion for the particular concept. Efron & Tibshirani (1994) claim that the two methods are equivalent when the sample size reaches infinity (asymptotic). The difference will appear if the sample size is relatively small. Chernick & LaBudde (2014) also review these two methods. Unfortunately, this literature does not contain exciting statements in the bootstrapping process, only focusing on algorithms and coding in the R programming language. Based on the two types of literature provides information that the difference lies only in the resampling scheme, and we will summarize it further. Suppose the linear regression model is = + , where ( , ) is an ordered pair of responses and a covariate vector of size × 1. The difference between the resampling schemes of the two methods is as follows. 1) Paired bootstrap takes a simulated sample (bootstrap sample) from the original data ( 1 , 1 ), ⋯ , ( , ) independently with returns. Each original data point has an equal chance of being taken as a sample point, 1/ . The resampling process allows a bootstrap sample to have two or more of the same sample points or an original data point to be taken twice or more as members of the bootstrap sample. 2) Residual Bootstrap performs the first procedure by matching the original data ( 1 , 1 ), ⋯ , ( , ) into the model to get ̂=̂. Then calculate each residual ̂= −̂ which gives the residual vector ̂= (̂1, ⋯̂). Determines ̂ * =̂+̂ * where taking ̂ * from a point on the vector ̂ independently with the return. The probability that each ̂∈ ̂ is drawn as a bootstrap sample point ̂ * is the same, i.e.

C. RESULT AND DISCUSSION
The design of the independent variable x and the dependent variable y in the simulation for the two algorithms follows the following conditions. The independent variable (covariate) x is an equidistant points design with xmin = 0 and xmax = 2π. The sample size will affect the distance from one point to another in the observation domain. The dependent variable (response) y comes from the trigonometric function ( ) = sin 2 with the addition of an error normally distributed with mean μ = 0 and standard deviation σ = 0.2. In addition, we need to choose a weight function where researchers generally, such as (Cleveland, 1979), (Cleveland & Grosse, 1991), and (Cleveland et al., 1988) use the tricube weight function, (1 − | | 3 ) 3 , untuk | | < 1 0 , untuk | | ≥ 1 .
The smoothing of the LPR curve uses two parameters, γ as the smoothing parameter and p as the degree of LPR. Researchers may choose the magnitude of the γ parameter provided that it is not too close to zero and not too close to one or use the optimal search using cross-validation. The simulation uses two alternatives to determine smoothing parameters. The first is to select the value γ = 0.5, and the second is to use the optimal γ search algorithm in (Mansyur & Simamora, 2022). The goal is to analyze whether there is an optimal γ influence. The simulation considers low-degree polynomials, namely p = 1 and p = 2. The interval construction uses a 95% confidence interval or a significance level of α = 5%. Figure 2 is the simulation result for the first algorithm with a sample size of n =100 with the number of bootstrap samples B = 1000 and a smoothing parameter γ = 0.5. Figure 2(a) is a scatterplot of CI-Paired and LPR-1 where the CI-Paired for the upper boundary (green curve) and lower boundary (blue curve) have jagged or wavy surfaces. The surface of the LPR-1 curve (curve in black) is very far from the curvature feature of the trigonometric function ( ) = sin 2 . The CI-Paired coverage probability of LPR-1 is 0.93, which is close to nominal coverage. On the other hand, Figure 2(b), derived from CI-Paired based on LPR-2, shows a smoother surface and its curvature features follow the trigonometric function ( ) = sin 2 . The area formed by LPR-1 is wider than LPR-2. As a result, the CI-Paired band of LPR-1 is broader than that of LPR-2. The CI-Paired coverage probability of LPR-2 is the same as the nominal coverage, as shown in Figure 2. The literature of (Efron & Tibshirani, 1994) on page 47, reveals that to get an ideal estimator, it is necessary to increase the number of bootstrap samples. To achieve that, we need to increase the number of bootstrap samples, say B = 10000. Considering that the larger B size will result in an expensive simulation is necessary. Figure 3 is a simulation with the same conditions as Figure  2 but only differs in the number of bootstrap samples. The simulation results show that Figure  3 gives a smoother curve scatterplot than Figure 2. The curve feature does not change, but the probability of coverage of CI-Paired from LPR-2 becomes 0.96, as shown in Figure 3. Using the same sample design and conditions above, we apply the second algorithm to the simulation with a Bootstrap B = 10000 sample lot. Figure 4 shows that there is no significant difference from the conclusions of the first algorithm. The coverage probability of the CI-Residual of LPR-1 is the same as the nominal coverage, while the CI-Residual of LPR-2 is 0.96. The simulation applies the search for optimal smoothing parameters from the (Mansyur & Simamora, 2022) algorithm, where the sample design conditions are the same as above. Mansyur & Simamora (2022) used the cross-validation function to get the optimal γ value. The cross-validation function uses the formula, where ̂− ( ) is the prediction of the LPR at the point xi for which the value of y(xi) is removed from original data. The simulation gives γOptimal = 0.09 for LPR-1 with CV(γOptimal) = 0.0499 and γOptimal = 0.25 for LPR-2 with CV(γOptimal) = 0.049. Because LPR-2 gives a smaller CV value, we use γOptimal = 0.25, as shown in Figure 5.  Figure 6 is the simulation result of the paired percentile bootstrap interval algorithm, which applies γOptimal = 0.25 for LPR-1 and LPR-2, and the number of bootstrap samples is B = 10000, as shown in Figure 6. The scatterplot shows the smooth surface of the LPR-1 and LPR-2 curves following the curvature feature of the trigonometric function ( ) = sin 2 . However, the bandwidth of the CI-Paired from LPR-2 is narrower than that of the LPR-1. The coverage probability of CI-Paired from LPR-1 is 0.945, while LPR-2 is 0.93. Figure 7 is a simulation result of the bootstrap residual, which shows the same conclusion as Figure 6 but has a different coverage probability. The probability of coverage of the CI-Residual from LPR-1 is 0.95, while the LPR-2 is 0.93, as shown in Figure 7.

D. CONCLUSION AND SUGGESTIONS
Nonparametric regression models generally require a large enough sample size to capture the curve features. The two new algorithms can work well at relatively small sample sizes. However, for the polynomial regression of degree one with the selection of the smoothing parameter α = 0.5, it cannot characterize the sample from a particular function. The bandwidth resulting from the regression of the first-degree polynomial is wider than the second-degree polynomial regression. Still, there is no guarantee that the coverage probability will be the same as the nominal coverage. On the other hand, second-degree polynomial regression can characterize the behaviour of the data derived from the generation of a particular function, and the probability coverage is close to the nominal probability coverage.
The smoothness of the curve is also affected by the number of bootstrap samples. If the number of bootstrap samples is relatively small, the surface of the curve is more jagged and wavy, especially for first-degree polynomial regression. At the same time, the second-degree polynomial has a smoother curvilinear surface even though the number of bootstrap samples is relatively small. The purpose of increasing the number of bootstrap only to smooth the surface of the curve does not change the behavior of the curve curve, which is analogous to the conclusion of (Gultom et al., 2022).
The scatterplot shows that applying the optimal smoothing parameter to the local polynomial regression model improves performance. Both local polynomial regressions can capture curve features based on the behaviour of the sample derived from the generation of a particular function. The band thickness of the first-degree polynomial is more proportional than that of the second-degree polynomial. The second-degree polynomial regression band trend is narrower than the first-degree polynomial regression for both algorithms. The probability coverage of the two algorithms is not significantly different. However, the coverage probability of the first-degree polynomial is better than that of the second-degree polynomial.
The simulation results conclude that the bootstrap method can improve the performance of complex and sensitive statistics where certain assumptions are not met. Applying the optimal smoothing parameter concludes that the two algorithms do not have a significant difference, and both local polynomial regressions do not show much difference. It counters the conclusion of open-ended questions by (Efron & Tibshirani, 1994), which conclude that paired bootstrapping has few disadvantages compared to residual bootstrapping.
We provide some suggestions for readers who wish to continue this article. Perhaps, the reader is interested in determining the BCA bootstrap confidence interval in a local polynomial regression model by adapting an existing procedure. Readers may also be interested in studying comparative studies, for example, between the bootstrap-t interval method and the bootstrap percentile, to determine the best interval between both. Another study that may be more interesting is the application of the wild bootstrap method to local polynomial regression prediction intervals where heteroscedasticity is present. In addition, readers can examine other topics related to the combination of the local polynomial regression concept with the bootstrap concept, which is the impact of this article. The priority for further research on bootstrap methods is no longer about polynomial degrees or smoothing.