Best Architecture Recommendations of ANN Backpropagation Based on Combination of Learning Rate, Momentum, and Number of Hidden Layers

ABSTRACT


A. INTRODUCTION
The process of building the ANN Backpropagation architecture to obtain the results of training, testing, and good data prediction requires a process that is not easy. Because the architecture built will affect the level of accuracy of the architecture networks. The selection of learning rate and momentum values serves in the process of accelerating network performance in knowing large amounts of data (Sutskever et al., 2013). This is in accordance with Moreira & Fiesler (1995) and Yu & Chen (1997) that learning rate and momentum can reduce data training time. Thus, Hao et al. (2021) also explained that the selection of learning rate and momentum affects the speed of the training process and testing of data. Learning rate is one of the important parameters in the training process that aims to calculate the value of weight correction between layers in the architecture. This learning rate is at intervals of 0 to 1 (Smith & Topin, 2019). Meanwhile, the momentum parameter aims to update the weight so that the training process is quickly completed.
The researchers recommended learning rate and momentum that vary according to the type and pattern of data. A small learning rate value will certainly slow down the training process, and vice versa, a large learning rate (close to 1) will speed up the training process. The combination of learning rate and momentum needs to be careful because it will affect the performance speed of the data training process and has implications for the number of iterations carried out by the network (Ch & Mathur, 2012). Research on the combination of learning rate and momentum has been done a lot. Learning rates of more than 0.5 are used, for example, for climate change prediction (Syaharuddin et al., 2021), oil production prediction (Aizenberg et al., 2016), prediction of diabetes mellitus (Jayalakshmi & Santhakumaran, 2011), and card classification problem (Abdul Hamid et al., 2011). When making climate change predictions, Syaharuddin et al. (2021) used a learning rate of 0.9, so that the training process took place quickly with an MSE value of 7.48539. This value is certainly not good because the resulting error value is relatively high. This is also seen from the research conducted by Jayalakshmi & Santhakumaran (2011) when they made predictions of diabetes mellitus and obtained a coefficient of determination value of 0.726.
On the other hand, a momentum value of 0.9 was also used to predict river flow (Ghorbani et al., 2016), and classify breast tumors in ultrasound imaging (Singh et al., 2015). A momentum value of 0.8 is used to predict graduation success (Lesinski et al., 2016), thyroid disease classification problem (Rehman & Nawi, 2012), and earthquake (Moustra et al., 2011). The accuracy rate of these experiments with a momentum value of more than 0.6 reached 92.3%. In addition to the combination of learning rate and momentum, the use of hidden layers is also a necessity in the data training process so that the architecture is able to recognize data patterns and reduce errors that occur. The addition of hidden layers will certainly affect many iterations or maximum epochs that occur. However, the more epochs are, the better the data recognition rate becomes (Solanki & Jethva, 2013). Therefore, there needs to be a broader study related to the use of a combination of learning rate, momentum, and number of neurons in the hidden layer both one layer and more. The use of architecture with one hidden layer has been widely done (Zhang et al., 2020;Nawi et al., 2017;Bai et al., 2016;Gowda & Mayya, 2014). Meanwhile, architecture with two hidden layers has been used by Irawan et al. (2013) in predicting hydroclimatology data with architecture 744-100-10-1, and Singh et al. (2015) in the classification of breast tumors in ultrasound imaging with architecture 50-20-1-1.
From this explanation, it is important of conduct an in-depth analysis of the results of research that conducts experiments combining learning rate, momentum, and the number of neurons in the input layer and hidden layer. Because every research result always claims that the architecture is built in the optimal architecture with the highest degree of accuracy. Hence, the purpose of this study is to find out the level of accuracy of architecture based on the learning rate and momentum value used, knowing to comparison of the accuracy rate of architecture based on the number of neurons in the hidden layer. The results of this study are expected to be able to provide recommendations related to good parameters when conducting the process of training, testing, or predicting data.

B. METHODS 1. Architecture of ANN Backpropagation
Generally, the ANN Backpropagation architecture consists of three layers, namely the input layer, hidden layer, and output layer (Haviluddin & Alfred, 2016;Karsoliya, 2012 to calculate together the input value by the activation function and the training function at each layer to produce an output close to the target value. In this study, there will be an analysis of the value of learning rate, momentum, and hidden layer based on the results of research that has been done, as shown in Figure 1.  (Fausett, 1994)

Research Procedures
This study is a meta-analysis study that aims to analyze more deeply the various intervals of learning rate value, momentum, and the number of hidden layers in the ANN Backpropagation architecture recommended by the researchers. We compile and use simple research procedures according to research needs are presented in Figure 2.

Selection & Tabulation Data
Based on eligibility criteria, including: year of issue, author name, architecture, number of data, learning rate, momentum, and accuracy level

Data Analysis & Hypothesis Test
Classification of data based on learning rate intervals (0-1), momentum (0-1), and the number of hidden layers (input layer & hidden layer). As well as conducting hypothesis tests to publication bias and the level of accuracy of each case.

Interpretation & Conclusion
Categorize accuracy levels based on the intervals arranged, as well as make conclusions based on the results of data analysis.

a. Data Selection and Tabulation
Data sources are articles published in 2011-2021 from indexing databases such as Scopus, Science Direct, and Google Scholar with forecasting topics using ANN Backpropagation. The components searched and tabulated from the filtered articles include the year of issue, author name, country, type of prediction data, architecture (input-hidden-output layer), amount of data (N), coefficient of determination value ( 2 R ), learning rate (LR), and momentum value. The data that matches the criteria specified the value of effect size ( ES ) and summary effect ( SE ) using the equation: with i is a sequence of data (1,2,3…, N) and N is the amount of data in each case.

b. Hypothesis Test
This study will test two things, namely (1) publication bias from the amount of data used in this study, and (2) difference in accuracy levels based on the combination of the learning rate, momentum, and the number of hidden layers. Publication bias testing was determined by the criteria that if the p-value Rank test is greater than 0.001 (p-value > 0.001), then the data used in this study do not indicated bias. In addition, it can also be determined by the Rosemthal (1979) namely: 5 + 10 < , with k is the amount of data and NR is the value of File-Safe N. Furthermore, the coefficient of determination value that correlates with the value of random effect (RE) was categorized according to intervals, namely very weak (0.00-0.199), weak (0.20-0.399), moderately high (0.40-0.599), high (0.60-0.799), and very high (0.80-1.00).

C. RESULT AND DISCUSSION 1. Data Selection Results
Search on the indexing database namely Scopus, ScienceDirect, and Google Scholar found as many as 79 data. From this amount of data, a re-examination was carried out according to the eligibility criteria ( Figure 2) so that 45 data that met the criteria of architecture of the number of hidden layers, 44 data that met the learning rate criteria, and 30 data that met the momentum criteria were obtained. More complete details of the data can be seen in Figure 3. Incomplete data is due to the absence of the amount of data (N) and the coefficient of determination (R2) that does not exist. Incomplete data is not used in the next stage, namely the stage of converting accuracy rate into ES and SE values. Furthermore, the selection results data is tabulated based on their respective test criteria. The complete learning rate (LR) data is divided into five intervals, namely LR < 0.1; 0.1-0.2; 0.3-0.4; 0.5-0.6; > 0.6. The complete momentum data is divided into three intervals, namely 0.1-0.3; 0.4-0.6; 0.7-0.9. Finally, the number of hidden layers is divided into two types, namely the number of input layers that is greater than the number of hidden layers (LI > LH) and the number of input layers that is smaller than the number of hidden layers (LI < LH).

Publication Bias Test (H1)
In this study, researchers used JASP software to perform data analysis. The publication bias test was carried out to see the adequacy of the amount of data used so that the results of the study can be generalized. The JASP output as shown in Table 1 and Figure 4.   Table 1 shows that in general, ANN Backpropagation architectures that use learning rate parameters during the data training process are able to increase the accuracy or performance of the use of architecture by 93%, the use of momentum by 91%, and the use of hidden layers up to 92%. This shows that the use of learning rate, momentum, and hidden layers, both one layer and two layers, is able to improve the performance of the architecture well. These results are in accordance with the results of previous studies. Furthermore, based on the p-Rank Test value on each data, a learning rate of 0.491 > 0.001, a momentum of 0.198 > 0.001, and the number of hidden layers of 0.822 > 0.001 were obtained. So it can be said that the data used in this study is sufficient according to minimum standards and can be generalized. Figure 4 also shows that the pattern of data distribution is good and there is no evidence of publication bias.

Accuracy Level of Each Parameter (H2)
At this stage, the researcher divided the data based on the value intervals of each parameter. This was done to see the accuracy level of the architecture according to the learning rate, momentum, and number of neurons in the hidden layer recommended by other researchers. As for the results of data analysis, they are presented in Table 2.  Table 2 shows that the learning rate at intervals of 0.1-0.2 provides a higher degree of accuracy than the learning rate at intervals smaller than 0.1 and greater than 0.2 with an estimated coefficient of correlation value of 0.938 (very high category). These results show that the greater the learning rate value used, the data training process will run faster, but the accuracy of the architectural network will be reduced. Conversely, if the learning rate used is getting smaller, then the accuracy of the network will be greater or increased, but consequently, the training process will take longer. The momentum value at intervals of 0.7-0.9 provides the highest degree of accuracy compared to momentum values at intervals smaller than 0.7 with an estimate of the coefficient of correlation value of 0.925 (very high category). This result is in accordance with the research of Mislan et al. (2015) when they conducted rainfall predictions with a hidden layer architecture of 50-20-1 and obtained by MSE of 0.00096341. Furthermore, the results of the use of the number of neurons in the hidden layer obtained the result of an estimate of the coefficient of correlation of 0.932 for the category of the number of neurons in the input layer that is smaller than the number of neurons in the hidden layer. This is in accordance with the results of research from Hamid et al. (2011) when they conducted the introduction of fragments of glass encountered in forensic work. So, the use of a learning rate of 0.1, the momentum of 0.9, and the number of hidden layer neurons (LH) greater than the input layer (LI) can improve architectural performance by 88%-99% (Sun & Huang, 2020), (Baldi et al., 2018), (Tarigan et al., 2017), (Lesinski et al., 2016), (Mislan et al., 2015).

D. CONCLUSION AND SUGGESTIONS
A good architecture cannot be separated from the optimal combination of parameters. Therefore, researchers continue to experiment with the combination of learning rate, momentum, and number of hidden layers to find a reliable architecture for various cases. The results of the data analysis showed that the selection of learning rates at intervals of 0.1-0.2 and momentum greater than 0.7 was able to provide a faster training process and high accuracy. Furthermore, the authors recommend each architecture using a hidden layer, where the number of neurons in the input layer is greater than the number of neurons in the hidden layer. The use of the number of hidden layers of more than one also has a good impact, because it is able to recognize data patterns better when the data training process.