Cluster Analysis of Environmental Pollution in Indonesia Using Complete Linkage Method with Elbow Optimization

ABSTRACT


A. INTRODUCTION
Humans have an extremely close interaction with their surroundings. However, humans are unconcerned about the influence on ecosystems. The emergence of pollution and environmental degradation is a significant result of its interaction with environmental concerns (Sipayung et al., 2020). According to Statistics Indonesia, there are 5644 villages in Indonesia that pollute the air, 1499 that pollute the soil, and 10683 that pollute the water (Statistik, n.d.). It has an impact on both short and long-term environmental sustainability. Short-term impacts include visual harm to the environment. Meanwhile, long-term consequences, such as ecosystem disturbance and global warming, are more significant. To control environmental pollution in Indonesia, the community, especially the government should be concerned. To support success in managing environmental pollution in Indonesia, research that can cluster provinces with identical pollution circumstances is required. So that, it can become a suggestion for the government in selecting policies. Cluster analysis is one of the clustering approaches that can be used (Dinata & Syaputra, 2020). Clustering methods can be used to generate more realistic categorization outputs for datasets in a wide range of fields like physical science, biological sciences, psychological, industry, and text documents, and they are a popular topic of research in data mining (Chen et al., 2015).
Cluster analysis is a method for categorizing some items or data into smaller clusters based on their similarity. In cluster analysis, there are two methods: non-hierarchical and hierarchical ones (Drews et al., 2019). Single linkage, complete linkage, average linkage, and ward's method are examples of hierarchical methods (Widyawati et al., 2020). Meanwhile, the K-Means algorithm is a non-hierarchical method (Ramadhani et al., 2018). This study focuses on the hierarchical technique, specifically the complete linkage, because the hierarchical method processes data more quickly, saving time, and the resulting output is in the form of levels or hierarchies, making it easy to analyze (Hidayat, n.d.).
Previous studies have shown the advantages of complete linkage when combining subdistricts in Sidoarjo Regency based on livestock yield potential. This is based on a comparison of the value of the standard deviation ratio, which shows that the value of the complete linkage is the smallest, that is 0.222, indicating that the complete linkage is the best technique (Mu'afa & Ulinnuha, 2019). Another research looking for ideal clusters utilizing the single linkage, complete linkage, and average linkage approaches based on the Human Development Index indicator in West Kalimantan also indicate the benefits of complete linkage. This was assessed by the accuracy value obtained using valley-tracing, which yielded the highest complete linkage value, that is 0.976 with 5 ideal cluster numbers (Hendra Perdana, Nur Asiska, 2019). Similarly, studies on clustering news articles using four linkage approaches, namely single linkage, complete linkage, average linkage, and average linkage-group, have been conducted. The obtained findings show that complete linkage is the best approach since it has the highest average purity of 0.888 and 0.938 (Wibisono & Khodra, 2018).
According to previous research, complete linkage is better than other methods. Complete linkage produces compact clusters with great precision (Nabiilah Ardini Fauziyyah & Sholikhah, 2021). Based on the preceding discussion, this research will be conducted to cluster provinces in Indonesia that suffer environmental pollution using the complete linkage method, so that it may be utilized as reference material and recommendations for the government in reducing incidents of environmental pollution.

B. METHODS
The data in this research is the number of villages in Indonesia that harm the environment per province. The data was gathered from Statistics Indonesia official website. The variables in this research are water pollution (X1), soil pollution (X2), and air pollution (X3). A hierarchical clustering method using complete linkage. The steps in this study are described in a flowchart, as shown in Figure 1. The research steps can be better understood with the flowchart above. The explanation for these actions is provided below.

Collecting Data
This research used secondary data gathered from the Statistics Indonesia. The data is presented in the Table 1.

Elbow Method
The Elbow approach is a methodology for determining the amount of clusters that should be used according to the proportion of comparison values between the cluster numbers. The Elbow method calculate the SSE (Sum of Square Error) for every cluster result (Muningsih & Kiswati, 2018). The formula of SSE value is as follows. (1) Where k is number of clusters, i is data index, j is a cluster, n is number of data, ( ) is data i in cluster j, and is centreoid or average of the data in a cluster. SSE decreases as expected when k is less than the optimal quantity of clusters. If k approaches the optimal number of clusters, then SSE will decrease dramatically and continue to be stable as k increases. As a result, the drop of SSE will be significant before becoming flat. In other words, the correlation curve between SSE and k has the structure of an elbow, as well as the value of the associated k at this elbow indicates the real cluster number of the data. (Liu & Deng, 2021).

Silhouette Coefficient
The Silhouette Coefficient is a method that determines the proximity of relationships between objects in a cluster. This method is used to calculate the proximity of objects in one cluster to those in another. The Silhouette Coefficient value is between -1 and 1. A value of 1 means the objects have been appropriately clustered, and a value of -1 shows that objects have not been efficiently clustered (Ogbuabor & F. N, 2018). The Silhouette Coefficient of an i object is calculated using two variables, and . The formula for calculating and are as follows (Hidayati et al., 2021): = min Where P is the number of data in cluster P, Q is the number of data in cluster Q, u and v are the data index, and d (u, v) is the distance between the u data and the v data in one cluster. The Silhouette Coefficient is calculated by the equation below (Widyawati et al., 2020): Where is object i's Silhouette Coefficient value in one cluster, is object i's average length from all other objects within the same cluster, is the minimal value of the mean distance between i object and all objects in other cluster that vary from i object. The indicators of Silhouette Coefficient value presented in Table 2 (Swindiarto et al., 2018), is required to determine the quality of the final clustering, as shown in Table 2.

Complete Linkage
Complete linkage computes the greatest dissimilarity between two objects, as opposed to single linkage. The maximum distance between any two items belonging to separate clusters defines the proximity of two clusters. This method of linkage produces tight clusters and is less affected by outliers (Govender & Sivakumar, 2020). Especially, measuring distance is a crucial part of the clustering process (Cao et al., 2020). The first step of complete linkage is determining the distance matrix between objects. The Euclidean distance, whose formula is provided below, is one of the way for estimating the closeness distance between objects (Cui, 2020): Where d (x, y) is the separation between objects x and y, is the value of object i in the k data, is the value of object j in the k data, and n is the number of objects. Second, from the distance matrix computation, identify the object that produces the shortest or least distance. The third uses the following formula to get the combined cluster distance with the smallest distance (Ramadhani et al., 2018).
Where and are the longest distances between clusters X and A and clusters Y and A, respectively. Fourth, based on prior computations, update the distance matrix. The second through fourth steps are repeated (n-1) times. All items will collect into a single cluster at the end of the operation.

Centroid Value
The centroid or center point for each cluster is calculated by taking the average of all data values in the cluster. Here is the formula for determining the centroid (Rizal, 2013): Where n represents the number of data in a cluster, j represents the j data index in the cluster, and xj represents the value of data j in a cluster.

C. RESULT AND DISCUSSION
Based on Table 1, Central Java is the province with the most cases of environmental pollution in Indonesia throughout 2021, followed by West Java and East Java. Meanwhile, Riau Islands have fewer pollution cases of water and soil pollution, and Bali province has fewer cases of air pollution. This objective result will be followed by cluster analysis to see the grouping results with the processes below.

Determination of the Cluster Optimal Number Using Elbow
The Elbow approach was utilized in this study to identify the appropriate number of cluster assumptions (k). The Elbow curve for environmental pollution in Indonesia showed in Figure  2. We can easily conclude that cluster 2 (k = 2) is the ideal number of cluster because the elbow curve is formed at point 2. The cluster number is assumed to be 2 (k = 2) and 3 (k = 3) to see all choices and assure the optimal number of clusters. Furthermore, to find the ideal number of clusters, these assumptions will be tested using the Silhouette Coefficient technique.

Optimal Cluster Number Validation
After assuming the ideal number of clusters using the Elbow method, the assumption result will be confirmed using the Silhouette Coefficient method. The following is a description of the Silhouette Coefficient, as shown in Figure 3 and Figure 4. Each data object in a cluster is represented by a bar chart above. Objects of the same color belong to the same cluster. The average silhouette width is the value of the Silhouette Coefficient of clustering, which is indicated by a dotted line in the graph's center that showed in Figure 3 and Figure 4. A graph that is close to 1 suggests that the data object is in the correct cluster. If it is around -1, the item is not in the proper cluster. (Wijaya et al., 2021). Both of the above figures show that Figure 4 has a more excellent Silhouette Coefficient value than Figure  3, which is 0.75. Based on the Kauffman Table, Table 2 shows a strong cluster structure. There are no data objects near -1, implying that two clusters (k = 2) are the ideal number of clusters for clustering in this research.

Complete Linkage Cluster Analysis
The proximity matrix is generated by calculating the Euclidean distance using the formula Equation (5) as the initial step in cluster creation. The results are presented in Table 3. The Euclidean distance is used to find the shortest distance between two items. Then, using the formula Equation (6), merge the two closest objects using the complete linkage approach. The Euclidean distance between objects should then be updated to interpret the proximity between the new clusters and the remaining clusters. The method is repeated indefinitely until only one cluster remains. Previously, the Elbow and Silhouette Coefficient approaches yielded the optimal number of clusters as two. As a result, the estimated cluster analysis findings will be divided into two clusters and displayed in Figure 5.
The dendogram displays the objects that constitute a hierarchy and two clusters. Objects in the same cluster join to form a larger hierarchy, until all objects are gathered in one hierarchy. The blue color represents the first cluster, while the red color represents the second. Furthermore, the cluster state can be determined using the cluster center (centroid). The province with the highest pollution status will be represented by the centroid with the highest score. Meanwhile, the centroid with the lowest score is going to be cluster of low-pollution provinces. The centroid value is derived by averaging the objects in a cluster (Ais et al., 2022). The centroid score of every cluster from each variable is displayed in Table 4. The centroid value of water, soil, and air pollution in cluster 1 is greater than that of cluster 2, indicating that the province in cluster 1 has a higher pollution status than the province in cluster 2. As a result, cluster 1 is a province with a high pollution status in 2021. Cluster 2 is a collection of provinces with low pollution levels. High cluster provinces include Central Java, West Java, and East Java. The following shows the members of each cluster, as shown in Table  5.