Clustering the Distribution of COVID-19 in Aceh Province Using the Fuzzy C-Means Algorithm

ABSTRACT


A. INTRODUCTION
At the end of 2019, the world was shocked when the emergence of the corona virus was first reported in Wuhan, China (Rothan & Byrareddy, 2020). In December 2019, in Wuhan there were five patients who were treated with Acute Respiratory Distress Syndrome (ARDS) (Ren et al., 2020). Coronavirus Disease 2019 (COVID-19) is a virus that attacks the respiratory system in humans with symptoms of cough, fever, runny nose, shortness of breath, sore throat. This disease must be watched out for because of its relatively fast transmission (Susilo et al., 2020). On January 20, 2020, WHO as the World Health Organization established the status of Public Health Emergency of International Concern (PHEIC) as a form of warning that this virus is very risk (Suni, 2020;World Health Organization (WHO), 2020).
The Indonesian government has provided a COVID-19 information center website that provides information for the number of confirmed positive cases, recovered, died, etc. in real time, and there is even information on the grouping of areas with high, medium, low risk, and areas with no COVID-19 cases. which is accessed through the address: http://covid19.go.id. However, the grouping carried out on the website is considered to be less accurate because it is not clear that the calculation process to obtain the grouping includes the process for clustering the regions in Indonesia (Farizi & Harmawan, 2020).
On the website of the COVID-19 information center for the Province of Aceh, there is no regional grouping with the highest COVID-19 cases to the lowest cases. The purpose of this study is to cluster the distribution of areas with the highest, medium and low COVID-19 cases. This grouping of zones is really needed by the community to stay alert, practice social distancing and comply with health protocols in carrying out activities or when traveling to other areas that are included in the group of the highest COVID-19 cases. With the amount of data needed to perform this grouping process, a data mining clustering technique is needed using the Fuzzy C-Means Algorithm. Fuzzy C-Means is a technique for clustering or grouping data where the existence of each data in a cluster is determined by the degree of membership.
There are several related studies that have been carried out, such as that conducted by (Noviyanto, 2020) in his research using only the parameter of the number of deaths due to COVID-19 from countries in the Asian continent using the k-means algorithm and Rapid Miner software so that the results of clustering the number of deaths are obtained. The consequences of COVID-19 are 4 countries with high clusters, 4 countries with moderate clusters, and 41 countries with low clusters. Another study conducted by (Solichin & Khairunnisa, 2020) in this study clustering was carried out based on the parameters of the number of people under surveillance, patient in care, positive cases, recovered patients, and patients died using the K-Means method and the Eucledean distance measurement method.
Research conducted by (Doroshenko, 2020) discusses the clustering of the spread of the COVID-19 virus in Italy during February-April 2020 using the K-Means Clustering Algorithm. This study uses 1113 data sample lines to be grouped describing the epidemiological situation in Italy from 20 February 2020 to 16 April 2020 based on 17 parameters, namely date, country, region code, region name, latitude, longitude, hospitalized patients, intensive care patients, total hospitalizations, hospitalizations, current positive cases, new positive cases, recoveries, deaths, total positive cases, and tests performed. This algorithm divides the area into clusters that describe the division of the area according to their geography. The results of this study yield a high accuracy, which is 97% and the clusters formed show the division of areas with high incidence (blue cluster, C1), and low incidence (red cluster, C2). In this study, the k-means algorithm is used, while the research that the author is doing uses the fuzzy cmeans algorithm.
Research conducted by (Crnogorac et al., 2021) grouped European countries based on the cumulative relative number of COVID-19 patients in Europe in 2020. This study aims to group European countries and regions into clusters, where countries from a cluster have The similarity in the value of the cumulative number of COVID-19 cases is carried out using a dataset containing the cumulative number of COVID-19 cases taken every day for 14 days for each European country and region per 100,000 population. The grouping uses 3 clustering methods, namely K-Means, agglomerative and BIRCH clustering. The results of this study were able to group European countries and regions into 5 clusters based on the cumulative relative number of COVID-19 cases by testing performance using the Silhouette coefficient value and showing good accuracy results. In this study, European countries and regions were grouped into 5 clusters based on the cumulative relative number of COVID-19 cases using 3 clustering methods, namely K-Means, agglomerative and BIRCH clustering.
In a study conducted by (S et al., 2021) grouped the districts in Tamilnadu based on their mobility to the most important places which were categorized into 6 groups, such as retail and recreation, grocery stores and pharmacies, parks, transit stations, workplaces, and residential areas so that the district groupings with high, medium, and low mobility were obtained. In this study, Fuzzy Clustering uses a Hybrid CSO-PSO search based on the movement of a group of people during the implementation of the COVID-19 Lockdown. While the research that the author did using dataset clustering was carried out based on the parameters of the number of people under surveillance, patients in treatment, positive cases, patients recovered, and patients died.
Another study conducted by (Kurniawan et al., 2021) used clustering and correlation methods to predict and analyze the risk of COVID-19 in countries exposed to the pandemic. The clustering method used is K-Means to group 200 countries in the world with 10 attributes, namely countries, total cases, new cases, total deaths, new deaths, total recovered, active cases, seriously critical, probability of death (%) and probability of recovering. (%). This research has succeeded in classifying 200 countries in the world with 10 attributes and 5 clusters. The results of the correlation analysis showed a strong positive linear correlation between the total number of COVID-19 cases and the number of deaths (0.78) and between the number of deaths and critical patients (0.85). This proves that if one variable increases, the other variables also increase. The K-Means Clustering method is widely applied to solve data clustering problems, such as grouping 142 countries into 4 clusters to identify the best strategy in fighting COVID-19 (Darapaneni et al., 2021) and clustering air quality data around Uttarakhan, India during the COVID-19 pandemic lockdown (Sunori et al., 2021). Research that the author did using dataset clustering was carried out based on the parameters of the number of people under surveillance, patients in treatment, positive cases, patients recovered, and patients died using the K-Means method and the Eucledean distance measurement method.
As a result of the COVID-19 pandemic that has hit the world, it has had a major impact on various sectors of life, such as the economic sector (Fernandes, 2020;Ozili & Arun, 2020), the environment sector (Zambrano-Monserrate et al., 2020), and the education sector (Akat & Karatas, 2020) which implemented an online learning system during the COVID-19 pandemic (König et al., 2020;Nurdin et al., 2022). To prevent an increase in COVID-19 cases, a clustering system for the COVID-19 distribution zone is needed using the Fuzzy C-Means Algorithm.
Based on the background of the problems described above and previous related research, no one has discussed and conducted research on clustering the spread of COVID-19 in Aceh Province using the Fuzzy C-Means Algorithm. Fuzzy C-Means Clustering is considered more accurate because each data is reallocated into each cluster during the iteration process by utilizing fuzzy set theory, where each data has the possibility to be able to join each cluster based on its membership degree (Muslimatin, 2011).

B. METHODS 1. Fuzzy C-Means Algorithm
In grouping there are 2 fuzzy techniques that can be used, namely Fuzzy Hashing and Fuzzy C-Means (Naik et al., 2018). Fuzzy C-Means is a data grouping method proposed by Bezdek by utilizing the concept of fuzzy theory (Bezdek et al., 1984). Fuzzy C-Means applies fuzzy grouping, where each data can be a member of several clusters with different degrees of membership in each cluster. Fuzzy C-Means is an iterative algorithm, which applies iteration to the data clustering process. The purpose of Fuzzy C-Means is to get the center of the cluster which will later be used to find out the data that enters the cluster. According to (Agustini, 2017) the advantage of the Fuzzy C-Means algorithm compared to other algorithms is that it can place the cluster center more precisely, namely by repairing the cluster center repeatedly so that the cluster center will move to the right cluster center location/point. In addition, the Fuzzy C-Means algorithm also has a smaller probability of failure than the k-means algorithm because the Fuzzy C-Means clustering process allows each data to be in two or more clusters.
To apply Fuzzy C-Means clustering in several cases, the first step is to make the number of classes that will be used as the basis for grouping, then iterate (iteration) to get the membership of the group. In data grouping the Fuzzy C-Means Algorithm can also be combined with the Fuzzy Support Vector Machine (Shan & Zhi, 2016).
The following flowchart of the Fuzzy C-Means Algorithm can be seen in Figure 1.

Data Collection and Variable Type
The data was taken through the website for the COVID-19 information center in Aceh, namely: http://covid19.acehprov.go.id which consists of 23 districts/cities in Aceh province with the following variables used: a. Confirmed number, namely the number of people who are confirmed positive for COVID-19 after a lab test is carried out even though they are asymptomatic. b. The number of patients in care, namely the number of patients hospitalized due to COVID-19. c. The number of recovered patients is the number of patients who have been hospitalized and have recovered. d. The number of deaths, namely the number of people who died from COVID-19. e. The number of suspects, namely the number of people who have symptoms of COVID-19 but have not been tested. f. Probable number, namely the number of people who have symptoms with positive rapid test results, but have not carried out a PCR lab test.

C. RESULT AND DISCUSSION 1. Dataset dan Variable
Based on data obtained from the Aceh Covid-19 information website, via https://covid19.acehprov.go.id/, clustering or zoning groupings will be carried out based on the level of cases in Aceh Province. In this study, the authors used a dataset of 23 districts/cities in Aceh Province are South Aceh, Southeast Aceh, East Aceh, Central Aceh, West Aceh, Big Aceh, Pidie, North Aceh, Simelue, Aceh Singkil, Bireun, Southwest Aceh, Gayo Lues Aceh Jaya, Nagan Raya, Aceh Tamiang, Bener Meriah, Pidie Jaya, Banda Aceh, Sabang, Lhokseumawe, Langsa, and Subulussalam. The variables that will be used for the clustering process are the number of confirmed cases, the number of patients in treatment, the number of recovered patients, the number of deaths, suspects, and probable. Furthermore, these 6 variables are initialized in Table 2. The initials of this variable name will be used in this manual calculation using the Fuzzy C-Means algorithm. The COVID-19 dataset that will be used in this calculation process is obtained from the website of the COVID-19 information center for the province of Aceh, which can be seen in Table 3.  Tamiang  380  88  274  18  232  9  17 Bener Meriah  163  37  120  6  99  32  18 Pidie Jaya  218  0  202  16  322  0  19 Banda Aceh  2836  548 2213 75  984  16  20 Sabang  92  1  82  9  62  52  21 Lhokseumawe  428  98  314  16  1608  1  22 Langsa  329  62  256  11  384  11  23 Subulussalam  85  2  76  7  16  7 The table above is a COVID-19 dataset obtained from the website of the Aceh Province COVID-19 information center to be used in this calculation process.

Calculation Result with Fuzzy C-Means Algorithm
After doing manual calculations using the stages of the Fuzzy C-Means Algorithm starting from step 1 to step 7, the final result of the calculation of the clusterization of the spread of COVID-19 in Aceh Province for each data with the calculation process up to 43 iterations can be seen in Table 4. iterations, the results of clustering with the degree of membership of each data are shown in Figure 2.

Figure 2. Last Iteration Membership Degree Chart
The picture above is a graph of the results of manual calculations in 43 iterations with the degree of membership of each data. Based on Table 4 above, the results of clustering can be presented in the form of a diagram, as shown in Figure 3. The picture above is a graph of the results of the clustering the spread of Covid-19 in Aceh. Table 5. This is the final result of the clustering of the spread of COVID-19, where cluster 1 as the red zone consists of 1 district/city, namely Kota Banda Aceh. Cluster 2 as the yellow zone consists of 4 districts/cities, namely Big Aceh, Pidie, Bireun, and Kota Lhokseumawe. Cluster 3 as a green zone consists of 18 districts/cities South Aceh, Southeast Aceh, East Aceh, Central Aceh, West Aceh, North Aceh, Simelue, Aceh Singkil, Southwest Aceh, Gayo Lues, Aceh Jaya, Nagan Raya, Aceh Tamiang, Bener Meriah, Pidie Jaya, Sabang, Langsa, and Subulussalam, as shown in Table 5.  The table above is the result of the clustering of the COVID-19 distribution zones in Aceh based on each calculation variable. The following is a zoning graph of the spread of COVID-19 in Aceh Province based on the results of the distribution zoning in Figure 4. The picture above is a zoning graph of the results of the clustering of the spread of Covid-19 in Aceh.

Implementation of System Interface
The following is the dashboard page interface. On the output page of this system, it contains data information and information on the results of clustering the distribution of COVID-19 in Aceh Province using the Fuzzy C-Means Algorithm, which can be seen in Figure 5. The picture above is a dashboard page display that contains data information and information on the results of clustering the spread of COVID-19 in Aceh Province using the Fuzzy C-Means Algorithm. The following is an image of the interface for the report page on the results of the clustering analysis with the degree of membership of each data. This page contains the results of the clustering of all datasets processed using the Fuzzy C-Means Algorithm in Figure 6.