Biclustering Performance Evaluation of Cheng and Church Algorithm and Iterative Signature Algorithm

ABSTRACT


A. INTRODUCTION
Biclustering is an analytical tool for grouping data simultaneously from two directions and is a clustering development. Unlike the case with clustering, which only clusters data from one direction, the row or column side separately, biclustering does clustering from the row and column sides simultaneously (Castanho et al., 2022;Divina et al., 2019;Patowary et al., 2020). Clustering data from the row side using clustering will produce a group of rows (objects) that must contain all columns (variables) and vice versa (Flores et al., 2019;Kamranrad et al., 2021). However, grouping using biclustering can produce a group of rows containing only a few columns (submatrix) (Brizuela et al., 2013). Based on the developed algorithm, biclustering is like the two-way classification approach.
The biclustering application uses the latest CC algorithm by Ningsih et al. (2022a) to identify the economic and Covid-19 pandemic vulnerability cases. Biclustering using the CC algorithm is quite popular in its application because it has several advantages. One of its advantages is avoiding overlapping between the resulting groups (biclusters) (Di Iorio et al., 2020;Pontes et al., 2015). Another application of biclustering to identify patterns of economic and Covid-19 pandemic vulnerability was also carried out by Ningsih et al. (2022b) using a different algorithm, namely the Iterative Signature Algorithm (ISA). Biclustering using ISA also has several advantages, one of which is the use of two thresholds which are said to be the most potent tools in clearly distinguishing structures at different levels (Khalili et al., 2019;Zhang et al., 2021).
The results of biclustering in identifying economic and Covid-19 pandemic vulnerability patterns using the CC and ISA algorithms in studies by Ningsih et al. (2022a) and Ningsih et al. (2022b) yield the same general conclusions. Biclustering using the CC and ISA algorithms both yielded the result that most regions in Indonesia tend to have low economic and Covid-19 pandemic vulnerability in their respective spatial pattern characteristic variables (Ningsih et al., 2022a(Ningsih et al., , 2022b. However, it gives different results when focusing on the Java Island region. Biclustering using the CC algorithm shows that most regions on Java Island tend to have low economic and Covid-19 pandemic vulnerability in the spatial pattern characteristic variables (Ningsih et al., 2022a). However, biclustering using ISA gives the opposite results. Most areas on Java Island tend to have high economic and Covid-19 pandemic vulnerability in the spatial pattern characteristic variables (Ningsih et al., 2022b).
In more detail from the studies by Ningsih et al. (2022a) and Ningsih et al. (2022b), there are particularly significant differences. This difference is interesting to study in evaluating the performance of the biclustering results between the two algorithms. Therefore, this research will evaluate the performance of biclustering algorithms, namely the Cheng and Church algorithm (CC algorithm) and the Iterative Signature Algorithm (ISA). The performance evaluation results expect to provide information related to the characteristics of the biclustering results produced by each algorithm, especially in the case of economic and Covid-19 pandemic vulnerability in Indonesia.

Data Sources
This study uses secondary data from the BPS-Statistics; Ministry of Health; Ministry of Villages, Development of Disadvantaged Regions and Transmigration; and the Ministry of Environment and Forestry. The units of observation in this study are regions (34 provinces) in Indonesia with variable data based on 2020, as presented in Table 1. Due to the limited data obtained, the X3 variable is based on 2018, and the X16 variable is based on 2019. The indicators that make up the Economic Vulnerability Index (EVI) and the Pandemic Vulnerability Index (PVI), according to the United Nations (UN) and National Institute of Environmental Health Sciences (NIEHS), are the basis for determining the variables used in this study (United Nations, 2011;National Institute of Environmental Health Sciences, 2020), as shown in Table 1. The use of inverse values is intended to align the variable condition with the concept of vulnerability in other variables. A brief explanation for some of the variables in Table 1 is as follows: a. Remoteness and underdeveloped areas use the inverse value approach of the Developing Village Index. b. The population in the coastal area is estimated using the proportion of the number of villages located by the sea multiplied by the total population. c. Export concentration uses the percentage of category A commodity exports to total exports. d. Instability of goods and services exports uses the inverse value approach of the ratio of the contribution of total exports in 2020 divided by the previous year. e. Instability of agricultural production results using the contribution ratio approach to Category A in 2020 divided by the previous year. f. The percentage of the daytime population is estimated by multiplying the total population by the proportion of passenger cars and motorcycles. g. The social distance score uses the ratio of domestic tourists (number of residents travelling other than for work or school) per resident.

Procedure of Analysis a. Data Exploration
To explore a data matrix measuring 34 regions × 23 variables that have been standardized using the standard normal. The standardized data matrix is called the scaling data matrix. Exploration was carried out using a heatmap to obtain an overview of the data related to the initial characteristics of each region according to the constituent variables of the EVI and PVI indicators. b. Biclustering Algoritme CC dan ISA Biclustering is a helpful methodology for finding hidden local coherent patterns in a data matrix by classifying patterns simultaneously in both directions, the rows and columns of the data matrix (Alzahrani et al., 2017;Henriques & Madeira, 2014;Huang et al., 2020). According to Ferraro et al. (2021) biclustering consists of simultaneously partitioning a set of rows and columns into classes or biclusters. Given a matrix × = ( , ) with a set of rows consisting of rows (| |) and a set of columns consisting of columns (| |). A bicluster is a submatrix × = ( ′, ′) with a row subset ′ consisting of n rows sample and a column subset ′ consisting of m column sample. Then is the value in the matrix corresponding to the-i th row and the-j th column (Siswantining et al., 2021) Noise in a bicluster is the residual calculated from the difference between the element value of with the predicted value. The predicted value ((̂) is calculated from the corresponding row and column averages and their bicluster averages (Pang, 2022). The existence of these residues makes the element values of follow equation (1) and the bicluster residues are denoted by like equation (2). Furthermore, the average value of the-i th row in the bicluster (row average) is denoted by and follows equation (3), the average value of the-j th column in the bicluster (column average) is denoted by and follows equation (4), and the average value of all elements in a bicluster (bicluster average) is denoted by and follows equation (5) (Ramkumar et al., 2022).
The Cheng and Church algorithm (CC algorithm) is a biclustering algorithm that looks for biclusters with constant values, rows or columns (Ardaneswari et al., 2017). This algorithm searches for biclusters simultaneously by considering row and column coherence for a submatrix that is the residual score average (Oghabian et al., 2014). Pontes et al. (2015) stated that Cheng and Church were the first to apply biclustering to gene expression data by adopting a sequential covering strategy to return a list of n biclusters from an expression data matrix. Bicluster quality was measured by the mean squared residue (MSR) size. The measurement aims to evaluate the coherence of genes (rows) and conditions (columns) of the bicluster using the gene expression values (objects) and conditions (variables) in it. Given a data matrix A and a threshold > 0, the goal of the CC algorithm is to find δ-bicluster, i.e., row subsets and column subsets with a score not greater than (Di Iorio et al., 2020). The score is the coherence score in the form of a residual score average and the algorithm makes the smallest MSR as the goal. The MSR of a matrix is denoted by ( , ) and defined by equation (6) with defined as equation (2) (Ramkumar et al., 2022).
In addition, the row-squared residuals average of a matrix ( ( ) ) and the columnsquared residuals average of a matrix ( ( )) are defined by equations (7) and (8), respectively.
The following is the Cheng and Church Algorithm Chart (Modified from Pontes et al., 2015), as shown in Figure 1. According to Pontes et al. (2015), the CC algorithm generally works by taking input as a matrix denoted by A and a threshold used to reject non-biclusters. Consequently, a list of δ-biclusters is returned as output. The following are the stages of the CC algorithm biclustering CC and are illustrated through a flowchart in Figure 1 (Pontes et al., 2015). 1) Bicluster initialization sets the initial matrix of the input (A) data and the delta threshold (δ).
2) The multiple node deletion phase, i.e., deleting rows and columns based on the residuals average of row-squared ( ( )) and column-squared ( ( )) that greater than 1,5 × the squared residual average of the entire matrix ( ( , ) ), as long as it satisfies the mean square residue (MSR) condition > δ.
3) The single node deletion phase, i.e., deleting rows or columns based on conditions ( ) or ( ), is the maximum, as long as it satisfies the MSR condition > δ. 4) The node addition phase is the addition of rows and columns based on conditions of ( ) ≤ ( , ) and ( ) ≤ ( , ) as long as it satisfies the MSR conditions of adding nodes results ≤ ( , ) .
5) The substitution phase, i.e., replacing bicluster resulting matrix elements with a random number to prevent overlapping between biclusters. 6) Repeat steps 1 to 5 as many as n times, that is, as many as n biclusters want to find.
Iterative Signature Algorithm (ISA) is a biclustering algorithm with input in the form of a matrix while the resulting output is in the form of a bicluster set and is defined as transcription modules (TM) (Balamurugan et al., 2015). A TM contains a subset of rows and columns that depend on a pair of thresholds, and it is the row and column thresholds that determine the degree of similarity of the TM (Pontes et al., 2015). Pontes et al. (2015) classified ISA into non-metric-based linear algebra groups. The algorithm does not use a specific evaluation size in the process of bicluster search. However, it uses vector spaces and linear mapping between these spaces to describe and find the most correlated submatrix (TM) (Pontes et al., 2015). It is known that a matrix | |×| | = ( , ), is a matrix with the number of rows denoted by | | and the number of columns denoted by | |. The row score on the ISA is the row average of the column sample ( ′ ) that meets the conditions in equation (9), while the column score is the column average of the row sample ( ′ ) that meets the conditions in equation (10) (Ningsih et al., 2022b).
with is a matrix element of a normalized column matrix ( ), and is a matrix element of a normalized row matrix ( ). The following are the biclustering stages of ISA and the illustration is in Figure 2  5) Selects a sample of rows that meet the conditions ′ > and its average value becomes the "row score" [ ′ ′ ]. 6) Computes each column average of a row sample ( ′ ) using . 7) Selects a sample of columns that satisfies the condition ′ > and its average value becomes the "column score" [ ′ ′′ ]. 8) Repeat stages 4 to 7 as many as the number of seeds (n) when convergent conditions are unmet. 9) When the convergent condition is met, i.e., | ′ \ ′′ | | ′ ∪ ′′ | < , rows and columns (bicluster) are selected. 10) Repeat stages 3 to 9 as many as the number of biclusters that may be formed, as shown in Figure 2. c. Performance Evaluation of Biclustering Algorithm According to Kavitha Sri & Porkodi (2019), the biclustering algorithm's performance evaluation uses two categories of evaluation functions: the intra-bicluster evaluation function and the inter-bicluster evaluation function. The intra-bicluster evaluation function is a function that measures the quality of a bicluster using the level of coherence in a bicluster (Ben Saber & Elloumi, 2014). The size of the intra-bicluster evaluation function used in this study is the mean squared residue (MSR) and is defined by equation (11) (Kavitha Sri & Porkodi, 2019).
with is a bicluster element in the-i th row and the-j th column, is the average across all biclusters, is the average in the-j th column, is the average in the-i th row, | | × | | is the bicluster dimension (volume), i.e., the bicluster row size (| |) multiplied by the bicluster column size (| |). The value of ( , ) represents the variation (diversity) associated with the bicluster interaction between rows and columns (Kavitha Sri & Porkodi, 2019;Saber & Elloumi, 2015). According to (Putri et al., 2021) , the bicluster quality will be better as the residual value decreases and/or the volume of the bicluster increases. The quality of the bicluster group based on MSR can then be measured by calculating the average of MSR divided by the volume (the MSR average per volume) and defined by equation (12) (Putri et al., 2021), where is the number of biclusters generated by a particular algorithm. Apart from using the MSR value, the Akaike information criterion (AIC) value of a bicluster can also be calculated using the formula in equation (13) (Brewer et al., 2016), where is the number of parameters adjusted independently to obtain a bicluster ( | |×| | ), i.e., = | | + | | + 1 and is the volume or dimension of a bicluster, i.e., = | | × | |, and is the residue of a bicluster following the formula in equation (2). The AIC value measures the goodness of the biclustering results. Meanwhile, the inter-bicluster evaluation function is a function that measures the quality of the bicluster group by assessing the accuracy of an algorithm to obtain actual biclusters in a data matrix (Ben Saber & Elloumi, 2014;Henriques & Madeira, 2018). The size of the inter-bicluster evaluation function used is the Liu and Wang index which is defined by equation (14) (Saber & Elloumi, 2015), with is the bicluster group that has the smallest average value of MSR per volume, and is the other bicluster group.
is the number of biclusters in , | ∩ | is the number of rows ( ) in which intersects with rows in , and | ∩ | is the number of columns ( ) in which intersects columns in . | ∪ | is the number of combined rows of and , and z| ∪ | is the number of combined columns of

and . Liu and Wang's index compares two solutions (biclustering results) by considering the rows and columns of a bicluster (Kavitha Sri & Porkodi, 2019). The Liu and Wang index values indicate how well an optimal bicluster group ( ) will have similarities with other bicluster group ( ). When
= , the Liu and Wang index values are 1.00 (Saber & Elloumi, 2015). The flowchart of the biclustering algorithm's performance evaluation process in this study is illustrated in Figure 3. In general, through Figure 3, each algorithm's first evaluation is carried out separately to obtain biclustering results at the optimal threshold. The biclustering results at the optimal threshold are then evaluated for their performance separately using the inter-bicluster evaluation function. In addition, the evaluation also uses the intra-bicluster and inter-bicluster evaluation functions simultaneously. This evaluation is a comparability of the biclustering results. The results are compared in terms of membership, characteristics, and distribution of the biclustering results, as shown in Figure 3.

Data Exploration
The description of the data related to the initial characteristics of each region according to the constituent variables of the EVI and PVI indicators is illustrated through the scaling data matrix heatmap in Figure 4. The heatmap describes several extreme values in the EVI (X1 to X8) and PVI (X9 to X23) indicator variables in a particular province. Some values are highly positive (dark orange), and some are highly negative (light yellow) (Guo et al., 2020).
The indication of provinces with values that tend to be highly positive is that the province tends to be vulnerable. Conversely, the province tends to be invulnerable. The example is on the PVI X9 indicator variable: the Covid-19 infectious case. DKI Jakarta Province has a highly positive value (dark orange). It indicates that DKI Jakarta Province tends to be highly vulnerable to the Covid-19 pandemic, especially regarding the indicator of Covid-19 infectious cases, as shown in Figure 4. Another example is the EVI X2 indicator variable: remoteness and underdeveloped areas. DKI Jakarta Province has a highly negative value (light yellow). It indicates that DKI Jakarta Province tends to have a low economic vulnerability, especially in remoteness and underdeveloped areas. Ningsih et al. (2022a) research on "biclustering applications in Indonesian economic and pandemic vulnerability" shows that using the CC algorithm produces optimal bicluster groups at the 0.01 delta threshold. Figure 5 shows six optimal bicluster groups, as shown in Figure 5. The optimal bicluster group of the CC algorithm concludes that areas that dominate in Indonesia are the first type of spatial pattern with the most invulnerable characteristics. It indicates that most regions in Indonesia tend to have low economic and Covid-19 pandemic vulnerability in the first spatial pattern characteristic variable (Bicluster 1). Meanwhile, Ningsih et al. (2022b) research regarding "pattern detection of economic and pandemic vulnerability index in Indonesia using bi-cluster analysis" shows that biclustering using ISA produces optimal bicluster groups at the -1.0 row and -1.0 column threshold. The number of biclusters formed from the optimal bicluster group is three. However, due to the overlap between the three biclusters, five different spatial patterns are formed, as shown in Figure 6. The ISA optimal bicluster group concludes that areas that dominate in Indonesia are the fifth type of spatial pattern with invulnerable characteristics. It indicates that most regions in Indonesia tend to have low economic and Covid-19 pandemic vulnerability on the fifth spatial pattern characteristic variable (Overlap Bicluster 1, 2, and 3), as shown in Figure 6.

Biclustering Algorithm Performance Evaluation
This study evaluated the biclustering application using the CC algorithms and ISA from research results by Ningsih et al. (2022a) and Ningsih et al. (2022b). The conducted performance evaluation is a comparability study of biclustering results for each optimal threshold. The optimal threshold in the CC algorithm is at 0.01 delta (Ningsih et al., 2022a) and ISA at the -1.0 row and -1.0 column threshold (Ningsih et al., 2022b). The results of comparability study carried out in this study included the comparability of the objects' membership (region) and the formed variables, the characteristics of the mean and median of the same identifying variables, the values distribution of the region identifying variables, as well as the results of intra-bicluster and inter-bicluster evaluation.
Figure 7 compares regional membership between the CC and ISA algorithms. This figure shows that both algorithms can group all objects (regions) into each formed bicluster (BC). However, the ISA produces overlapping regional memberships, forming five types of spatial patterns or bicluster groups. Obtaining the five spatial pattern types comes from the overlap of the three BC combinations resulting from biclustering ISA, as shown in Figure 7.  Figure 8 compares the results of the variable's membership between the CC and ISA algorithms. The figure shows that the membership of the CC algorithm variables tends to be less, i.e., only eleven identifying variables, in contrast to the membership of the ISA variables, which are as many as 23. It shows that the identifying variables that describe the characteristics of the biclustering results of the CC algorithm tend to be few. Therefore, the biclustering results of the CC algorithm are local characters. However, the results of biclustering ISA tend to be global characters. It is because ISA can transform all research variables into identifier variables for each formed bicluster, as shown in Figure 8.  Table 2 presents a comparison of the characteristics of the mean and median values of the same identifying variables between the two results of the biclustering algorithm. Table 2 shows six identifying variables that are the same between the biclustering results of the CC algorithms and ISA. The six variables consist of three EVI indicator variables (X1, X3, and X8) and three PVI indicator variables (X11, X12, and X15). These variables tend to have the characteristics of an average value classified as an invulnerable value. The mean and median values of these identifying variables in total are classified as having an invulnerable value characteristic. However, there are differences between the mean and median values of each identifying variable in each bicluster, as shown in Table 2.  Table 2 that the median value of each characterizing variable in each bicluster of the CC and ISA algorithms has the same invulnerable value characteristics but differs from the average value. Only one variable is classified as a vulnerable characteristic of the CC algorithm, i.e., variable X12 in Bicluster 4. Meanwhile, the average value of the other variable has an invulnerable value characteristic. Meanwhile, ten identifying variable values of ISA are classified as having vulnerable value characteristics. These variables are X1 (Bicluster 2 and 3), X3 (Bicluster 2 and 3), X8 (Bicluster 1), X11 (Bicluster 2 and 3), X12 (Bicluster 2 and 3), and X15 (Bicluster 2). It indicates that the results of biclustering ISA have outlier values (outliers) due to a significant difference between the characteristics of the mean and median values. The outlier values are more clearly seen through the values distribution of the regional identifying variables according to the CC algorithm and ISA biclustering results, which are depicted in Figure 9.  Figure 9(a) shows that almost all regions in each identifying variable of the CC algorithm biclustering results have values below zero (low values). 89.74% of the 156 observation points of the CC algorithm results produce low values (invulnerable characteristics). It indicates that almost all regions of Indonesia tend to be a low vulnerability in the identifying variable of the CC algorithm result. Meanwhile, it can be seen from Figure 9(b) that there are several areas with outlier values, most of which are above zero in each identifying variable of the ISA biclustering results. However, when examined more closely, it was found that most areas in each identifying variable of the ISA biclustering results were at low values, i.e., around 63.28% of the 1,525 observation points. It indicates that most regions in Indonesia also tend to have a low vulnerability in the identifying variable of the ISA result. Figure 9 describes that the CC algorithm and ISA results are equally dominant in areas with low identifying variables (low vulnerability) values. However, the values of the biclustering results of the CC algorithm tend to be homogeneous, as indicated by the relatively small mean values of the identifier variables, which range from 0.011 to 0.023. In addition, the results of the CC algorithm tend to be sensitive to outliers because there are almost no outlier values in the biclustering results of the CC algorithm. Meanwhile, the results of biclustering ISA tend to be scattered or heterogeneous. It is indicated by the relatively high mean value of the identifying variables' variance compared to the CC algorithm, which ranges from 0.634 to 1.045.
The ISA results are not sensitive to outliers because of many outlier values in its biclustering result, so the ISA tends to be robust.
Based on the intra-bicluster evaluation measure, the CC algorithm produces a smaller value of MSR average per volume (0.00041) compared to ISA (0.00141). It indicates that the bicluster quality of the CC algorithm results tends to be better when compared to the ISA results. The MSR average per volume value aligns with the MSR and AIC values for each bicluster result of the CC algorithm, and ISA presented in . From this table, the MSR and AIC values for each CC algorithm bicluster tend to be smaller when compared to the ISA bicluster. Based on the inter-bicluster evaluation measure using the Liu and Wang index values, the biclustering results between the CC and ISA algorithms show a deficient similarity level, around 20 to 31 percent. When the biclustering results of the CC algorithm are assumed to be the optimal bicluster group, the Liu and Wang index values are 0.20. Meanwhile, when the ISA biclustering results are assumed to be the optimal bicluster group, the Liu and Wang index values are 0.31. It indicates that the biclustering results between the CC and ISA algorithms differ, resulting in biclusters with different memberships and characteristics. The 69 to 80 percent difference is supported by an explanation regarding the distribution of regional identifying variable values in Figure 9 and the results comparison of the area and variable membership in Figure 7 and Figure 8.

D. CONCLUSION AND SUGGESTIONS
The evaluation result of the CC algorithm and ISA performance in the form of its optimal threshold biclustering comparative study shows that the bicluster quality of the CC algorithm tended to be better. The indication is that the MSR average per volume of the CC algorithm is lower than the ISA. In addition, the two biclustering results show a deficient level of similarity (20-31 percent) supported by the differences in their membership and characteristics. The biclustering results of the CC algorithm tend to be homogeneous with a small number of identifying variables (local characters) and dominated by areas with low values (low vulnerability). Meanwhile, the results of biclustering ISA tend to be heterogeneous, with the number of identifying variables covering all research variables (global character) and dominated by areas with low vulnerability. Besides, the ISA result tends to be robust (not sensitive to outliers) because its biclustering results have many outlier values.