Robust Optimization Model for Twitter Sentiment Analysis of PeduliLindungi Application

ABSTRACT


A. INTRODUCTION
The PeduliLindungi application is an application developed by the government in April 2020 to do digital tracking in order to prevent the spread of COVID-19 in Indonesia (Kurniawati et al., 2020). Every application which is still developed has advantages and disadvantages, including the PeduliLindungi application. The advantages and disadvantages can be seen from the responses as well as opinions from its users. People can write their opinions from various media, including social media. Social media user activity has been subjected to data collection, and considered as meaningful data sources in the industry and academia (Firdaniza et al., 2021). One popular social media that Indonesian people actively use is Twitter. Recorded as many as 18.45 million Twitter users in Indonesia and became the fifth largest country of Twitter users in the world as of January 2022 (Source: Statista.com). From Twitter, anyone can see information or even opinions from tweets that are shared by its users. Tweets about the PeduliLindungi application can contain positive or negative opinions. Sentiment analysis or opinion mining is a study field that analyzes opinions, evaluations, attitudes, also emotions of someone from the written language (Liu, 2012). Sentiment analysis is a computational study in the field of Natural Language Processing (NLP). The sentiment analysis group sentiment in the data of text and divided into three classes, which are positive, negative, or neutral. The approaches used in the sentiment analysis are divided into three types, namely Machine Learning, Lexicon Based, and Hybrid Approach (Elsaid Moussa et al., 2021). This research uses Machine Learning Approach refers to Kumar et al., (2020), because on Twitter the are many non-standard languages, Lexicon Based Method cannot handle different dialects and informal words.
This research classifies opinion about PeduliLindungi application into categories of positive, negative, or neutral by using the approach of Machine Learning with the algorithm of Support Vector Machine (SVM). Information is retrieved by crawling Twitter data using the Twitter API Key with the Python programming language. The optimization is done in this research to maximize the performance metrics, which are the values of Accuracy, Precision, Recall and F1-Score from the sentiment analysis that uses the SVM algorithm. The state of the art in this research can be seen in Table 1. It can be seen from Table 1, there is no research that discusses the issue of sentiment analysis specifically to determine the weighting problem with Robust Optimization. Therefore, in this study the completion of the multi-objective model in the research of Kumar et al., (2020) is carried out to determine the weighting of the sentiment analysis problem using Robust Optimization. Robust Optimization model is used because it can handle uncertainty factors that represents the worst-case that can happen.
The main discussion in this research is the implementation of Robust Optimization to analyze sentiment analysis on Twitter with a case study of the PeduliLindungi application. There will be a reformulation of the multi-objective optimization model by Kumar et al., (2020) and considering the uncertainty of the data. It is assumed that the uncertainty factor of this problem is the value of Accuracy, Precision, Recall and F1-Score. The negative sentiment can be used for the government to evaluate the performance of PeduliLindungi application and the positive sentiment can be an appreciation for the government to PeduliLindungi application that is being developed. For the researchers, the result of Robust Optimization model for sentiment analysis in this research can be used to evaluate the performance of a classification model.

B. METHODS
In this section, the methods used in this research are discussed. This section provides the theories that are relevant to the research problem, including the theory of Robust Optimization, Multi-objective Optimization and Sentiment Analysis.

Research Stages
In this research, several methods is used to solve the problem. Figure 1 is a research flowchart with the methods that used in this research. As shown in Figure 1.

Robust Optimization
A problem relating to uncertainty can be solved by Robust Optimization. According to Ben-Tal & Nemirovski (2002), Robust Optimization is a mathematical methodology that is combined with computation tools in order to process the optimization problem with data uncertainty where the uncertainty can be found in an uncertainty set. This Robust Optimization aims to find a robust solution to the uncertain data on the parameter (Hertog, 2013). According to Ben-Tal and Nemirovski (2002) and discussed by Gorissen et al. (2015), the general model of the uncertain data can be formulated as problem (1) where , and are uncertainty coefficients and is the uncertainty set. There are some basic assumptions of Robust Optimization (Ben-Tal et al., 2009) which are as follows: a. All decision variables ∈ ℝ represent "here and now" decisions, meaning that specific numerical values must be obtained as a result of solving the problem before the actual data "reveals itself". b. The decision-maker is fully responsible for the decision to be made if and only if the actual data has been determined in the uncertainty set . c. The constraints of an uncertain linear programming problem are "hard", meaning that the decision-maker cannot tolerate a constraint violation when the data is in .
Besides the basic assumptions, it can be assume that the objective function is certain and the constraint on the right-hand side is certain (Gorissen et al., 2015).
a. The objective function is certain. b. The constraint right-hand side is certain c. Robustness to can be formulated in constraint-wise form, and the uncertainty set is a compact and convex.
Based on given assumptions, the Robust optimization approach changes the uncertain problem to become a single deterministic problem called Robust Counterpart (RC). If assumed ∈ ℝ and ∈ ℝ are certain, then the Robust Counterpart formulation of (1) is written as problem (2) (Gorissen et al., 2015).
Substitute the parameter uncertainty equation (3) on the constraint (2) where ( ̅ + ) is an affine function over the primitive uncertainty parameters ∈ , ∈ ℝ , and ∈ , (ℝ). The robust optimal solution is the optimal solution of the Robust Counterpart, and the optimal robust value of an uncertainty linear programming problem is the optimal value of the Robust Counterpart.
According to Ben-Tal & Nemirovski (2002) and Chaerani & Roos (2013), computationally tractable can be analyzed by representing Robust Counterpart in the form of Linear Programming (LP), Conic Quadratic Programming (CQP) and Semidefinite Programming (SDP). Table 2 shows the tractable formulation for constraints with uncertainty sets, as shown in Table  2.

Text Mining
Text mining is a technique that extracts information from structured and unstructured data, then finds the patterns (Kannan et al., 2015). Text preprocessing is an essential process in text mining. In this study, tweet data preprocessing is carried out with the following stages: a. Case folding folding is the process of transforming data into a uniform cases or letters. This process converts all letters to lowercase and removes numbers and punctuation marks. b. Tokenizing is splitting sentences into chunks of words called tokens. c. Normalization is the process of transforming non-standard text and slang into the standard text. d. Stopword removal is the removal of words that have no meaning (usually appear in large numbers). e. Stemming is the process of transforming a word into a root word by removing all affixes.

TF-IDF
Term Frequency-Inverse Document Frequency (TF-IDF) is a weighting method by combining the term frequency (TF) and Inverse Document Frequency (IDF) methods which illustrates how importance terms are in a corpus (Saadah et al., 2013). The equation for TF-IDF is written in equations (6) and (7).
where is the total number of documents and ( ) is the number of terms appearing in all documents. Therefore, TF-IDF is normalized using the Euclidean norm in equation (8) = √ 1 2 + 2 2 + ⋯ + 2 .
where is TF-IDF of each term in the entire documents.

Support Vector Machine (SVM)
Support Vector Machine (SVM) is one of the machine learning techniques used to do classification (Suthaharan, 2016). SVM aims to find the best hyperplane that separates two classes in the input space (Vapnik, 2000). In obtaining the hyperplane on the SVM, can use the following equation: ( . ) + = 0 In the data , which belongs to class -1, is formulated as in equation (10) ( . + ) ≤ 1, = −1, while the data which belongs to class +1, is formulated as in equation (11) ( . + ) ≥ 1, = 1 The optimal hyperplane search is done by finding the minimum point of the equation (12) ⃗⃗ where is input data, is output value, is a vector normal to hyperplane and is an offset. Problem (12) can be solved with the Lagrange Multiplier. The optimal hyperplane search problem can be written as equation (13) ( , , ) = 1 2 ‖ ‖ 2 − ∑ ( ( • + ) − 1) =1 , = 1,2, … , .
where ≥ 0 is the Lagrange multipliers. The optimal equation (13) value can be calculated by minimizing to and while maximizing to . Since the optimal point of the gradient is = 0, then equation (13) can be modified by maximizing equation (14) ∑ subject to: The maximization of (15) produces several which has a positive value. The data associated with positive are called Support Vector (SV) and can be used to find hyperplanes.

Performance Metrics
Performance metrics can be described as measuring tools to measure the performance of classifiers (M & M.N, 2015). To calculate performance metrics, the formula as shown in Table 3.

Multi-objective Optimization
Multi-objective optimization has several objective functions which are subject to some constraints. One of methods to solve multi-objective optimization problem is the Utility Function Method. Multi-objective decision-making problems can be considered as equation (16) (Rao, 2009). max[ 1 ( ), 2 ( ), 3 ( ), … , ( )] s. t ( ) ≤ 0, ∀ = 1,2,3, … , The problem in equation (16) has goals that must be maximized. The utility function will be converted to a function. can be defined in many ways, one of the simplest ways is the Weighted Sum Method. The Weighted Sum Method combines all multi-objective functions into one scalar, combining the objective function with the weighted sum (Yang, 2014). Problem (16) will turn into a problem (17) (Aliakbari & Seifbarghy, 2011).

Optimization Model for Sentiment Analysis
Sentiment analysis as a classification problem uses performance metrics to evaluate performance (Vakili et al., 2020). Performance metrics can be determined by calculating the Accuracy, Precision, Recall and F1-Score values. The problem of sentiment analysis discussed in this study is to determine the weight of each performance metric that can maximize the performance metrics. This problem can be solved by using an optimization model to get optimal results.
The sentiment analysis weighting optimization problem is explained as follows. a. represents the i-th scalar weighting factor, which can maximize the value of Accuracy, Precision, Recall and F1-Score. b. is a weighting factor that shows the relative importance of each performance metric, so the total value of the weighting factor is 1. c.
has a minimum value and a maximum value at the closing interval [0,1].
The objective function of Kumar et al. (2020) is created to maximize the value of Accuracy, Precision, Recall and F1-Score. Several objective functions are combined into a single objective function by adding the objective function with scalar weights. This problem involves several parameters and variables as follows.

C. RESULT AND DISCUSSION
In this section the results are presented. Optimization Model Formulation for Sentiment Analysis Weighting Problem is discussed and also the robust version. Numerical experiment shows the validation for the case of PeduliLindungi.

Optimization Model Formulation for Sentiment Analysis Weighting Problem
The parameters used in this problem are which are the performance metrics of sentiment analysis, where are the following value.
1 : Accuracy 2 : Precision 3 : Recall 4 : F1-Score A multi-objective function to maximize the value of performance metrics is formulated as equation (20) With the Utility Function Method as state in Rao (2009), the multi-objective function in (20) is reformulated into a single objective function. Then the optimization model formulation in this study is obtained as in equation (21) (Kumar et al., 2020).
where represents the i-th scalar weighting factor of each performance metric that indicates the level of importance.

Uncertainty Optimization Model Formulation for Sentiment Analysis Weighting Problem
In this study, it is assumed that the uncertainty factors of this problem are which is obtained from the classification process. As it is stated by (Ben-Tal & Nemirovski, 2002), the procedure on implementing Robust Optimization can be done by firstly, assume that the objective function is certain and bounded below by a single variable function . This implies that the problem (21) can be formulated such that the uncertainty only appears in the constraint function. The single variable ∈ ℝ is added instead of the objective function. max Thus, it can be seen in equation (22) that the uncertainty only appears in the constraint function. Defined uncertainty parameter as equation (23) where ̅ ∈ ℝ is a nominal value, and ∈ ℝ × is a perturbation matrix. Define the set as equation (24) Substitute the uncertainty of equation (24) on the constraint (22), The formulation of the Robust Optimization model for the weighting problem of sentiment analysis is obtained as the following equation (26) max

Robust Counterpart Formulation for Sentiment Analysis Weighting Problem with Polyhedral Uncertainty Sets
Using the steps for defining polyhedral uncertainty set as introduce in Gorissen et al. (2015). Formulation of Robust Counterpart for Sentiment Analysis Weighting Problem with Polyhedral Uncertainty Sets is obtained in three steps: a.
Step 1 The constraint reformulation in equation (25) is equivalent to the worst-case formulation as in equation (27) below.
Step 2 Formulate the dual form of the maximization problem in equation (27). The reformulated form of equation (27) is obtained to be of the following form.
Step 3 The formulation of the Robust Counterpart model with polyhedral uncertainty set for the weighting problem of sentiment analysis is as equation (29) The Robust Counterpart of uncertain Linear Programming (LP) problem is computationally tractable (Ben-Tal et al., 2009). It can be seen that equation (29) is in the form of LP, so that the formulation of the Robust Counterpart model with a polyhedral uncertainty set is a computationally tractable problem, meaning that it can be solved computationally in polynomial time.

Support Vector Machine Classification
In this section, Support Vector Machine is classified. In this step, the tweets data are labeled manually. In this study, three classes of sentiment are used; positive, negative and neutral. The number of tweet data label can be seen in Table 4. It can be seen from Table 4 that neutral sentiment has a greater number, followed by positive and negative sentiment. The criteria for each class are as follows: a. Positive class, contains tweet responses that support or praise the PeduliLindungi application. b. Negative class, contains tweet responses that contras to the PeduliLindungi application. c. Neutral class, contains tweet responses that do not contain both positive or negative class.
The data classification process begins by dividing the data into training data and testing data. The ratio that commonly used is 80:20, which means that 80% of the data is for training and 20% is for testing (Joseph, 2022). Table 5 below shows the results of a 3x3 confusion matrix consisting of predict class and actual class. From Table 5, it can be calculated the value of the performance metrics from the SVM classification is obtained as shown in Table 6.

Numerical Experiment
By using the Python programming language, the optimal solution of the deterministic model for the weighting problem of sentiment analysis is obtained, as shown in Table 7. After that, the optimal solution is sought from the Robust Counterpart model for the weighting problem of sentiment analysis with polyhedral uncertainty, as shown in Table 8. The numerical experiment result with the deterministic model has a larger objective function value than the Robust Counterpart model with polyhedral uncertainty. The result difference is due to the Robust Counterpart Optimization model with polyhedral uncertainty considering the uncertainty factors of the performance metrics value.

D. CONCLUSION AND SUGGESTIONS
The form of Robust Counterpart formulation with polyhedral uncertainty set for the weighting problem of sentiment analysis can be seen as Linear Programming (LP). It can be concluded that Robust Counterpart formulation with polyhedral uncertainty set for the weighting problem of sentiment analysis is computationally tractable, i.e. the model can be solved computationally in polynomial time. The performance of the SVM model used has a value 0.6433 with the Robust Counterpart model with polyhedral uncertainty. The results of numerical experiment use the Robust Counterpart model with polyhedral uncertainty have a smaller objective function value than the deterministic model which has value 0.66, because the Robust Counterpart model maximizes the performance metrics by handling uncertainty factors that represents the worst that can happen. Further research can be carried out using the Robust Counterpart optimization model with other sets of uncertainties, the box uncertainty set and the ellipsoidal uncertainty set (Gorissen et al., 2015). Different classification methods can be used for sentiment analysis in order to know the comparison between the performance of other algorithms.