Chi-Square Feature Selection with Pseudo-Labelling in Natural Language Processing

Sintia Afriyani, Sugiyarto Surono, Iwan Mahmud Solihin

Abstract


This study aims to evaluate the effectiveness of the Chi-Square feature selection method in improving the classification accuracy of linear Support Vector Machine, K-Nearest Neighbors and Random Forest in natural language processing when combined with classification algorithms as well as introducing Pseudo-Labelling techniques to improve semi-supervised classification performance. This research is important in the context of NLP as accurate feature selection can significantly improve model performance by reducing data noise and focusing on the most relevant information, while Pseudo-Labelling techniques help maximise unlabelled data, which is particularly useful when labelled data is sparse. The research methodology involves collecting relevant datasets, thus applying the Chi-Square method to filter out significant features, and applying Pseudo-Labelling techniques to train semi-supervised models. In this study, the dataset used in this research is the text data of public comments related to the 2024 Presidential General Election, which is obtained from the Twitter scrapping process. The characteristics of this dataset include various comments and opinions from the public related to presidential candidates, including political views, support, and criticism of these candidates. The experimental results show a significant improvement in classification accuracy to 0.9200, with precision of 0.8893, recall of 0.9200, and F1-score of 0.8828. The integration of Pseudo-Labelling techniques prominently improves the performance of semi-supervised classification, suggesting that the combination of Chi-Square and Pseudo-Labelling methods can improve classification systems in various natural language processing applications. This opens up opportunities to develop more efficient methodologies in improving classification accuracy and effectiveness in natural language processing tasks, especially in the domains of linear Support Vector Machine, K-Nearest Neighbors and Random Forest well as semi-supervised learning.

Keywords


Chi-Square Feature Selection; Natural Language Pocessing; Pseudo-Labelling; Semi-supervised.

Full Text:

DOWNLOAD [PDF]

References


Adnan, K., & Akbar, R. (2019). An analytical study of information extraction from unstructured and multidimensional big data. In Journal of Big Data (Vol. 6, Issue 1). 56-70 Springer International Publishing. https://doi.org/10.1186/s40537-019-0254-8

Al Walid, M. H., Anisuzzaman, D. M., & Saif, A. F. M. S. (2019). Data Analysis and Visualization of Continental Cancer Situation by Twitter Scraping. International Journal of Modern Education and Computer Science, 11(7), 23–31. https://doi.org/10.5815/ijmecs.2019.07.03

Alshaer, H. N., Otair, M. A., Abualigah, L., Alshinwan, M., & Khasawneh, A. M. (2021). Feature selection method using improved CHI Square on Arabic text classifiers: analysis and application. Multimedia Tools and Applications, 80(7), 10373–10390. https://doi.org/10.1007/s11042-020-10074-6

Arora, N., & Kaur, P. D. (2020). A Bolasso based consistent feature selection enabled random forest classification algorithm: An application to credit risk assessment. Applied Soft Computing Journal, 86(11), 105936. https://doi.org/10.1016/j.asoc.2019.105936

Asghar, S., Choi, J., Yoon, D., & Byun, J. (2020). Spatial pseudo-labeling for semi-supervised facies classification. Journal of Petroleum Science and Engineering, 195(August), 107834. https://doi.org/10.1016/j.petrol.2020.107834

Chen, R. C., Dewi, C., Huang, S. W., & Caraka, R. E. (2020). Selecting critical features for data classification based on machine learning methods. Journal of Big Data, 7(1), 52. https://doi.org/10.1186/s40537-020-00327-4

Deta Kirana, Y., & Al Faraby, S. (2021). Sentiment Analysis of Beauty Product Reviews Using the K-Nearest Neighbor (KNN) and TF-IDF Methods with Chi-Square Feature Selection. Open Access J Data Sci Appl, 4(1), 31–042. https://doi.org/10.34818/JDSA.2021.4.71

Ferrario, A., & Naegelin, M. (2020). The Art of Natural Language Processing: Classical, Modern and Contemporary Approaches to Text Document Classification. SSRN Electronic Journal, 3(1), 1–51. https://doi.org/10.2139/ssrn.3547887

Garg, S., Panwar, D. S., Gupta, A., & Katarya, R. (2020). A literature review on sentiment analysis techniques involving social media platforms. PDGC 2020 - 2020 6th International Conference on Parallel, Distributed and Grid Computing, 3(1), 254–259. https://doi.org/10.1109/PDGC50313.2020.9315735

Hamzah, M. B. (2021). Classification of Movie Review Sentiment Analysis Using Chi-Square and Multinomial Naïve Bayes with Adaptive Boosting. Journal of Advances in Information Systems and Technology, 3(1), 67–74. https://doi.org/10.15294/jaist.v3i1.49098

Herlawati, H., Trias Handayanto, R., Ekawati, I., Meutia, K. I., Asian, J., & Aditiawarman, U. (2020). Twitter scrapping for profiling education staff. 2020 5th International Conference on Informatics and Computing, ICIC 2020. 3(1), 23-67. https://doi.org/10.1109/ICIC50835.2020.9288607

Jabbar, A., Iqbal, S., Tamimy, M. I., Hussain, S., & Akhunzada, A. (2020). Empirical evaluation and study of text stemming algorithms. In Artificial Intelligence Review (Vol. 53, Issue 8). 5559-5588. Springer Netherlands. https://doi.org/10.1007/s10462-020-09828-3

Krstinić, D., Braović, M., Šerić, L., & Božić-Štulić, D. (2020). Multi-label Classifier Performance Evaluation with Confusion Matrix. 3(1), 01–14. https://doi.org/10.5121/csit.2020.100801

Kudo, T., & Richardson, J. (2018). SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. EMNLP 2018 - Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Proceedings, 3(8), 66–71. https://doi.org/10.18653/v1/d18-2012

Mohd Nafis, N. S., & Awang, S. (2021). An Enhanced Hybrid Feature Selection Technique Using Term Frequency-Inverse Document Frequency and Support Vector Machine-Recursive Feature Elimination for Sentiment Classification. IEEE Access, 9(Ml), 52177–52192. https://doi.org/10.1109/ACCESS.2021.3069001

Paudel, S., Prasad, P. W. C., Alsadoon, A., Islam, M. R., & Elchouemi, A. (2019). Feature selection approach for twitter sentiment analysis and text classification based on chi-square and naïve bayes. Advances in Intelligent Systems and Computing, 842(11), 281–298. https://doi.org/10.1007/978-3-319-98776-7_30

Sakthi Vel, S. (2021). Pre-Processing techniques of Text Mining using Computational Linguistics and Python Libraries. Proceedings - International Conference on Artificial Intelligence and Smart Systems, ICAIS 2021, 3(1), 879–884. https://doi.org/10.1109/ICAIS50930.2021.9395924

Sarica, S., & Luo, J. (2021). Stopwords in technical language processing. PLoS ONE, 16(8 August), 1–13. https://doi.org/10.1371/journal.pone.0254937

Shan Lee, V. L., Gan, K. H., Tan, T. P., & Abdullah, R. (2019). Semi-supervised learning for sentiment classification using small number of labeled data. Procedia Computer Science, 161(2019), 577–584. https://doi.org/10.1016/j.procs.2019.11.159

Singh, K. N., Devi, S. D., Devi, H. M., & Mahanta, A. K. (2022). A novel approach for dimension reduction using word embedding: An enhanced text classification approach. International Journal of Information Management Data Insights, 2(1), 100061. https://doi.org/10.1016/j.jjimei.2022.100061

Singh, N. K., Tomar, D. S., & Sangaiah, A. K. (2020). Sentiment analysis: a review and comparative analysis over social media. Journal of Ambient Intelligence and Humanized Computing, 11(1), 97–117.https://doi.org/10.1007/s12652-018-0862-8

Syrotkina, O., Aleksieiev, M., Moroz, B., Matsiuk, S., Shevtsova, O., & Kozlovskyi, A. (2020). Mathematical Methods for optimizing Big Data Processing. Proceedings - International Conference on Advanced Computer Information Technologies, ACIT, 1(9), 170–176. https://doi.org/10.1109/ACIT49673.2020.9208940

Tubishat, M., Abushariah, M. A. M., Idris, N., & Aljarah, I. (2019). Improved whale optimization algorithm for feature selection in Arabic sentiment analysis. Applied Intelligence, 49(5), 1688–1707. https://doi.org/10.1007/s10489-018-1334-8

Yang, A., Zhang, J., Pan, L., & Xiang, Y. (2016). Enhanced twitter sentiment analysis by using feature selection and combination. Proceedings - 2015 International Symposium on Security and Privacy in Social Networks and Big Data, SocialSec 2015, 9(November), 52–57. https://doi.org/10.1109/SocialSec2015.9

Yang, W., Zhang, R., Chen, J., Wang, L., & Kim, J. (2023). Prototype-Guided Pseudo Labeling for Semi-Supervised Text Classification. Proceedings of the Annual Meeting of the Association for Computational Linguistics, 1(july), 16369–16382. https://doi.org/10.18653/v1/2023.acl-long.904




DOI: https://doi.org/10.31764/jtam.v8i3.22751

Refbacks

  • There are currently no refbacks.


Copyright (c) 2024 Sintia Afriyani, Sugiyarto Surono, Mahmud Iwan Solihin

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

_______________________________________________

JTAM already indexing:

                     


_______________________________________________

 

Creative Commons License

JTAM (Jurnal Teori dan Aplikasi Matematika) 
is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License

______________________________________________

_______________________________________________

_______________________________________________ 

JTAM (Jurnal Teori dan Aplikasi Matematika) Editorial Office: