Identification of Demographic Factors Affecting Student Performance using Tree-Based Machine Learning Models
Abstract
This study aims to identify key academic and demographic factors influencing student performance in the Logic and Set Theory course, particularly in the context of different learning modes during and after the COVID-19 pandemic. It adopts a quantitative exploratory design involving students from the 2020 to 2023 cohorts at Sanata Dharma University. Academic data (exam and assignment scores, course outcomes) and demographic data (e.g., parental education and income, region of origin, gender, and high school major) were collected from the academic system and supplemented via questionnaires. The dataset was cleaned, encoded, and normalized using RobustScaler, with class imbalance addressed through SMOTE. Descriptive statistics were used to explore initial data characteristics. Five tree-based machine learning models, Decision Tree, Random Forest, XGBoost, LightGBM, and CatBoost, were implemented within a pipeline that included preprocessing and model optimization using GridSearchCV with 5-fold cross-validation. Model evaluation employed multiple metrics, including accuracy, precision, recall, F1-score, AUC, and Average Precision. Results showed that XGBoost and CatBoost achieved the best performance (accuracy 92%, AUC 0.99) with balanced precision and recall across all four performance categories. Feature importance analysis indicated that exam and assignment scores were the strongest predictors, while demographic factors such as enrollment year, parental education, and income contributed moderately. Variables like gender, region, and high school major had minimal influence. This research demonstrates how machine learning can effectively integrate academic and demographic data, rather than analyzing them in isolation, to uncover nuanced patterns in student achievement. The findings support the development of data-driven educational interventions, such as preparatory learning modules, peer mentoring for underperforming groups, targeted academic advising for students from low-income or less-educated families, and flexible instructional strategies for cohorts affected by pandemic-related disruptions.
Keywords
Full Text:
DOWNLOAD [PDF]References
Alhazmi, E., & Sheneamer, A. (2023). Early Predicting of Students Performance in Higher Education. IEEE Access, 11, 27579–27589. IEEE Access. https://doi.org/10.1109/ACCESS.2023.3250702
Barbeiro, L., Gomes, A., Correia, F. B., & Bernardino, J. (2024). A Review of Educational Data Mining Trends. Procedia Computer Science, 237, 88–95. https://doi.org/10.1016/j.procs.2024.05.083
Bayirli, E. G., Kaygun, A., & Öz, E. (2023). An Analysis of PISA 2018 Mathematics Assessment for Asia-Pacific Countries Using Educational Data Mining. Mathematics, 11(6), 1318. https://doi.org/10.3390/math11061318
Bogdanov, K., Gura, D., Khimmataliev, D., & Bogdanova, Y. (2024). Effectiveness of using Decision trees to increase student’s analytical skills and cognitive development in education. Interactive Learning Environments, 33(2), 1480–1489. https://doi.org/10.1080/10494820.2024.2372641
Casillano, N. F. B., & Cantilang, K. W. (2024). Employing educational data mining techniques to predict programming students at-risk of dropping out. Indonesian Journal of Electrical Engineering and Computer Science, 35(2), 1219–1226. https://doi.org/10.11591/ijeecs.v35.i2.pp1219-1226
De Amorim, L. B. V., Cavalcanti, G. D. C., & Cruz, R. M. O. (2023). The choice of scaling technique matters for classification performance. Applied Soft Computing, 133, 109924. https://doi.org/10.1016/j.asoc.2022.109924
Detyna, M., Sanchez-Pizani, R., Giampietro, V., Dommett, E. J., & Dyer, K. (2023). Hybrid flexible (HyFlex) teaching and learning: Climbing the mountain of implementation challenges for synchronous online and face-to-face seminars during a pandemic. Learning Environments Research, 26(1), 145–159. https://doi.org/10.1007/s10984-022-09408-y
Early, E., Miller, S., Dunne, L., & Moriarty, J. (2023). The influence of socio-demographics and school factors on GCSE attainment: Results from the first record linkage data in Northern Ireland. Oxford Review of Education, 49(2), 171–189. https://doi.org/10.1080/03054985.2022.2035340
Flores, V., Heras, S., & Julian, V. (2022). Comparison of Predictive Models with Balanced Classes Using the SMOTE Method for the Forecast of Student Dropout in Higher Education. Electronics, 11(3), 457. https://doi.org/10.3390/electronics11030457
Gil, P. D., Da Cruz Martins, S., Moro, S., & Costa, J. M. (2021). A data-driven approach to predict first-year students’ academic success in higher education institutions. Education and Information Technologies, 26(2), 2165–2190. https://doi.org/10.1007/s10639-020-10346-6
Gimenez, G., Martín-Oro, Á., & Sanaú, J. (2018). The effect of districts’ social development on student performance. Studies in Educational Evaluation, 58, 80–96. https://doi.org/10.1016/j.stueduc.2018.05.009
Grätz, M., & Wiborg, Ø. N. (2020). Reinforcing at the Top or Compensating at the Bottom? Family Background and Academic Performance in Germany, Norway, and the United States. European Sociological Review, 36(3), 381–394. https://doi.org/10.1093/esr/jcz069
Hakkal, S., & Lahcen, A. A. (2024). XGBoost To Enhance Learner Performance Prediction. Computers and Education: Artificial Intelligence, 7, 100254. https://doi.org/10.1016/j.caeai.2024.100254
Hancock, J. T., & Khoshgoftaar, T. M. (2020). CatBoost for big data: An interdisciplinary review. Journal of Big Data, 7(1), 94. https://doi.org/10.1186/s40537-020-00369-8
Huang, T. (2024). LightGBM model applied in the teaching course of Civic Education integrating red culture. Applied Mathematics and Nonlinear Sciences, 9(1), 1–18. https://doi.org/10.2478/amns.2023.2.00173
Isungset, M. A., Conley, D., Zachrisson, H. D., Ystrom, E., Havdahl, A., Njølstad, P. R., & Lyngstad, T. H. (2022). Social and genetic associations with educational performance in a Scandinavian welfare state. Proceedings of the National Academy of Sciences, 119(25), e2201869119. https://doi.org/10.1073/pnas.2201869119
Jin, X. (2023). Predicting academic success: Machine learning analysis of student, parental, and school efforts. Asia Pacific Education Review, 1–22. https://doi.org/10.1007/s12564-023-09915-4
Khairy, D., Alharbi, N., Amasha, M. A., Areed, M. F., Alkhalaf, S., & Abougalala, R. A. (2024). Prediction of student exam performance using data mining classification algorithms. Education and Information Technologies, 29, 21621–21645. https://doi.org/10.1007/s10639-024-12619-w
Kumar, M., Singh, N., Wadhwa, J., Singh, P., Kumar, G., & Qtaishat, A. (2024). Utilizing Random Forest and XGBoost DataMining Algorithms for Anticipating Students’Academic Performance. International Journal of Modern Education and Computer Science, 16(2), 29–44. https://doi.org/10.5815/ijmecs.2024.02.03
Lu, Y., Zhang, X., & Zhou, X. (2023). Assessing gender difference in mathematics achievement. School Psychology International, 44(5), 553–567. https://doi.org/10.1177/01430343221149689
Ludeke, S. G., Gensowski, M., Junge, S. Y., Kirkpatrick, R. M., John, O. P., & Andersen, S. C. (2021). Does parental education influence child educational outcomes? A developmental analysis in a full-population sample and adoptee design. Journal of Personality and Social Psychology, 120(4), 1074–1090. https://doi.org/10.1037/pspp0000314
Marks, G. N., & Pokropek, A. (2019). Family income effects on mathematics achievement: Their relative magnitude and causal pathways. Oxford Review of Education, 45(6), 769–785. https://doi.org/10.1080/03054985.2019.1620717
Marshall, D. T. (2024). Student Attendance Patterns as Actionable Early Warning Indicators of High School Graduation Outcomes: Findings from an Urban Alternative Charter School. Urban Science, 8(3), 78. https://doi.org/10.3390/urbansci8030078
Mashagba, E., Al-Saqqar, F., & Al-Shatnawi, A. (2023). Using Gradient Boosting Algorithms in Predicting Student Academic Performance. 2023 International Conference on Business Analytics for Technology and Security (ICBATS), 1–7. https://doi.org/10.1109/ICBATS57792.2023.10111325
Molnár, G., & Kocsis, Á. (2024). Cognitive and non-cognitive predictors of academic success in higher education: A large-scale longitudinal study. Studies in Higher Education, 49(9), 1610–1624. https://doi.org/10.1080/03075079.2023.2271513
Muminin, R. S., Hadiana, A., & Natalia, N. (2023). The Study of Neural Network Algorithm, Random Forest for Classification of Student Graduation. International Journal of Scientific Research in Science, Engineering and Technology, 10(3), 517–522. https://doi.org/10.32628/IJSRSET23103145
Odeh, A., Al-Haija, Q. A., Aref, A., & Taleb, A. A. (2023). Comparative Study of CatBoost, XGBoost, and LightGBM for Enhanced URL Phishing Detection: A Performance Assessment. Journal of Internet Services and Information Security, 13(4), 1–11. https://doi.org/10.58346/JISIS.2023.I4.001
Richards, K., & Thompson, B. M. W. (2023). Challenges and instructor strategies for transitioning to online learning during and after the COVID-19 pandemic: A review of literature. Frontiers in Communication, 8, 1–7. https://doi.org/10.3389/fcomm.2023.1260421
Ritonga, A., Masrizal, M., & Irmayanti, I. (2024). Analysis of Student Excellence Classes in Data Mining Using the KNN Method. Sinkron, 8(2), 1148–1159. https://doi.org/10.33395/sinkron.v8i2.13627
Sarker, S., Paul, M. K., Thasin, S. T. H., & Hasan, Md. A. M. (2024). Analyzing students’ academic performance using educational data mining. Computers and Education: Artificial Intelligence, 7, 100263. https://doi.org/10.1016/j.caeai.2024.100263
Sembiring, M. T., & Tambunan, R. H. (2021). Analysis of graduation prediction on time based on student academic performance using the Naïve Bayes Algorithm with data mining implementation (Case study: Department of Industrial Engineering USU). IOP Conference Series: Materials Science and Engineering, 1122(1), 012069. https://doi.org/10.1088/1757-899x/1122/1/012069
Sunarto, M. J. D. (2024). A Comparison of Students’ Learning Outcomes in Advanced Mathematics Courses through Hybrid Learning. Journal of Educators Online, 21(1), 1–11. https://doi.org/10.9743/JEO.2024.21.1.10
Uddin, S., & Lu, H. (2024). Confirming the statistically significant superiority of tree-based machine learning algorithms over their counterparts for tabular data. PLOS ONE, 19(4), e0301541. https://doi.org/10.1371/journal.pone.0301541
Valentia, T. R. (2023). Digital Divide and Digital Literacy During the Covid-19 Pandemic. Scriptura, 13(1), 69–78. https://doi.org/10.9744/scriptura.13.1.69-78
Wang, G., & Qing, X. (2023). Analyzing online and offline mixed teaching model for university students during and after COVID-19. Interactive Learning Environments, 32(5), 1779–1794. https://doi.org/10.1080/10494820.2022.2127781
Werang, B. R., Agung, A. A. G., Sri, A. A. P., Leba, S. M. R., & Jim, E. L. (2024). Parental socioeconomic status, school physical facilities availability, and students’ academic performance. Edelweiss Applied Science and Technology, 8(5), 1–15. https://doi.org/10.55214/25768484.v8i5.1146
Wut, T., & Xu, J. (2021). Person-to-person interactions in online classroom settings under the impact of COVID-19: A social presence theory perspective. Asia Pacific Education Review, 22(3), 371–383. https://doi.org/10.1007/s12564-021-09673-1
Xi, X. (2024). The role of LightGBM model in management efficiency enhancement of listed agricultural companies. Applied Mathematics and Nonlinear Sciences, 9(1), 1–14. https://doi.org/10.2478/amns.2023.2.00386
Yusof, R., Hashim, N., Abdul Rahman, N., Mohd Yunus, S. Y., & Aziz Fadzillah, N. A. (2022). Academic Performance Prediction Model Using Classification Algorithms: Exploring the Potential Factors. International Journal of Academic Research in Progressive Education and Development, 11(3), 706–724. https://doi.org/10.6007/ijarped/v11-i3/14753
Zhang, X., & Lu, H. (2024). Optimization of Practical Path of Teaching Reform in Higher Education—Based on Distributed Logistic Model Application. Applied Mathematics and Nonlinear Sciences, 9(1), 1–17. https://doi.org/10.2478/amns-2024-1388
Zhao, K. (2022). Rural-urban gap in academic performance at a highly selective Chinese university: Variations and determinants. Higher Education Research & Development, 41(1), 177–192. https://doi.org/10.1080/07294360.2020.1835836
DOI: https://doi.org/10.31764/jtam.v9i2.28815
Refbacks
- There are currently no refbacks.
Copyright (c) 2025 Chatarina Enny Murwaningtyas

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
_______________________________________________
JTAM already indexing:
_______________________________________________
![]() | JTAM (Jurnal Teori dan Aplikasi Matematika) |
_______________________________________________
_______________________________________________
JTAM (Jurnal Teori dan Aplikasi Matematika) Editorial Office: