INTRODUCTION

2541-2590

JRAMathEdu (Journal of Research and Advances in Mathematics Education)

J.Res.Adv.Math.Educ

2541-25902503-3697

Lembaga Pengembangan Publikasi Ilmiah dan Buku Ajar, Universitas Muhammadiyah Surakarta

10.23917/jramathedu.v9i4.4643

Predictive analytics of student performance: Multi-method and code

Vladova

Alla Yu.

Russian Federationalla.vladova@gmail.com

Borchyk

Katsiaryna M.

Belarus

Financial University under the Government of Russian FederationBelarusian-Russian University

Corresponding author: Alla Yu. Vladova, Financial University under the Government of Russian Federation .Email:alla.vladova@gmail.com

30102024

31102024

Volume 9 Issue 4 October 2024

19020428320241610202423102024

2024

Alla Vladova, Katsiaryna M. Borchyk

https://creativecommons.org/licenses/by-nc/4.0

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Predictive analytics of student performance: Multi-method and code

The maintenance of a high level of education in universities can be a challenging task due to low academic performance. Despite the significant amount of collected diagnostic data, education managers underutilize machine learning methods to improve the accuracy of predicting academic performance. Authors apply a multi-method approach for data analysis using simple logistic and linear regressions, k-means clustering, that all together gave a synergetic effect. The proposed approach differs from known analogs in that, firstly, the dimensionality of the feature space increases due to the normalization of scores onto a single scale and the creation of new features: the index and rank of students, as well as the changes in performance across various activities for each student. Secondly, students at academic risk are forecasted, and the statistical significance of the features included in the model is evaluated. Thirdly, for each student, the final score for the semester is forecasted using an linear regressive model of academic performance. Fourthly, groups of students with similar learning trajectories are identified for customization of consultations. The authors managed to achieve a high predictive ability of models based on historical training data: binary prediction of exam passing in 90% of cases, prediction of individual assessment in 70% of cases.

Customized LearningAcademic PerformanceEducational Data Analysis

This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors

File created by JATS Editor

JATS Editor

issue-created-year

2024

INTRODUCTION

The academic performance of students is one of the most important characteristics of the educational activities of an educational institution, by which professors and education managers can judge the results achieved or the problems that exist. Each university has its own systems for assessing academic performance, including various indicators of academic activities (Elisabeta & Alexandru, 2018). The academic performance of students in mathematical disciplines is usually assessed through computer tests, expert evaluation of semester projects, preparation level for seminars, and attendance ((Zafar et al., 2020). The quality of students' work is then used for effective educational process management, in making decisions about awarding state academic and named scholarships, issuing diplomas with honors, and other tasks. Thus, the research covers the following tasks: (1) how to effectively use historical data to predict student performance; (2) predictive models that are the most understandable to education managers; (3) how to reduce the subjective influence of experts on the final grades of students.

Literature review

The researchers (Zhang et al., 2021) state that predicting student performance helps all stakeholders in the educational process. For example, students can choose appropriate courses or exercises and make plans for the semester accordingly (Ibrahim & Rusli, 2007), discovering the relationships between courses. Professors can adjust educational materials and curricula depending on students' abilities and identify students at risk within the group (Kloft et al., 2014). Managers in the education sector can review the curriculum and optimize the set of disciplines. The prediction could guide course selection and early warning on student learning, but finding the key factors affecting most education behaviors is a more important task. That is because (1) the key feature could correspond to interventions of education; (2) the reason of success or failure could reflect the pattern of student learning; (3) understanding of these factors could provide plan settings, course assignments, and learning sequence with suggestions. As (Kahramanoğlu, 2018) notes, the same characteristics help to indirectly analyze the hard and soft skills of prospective teachers.

The article (Yağcı, 2022) proposes a machine learning model for predicting the final scores of undergraduate students, using their scores for midterm exams as the input data. To forecast the exam scores, the performance metrics of random forest, k-nearest neighbor, support vector machines, logistic regression, and naive Bayes algorithms are calculated and compared. The dataset consisted of the academic performance scores of 1854 students at a state university in Turkey during the autumn semester of 2019-2020. Predictions are made using three types of features: midterm exam scores, as well as department and faculty names. The proposed model achieved a classification accuracy of 70–75%. The insufficient accuracy of the model can be explained by the presence of low-variable features.

Authors of the study (Oluwadele et al., 2023) assessed academic performance in the field of medical education through indicators of students' acquisition and perception of knowledge, level of confidence, ease of use of the e-learning platform and willingness to recommend e-learning. The flaw of the proposed approach lies in the qualitative nature of the analyzed features, their strong dependence on the opinion of different experts.

Researchers (Liu & Yu, 2023) uses the online student actions that the e-learning platform allows you to collect, namely: the time it takes to answer a question or submit an assignment, the number of missed questions, excessive tardiness, cheating on tests, derogatory comments in online discussions. The disadvantage of this approach is that the data was taken without additional transformations that affect the performance of the model.

Exploration (Ye et al., 2022) offers a model for predicting the effectiveness of online learning, based on the selection and merging of features. The model uses the relationship between behaviors and examines whether combinations of behaviors are better predictors of academic performance.

The researchers (Yadav & Deshmukh, 2023) emphasize that the most significant factors influencing students' academic performance are low initial scores, family support, living arrangements, gender, previous performance, students' internal assessment, average academic performance, and students' activity in e-learning. They also note that the plan to improve students' academic performance should take into account additional consultations for students with low performance. This helps both students and teachers to overcome the challenges faced during education. The idea of selecting students for additional consultations formes the basis of the fourth stage of the proposed method.

Thus, it is possible to identify the following features that are used to predict academic performance: attendance coefficient; ratio of scores for work or campus activities to the total possible certification score; performance dynamics, the change in scores between the first and second certification. This change in estimates is the basis of the proposed method.

The authors (Yang et al., 2018) designed several week learning activity, includes homework, quizzes, video-based learning. They applied multiple linear regression model to predict students’ academic scores. They also reworked the well-known metrics for assessing the accuracy of the models, using data obtained during the cross-validation stage. They believe, however, that the models are applicable to the courses, learning activities and data attributes for which they were developed.

Figure 1

Tag cloud of scores

Image

Researchers note (Troussas et al., 2013), that clustering users into groups with common interests is very useful when learning multiple languages. They used the k-means algorithm because of its simplicity. Authors (Bayazit et al., 2022) use the same clustering algorithm to identify students with low engagement. A known drawback of the algorithm is that the number of clusters is set a priori and does not sufficiently reflect students with a satisfactory level of interaction. Two clustering quality indexes are tested and compared in (Petrovic, 2006). Experimental results comparing the effectiveness of a multiple classiﬁer with the two indexes implemented show that the system using the Silhouette index produces slightly more accurate results than the system that uses the Davies-Bouldin index.

Based on the literature review, it can be concluded that there is significant interest in predicting students' academic performance using machine learning methods. It has been established that academic performance prediction is carried out either through binary classification – whether a student will pass the exam or not, or through regression to predict the potential score for the exam. It has been identified that methods grouping students based on similarities in learning trajectories are not commonly applied.

Quick overview of the initial dataset

The quality of the data of the e-learning platform has a direct impact on the accuracy of predictive models (Qiu et al., 2022). The initial dataset with intermediate and final scores of students for the first semester in the discipline Data Analysis contains 20 features: two text features with the surnames and first names of students; number of a group; eight digital features with students' scores for homework, self-study, term papers and tests posted in the e-learning platform; and the remaining numerical features contain the scores given by the teacher for programming skills, activity in the classroom, as well as final scores.

To visualize numerical features (Vladova, 2024) we apply the WordCloud library of the Python language, and built the tag cloud shown in Figure 1. Based on Figure 1, it is evident that the educational institution employs two grading scales: midterm performance is evaluated on a scale from 0 to 5.0, while the final scores are converted to a scale from 0 to 100.0 with maximum scores of 5 and 100 points respectively. The majority of students demonstrate good and excellent knowledge

Figure 2

The map of the number of scores for different activities: a) taking into account the students' gender; b) taking into account the students' group

Image

The map of the number of scores for different activities: a) taking into account the students' gender; b) taking into account the students' group

Purpose and objectives of the study

The aim of the study is to enhance students' preparedness by developing models for assessing and predicting students' academic performance based on their current scores. The objectives of this research are:

Conduct a statistical analysis of student scores for a particular discipline: assess data imbalance, identify gaps, construct score distribution densities, and explore the correlation matrix of scores.

Increase the dimensionality of the feature space by normalizing scores to a common scale and creating new features such as student indexes, ranks, and the differences between scores at different time points, obtained by each student.

Predict students at academic risk and evaluate the statistical significance of the features included in the model.

Customize consultations for student groups with similar learning trajectories.

Forecast semester final scores for each student. This approach involves a comprehensive analysis, modeling, and customization of consultations to effectively improve students' academic performance levels in universities.

Statistical analysis of the initial dataset

It is essential to address the data imbalance highlighted in the primary statistical analysis:

Gender imbalance, with women constituting 30% more than men

Ibalance in the number and level of scores received by students for various activities. The histograms depicting the number of students assessed for each activity, segmented by gender and group, are presented in Figure 2.

The analysis revealed that male students are more involved in coursework and significantly more active in classes, while a significantly larger number of female students completed the second test. Bonus assignments are challenging for both male and female students. There are missing values in the data shown in Figure 3 because not all students completed the full amount of work. These

Figure 3

Map of data omissions

Image

Figure 4

Correlation matrix

Image

missing values are logically replaced with zero scores. From Figure 3, it is evident that there is a positive trend in programming skills but a negative trend in class activity.

Understanding the relationships between features allows for better preparation for the clustering process, eliminating redundant or highly correlated features (Hafsa et al., 2023). The lower triangular correlation matrix in Figure 4 shows that the highest coefficients of linear correlation are observed between the second module certification and the final score (0.86), as well as between exam scores and the final score (0.93).

Figure 5

Score distribution densities

Figure description...

Image

Figure 6

Stages of the methodology

Image

We made the density of score distributions across the main control points (tests, exam, final score) close to normal shape shown in Figure 5 using the Kernel density estimation method ((Humbert et al., 2022); (Węglarczyk, 2018)). Moreover, the grading scales for tests vary from 0 to 10, for the exam from 0 to 60, for the final score from 0 to 100. The average score for men for the second test, exam and final score is slightly shifted to the left relative to the average score for women.

METHODS

The proposed methodology for predicting the final score includes five stages shown in Figure 6. The first stage involves performing a statistical analysis to identify imbalances, data omissions, high and low correlations. In the second stage, new features are formed, and the dataset is aggregated by student index and group number, forming a performance for each student. In the third stage, with a sufficient number of scores in the aggregate, a binary classification predictive model for students who passed and did not pass the exam is built along with an error assessment. There are several methods that implement a binary classification and the most effective methods are those derived from linear discriminant analysis (such as Quadratic Discriminant Analysis (QDA), Regularized Discriminant Analysis (RDA) and Logistic regression (Araveeporn, 2023). The choice of logistic regression is dictated by the fact that it does not assume normal distribution of independent variables and homogeneity of variation-covariance matrices. The quality of separation is evaluated using the F1 metric - the harmonic mean of accuracy and completeness:

, (1)

where TP, FP, FN – are true positive, false positive and false negative forecasts.

In the fourth stage performance is forecasted using linear regression. As a result, a trend and a predicted performance score are determined for each student. The quality of prediction is evaluated through mean absolute error (MAE), mean squared error (MSE), and root mean squared error (RMSE) (Aissaoui et al., 2020):

, , (2)

where yi is the prediction and xi is the true value, n – the observations number.

Finally, the fifth stage entails clustering students based on the similarity of their score set. The quality of clustering is determined by the silhouette coefficient. It measures how well an object matches its cluster compared to other clusters: a value close to 1 indicates that the object is well clustered, a value close to 0 indicates that the object is on the border between two clusters, a negative value indicates incorrect clustering.

The Silhouette Score can be calculated for each feature in the cluster and then averaged for an overall assessment of the clustering quality (Bonaccorso, 2018). The formula for calculating the silhouette coefficient for the individual object i is as follows:

(3)

where a(i) — The average distance from feature i to all other features in the same cluster. This value indicates how close the features within the cluster are to each other; b(i) — The minimum average distance from feature i to features in the nearest other cluster. This value indicates how close the feature is to other clusters.

The average value of the silhouette coefficients of all objects is calculated according to the formula:

, (4)

where n is the total number of objects.

The proposed method differs from known analogs in that, firstly, the dimensionality of the feature space increases due to the normalization of scores onto a single scale and the creation of new features: the index and rank of students, as well as the changes in performance across various activities for each student. Secondly, students at academic risk are forecasted, and the statistical significance of the features included in the model is evaluated. Thirdly, for each student, the final score for the semester is forecasted using an linear regressive model of academic performance. Fourthly, groups of students with similar learning trajectories are identified for customization of consultations.

The practical significance of this method lies in the possibility of obtaining new knowledge about the learning process.

New features generation

Creating new features improves models in the following aspects: reducing calculation speed or required data volume, enhancing model interpretability, and increasing predictive accuracy (Vladova & Shek, 2021) . Based on the names and endings of Russian surnames (Zahoranský & Polasek, 2015), the binary attribute Sex has been added. To anonymize the data (Alier et al., 2021), a feature called Index is introduced, comprised of the first letters of the student's last name, first name, gender, and group number. It was found that test, project, and homework scores range from 0 to 5 points, midterm assessment scores vary from 0 to 22 points, exam scores range from 0 to 60 points, and final scores range from 0 to 100 points. Therefore, all scores are normalized to a range from 0 to 1 by dividing by their respective maximum values. This transformation results in new normalized features that comprehensively characterize an academic performance on a scale from 0 to 1. The distributions of these features are shown in Figure 7.

The next feature contains the student's rank relative to other students in the group and is obtained by calculating the sum of the product of the scores by the weighting coefficients and sorting the results from maximum to minimum shown in Table 1.

Figure 7

Distribution of normalized scores by main types of academic work

Image

Table 1

Student’s rank. Fragment

Index	Rank
ВАf5	1
МАf5	2
	…..
НАm4	49

Table

Table 2

Academic performance. Fragment

Index	Campus work progress	Programming skills progress	Activity progress	Test progress	Attestation progress
ВДmf5	0.11	0.6	0.2	-0.1	0.13
ВКf5	0.23	0.0	0.0	0.9	0.41
ГАf5	-0.06	0.0	0.0	0.1	0.01
MАf5	0.12	0.0	0.4	0.0	-0.11
	….		….

To provide a more comprehensive view of a student's academic progress we consider changes in academic performance across various activities shown in Table 2. Negative values indicate a decrease in score in the second part of the semester. This approach can be particularly useful for educators and academic institutions to gain a deeper understanding of student development.

By calculating the differences between normalized scores at different time points (e.g., Campus work 2 - Campus work 1, Attestation 2 – Attestation 1, etc.), these new features effectively capture the change in an academic performance or achievement in specific activities over a period of time, such as a semester.

The article (Shahiri et al., 2015) gives the top four methods for predicting academic performance. The neural network has the highest prediction accuracy (98%), followed by the decision tree (91%). Further, the support vector machine and the K-nearest neighbor machine gave the same accuracy, which is (83%). Finally, the method that has lower prediction accuracy is the naive Bayes method (76%). Since the number of scores for each student in the existing dataset is small, it is inefficient to use a neural network. For the available data, it is rational to use one of binary classification methods.

Binary classification of academic performance

At the first stage we need to classify students into those who will pass or fail the exam. The logistic regression is one of binary classification method applicable when the dependent variable is dichotomous. Let's assume that passing the exam is the target event (A. Yu. Vladova, 2024). There is

Figure 8

Results of classifying students into those who passed and those who did not pass the exam

Image

Table 3

Feature Importance

Number	Feature	Importance
0	Sex	0.502
5	Test progress	0.475
3	Programming skills progress	0.254
2	Campus work progress	0.189
1	Group	0.169
4	Activity progress	0.125
7	Attestation progress	0.054
6	Bonus task progress	0.009

Table note...

labeled dataset - students' scores on various tasks (progress in Campus works, Tests, and Attestations) and their score for the exam. The training (80%) and test (20%) datasets are separated from it. The logistic regression model is then trained and evaluated for accuracy as follows: Let X be the vector of input features (students' scores on various tasks), and Y be the binary output (pass/fail the exam). Let's assume that a student has the probability of passing the exam is:

P{y=1|x} = f(z), (5)

where z = θ0 + θ1x1+… +θnxn are column vectors of the values of the input normalized features x and regression coefficients θ; f(z) – logistic function defined as .

Visualization of classification results uses student indexes, where learning outcomes and predicted test results are indicated in different colors shown in Figure 8. The output is a trained logistic regression model capable of predicting whether a student will pass an exam based on their input scores. After training the logistic regression model, the importance of features was estimated by the absolute values of their coefficients shown in Table 3.

Features with higher coefficients are more important in predicting the target variable (Bruce & Bruce, 2017). Therefore, the latter feature can be excluded from model training. In addition, we checked its statistical significance with the Student's test (How to Do a T-Test in Python | Built In, n.d.). There is a negative trend, but the effect of project marks on the exam result does not demonstrate statistical significance (t-statistic = -1.99, p > 0.05). In this case, there is no reason to conclude that project scores have a significant impact on exam passing.

Academic performance prediction

The normalized set of input and output features is again broken down into training and test parts. An instance of the linear regression model learns from the training part and makes predictions for the test part. The data is visualized using a scatter plot. Figure 9 showing the exam result for several student's index. To evaluate the performance of the linear regression model, error metrics are calculated on test data: MAE = 0.079, MSE = 0.078, RMSE = 0.088, R squared = 0.89.

Figure 9

Academic performance prediction

Image

Figure 10

Results of сlassifying students by similarity of scores

Image

For the available dataset, the model shows high accuracy because the known test and predicted values are close enough, and the errors do not exceed 9%. The t-statistic of -2.83 and a p-value of 0.01 suggest that there may be a significant difference in the Test progress feature between the groups of males and females being compared (Olatunde-Aiyedun, 2021). The low p-value indicates evidence to reject the null hypothesis in favor of a significant difference.

Clustering students by similarity of scores

When creating personalized learning plans, identifying successful/unsuccessful learning strategies, it is useful to identify groups of students with similar sets of assessments (Reiser & Joseph’s College, 2017). For this purpose, students are clustered using the k-means method (Wati et al., 2021) on the same labeled dataset. Let X be the vector of input features (students' scores for various tasks) and Y be the output feature (cluster number). The initialization of the mass centers of the clusters is random. The algorithm seeks to minimize the total standard deviation of the cluster points from the centers of these clusters:

, (3)

where k is the number of clusters, Si are resulting clusters, i=1,2,…., k, and μi are centers of mass of all X vectors from the cluster Si.

The steps are repeated until convergence, that is, until the centroids stop changing significantly or until the maximum number of iterations is reached (Boehmke & Greenwell, 2020). The results of clustering are presented in a two-dimensional plot in Figure 10 using the principal component analysis (Ahmad et al., 2019), which reduces the dimensionality of the data space by converting a large set of features into a smaller one with minimal loss.

To find the optimal number of clusters, three characteristics are calculated: inertia, silhouette index, and Davis-Baldwin index. Inertia is computed as the sum of the squared distances from each data point to its nearest cluster centroid. It shows how grouped the points are within all the clusters in Figure 11. The lower the inertia, the better the model, because more compact and dense clusters usually imply a clearer structure in the data (Rykov et al., 2024). The silhouette index is computed for each data item in a cluster by measuring how close it is to the rest of its cluster compared to the elements of other clusters. The closer the silhouette index value is to one, the better the clusters are

Figure 11

Selection of the optimal number of clusters

Image

separated. The Davis-Baldwin index is computed as the average of the paired distance ratios between the centroids of the clusters and their average intra-cluster distance. The smaller the Davis-Baldwin index, the better the clustering. As a result, 17 clusters of students with similar scores are created with the following characteristics: inertia 890.47, silhouette index 0.28, Davis-Baldwin index 0.83. Analysis of inertia graphs, silhouette and Davis-Baldwin indices showed that the optimal number of clusters varies from 16 to 20.

FINDINGS

To identify methods and key factors influencing academic performance we performed the literature review. To analyze, customize and predict academic performance based on the data from e-learning platforms we offered the multistage methodology. At the first and second stages it applies statistical methods to form new features and improve the predictive ability of models. Thus, correlation analysis revealed a strong relationship between a number of features. Therefore, the dynamic features introduced into the feature space, taking into account the change in academic performance over time. The problems of predicting exam grades, classifying students into passing and non-passing an exam, as well as clustering students by sets of grades are solved at the third, forth and fifth stages consequentially. The results of the classification of exam grades are as follows: the estimate of the harmonic mean value of accuracy and completeness for the initial data F1 is 82 %. The linear regression model demonstrated the following error values: MAE = 0.1, MSE = 0.02, RMSE = 0.1, R2 = 0.7.

DISCUSSION

Classifying, clustering, and predicting academic performance can be useful for multiple stakeholders such as teachers, students, and institutions. For teachers, these tools help identify at-risk students, adapt curricula, and design targeted interventions. Students benefit by gaining insights into their performance trends, enabling better planning of study strategies and group work. Institutions can use these models to identify program-wide trends, evaluate curriculum effectiveness, and allocate resources to address systemic issues.

This article explores the use of machine learning methods to predict student performance, emphasizing several critical steps in the process. To ensure that all features are on a comparable scale, missing data is addressed, and normalization is applied. Key academic components—such as homework, projects, midterm scores, and test scores—serve as predictors, while new features are created to enhance the models' predictive power. The study employs clustering to analyze student behavior, multiple linear regression for performance prediction, and logistic regression for binary classification, such as pass/fail outcomes. The models are evaluated using metrics like mean absolute error (MAE), mean squared error (MSE), and root mean squared error (RMSE) to assess their accuracy. Additionally, Principal Component Analysis (PCA) and Kernel Density Estimation (KDE) are used to visualize the data, providing deeper insights into its structure and distribution. Together, these methods offer a robust framework for understanding and predicting academic performance.

At the same time, the proposed multi-method has a number of limitations. For example, when performing the first two stages, which include statistical analysis and generation of additional features, it is necessary to make some efforts to normalize the data, set a threshold for excluding strongly correlated features, and correctly specify pairs of different-temporal features of the same name that are converted into dynamic ones. In addition, when using methodology in universities of a humanitarian orientation, the Programming skills attribute can be replaced, for example, by attendance.

At the last stage, the k-means clustering method is used to divide students into groups. This is a simple and straightforward method, which, unlike more modern clustering methods (e.g., DBSCAN (Vladova, 2024), DBCLASD (Sheikholeslami & Zhang, 1998), WaveClaster (Sheikholeslami & Zhang, 1998)) involves a separate expert study on the number of clusters. This study includes an estimate of the inertia, silhouette index, and Davis-Baldwin index and results in a recommended cluster count interval. Within this interval, the educational manager must select one number – the exact number of clusters. Such a choice requires a certain expertise from the decision-maker, but at the same time allows him to take into account the administrative restrictions on the number of groups of students studying in different programs.

In the subsequent study, it is proposed to exclude from further consideration students at academic risk identified at the first stage (Shou et al., 2024), and also to investigate the impact on academic performance of signs of IP address coincidences when doing homework (Komosny & Rehman, 2022), duration of work, start and end time of work. In addition, it is necessary to develop dashboards that greatly facilitate model settings and decision-making for managers and teachers of educational institutions.

CONCLUSION

The literature review highlights various approaches to predicting student academic performance using machine learning and statistical methods. Researchers emphasize the importance of identifying key factors influencing performance, such as prior academic scores, and engagement in e-learning platforms. Various models, including regression analysis and classification techniques, have been utilized, demonstrating mixed success rates, while suggesting the need for further feature selection and data transformation to enhance predictive accuracy. The weaknesses of the approaches include insufficient accuracy of the models, the use of qualitative features, and the influence of experts

The study carried out a comprehensive statistical analysis of students' scores for a math discipline. This involved assessing data imbalance, identifying gaps, constructing score distribution densities, and exploring the correlation matrix of scores. These analyses provide valuable insights into the distribution and relationships of student scores, laying a strong foundation for further modeling and predictions.

The study changed the dimensionality of the feature space by normalizing scores to a common scale and creating new features such as student indexes, ranks, and differences between scores at different time points. This expansion of the feature space enhances the richness of the dataset and can potentially lead to more robust and accurate predictive models.

By developing models to predict students at academic risk and evaluating the statistical significance of the included features, the study addresses the crucial issue of identifying and supporting students who may be at risk of underperforming. This proactive approach can help institutions tailor interventions and support to students who need it most.

The study's plan to customize consultations for student groups with similar learning trajectories reflects a student-centric approach to enhancing academic performance. By recognizing the diverse needs of student cohorts and tailoring support accordingly, the study aims to foster a more personalized and effective learning environment.

The approach of predicting exam scores for individual students demonstrates a commitment to providing comprehensive support beyond mere assessment. By leveraging analysis, modeling, and customization of consultations, the study aims to proactively improve students' academic performance levels within university settings.

The proposed multi-method to the analysis of data from electronic platforms shows a picture of student engagement close to reality. The progress track, predictive assessment and clustering allows educational managers and teachers to assign consultations to groups of students at academic risk and with deteriorating academic performance.

Author Contributions

Vladova, A., & Borchyk, K. M. (2024). Predictive analytics of student performance: Multi-method and code. JRAMathEdu (Journal of Research and Advances in Mathematics Education), 9(4), 190-204. https://doi.org/10.23917/jramathedu.v9i4.4643

Availability of data and materials

All data are available from https://github.com/avladova/Student-performance-prediction .

Competing interests

The authors declare that the publishing of this paper does not involve any conflicts of interest. This work has never been published or offered for publication elsewhere, and it is completely original.

How to Cite

References

Principal Component Analysis and Self-Organizing Map Clustering for Student Browsing Behaviour Analysis

Procedia Computer Science163

Ahmad

N.B.

Alias

U.F.

Mohamad

Yusof

2019550559

550-559

10.1016/J.PROCS.2019.12.137

A Multiple Linear Regression-Based Approach to Predict Student Performance

Aissaoui

Madani

Oughdir

Dakkak

ALLIOUI

E.L.

2020923

9-23

10.1007/978-3-030-36653-7_2

Privacy and e-learning: A pending task

Sustainability (Switzerland1316

Alier

Casañ Guerrero

M.J.

Amo

Severance

Fonseca

2021

10.3390/SU13169206

Comparison of Logistic Regression and Discriminant Analysis for Classification of Multicollinearity Data

WSEAS TRANSACTIONS ON MATHEMATICS22

Araveeporn

2023120131

120-131

10.37394/23206.2023.22.15

Forecasting Subscriber Churn: Comparison of Machine Learning Methods

Computer Tools in Education5

Arzamastsev

S.A.

Bgatov

M.V.

Kartysheva

E.N.

Derkunskii

V.A.

Semenchikov

D.N.

2018523

5-23

Forecasting Subscriber Churn: Comparison of Machine Learning Methods

Profiling students via clustering in a flipped clinical skills course using learning analytics

Medical Teacher457

Bayazit

Ilgaz

Gönüllü

İ.

Erden

Ş.

2022724731

724-731

10.1080/0142159x.2022.2152663

Hands-on Machine Learning with R

CRC PressBoehmke

Greenwell

2020

Hands-on Machine Learning with R

Machine Learning Algorithms

Packt PublishingBonaccorso

Giuseppe

2018

Packt Publishing Ltd

Machine Learning Algorithms

Practical Statistics for Data Scientists

Bruce

2017

Practical Statistics for Data Scientists

Comparative Analysis of E-Learning Platforms on The Market

2018 10th International Conference on Electronics, Computers and Artificial Intelligence (ECAIElisabeta

P.M.

Alexandru

M.R.

201814

1-4

10.1109/ECAI.2018.8679004

E-learning recommender system dataset

Data in Brief47

Hafsa

Wattebled

Jacques

Jourdan

2023

108942

10.1016/j.dib.2023.108942

How to Do a T-Test in Python | Built In

Robust Kernel Density Estimation with Median-of-Means principle

Proceedings of the 39th International Conference on Machine Learning162

Humbert

Bars

B.Le

Minvielle

202294449465

9444-9465

Robust Kernel Density Estimation with Median-of-Means principle

Predicting Students

Academic Performance: Comparing Artificial Neural Network, Decision Tree and Linear Regression. 21st Annual SAS Malaysia ForumIbrahim

Rusli

2007

Analysis of Changes in the Affective Characteristics and Communicational Skills of Prospective Teachers: Longitudinal Study

International Journal of Progressive Education146

Kahramanoğlu

2018177199

177-199

10.29329/IJPE.2018.179.14

Predicting MOOC Dropout over Weeks Using Machine Learning Methods

Proceedings of the EMNLP 2014 Workshop on Analysis of Large Scale Social Interaction in MOOCsKloft

Stiehler

Zheng

Pinkwart

20146065

60-65

10.3115/v1/W14-4111

A Method for Cheating Indication in Unproctored On-Line Exams. Sensors

Komosny

Rehman

S.U.

2022

Basel, Switzerland

10.3390/S22020654

Towards intelligent E-learning systems

Education and Information Technologies287

Liu

202378457876

7845-7876

10.1007/s10639-022-11479-6

Student Teachers’ Attitude towards Teaching Practice

International Journal of Culture and Modernity8

Olatunde-Aiyedun

2021617

6-17

Student Teachers’ Attitude towards Teaching Practice

E-Learning Performance Evaluation in Medical Education—A Bibliometric and Visualization Analysis

Healthcare11

Oluwadele

Singh

Adeliyi

2023

232

10.3390/healthcare11020232

A Comparison Between the Silhouette Index and the Davies-Bouldin Index in Labelling IDS Clusters

Petrovic

S.V.

2006

A Comparison Between the Silhouette Index and the Davies-Bouldin Index in Labelling IDS Clusters

Predicting students’ performance in e-learning using learning process and behaviour data

Scientific Reports121

Qiu

Zhang

Sheng

Jiang

Zhu

Xiang

Jiang

Chen

2022

453

10.1038/s41598-021-03867-8

Blending Individual and Group Assessment: A Model for Measuring Student Performance

Journal of the Scholarship of Teaching and Learning174

Reiser

Joseph’s College

20178394

83-94

10.14434/JOSOTL.V17I4.21938

Inertia-Based Indices to Determine the Number of Clusters in K-Means: An Experimental Evaluation

IEEE Access12

Rykov

Amorim

R.C.

Makarenkov

Mirkin

20241176111773

11761-11773

10.1109/ACCESS.2024.3350791

A Review on Predicting Student’s Performance Using Data Mining Techniques

Procedia Computer Science72

Shahiri

Husain

Abdul Rashid

2015414422

414-422

10.1016/j.procs.2015.12.157

A Multi-Resolution Clustering Approach for Very Large Spatial Databases *

Proceedings of the 24th VLDB ConferenceSheikholeslami

Zhang

1998

A Multi-Resolution Clustering Approach for Very Large Spatial Databases *

Predicting Student Performance in Online Learning: A Multidimensional Time-Series Data Analysis Approach

Applied Sciences146

Shou

Xie

Zhang

2024

10.3390/app14062522

Comulang: towards a collaborative e-learning system that supports student group modeling

SpringerPlus21

Troussas

Virvou

Alepis

2013

387

10.1186/2193-1801-2-387

Logistic Regression Model for the Academic Performance of First-Year Medical Students in the Biomedical Area

Creative Education07

Urrutia-Aguilar

Fuentes-Garcia

Martinez

Beck

Ortiz

Guevara-Guzmán

201622022211

2202-2211

10.4236/ce.2016.715217

Developing group and individual performance paths based on e-learning platform data

Large-Scale Systems Control (UBS111

Vladova

A.Yu

2024179196

179-196

10.25728/ubs.2024.111.7

Data preprocessing for machine analysis of sales representatives’ key performance indicators

Business Informatics153

Vladova

Shek

20214859

48-59

10.17323/2587-814X.2021.3.48.59

Visualizing Results of Promoting Campaigns

14th International Conference Management of Large-Scale System Development (MLSDVladova

A.Yu

Vladov

Yu R.

Yakimov

A.I.

202114

1-4

10.1109/MLSD52249.2021.9600205

Analysis K-Means Clustering to Predicting Student Graduation

Journal of Physics: Conference Series18441

Wati

Rahmah

W.H.

Novirasari

Haviluddin

Budiman

Islamiyah

2021

012028

10.1088/1742-6596/1844/1/012028

Kernel density estimation and its application

ITM Web of Conferences23

Węglarczyk

2018

00037

10.1051/ITMCONF/20182300037

Prediction of Student Performance Using Machine Learning Techniques: A Review

Yadav

Deshmukh

2023735741

735-741

10.2991/978-94-6463-136-4_63

Educational data mining: prediction of students’ academic performance using machine learning algorithms

Smart Learning Environments91

Yağcı

2022

10.1186/s40561-022-00192-z

Predicting Students’ Academic Performance Using Multiple Linear Regression and Principal Component Analysis

Journal of Information Processing26

Yang

S.J.H.

O.H.T.

Huang

A.Y.Q.

Huang

J.C.H.

Ogata

Lin

A.J.Q.

2018170176

170-176

10.2197/IPSJJIP.26.170

SA-FEM: Combined Feature Selection and Feature Fusion for Students’ Performance Prediction

Sensors2222

Sheng

Zhang

Chen

Jiang

Zou

Dai

2022

8838

10.3390/s22228838

Predict Students’ Academic Performance based on their Assessment Grades and Online Activity Data

International Journal of Advanced Computer Science and Applications11

Zafar

Alhassan

Mueen

2020

10.14569/IJACSA.2020.0110425

Text search of surnames in some Slavic and other morphologically rich languages using rule based phonetic algorithms

IEEE Transactions on Audio, Speech and Language Processing233

Zahoranský

Polasek

2015553563

553-563

10.1109/TASLP.2015.2393393

Educational Data Mining Techniques for Student Performance Prediction

Method Review and Comparison Analysis. Frontiers in Psychology12

Zhang

Yun

Cui

Dai

Shang

2021

10.3389/fpsyg.2021.698490