Predicting Student Dropout Using Machine Learning

In the realm of modern education, the issue of student dropout looms as a pressing challenge, impacting individuals and educational institutions alike. The ramifications of high dropout rates extend beyond academic achievements, extending into future career prospects and overall well-being. However, detecting and addressing this issue at its early stages can significantly mitigate its negative consequences. Enter machine learning, an innovative field within artificial intelligence. Machine learning algorithms possess the potential to accurately predict student dropout by harnessing extensive data and advanced analytical techniques. By examining various factors and intricate patterns, these models can identify students who are more susceptible to dropping out. This article delves into the realm of machine learning's application in forecasting student dropout, highlighting the advantages, obstacles, and potential implications it presents for the education sector.

To comprehend the intricate landscape of student dropout, it is imperative to acknowledge the multitude of factors that contribute to this phenomenon. Student dropout is a multifaceted occurrence influenced by a diverse range of individual, societal, and institutional elements. Academic struggles, disengagement, socio-economic constraints, familial circumstances, and insufficient support mechanisms are among the common catalysts for dropout. By attaining a deep understanding of these underlying factors, educators and policymakers can develop targeted interventions and comprehensive strategies to tackle the dropout challenge head-on. In this context, machine learning emerges as a valuable ally, offering unique insights and innovative approaches to combat student dropout.

Cutting-edge machine learning algorithms possess a remarkable ability to dissect extensive and heterogeneous datasets, enabling them to discern intricate patterns and make reliable forecasts. In the realm of predicting student dropout, these machine-learning models can effectively harness a diverse array of data points. Demographic details, academic achievements, attendance records, levels of engagement, socio-economic indicators, and an assortment of pertinent factors all come into play. Through careful analysis of this rich dataset, machine learning algorithms have the potential to unveil concealed patterns and interdependencies that may elude human analysts. By leveraging historical data, these algorithms learn to accurately gauge the probability of a student discontinuing their education, leveraging the student's unique characteristics and individual circumstances as vital inputs for their predictions.

Benefits of Predicting Student Dropout

The utilization of machine learning in predicting student attrition yields numerous advantages for educational institutions, policymakers, and students themselves. Firstly, the early identification of students at risk of dropping out allows for prompt interventions and the establishment of support systems. Educators can deliver personalized assistance, provide supplementary resources, and implement targeted measures to enhance students' likelihood of success. This proactive approach significantly curtails attrition rates and bolsters student retention.

Secondly, the predictive aspect facilitates the efficient allocation of institutional resources. By pinpointing students who are susceptible to dropout, educational institutions can concentrate their efforts and resources on furnishing the necessary support to these individuals. This targeted approach ensures that interventions are directed precisely where they will have the most impact, optimizing resource utilization.

Moreover, the integration of machine learning in dropout prediction fosters the development of evidence-based policies and strategies. By scrutinizing the underlying factors contributing to attrition, policymakers can devise interventions that address the core issues and foster a more supportive and conducive learning environment. This data-informed approach empowers informed decision-making and assists in formulating effective policies aimed at enhancing student outcomes.

Challenges of Predicting Student Dropout Using Machine Learning

While embracing machine learning for predicting student attrition shows immense potential, it is imperative to acknowledge and address the associated challenges and ethical implications. One key challenge revolves around the accessibility and quality of data. Building accurate prediction models necessitates sufficient and reliable data. Educational institutions must establish robust protocols for data collection, storage, and privacy to uphold the confidentiality and integrity of student information.

Another hurdle involves the potential bias within machine learning models. If the training data used to develop the models is biased or incomplete, the predictions may become skewed or unfair. Addressing bias requires a conscientious effort to train models on diverse and representative datasets, fostering reliable and impartial predictions.

Ethical considerations play a pivotal role when deploying machine learning to forecast student attrition. Responsible utilization of predictive models should prioritize student privacy, consent, and transparency. Students should be informed about the purpose of data collection and how it will be employed in predicting attrition. Furthermore, mechanisms must be established to address concerns pertaining to privacy and data protection, ensuring a responsible and accountable approach to safeguarding student rights.

About the Dataset

This dataset provides a comprehensive view of students enrolled in various undergraduate degrees offered at a higher education institution. It includes demographic data, social-economic factors, and academic performance information that can be used to analyze the possible predictors of student dropout and academic success. This dataset contains multiple disjoint databases consisting of relevant information available at the time of enrollment, such as application mode, marital status, course chosen, and more. Additionally, this data can be used to estimate overall student performance at the end of each semester by assessing curricular units credited/enrolled/evaluated/approved as well as their respective grades. Finally, we have the unemployment rate, inflation rate, and GDP from the region, which can help us further understand how economic factors play into student dropout rates or academic success outcomes. This powerful analysis tool will provide valuable insight into what motivates students to stay in school or abandon their studies for a wide range of disciplines such as agronomy, design, education, nursing journalism, management, social service, or technologies.

Columns

Marital status: The marital status of the student. (Categorical)
Application mode: The method of application used by the student. (Categorical)
Application order: The order in which the student applied. (Numerical)
Course: The course taken by the student. (Categorical)
Daytime/evening attendance: Whether the student attends classes during the day or in the evening. (Categorical)
Previous qualification: The qualification obtained by the student before enrolling in higher education. (Categorical)
Nationality: The nationality of the student. (Categorical)
Mother's qualification: The qualification of the student's mother. (Categorical)
Father's qualification: The qualification of the student's father. (Categorical)
Mother's occupation: The occupation of the student's mother. (Categorical)
Father's occupation: The occupation of the student's father. (Categorical)
Displaced: Whether the student is a displaced person. (Categorical)
Educational special needs: Whether the student has any special educational needs. (Categorical)
Debtor: Whether the student is a debtor. (Categorical)
Tuition fees up to date: Whether the student's tuition fees are up to date. (Categorical)
Gender: The gender of the student. (Categorical)
Scholarship holder: Whether the student is a scholarship holder. (Categorical)
Age at enrollment: The age of the student at the time of enrollment. (Numerical)
International: Whether the student is an international student. (Categorical)
Curricular units 1st sem (credited): The number of curricular units credited by the student in the first semester. (Numerical)
Curricular units 1st sem (enrolled): The number of curricular units enrolled by the student in the first semester. (Numerical)
Curricular units 1st sem (evaluations): The number of curricular units evaluated by the student in the first semester. (Numerical)
Curricular units 1st sem (approved): The number of curricular units approved by the student in the first semester. (Numerical)

Now we will implement it in the code. We will try to find the model for the best accuracy for predicting the student dropout rate.

Code:

Importing Libraries

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline
import plotly.express as px
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)

import seaborn as sns

from sklearn.decomposition import PCA

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score, recall_score, f1_score

import pickle 

import warnings
warnings.filterwarnings('ignore')

Reading the Dataset

Understanding the Dataset

# check the shape of the dataset in student DataFrame
student.shape

Output:

# See which are the 35 columns
student.columns

Output:

# How the data looks
student.sample(4)

Output:

# Check info about all the columns 
student.info()

Output:

Looks like there are no nulls or duplicates, but still, we can check and handle them if required.

Output:

Only the Target column is non-numeric, which we can convert to numeric. The target column is an output column, so we need it in numeric form so that we can find its correlation with others.

Output:

So there are 3 unique values in the target column which we can replace by

Dropout -> 0
Enrolled -> 1
Graduate -> 2

student['Target'] = student['Target'].map({
    'Dropout':0,
    'Enrolled':1,
    'Graduate':2
})

# Check the Target column. It must have filled with 0, 1 & 2
student

Output:

student.dtypes
# Target column is an integer now

Output:

# Learn the data mathematically
student.describe()

Output:

Finally, find the correlation of Target with all other numeric columns.

Output:

fig = px.imshow(student)
fig.show()

Output:

This is the new Df considering relevant input and output columns.

# This is the new Df considering relevant input and output columns
student_df = student.iloc[:,[1,11,13,14,15,16,17,20,22,23,26,28,29,34]]
student_df.head()

Output:

EDA(Exploratory Data Analysis)

In our exploration of the Student Dropout dataset, we will engage in a process called Exploratory Data Analysis (EDA). Think of it as our way of investigating and getting to know the data better. It's like peeling back the layers of an onion to reveal its true nature. By using different tools and techniques, we will examine the dataset closely, looking for interesting patterns and insights. EDA helps us understand the factors behind student dropout and enables us to make informed decisions to address this issue.

# How many dropouts, enrolled & graduates are there in Target column
student_df['Target'].value_counts()

Output:

# Plot the above values
x = student_df['Target'].value_counts().index
y = student_df['Target'].value_counts().values

df = pd.DataFrame({
    'Target': x,
    'Count_T' : y
})

fig = px.pie(df,
             names ='Target', 
             values ='Count_T',
            title='How many dropouts, enrolled & graduates are there in Target column')

fig.update_traces(labels=['Graduate','Dropout','Enrolled'], hole=0.4,textinfo='value+label', pull=[0,0.2,0.1])
fig.show()

Output:

# Now see the correlation of Target with the rest
student_df.corr()['Target']

Output:

fig = px.scatter(student_df, 
             x = 'Curricular units 1st sem (approved)',
             y = 'Curricular units 2nd sem (approved)',
             color = 'Target')
fig.show()

Output:

Let's plot the column Curricular units 1st sem (grade) against Curricular units 1st sem (grade) and differentiate Target by color.

fig = px.scatter(student_df, 
             x = 'Curricular units 1st sem (grade)',
             y = 'Curricular units 2nd sem (grade)',
             color = 'Target')
fig.show()

Output:

fig = px.scatter(student_df, 
             x = 'Curricular units 1st sem (grade)',
             y = 'Curricular units 2nd sem (grade)',
             color = 'Target')
fig.show()

Output:

fig = px.box(student_df, y='Age at enrollment')
fig.show()

Output:

# Distribution of age of students at the time of enrollment
sns.histplot(data=student_df['Age at enrollment'], kde=True)

Output:

# Let's try a plotly histogram for interactive figure
px.histogram(student_df['Age at enrollment'], x='Age at enrollment',color_discrete_sequence=['red'])

Output:

Extract Input & Output Columns

X = student_df.iloc[:,0:13]
y = student_df.iloc[:,-1]
X

Output:

Splitting the Dataset into Training and Testing Sets

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

Output:

Modeling

Modeling is a crucial step in the predictive analytics process. It involves training and testing various machine learning models to determine their accuracy and performance in predicting student dropout. During this stage, different algorithms are applied to the dataset, each with its own strengths and weaknesses.

Here, we will train various models and then look for their accuracy.

Logistic Regeression

from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()

# Without Scaling 
clf.fit(X_train,y_train) 
y_pred = clf.predict(X_test)
print("Without Scaling and without CV: ",accuracy_score(y_test,y_pred))
scores = cross_val_score(clf, X_train, y_train, cv=10)
print("Without Scaling and With CV: ",scores.mean())

Output:

Stochastic Gradient Classifier

from sklearn.linear_model import SGDClassifier
clf = SGDClassifier(max_iter=1000, tol=1e-3)

# Without Scaling
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
print("Without Scaling and without CV: ",accuracy_score(y_test,y_pred))
scores = cross_val_score(clf, X_train, y_train, cv=10)
print("Without Scaling and With CV: ",scores.mean())

Output:

Perceptron

from sklearn.linear_model import Perceptron
# this is same as SGDClassifier(loss="perceptron", eta0=1, learning_rate="constant", penalty=None)

clf = Perceptron(tol=1e-3, random_state=0)
# Without Scaling
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
print("Without Scaling and without CV: ",accuracy_score(y_test,y_pred))
scores = cross_val_score(clf, X_train, y_train, cv=10)
print("Without Scaling and With CV: ",scores.mean())

Output:

Logistic Regression CV

from sklearn.linear_model import LogisticRegressionCV
clf = LogisticRegressionCV(cv=5, random_state=0)

# Without Scaling
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
print("Without Scaling and without CV: ",accuracy_score(y_test,y_pred))
scores = cross_val_score(clf, X_train, y_train, cv=10)
print("Without Scaling and With CV: ",scores.mean())

Output:

Decision Tree Classifier

# Using DecisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(random_state=0)

#without scaling
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
print("Without Scaling and without CV: ",accuracy_score(y_test,y_pred))
scores = cross_val_score(clf, X_train, y_train, cv=10)
print("Without Scaling and With CV: ",scores.mean())

Output:

Random Forest Classifier

from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(max_depth=10, random_state=0)

clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
print("Without Scaling and without CV: ",accuracy_score(y_test,y_pred))
scores = cross_val_score(clf, X_train, y_train, cv=10)
print("Without Scaling and With CV: ",scores.mean())

Output:

Support Vector Machine

from sklearn.svm import SVC
#clf = SVC(gamma='auto')

svc = SVC()
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
clf = GridSearchCV(svc, parameters)

clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
print("Without Scaling and without CV: ",accuracy_score(y_test,y_pred))
scores = cross_val_score(clf, X_train, y_train, cv=10)
print("Without Scaling and With CV: ",scores.mean())

Output:

NuSVC

from sklearn.svm import NuSVC
clf = NuSVC()

clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
print("Without Scaling and without CV: ",accuracy_score(y_test,y_pred))
scores = cross_val_score(clf, X_train, y_train, cv=10)
print("Without Scaling and With CV: ",scores.mean())

Output:

Linear SVC

from sklearn.svm import LinearSVC
clf = LinearSVC(random_state=0, tol=1e-5)

clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
print("Without Scaling and CV: ",accuracy_score(y_test,y_pred))
scores = cross_val_score(clf, X_train, y_train, cv=10)
print("Without Scaling and With CV: ",scores.mean())

Output:

Naive Bayes

from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()

#y_pred = gnb.fit(X_train, y_train).predict(X_test)
#print("Number of mislabeled points out of a total %d points : %d" % (X_test.shape[0], (y_test != y_pred).sum()))

clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
print("Without Scaling and CV: ",accuracy_score(y_test,y_pred))
scores = cross_val_score(clf, X_train, y_train, cv=10)
print("Without Scaling and With CV: ",scores.mean())

Output:

from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()

clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
print("Without Scaling and without CV: ",accuracy_score(y_test,y_pred))
scores = cross_val_score(clf, X_train, y_train, cv=10)
print("Without Scaling and With CV: ",scores.mean())

Output:

from sklearn.naive_bayes import BernoulliNB
clf = BernoulliNB()

clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
print("Without Scaling and without CV: ",accuracy_score(y_test,y_pred))
scores = cross_val_score(clf, X_train, y_train, cv=10)
print("Without Scaling and With CV: ",scores.mean())

Output:

from sklearn.naive_bayes import CategoricalNB
clf = CategoricalNB()

clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
print("Without Scaling and without CV: ",accuracy_score(y_test,y_pred))

Output:

from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier(n_neighbors=3)

clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
print("Without Scaling and without CV: ",accuracy_score(y_test,y_pred))
scores = cross_val_score(clf, X_train, y_train, cv=10)
print("Without Scaling and With CV: ",scores.mean())

Output:

After evaluating and comparing multiple machine learning models for predicting student dropout, the Random Forest model emerged as the top performer with an accuracy of 76.94% and 77.08% with cross-validation, as the Random Forest algorithm is known for its ability to handle complex datasets and capture intricate relationships between variables.

Model Selection

Select the model which gives maximum accuracy. So we select Random Forest with accuracy 76.94 & 77.08 with Cross Validation.

clf = RandomForestClassifier(max_depth=10, random_state=0)
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
print("Without CV: ",accuracy_score(y_test,y_pred))
scores = cross_val_score(clf, X_train, y_train, cv=10)

print("With CV: ",scores.mean())
print("Precision Score: ", precision_score(y_test, y_pred,average='macro'))
print("Recall Score: ", recall_score(y_test, y_pred,average='macro'))
print("F1 Score: ", f1_score(y_test, y_pred,average='macro'))

Output:

We will use GridSearchCV for hyperparameter tuning in a Random Forest classifier model.

param_grid = {
    'bootstrap': [False,True],
    'max_depth': [5,8,10, 20],
    'max_features': [3, 4, 5, None],
    'min_samples_split': [2, 10, 12],
    'n_estimators': [100, 200, 300]
}

rfc = RandomForestClassifier()

clf = GridSearchCV(estimator = rfc, param_grid = param_grid, cv = 5, n_jobs = -1, verbose = 1)

clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
print("Accuracy: ",accuracy_score(y_test,y_pred))
print(clf.best_params_)
print(clf.best_estimator_)

Output:

Here, the accuracy of the model has been improved.

clf = RandomForestClassifier(bootstrap=False, max_depth=10,max_features=3,
                             min_samples_split=12,
                             n_estimators=100, random_state=0)
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
print("Without CV: ",accuracy_score(y_test,y_pred))
scores = cross_val_score(clf, X_train, y_train, cv=10)
print("With CV: ",scores.mean())

print("Precision Score: ", precision_score(y_test, y_pred,average='micro'))
print("Recall Score: ", recall_score(y_test, y_pred,average='micro'))
print("F1 Score: ", f1_score(y_test, y_pred,average='micro'))

Output:

The accuracy of the model on the test dataset without cross-validation is approximately 0.7898, or 78.98%. This indicates that the model correctly predicted the target variable for nearly 78.98% of the instances in the test dataset.
The cross-validated accuracy of the model on the training dataset is approximately 0.7655, or 76.55%. This score represents the average accuracy across multiple folds of cross-validation.
The precision score of approximately 0.7898 indicates that the model achieved a high proportion of true positive predictions relative to the total positive predictions. It suggests that when the model predicted a student dropout, it was correct around 78.98% of the time.
The recall score of approximately 0.7898 indicates that the model captured a high proportion of true positive predictions relative to the total actual positive instances in the dataset. It implies that the model was able to identify around 78.98% of the actual student dropouts.
The F1 score of approximately 0.7898 is the harmonic mean of precision and recall. It provides a single metric that combines both precision and recall, taking into account both false positives and false negatives.

Considering all the points Random Forest Classifier can be used as the model for the prediction of student dropouts.

Conclusion

In conclusion, the application of machine learning in predicting student attrition presents a transformative opportunity for educational institutions to tackle this widespread issue effectively. By leveraging the capabilities of machine learning algorithms, educators, policymakers, and institutions can take proactive measures, provide targeted support, and foster an environment conducive to student success. Nevertheless, it is imperative to navigate challenges pertaining to data quality, bias, and ethical considerations to ensure the conscientious and equitable use of these predictive models. As machine learning and data analysis continue to advance, we hold the potential to make substantial strides in reducing student attrition, enhancing educational outcomes, and cultivating an inclusive and supportive education system that caters to the needs of all students.

Next TopicImage Processing Using Machine Learning

← prev next →