Credit Card Approval Using Machine Learning

Credit scorecards are widely used in the financial industry as a risk control measure. These cards utilize personal information and data provided by credit card applicants to assess the likelihood of potential defaults and credit card debts in the future. Based on this evaluation, the bank can make informed decisions regarding whether to approve the credit card application. Credit scores provide an objective way to measure and quantify the level of risk involved.

Credit card approval is a crucial process in the banking industry. Traditionally, banks rely on manual evaluation of creditworthiness, which can be time-consuming and prone to errors. However, with the advent of Machine Learning (ML) algorithms, the credit card approval process has been significantly streamlined.

Machine Learning algorithms have the ability to analyze large volumes of data and extract patterns, making them invaluable in credit card approval. By training ML models on historical data that includes information about applicants, their financial behavior, and credit history, banks can predict creditworthiness more accurately and efficiently.

Benefits of Credit Card Approval Using Machine Learning

Enhanced Accuracy: Machine learning algorithms have the ability to analyze vast amounts of data and identify patterns that may not be apparent to human analysts. By incorporating various data points, including credit history, income, employment, and spending patterns, machine learning models can make more accurate predictions regarding an individual's creditworthiness. This leads to better-informed credit card approval decisions, reducing the risk of defaults and improving overall portfolio performance.
Faster Processing: Traditional credit card approval processes can be time-consuming, involving manual reviews, paperwork, and extensive documentation. Machine learning streamlines this process by automating many of the tasks. By leveraging algorithms and predictive models, financial institutions can expedite credit card approvals, providing customers with faster access to credit facilities.
Personalized Offerings: Machine learning enables lenders to personalize credit card offerings based on individual profiles and preferences. By analyzing customer data and behavior, machine learning algorithms can identify specific needs, spending patterns, and risk profiles. This allows lenders to tailor credit card features, such as interest rates, credit limits, rewards programs, and promotional offers, to match the unique requirements of each customer.
Risk Mitigation: The use of machine learning algorithms in credit card approval helps mitigate risks associated with lending. By accurately assessing creditworthiness and identifying high-risk applicants, financial institutions can make informed decisions on interest rates, credit limits, and terms of repayment. This not only protects lenders from potential losses but also ensures responsible lending practices and safeguards the financial well-being of customers.

Challenges of Credit Card Approval Using Machine Learning

Data Privacy and Security: The use of machine learning in credit card approval requires access to vast amounts of sensitive customer data. It is crucial for financial institutions to implement robust data privacy and security measures to protect this information from unauthorized access or misuse. Strict compliance with data protection regulations and encryption techniques is essential to ensure the confidentiality and integrity of customer data.
Model Interpretability and Transparency: Machine learning algorithms can be complex, making it challenging to interpret and explain the decisions they make. This lack of interpretability can pose challenges in terms of regulatory compliance and consumer trust. Efforts must be made to develop transparent models that provide clear explanations for credit card approval decisions, ensuring fairness and accountability.
Bias and Fairness: Machine learning algorithms are susceptible to bias as they learn from historical data that may contain inherent biases. This can lead to discriminatory practices in credit card approval, impacting certain demographic groups unfairly. It is important to continuously monitor and evaluate machine learning models to ensure fairness and mitigate any bias that may arise.

For better Understanding, we will try to implement it in code, here will try to find whether an applicant is a 'good' or 'bad' client.

Data Definition

There are two .csv files, such as :

1. application_record.csv:

ID: A unique identifier for each client.
CODE_GENDER: Gender of the client.
FLAG_OWN_CAR: Indicates whether the client owns a car.
FLAG_OWN_REALTY: Indicates whether the client owns any property.
CNT_CHILDREN: Number of children the client has.
AMT_INCOME_TOTAL: Annual income of the client.
NAME_INCOME_TYPE: Category of the client's income.
NAME_EDUCATION_TYPE: Education level of the client.
NAME_FAMILY_STATUS: Marital status of the client.
NAME_HOUSING_TYPE: Way of living for the client.
DAYS_BIRTH: Birthday of the client, represented as the count of days backward from the current day. (0 indicates the current day, and -1 indicates yesterday)
DAYS_EMPLOYED: Start date of employment, represented as the count of days backward from the current day. If the value is positive, it means the person is currently unemployed.
FLAG_MOBIL: Indicates whether the client has a mobile phone.
FLAG_WORK_PHONE: Indicates whether the client has a work phone.
FLAG_PHONE: Indicates whether the client has a personal phone.
FLAG_EMAIL: Indicates whether the client has an email.
OCCUPATION_TYPE: Occupation of the client.
CNT_FAM_MEMBERS: The family size of the client.

2. credit_record.csv:

ID: A unique identifier for each client.
MONTHS_BALANCE: The record month, represented as a count backward from the current month. (0 indicates the current month, -1 indicates the previous month, and so on)
STATUS: The status of the client's credit for a particular month. The values range from 0 to 5, where 0 represents 1-29 days past due, 1 represents 30-59 days past due, 2 represents 60-89 days overdue, 3 represents 90-119 days overdue, 4 represents 120-149 days overdue, 5 represents overdue or bad debts for more than 150 days, C represents paid off that month, and X indicates no loan for the month.

Code:

Importing Libraries

%matplotlib inline
%config InlineBackend.figure_format = 'svg'

import warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd   
import matplotlib.pyplot as plt
import seaborn as sns
from imblearn.over_sampling import SMOTE
import itertools

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier

Reading the Dataset

data = pd.read_csv("application_record.csv", encoding = 'utf-8') 
record = pd.read_csv("credit_record.csv", encoding = 'utf-8')  

Feature Engineering

Here we will aim to extract the most relevant information from the available data and represent it in a way that the machine learning algorithm can effectively learn from it.

begin_month=pd.DataFrame(record.groupby(["ID"])["MONTHS_BALANCE"].agg(min))
begin_month=begin_month.rename(columns={'MONTHS_BALANCE':'begin_month'}) 
new_data=pd.merge(data,begin_month,how="left",on="ID") #merge to record data

Here, we will combine the information from two DataFrames, data and begin_month, based on the 'ID' column. It adds a new column, 'begin_month', to the data DataFrame, indicating the minimum value of 'MONTHS_BALANCE' for each unique 'ID' from the record DataFrame.

Target Variable

Typically, the target risk users are expected to account for approximately 3% of all users. In this case, We have identified users who have overdue payments for more than 60 days as the target risk users. These specific samples are labeled as '1', while the remaining samples are labeled as '0'.

Now we will create the target variable.

# Creating a new column 'dep_value' in the record dataframe.
record['dep_value'] = None
record['dep_value'][record['STATUS'] =='2']='Yes' 
record['dep_value'][record['STATUS'] =='3']='Yes' 
record['dep_value'][record['STATUS'] =='4']='Yes' 
record['dep_value'][record['STATUS'] =='5']='Yes' 

cpunt=record.groupby('ID').count()
cpunt['dep_value'][cpunt['dep_value'] > 0]='Yes' 
cpunt['dep_value'][cpunt['dep_value'] == 0]='No' 
cpunt = cpunt[['dep_value']]
new_data=pd.merge(new_data,cpunt,how='inner',on='ID')
new_data['target']=new_data['dep_value']
new_data.loc[new_data['target']=='Yes','target']=1
new_data.loc[new_data['target']=='No','target']=0

print(cpunt['dep_value'].value_counts())
cpunt['dep_value'].value_counts(normalize=True)

"No" appears 45,318 times which accounts for approximately 98.55% of the total values.

"Yes" appears 667 times which accounts for approximately 1.45% of the total values.

Features

We will now proceed with the exploratory data analysis of the features, where we will examine, analyze and do various operations on the features.

# Renaming the columns
new_data.rename(columns={'CODE_GENDER':'Gender','FLAG_OWN_CAR':'Car','FLAG_OWN_REALTY':'Reality',
                         'CNT_CHILDREN':'ChldNo','AMT_INCOME_TOTAL':'inc',
                         'NAME_EDUCATION_TYPE':'edutp','NAME_FAMILY_STATUS':'famtp',
                        'NAME_HOUSING_TYPE':'houtp','FLAG_EMAIL':'email',
                         'NAME_INCOME_TYPE':'inctp','FLAG_WORK_PHONE':'wkphone',
                         'FLAG_PHONE':'phone','CNT_FAM_MEMBERS':'famsize',
                        'OCCUPATION_TYPE':'occyp'
                        },inplace=True)

# Dropping missing values
new_data.dropna()
new_data = new_data.mask(new_data == 'NULL').dropna()

ivtable=pd.DataFrame(new_data.columns,columns=['variable'])
ivtable['IV']=None
namelist = ['FLAG_MOBIL','begin_month','dep_value','target','ID']

for i in namelist:
    ivtable.drop(ivtable[ivtable['variable'] == i].index, inplace=True)

The ivtable DataFrame will contain the remaining columns from the original DataFrame, excluding the ones specified in namelist

Defining calc_iv function to calculate Information Value and WOE Value

# Calculate information value
def calc_iv(df, feature, target, pr=False):
    lst = []
    df[feature] = df[feature].fillna("NULL")

    for i in range(df[feature].nunique()):
        val = list(df[feature].unique())[i]
        lst.append([feature,                                                        # Variable
                    val,                                                            # Value
                    df[df[feature] == val].count()[feature],                        # All
                    df[(df[feature] == val) & (df[target] == 0)].count()[feature],  # Good (think: Fraud == 0)
                    df[(df[feature] == val) & (df[target] == 1)].count()[feature]]) # Bad (think: Fraud == 1)

    data = pd.DataFrame(lst, columns=['Variable', 'Value', 'All', 'Good', 'Bad'])
    data['Share'] = data['All'] / data['All'].sum()
    data['Bad Rate'] = data['Bad'] / data['All']
    data['Distribution Good'] = (data['All'] - data['Bad']) / (data['All'].sum() - data['Bad'].sum())
    data['Distribution Bad'] = data['Bad'] / data['Bad'].sum()
    data['WoE'] = np.log(data['Distribution Good'] / data['Distribution Bad'])
    
    data = data.replace({'WoE': {np.inf: 0, -np.inf: 0}})

    data['IV'] = data['WoE'] * (data['Distribution Good'] - data['Distribution Bad'])

    data = data.sort_values(by=['Variable', 'Value'], ascending=[True, True])
    data.index = range(len(data.index))

    if pr:
        print(data)
        print('IV = ', data['IV'].sum())

    iv = data['IV'].sum()
    print('This variable\'s IV is:',iv)
    print(df[feature].value_counts())
    return iv, data

def convert_dummy(df, feature,rank=0):
    pos = pd.get_dummies(df[feature], prefix=feature)
    mode = df[feature].value_counts().index[rank]
    biggest = feature + '_' + str(mode)
    pos.drop([biggest],axis=1,inplace=True)
    df.drop([feature],axis=1,inplace=True)
    df=df.join(pos)
    return df

It converts a categorical feature into dummy variables in a DataFrame.

def get_category(df, col, binsnum, labels, qcut = False):
    if qcut:
        localdf = pd.qcut(df[col], q = binsnum, labels = labels) # quantile cut
    else:
        localdf = pd.cut(df[col], bins = binsnum, labels = labels) # equal-length cut
        
    localdf = pd.DataFrame(localdf)
    name = 'gp' + '_' + col
    localdf[name] = localdf[col]
    df = df.join(localdf[name])
    df[name] = df[name].astype(object)
    return df

It creates categorical bins based on a numerical column in a DataFrame.

def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        
    print(cm)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

Binary Features

Binary features, also known as binary variables or binary indicators, are categorical variables that can take only two distinct values, typically represented as 0 and 1. These features are used to indicate the presence or absence of a particular characteristic or attribute within the dataset.

We will look for the various binary features and their various properties.

Gender

new_data['Gender'] = new_data['Gender'].replace(['F','M'],[0,1])
print(new_data['Gender'].value_counts())
iv, data = calc_iv(new_data,'Gender','target')
ivtable.loc[ivtable['variable']=='Gender','IV']=iv
data.head()

Output:

Having a Car or Not

new_data['Car'] = new_data['Car'].replace(['N','Y'],[0,1])
print(new_data['Car'].value_counts())
iv, data=calc_iv(new_data,'Car','target')
ivtable.loc[ivtable['variable']=='Car','IV']=iv
data.head()

Output:

Having a House Reality or Not

new_data['Reality'] = new_data['Reality'].replace(['N','Y'],[0,1])
print(new_data['Reality'].value_counts())
iv, data=calc_iv(new_data,'Reality','target')
ivtable.loc[ivtable['variable']=='Reality','IV']=iv
data.head()

Output:

Having a Phone or Not

new_data['phone']=new_data['phone'].astype(str)
print(new_data['phone'].value_counts(normalize=True,sort=False))
new_data.drop(new_data[new_data['phone'] == 'nan' ].index, inplace=True)
iv, data=calc_iv(new_data,'phone','target')
ivtable.loc[ivtable['variable']=='phone','IV']=iv
data.head()

Output:

Having an Email or Not

print(new_data['email'].value_counts(normalize=True,sort=False))
new_data['email']=new_data['email'].astype(str)
iv, data=calc_iv(new_data,'email','target')
ivtable.loc[ivtable['variable']=='email','IV']=iv
data.head()

Output:

Having a Work Phone or Not

new_data['wkphone']=new_data['wkphone'].astype(str)
iv, data = calc_iv(new_data,'wkphone','target')
new_data.drop(new_data[new_data['wkphone'] == 'nan' ].index, inplace=True)
ivtable.loc[ivtable['variable']=='wkphone','IV']=iv
data.head()

Output:

Continuous Variables

Continuous variables, also known as quantitative or numerical variables, are measurements that can take any value within a specific range. Unlike binary features, which have only two possible values, continuous variables can have an infinite number of possible values within a given interval. Now we will look for the various continuous variables and their properties.

Children Numbers

new_data.loc[new_data['ChldNo'] >= 2,'ChldNo']='2More'
print(new_data['ChldNo'].value_counts(sort=False))

Output:

iv, data=calc_iv(new_data,'ChldNo','target')
ivtable.loc[ivtable['variable']=='ChldNo','IV']=iv
data.head()

Output:

Annual Income

new_data['inc']=new_data['inc'].astype(object)
new_data['inc'] = new_data['inc']/10000 
print(new_data['inc'].value_counts(bins=10,sort=False))
new_data['inc'].plot(kind='hist',bins=50,density=True)

Output:

new_data = get_category(new_data,'inc', 3, ["low","medium", "high"], qcut = True)
iv, data = calc_iv(new_data,'gp_inc','target')
ivtable.loc[ivtable['variable']=='inc','IV']=iv
data.head()

Output:

new_data['Age']=-(new_data['DAYS_BIRTH'])//365	
print(new_data['Age'].value_counts(bins=10,normalize=True,sort=False))
new_data['Age'].plot(kind='hist',bins=20,density=True)

Output:

new_data = get_category(new_data,'Age',5, ["lowest","low","medium","high","highest"])
iv, data = calc_iv(new_data,'gp_Age','target')
ivtable.loc[ivtable['variable']=='DAYS_BIRTH','IV'] = iv
data.head()

Output:

Working Years

new_data['worktm']=-(new_data['DAYS_EMPLOYED'])//365	
new_data[new_data['worktm']<0] = np.nan # replace by na
new_data['DAYS_EMPLOYED']
new_data['worktm'].fillna(new_data['worktm'].mean(),inplace=True) #replace na by mean
new_data['worktm'].plot(kind='hist',bins=20,density=True)

Output:

new_data = get_category(new_data,'worktm',5, ["lowest","low","medium","high","highest"])
iv, data=calc_iv(new_data,'gp_worktm','target')
ivtable.loc[ivtable['variable']=='DAYS_EMPLOYED','IV']=iv
data.head()

Output:

Family Size

Output:

new_data['famsize']=new_data['famsize'].astype(int)
new_data['famsizegp']=new_data['famsize']
new_data['famsizegp']=new_data['famsizegp'].astype(object)
new_data.loc[new_data['famsizegp']>=3,'famsizegp']='3more'
iv, data=calc_iv(new_data,'famsizegp','target')
ivtable.loc[ivtable['variable']=='famsize','IV']=iv
data.head()

Output:

Categorical Features

Categorical features, also known as qualitative or nominal variables, represent characteristics or attributes that fall into distinct categories or groups. Unlike continuous variables, which have a range of numerical values, categorical features have a finite number of discrete values or labels. Now we will look at the various categorical features and their properties.

Income Types

print(new_data['inctp'].value_counts(sort=False))
print(new_data['inctp'].value_counts(normalize=True,sort=False))
new_data.loc[new_data['inctp']=='Pensioner','inctp']='State servant'
new_data.loc[new_data['inctp']=='Student','inctp']='State servant'
iv, data=calc_iv(new_data,'inctp','target')
ivtable.loc[ivtable['variable']=='inctp','IV']=iv
data.head()

Output:

new_data.loc[(new_data['occyp']=='Cleaning staff') | (new_data['occyp']=='Cooking staff') | (new_data['occyp']=='Drivers') | (new_data['occyp']=='Laborers') | (new_data['occyp']=='Low-skill Laborers') | (new_data['occyp']=='Security staff') | (new_data['occyp']=='Waiters/barmen staff'),'occyp']='Laborwk'
new_data.loc[(new_data['occyp']=='Accountants') | (new_data['occyp']=='Core staff') | (new_data['occyp']=='HR staff') | (new_data['occyp']=='Medicine staff') | (new_data['occyp']=='Private service staff') | (new_data['occyp']=='Realty agents') | (new_data['occyp']=='Sales staff') | (new_data['occyp']=='Secretaries'),'occyp']='officewk'
new_data.loc[(new_data['occyp']=='Managers') | (new_data['occyp']=='High skill tech staff') | (new_data['occyp']=='IT staff'),'occyp']='hightecwk'
print(new_data['occyp'].value_counts())
iv, data=calc_iv(new_data,'occyp','target')
ivtable.loc[ivtable['variable']=='occyp','IV']=iv
data.head()         

Output:

House Type

iv, data=calc_iv(new_data,'houtp','target')
ivtable.loc[ivtable['variable']=='houtp','IV']=iv
data.head()

Output:

Education

new_data.loc[new_data['edutp']=='Academic degree','edutp']='Higher education'
iv, data=calc_iv(new_data,'edutp','target')
ivtable.loc[ivtable['variable']=='edutp','IV']=iv
data.head()

Output:

iv, data=calc_iv(new_data,'famtp','target')
ivtable.loc[ivtable['variable']=='famtp','IV']=iv
data.head()

Output:

IV and WOE

Weight of Evidence(WoE):

woe_i = ln((P(yi) / P(ni)) = ln((yi / ys) / (ni / ns))

Where:

woe_i is the WoE for a specific category i.
P(yi) is the proportion of "Good" (non-default) observations in category i.
P(ni) is the proportion of "Bad" (default) observations in category i.
yi is the number of "Good" observations in category i.
ys is the total number of "Good" observations.
ni is the number of "Bad" observations in category i.
ns is the total number of "Bad" observations.

Information Value (IV):

IV = Σ[(Pyi - Pni) * ln(Pyi / Pni)]

Where:

Pyi is the proportion of positive samples in category i (number of positive samples in category i divided by the total number of positive samples).
Pni is the ratio of negative samples (ni) in category i to the total number of negative samples (ns).

The IV value measures the variable's ability to predict.

Relationship between IV value and predictive power

IV	Ability to predict
<0.02	Almost no predictive power
0.02~0.1	weak predictive power
0.1~0.3	Moderate predictive power
0.3~0.5	Strong predictive power
>0.5	Predictive power is too strong, need to check variables

ivtable=ivtable.sort_values(by='IV',ascending=False)
ivtable.loc[ivtable['variable']=='DAYS_BIRTH','variable']='agegp'
ivtable.loc[ivtable['variable']=='DAYS_EMPLOYED','variable']='worktmgp'
ivtable.loc[ivtable['variable']=='inc','variable']='incgp'
ivtable

Output:

Age Group (agegp) has the highest IV of 0.0659351, indicating a relatively strong predictive power while other variables such as Work Phone (wkphone), Number of Children (ChldNo), Phone (phone), Income Type (inctp), Email (email), Car Ownership (Car), and Occupation Type (occyp) have very low IV values, suggesting they have little or no predictive power.

Output:

Splitting the Dataset

Now we will split the dataset into a training and testing set.

Y = new_data['target']
X = new_data[['Gender','Reality','ChldNo_1', 'ChldNo_2More','wkphone',
              'gp_Age_high', 'gp_Age_highest', 'gp_Age_low',
       'gp_Age_lowest','gp_worktm_high', 'gp_worktm_highest',
       'gp_worktm_low', 'gp_worktm_medium','occyp_hightecwk', 
              'occyp_officewk','famsizegp_1', 'famsizegp_3more',
       'houtp_Co-op apartment', 'houtp_Municipal apartment',
       'houtp_Office apartment', 'houtp_Rented apartment',
       'houtp_With parents','edutp_Higher education',
       'edutp_Incomplete higher', 'edutp_Lower secondary','famtp_Civil marriage',
       'famtp_Separated','famtp_Single / not married','famtp_Widow']]

Y = Y.astype('int')
X_balance,Y_balance = SMOTE().fit_sample(X,Y)
X_balance = pd.DataFrame(X_balance, columns = X.columns)

X_train, X_test, y_train, y_test = train_test_split(X_balance,Y_balance, 
                                                    stratify=Y_balance, test_size=0.3,
                                                    random_state = 10086)

Modeling

We will then proceed to train and evaluate different machine learning algorithms, including logistic regression, decision trees, random forests, support vector machines (SVM), and gradient boosting methods. Each algorithm has its own strengths and characteristics, which makes it important to compare their performance and choose the one that best fits our credit card approval prediction task.

1. Logistic Regression

model = LogisticRegression(C=0.8,
                           random_state=0,
                           solver='lbfgs')
model.fit(X_train, y_train)
y_predict = model.predict(X_test)

print('Accuracy Score is {:.5}'.format(accuracy_score(y_test, y_predict)))
print(pd.DataFrame(confusion_matrix(y_test,y_predict)))

sns.set_style('white') 
class_names = ['0','1']
plot_confusion_matrix(confusion_matrix(y_test,y_predict),
                      classes= class_names, normalize = True, 
                      title='Normalized Confusion Matrix: Logistic Regression')

Output:

Logistic Regression (LR) achieved an accuracy score of 0.61215. This indicates that the model's ability to correctly predict credit card approval is moderate.

2. Decision Tree

model = DecisionTreeClassifier(max_depth=12,
                               min_samples_split=8,
                               random_state=1024)
model.fit(X_train, y_train)
y_predict = model.predict(X_test)

print('Accuracy Score is {:.5}'.format(accuracy_score(y_test, y_predict)))
print(pd.DataFrame(confusion_matrix(y_test,y_predict)))

plot_confusion_matrix(confusion_matrix(y_test,y_predict),
                      classes=class_names, normalize = True, 
                      title='Normalized Confusion Matrix: CART')

Output:

Decision Tree Classifier (DTC) performed better with an accuracy score of 0.82897. This suggests that the model is more effective in capturing the patterns and relationships in the data for credit card approval prediction.

3. Random Forest

model = RandomForestClassifier(n_estimators=250,
                              max_depth=12,
                              min_samples_leaf=16
                              )
model.fit(X_train, y_train)
y_predict = model.predict(X_test)

print('Accuracy Score is {:.5}'.format(accuracy_score(y_test, y_predict)))
print(pd.DataFrame(confusion_matrix(y_test,y_predict)))

plot_confusion_matrix(confusion_matrix(y_test,y_predict),
                      classes=class_names, normalize = True, 
                      title='Normalized Confusion Matrix: Random Forests')

Output:

Random Forest Classifier (RFC) demonstrated a higher accuracy score of 0.89459. This indicates that the ensemble of decision trees in the random forest model improved the predictive performance compared to the single decision tree.

4. SVM

model = svm.SVC(C = 0.8,
                kernel='linear')
model.fit(X_train, y_train)
y_predict = model.predict(X_test)

print('Accuracy Score is {:.5}'.format(accuracy_score(y_test, y_predict)))
print(pd.DataFrame(confusion_matrix(y_test,y_predict)))

plot_confusion_matrix(confusion_matrix(y_test,y_predict),
                      classes=class_names, normalize = True, 
                      title='Normalized Confusion Matrix: SVM')

Output:

Support Vector Machines (SVM) had a lower accuracy score of 0.59367, indicating that they may not be as effective in capturing the complexities of the credit card approval prediction task in this case.

5. LightGBM

model = LGBMClassifier(num_leaves=31,
                       max_depth=8, 
                       learning_rate=0.02,
                       n_estimators=250,
                       subsample = 0.8,
                       colsample_bytree =0.8
                      )
model.fit(X_train, y_train)
y_predict = model.predict(X_test)
print('Accuracy Score is {:.5}'.format(accuracy_score(y_test, y_predict)))
print(pd.DataFrame(confusion_matrix(y_test,y_predict)))

Output:

Light GBM achieved a high accuracy score of 0.90356, suggesting that the gradient boosting algorithm used in this model effectively improved the prediction accuracy compared to the other models.

# This function is used to plot the feature importance of a classifier model.
def plot_importance(classifer, x_train, point_size = 25):
    '''plot feature importance'''
    values = sorted(zip(x_train.columns, classifier.feature_importances_), key = lambda x: x[1] * -1)
    imp = pd.DataFrame(values,columns = ["Name", "Score"])
    imp.sort_values(by = 'Score',inplace = True)
    sns.scatterplot(x = 'Score',y='Name', linewidth = 0,
                data = imp,s = point_size, color='red').set(
    xlabel='importance', 
    ylabel='features')
    
plot_importance(model, X_train,20)   

Output:

model.booster_.feature_importance(importance_type='gain')
# It is a method used to obtain the feature importance scores of a LightGBM model.

Output:

6. XGBoost

model = XGBClassifier(max_depth=12,
                      n_estimators=250,
                      min_child_weight=8, 
                      subsample=0.8, 
                      learning_rate =0.02,    
                      seed=42)

model.fit(X_train, y_train)
y_predict = model.predict(X_test)
print('Accuracy Score is {:.5}'.format(accuracy_score(y_test, y_predict)))
print(pd.DataFrame(confusion_matrix(y_test,y_predict)))

Output:

XGBoost performed with a high accuracy score of 0.93789. This indicates that the extreme gradient boosting algorithm employed in XGBoost captured the intricate patterns in the data and made highly accurate predictions for credit card approval.

Output:

7. CatBoost

model = CatBoostClassifier(iterations=250,
                           learning_rate=0.2,
                           od_type='Iter',
                           verbose=25,
                           depth=16,
                           random_seed=42)

model.fit(X_train, y_train)
y_predict = model.predict(X_test)
print('CatBoost Accuracy Score is {:.5}'.format(accuracy_score(y_test, y_predict)))
print(pd.DataFrame(confusion_matrix(y_test,y_predict)))

Output:

CatBoost, however, achieved a relatively lower accuracy score of 0.50081. This suggests that the model did not perform well in this context and may require further investigation or parameter tuning to improve its predictive capabilities.

XGBoost model exhibited the highest accuracy among the models considered, followed by Light GBM and Random Forest Classifier. These models appear to be more suitable for predicting credit card approval.

Conclusion

Credit card approval using machine learning offers numerous benefits, including enhanced accuracy, faster processing, personalized offerings, and risk mitigation. By leveraging machine learning algorithms, financial institutions can streamline the approval process, provide customized credit card solutions, and make informed lending decisions. However, it is crucial to address challenges related to data privacy, model interpretability, and fairness to ensure responsible and ethical implementation of machine learning in credit card approval. With proper consideration and oversight, machine learning has the potential to revolutionize the lending landscape, benefiting both consumers and lenders alike.

Next TopicLiver Disease Prediction Using Machine Learning

← prev next →