Credit Card Fraud Detection Using Machine Learning

Credit card fraud occurs when someone uses someone else's credit card to conduct financial transactions without the card owner's knowledge. Credit cards were created to help consumers enhance their purchasing power; they are agreements with their banks that allow the user to spend the money lent by the bank in return for repaying the money on the due date or incurring interest charges.

With the advent of e-commerce and the current boom of OTT platforms during the Coronavirus Pandemic, credit card use has increased tremendously, along with other payment methods. Because everything in nature is binary, the amount of credit card scams has increased significantly. These thefts cost the global economy more than $24 billion every year. As a result, solving this problem has become vital, and several firms have been launched within this \$ 30 billion market. Thus, creating automated models for such a growing issue statement is required, and ML is the key!

Now we will try to classify whether a credit card transaction is fraudulent or genuine and handle an unbalanced dataset.

Attributes of Dataset

  • V1 - V28: Numerical features that are a result of PCA transformation.
  • Time: Seconds elapsed between each transaction and the 1st transaction.
  • Amount: Transaction amount.
  • Class: Fraud or otherwise (1 or 0)

Code:

Importing Libraries

Reading the Dataset


Credit Card Fraud Detection Using Machine Learning
Credit Card Fraud Detection Using Machine Learning

Output:

Credit Card Fraud Detection Using Machine Learning

Output:

Credit Card Fraud Detection Using Machine Learning

Output:

Credit Card Fraud Detection Using Machine Learning

Output:

Credit Card Fraud Detection Using Machine Learning

Output:

Credit Card Fraud Detection Using Machine Learning
  • Feature mean values for fraud and non-fraud situations!
  • In non-fraud cases, V1 - V28 mean values are nearly zero. The mean amount, 88.29, is smaller than the mean transaction amount, 122.21, in fraud instances.
  • The time spent for no-fraud transactions is longer than for fraudulent ones.
  • These might be some of the clues for identifying fraudulent transactions.

Data Visualisation

We will now visualise our data.

Target Variable Visualisation(Class)

Output:

Credit Card Fraud Detection Using Machine Learning
  • The data is plainly imbalanced, with the bulk of transactions indicating no fraud.
  • Due to the very uneven data, the classification model will bias its prediction towards the majority class, No Fraud.
  • As a result, data balance becomes an important step in developing a strong model.

Feature Selection

We need to select certain features from our dataset.

Correlation Matrix

Output:

Credit Card Fraud Detection Using Machine Learning
  • The dataset has much too many characteristics, making everything tough to grasp.
  • Thus, we will just draw the correlation map with the desired variable.

Output:

Credit Card Fraud Detection Using Machine Learning
  • For feature selection, we will reject features with correlation values between -0.1 and 0.1.
  • V4 and V11 are favourably connected, whereas V7, V3, V16, V10, V12, V14, and V17 are adversely correlated with the Class characteristic.

ANOVA Test

Output:

Credit Card Fraud Detection Using Machine Learning
  • The greater the ANOVA score, the more important that characteristic is to the target variable.
  • From the above figure, we will discard characteristics with values less than 50.
  • In this scenario, we will develop two models using characteristics from the Correlation Plot and ANOVA Score.

Output:

Credit Card Fraud Detection Using Machine Learning

Output:

Credit Card Fraud Detection Using Machine Learning

Data Balancing

There are two options for dealing with imbalanced data:

  • Undersampling is the process of reducing the majority of the target variable's samples.
  • Oversampling: Convert minority samples of the target variable to majority samples.

For optimal results, we will utilize a mix of undersampling and oversampling.

We will first undersample the majority samples, and then oversample the minority samples.

For data balance, we will utilise imblearn.


Output:

Credit Card Fraud Detection Using Machine Learning

Output:

Credit Card Fraud Detection Using Machine Learning

Calculation for Data Balancing :

  • Sampling Strategy: It is a ratio that is the common parameter for oversampling and undersampling.
  • Sampling Strategy : ( Samples of Minority Class ) / ( Samples of Majority Class )

In this case,

  • Majority Class: No Fraud Cases: 284315 samples
  • Minority Class: Fraud Cases: 492 samples

Undersampling: Trim down the majority class samples

  • Sampling_Strategy = 0.1
  • 1 = ( 492 ) / Majority Class Samples
  • After undersampling,
    • Majority Class: No Fraud Cases: 4920 samples
    • Minority Class: Fraud Cases: 492 samples

Oversampling: Increasing minority class samples.

  • Sampling strategy = 0.5
  • which is (Minority Class Samples) / 4920.

Following oversampling,

  • Majority Class: No Fraud Cases: 4920 samples.
  • Minority class: Fraud cases: 2460 samples.

Final Class Samples:

  • Majority Class: No Fraud Cases: 4920 samples.
  • Minority class: Fraud cases: 2460 samples.

To account for potential bias in predictions, we duplicate data from unbalanced datasets. Because of this duplication process, we use synthetic data for modeling to guarantee that the forecasts are not skewed towards the majority target class value.

Thus, rating models based on accuracy will be deceptive. Instead, we will evaluate the model using the confusion matrix, ROC-AUC graph, and ROC-AUC score.

Modeling

Now, we will work on various machine-learning models.



1. Logistic Regression


Output:

Credit Card Fraud Detection Using Machine Learning

Output:

Credit Card Fraud Detection Using Machine Learning

2. SVM


Output:

Credit Card Fraud Detection Using Machine Learning

Output:

Credit Card Fraud Detection Using Machine Learning

3. DTC


Output:

Credit Card Fraud Detection Using Machine Learning

Output:

Credit Card Fraud Detection Using Machine Learning

4. RFC


Output:

Credit Card Fraud Detection Using Machine Learning

Output:

Credit Card Fraud Detection Using Machine Learning

5. KNN


Output:

Credit Card Fraud Detection Using Machine Learning

Output:

Credit Card Fraud Detection Using Machine Learning

Result Tables

Models Based on Correlation Plot

Sr. No.ML AlgorithmCross Validation ScoreROC AUC ScoreF1 Score (Fraud)
1Logistic Regression98.01%92.35%91%
2Support Vector Classifier97.94%92.10%91%
3Decision Tree Classifier96.67%91.36%90%
4Random Forest Classifier97.84%91.71%91%
5K-Nearest Neighbors99.34%97.63%97%

Models Based on ANOVA Score

Sr. No.ML AlgorithmCross Validation ScoreROC AUC ScoreF1 Score (Fraud)
1Logistic Regression98.45%94.69%94%
2Support Vector Classifier98.32%94.40%94%
3Decision Tree Classifier97.13%93.69%93%
4Random Forest Classifier98.20%94.06%94%
5K-Nearest Neighbors99.54%98.47%97%

The features are hidden, and feature selection cannot be supported by domain knowledge of the issue. Statistical tests are extremely important in selecting characteristics for modeling.

Because the data was balanced using SMOTE analysis, the models trained on this synthetic data could not be tested for correctness. As a result, we use the Cross Validation Score and the ROC-AUC Score to evaluate our models.


Next TopicKL-Divergence