Credit Card Fraud Detection Using Machine LearningCredit card fraud occurs when someone uses someone else's credit card to conduct financial transactions without the card owner's knowledge. Credit cards were created to help consumers enhance their purchasing power; they are agreements with their banks that allow the user to spend the money lent by the bank in return for repaying the money on the due date or incurring interest charges. With the advent of e-commerce and the current boom of OTT platforms during the Coronavirus Pandemic, credit card use has increased tremendously, along with other payment methods. Because everything in nature is binary, the amount of credit card scams has increased significantly. These thefts cost the global economy more than $24 billion every year. As a result, solving this problem has become vital, and several firms have been launched within this \$ 30 billion market. Thus, creating automated models for such a growing issue statement is required, and ML is the key! Now we will try to classify whether a credit card transaction is fraudulent or genuine and handle an unbalanced dataset. Attributes of Dataset
Code: Importing LibrariesReading the DatasetOutput: Output: Output: Output: Output:
Data VisualisationWe will now visualise our data. Target Variable Visualisation(Class)Output:
Feature SelectionWe need to select certain features from our dataset. Correlation MatrixOutput:
Output:
ANOVA TestOutput:
Output: Output: Data BalancingThere are two options for dealing with imbalanced data:
For optimal results, we will utilize a mix of undersampling and oversampling. We will first undersample the majority samples, and then oversample the minority samples. For data balance, we will utilise imblearn. Output: Output: Calculation for Data Balancing :
In this case,
Undersampling: Trim down the majority class samples
Oversampling: Increasing minority class samples.
Following oversampling,
Final Class Samples:
To account for potential bias in predictions, we duplicate data from unbalanced datasets. Because of this duplication process, we use synthetic data for modeling to guarantee that the forecasts are not skewed towards the majority target class value. Thus, rating models based on accuracy will be deceptive. Instead, we will evaluate the model using the confusion matrix, ROC-AUC graph, and ROC-AUC score. ModelingNow, we will work on various machine-learning models. 1. Logistic RegressionOutput: Output: 2. SVMOutput: Output: 3. DTCOutput: Output: 4. RFCOutput: Output: 5. KNNOutput: Output: Result TablesModels Based on Correlation Plot
Models Based on ANOVA Score
The features are hidden, and feature selection cannot be supported by domain knowledge of the issue. Statistical tests are extremely important in selecting characteristics for modeling. Because the data was balanced using SMOTE analysis, the models trained on this synthetic data could not be tested for correctness. As a result, we use the Cross Validation Score and the ROC-AUC Score to evaluate our models. Next TopicKL-Divergence |