Insurance Fraud Detection -Machine Learning

Insurance companies face a serious problem with insurance fraud, which costs them billions of dollars every year. There are several ways that insurance fraud might appear, including fabricating or exaggerating claims. Here is where machine learning may be used to detect insurance fraud.

Machine learning algorithms may be used to analyze large amounts of data to find trends that may indicate fraud. These real-time data processing methods allow insurance companies to quickly spot and prevent bogus claims.

Many machine learning methods, including decision trees, random forests, logistic regression, and neural networks, can be used to detect insurance fraud. The choice of algorithm will rely on the particular needs of the application. Each of these algorithms has advantages and disadvantages.

Benefits of Machine Learning for Fraud Detection

Here are some of the benefits of using machine learning for insurance fraud detection:

Due to the real-time processing of vast volumes of data using machine learning algorithms, fraudulent claims can be identified and flagged considerably more quickly than conventional techniques.
Machine learning algorithms can examine data from many different sources and spot trends that can point to fraud. This results in fewer false positives and more accurate fraud detection.
Insurance companies may save a lot of money if fraudulent claims are caught early. Insurance firms may identify and stop fraudulent claims before they are paid out by utilizing machine learning algorithms, which can result in considerable cost savings.
The whole customer experience may be enhanced by insurance firms by identifying and avoiding false claims. Fraud is less likely to cause valid claims to be delayed or refused, which can increase customer satisfaction.
The demands of the insurance firm may be met by scaling up or down the machine learning algorithms. Machine learning algorithms can handle the increased burden as data volume increases without the need for extra resources.

The data imbalance is a major problem in the identification of insurance fraud. Due to the relative rarity of fraudulent claims in comparison to valid claims, it might be challenging to develop a model that can reliably identify fraud. Techniques like oversampling, undersampling, and cost-sensitive learning can be used to balance the data and enhance the model's performance in order to solve this problem.

Python Implementation

Here we will see various models that can be used for insurance fraud detection and their accuracy.

Importing Libraries

# necessary imports

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

import warnings
warnings.filterwarnings('ignore')

plt.style.use('ggplot')

Reading Dataset

Output:

The dataset contains 40 columns.

# some missing values are denoted by '0', so let's replace missing values with np.nan

dataframe.replace('0', np.nan, inplace = True)
dataframe.describe()

Output:

Data Preprocessing

Data preprocessing is a critical step in machine learning that involves cleaning, transforming, encoding, selecting, integrating, and reducing data to prepare it for training a machine learning model. The quality of the data and how it is prepared can have a significant impact on the accuracy and performance of the model.

Here, we will do the followings:

Visualizing Missing values
Handling Missing Values
Encoding Categorical columns
Outliers Detection

# looking for missing values
dataframe.isna().sum()

Output:

We do have missing values in our data.

Visualizing Missing Values

Missing values can be problematic for machine learning models as they may result in biased or inaccurate results. So visualizing them would help in understanding the extent and pattern of missing data.

import missingno as msno

msno.bar(dataframe)
plt.show()

Output:

Handling Missing Value

We will handle the missing value as we will allocate 0 to the missing values as a substitute.

dataframe['collision_type'] = dataframe['collision_type'].fillna(dataframe['collision_type'].mode()[0])
dataframe['property_damage'] = dataframe['property_damage'].fillna(dataframe['property_damage'].mode()[0])
dataframe['police_report_available'] = dataframe['police_report_available'].fillna(dataframe['police_report_available'].mode()[0])

dataframe.isna().sum()

Output:

Now, there is no missing value in our data.

# heatmap

plt.figure(figsize = (18, 12))

corr = dataframe.corr()

sns.heatmap(data = corr, annot = True, fmt = '.2g', linewidth = 1)
plt.show()

Output:

# dropping columns that are not necessary for prediction

to_drop = ['policy_number','policy_bind_date','policy_state','insured_zip','incident_location','incident_date',
           'incident_state','incident_city','insured_hobbies','auto_make','auto_model','auto_year', '_c39']

dataframe.drop(to_drop, inplace = True, axis = 1)
dataframe.head()

Output:

# checking for multicollinearity

plt.figure(figsize = (18, 12))

corr = dataframe.corr()
mask = np.triu(np.ones_like(corr, dtype = bool))

sns.heatmap(data = corr, mask = mask, annot = True, fmt = '.2g', linewidth = 1)
plt.show()

Output:

From the above plot, we can see that there is a high correlation between age and months_as_customer. We will drop the "Age" column. Also, there is a high correlation between total_clam_amount, injury_claim, property_claim, and vehicle_claim, as the total claim is the sum of all others. So we will drop the total claim column.

dataframe.drop(columns = ['age', 'total_claim_amount'], inplace = True, axis = 1)
dataframe.head()

Output:

# separating the feature and target columns

X = dataframe.drop('fraud_reported', axis = 1)
y = dataframe['fraud_reported']

Encoding Categorical Variable

It involves converting categorical data into numerical data that can be processed by machine learning models.

We will encode categorical variables into numerical data so that our model will have the ease to predict insurance fraud.

# extracting categorical columns
dataframe_cat = X.select_dtypes(include = ['object'])

Output:

# printing unique values of each column
for col in dataframe_cat.columns:
    print(f"{col}: \n{dataframe_cat[col].unique()}\n")

Output:

dataframe_cat = pd.get_dummies(dataframe_cat, drop_first = True)
dataframe_cat.head()

Output:

# extracting the numerical columns

dataframe_num = X.select_dtypes(include = ['int64'])
dataframe_num.head()

Output:

# combining the Numerical and Categorical dataframes to get the final dataset

X = pd.concat([dataframe_num, dataframe_cat], axis = 1)
X.head()

Output:

plt.figure(figsize = (25, 20))
plotnumber = 1

for col in X.columns:
    if plotnumber <= 24:
        ax = plt.subplot(5, 5, plotnumber)
        sns.distplot(X[col])
        plt.xlabel(col, fontsize = 15)
       
    plotnumber += 1
   
plt.tight_layout()
plt.show()

Output:

The data looks good. Let's check for outliers.

Outliers Detection

Data points known as outliers differ dramatically from other data points in a dataset. Outliers can appear for a number of reasons, including measurement mistakes, data input problems, or inherent data variability. Statistical analysis and machine learning models can be significantly impacted by outliers because they might provide estimates that are skewed or forecasts that are incorrect.

We will try to look out for the outliers in our data.

plt.figure(figsize = (20, 15))
plotnumber = 1

for col in X.columns:
    if plotnumber <= 24:
        ax = plt.subplot(5, 5, plotnumber)
        sns.boxplot(X[col])
        plt.xlabel(col, fontsize = 15)
   
    plotnumber += 1
plt.tight_layout()
plt.show()

Output:

Outliers are present in some numerical columns. We will scale numerical columns later.

# splitting data into a training set and test set

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)
X_train.head()

Output:

dataframe_num= X_train[['months_as_customer', 'policy_deductable', 'umbrella_limit',
       'capital-gains', 'capital-loss', 'incident_hour_of_the_day',
       'number_of_vehicles_involved', 'bodily_injuries', 'witnesses', 'injury_claim', 'property_claim',
       'vehicle_claim']]

# Scaling the numeric values in the dataset

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaled_data = scaler.fit_transform(dataframe_num)
scaled_dataframe_num = pd.DataFrame(data = scaled_data, columns = dataframe_num.columns, index = X_train.index)
scaled_dataframe_num.head()

Output:

X_train.drop(columns = scaled_dataframe_num.columns, inplace = True)
X_train = pd.concat([scaled_dataframe_num, X_train], axis = 1)
X_train.head()

Output:

Models

Now, we will train and test the following models:

Support Vector Classifier
Knn
Decision Tree Classifier
Random Forest Classifier
Ada Boost Classifier
Gradient Boosting Classifier
Stochastic Gradient Boosting (SGB)
XgBoost
Cat Boost Classifier
Extra Trees Classifier
LGBM Classifier
Voting Classifier

There will also check on the accuracy of the models.

1.SVM

from sklearn.svm import SVC

svc = SVC()
svc.fit(X_train, y_train)

y_pred = svc.predict(X_test)

# accuracy_score, confusion_matrix and classification_report

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

acc_svc_train = accuracy_score(y_train, svc.predict(X_train))
acc_svc_test = accuracy_score(y_test, y_pred)

print(f"Training accuracy of Support Vector Classifier is : {acc_svc_train}")
print(f"Test accuracy of Support Vector Classifier is : {acc_svc_test}")

print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

Output:

2. KNN

from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors = 30)
knn.fit(X_train, y_train)

y_pred = knn.predict(X_test)
# accuracy_score, confusion_matrix and classification_report

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

acc_knn_train = accuracy_score(y_train, knn.predict(X_train))
acc_knn_test = accuracy_score(y_test, y_pred)

print(f"Training accuracy of KNN is : {acc_knn_train}")
print(f"Test accuracy of KNN is : {acc_knn_test}")

print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

Output:

3. Decision Tree Classifier

from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)

y_pred = dt.predict(X_test)
# accuracy_score, confusion_matrix and classification_report

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

acc_dt_train = accuracy_score(y_train, dt.predict(X_train))
acc_dt_test = accuracy_score(y_test, y_pred)

print(f"Training accuracy of Decision Tree is : {acc_dt_train}")
print(f"Test accuracy of Decision Tree is : {acc_dt_test}")

print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

Output:

# hyper parameter tuning

from sklearn.model_selection import GridSearchCV

params_grid = {
    'criterion' : ['gini', 'entropy'],
    'max_depth' : [3, 5, 7, 10],
    'min_samples_split' : range(2, 10, 1),
    'min_samples_leaf' : range(2, 10, 1)
}

search_grid = GridSearchCV(dt, params_grid, cv = 5, n_jobs = -1, verbose = 1)
search_grid.fit(X_train, y_train)

Output:

# best parameters and the best score

print(search_grid.best_params_)
print(search_grid.best_score_)

Output:

# best estimator

dt = search_grid.best_estimator_

y_pred = dt.predict(X_test)
# accuracy_score, confusion_matrix and classification_report

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

acc_dt_train = accuracy_score(y_train, dt.predict(X_train))
acc_dt_test = accuracy_score(y_test, y_pred)

print(f"Training accuracy of Decision Tree is : {acc_dt_train}")
print(f"Test accuracy of Decision Tree is : {acc_dt_test}")

print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

Output:

4. Random Forest Classifier

from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(criterion= 'entropy', max_depth= 10, max_features= 'sqrt', min_samples_leaf= 1, min_samples_split= 3, n_estimators= 140)
rfc.fit(X_train, y_train)

y_pred = rfc.predict(X_test)
# accuracy_score, confusion_matrix and classification_report

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

acc_rfc_train = accuracy_score(y_train, rfc.predict(X_train))
acc_rfc_test = accuracy_score(y_test, y_pred)

print(f"Training accuracy of Random Forest is : {acc_rfc_train}")
print(f"Test accuracy of Random Forest is : {acc_rfc_test}")

print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

Output:

5. Ada Boost Classifier

from sklearn.ensemble import AdaBoostClassifier

ada = AdaBoostClassifier(base_estimator = dt)

parameters = {
    'n_estimators' : [50, 70, 90, 120, 180, 200],
    'learning_rate' : [0.001, 0.01, 0.1, 1, 10],
    'algorithm' : ['SAMME', 'SAMME.R']
}

search_grid = GridSearchCV(ada, parameters, n_jobs = -1, cv = 5, verbose = 1)
search_grid.fit(X_train, y_train)

Output:

# best parameter and the best score

print(search_grid.best_params_)
print(search_grid.best_score_)

Output:

# best estimator

ada = search_grid.best_estimator_

y_pred = ada.predict(X_test)
# accuracy_score, confusion_matrix and classification_report

acc_ada_train = accuracy_score(y_train, ada.predict(X_train))
acc_ada_test = accuracy_score(y_test, y_pred)

print(f"Training accuracy of Ada Boost is : {acc_ada_train}")
print(f"Test accuracy of Ada Boost is : {acc_ada_test}")

print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

Output:

6. Gradient Boosting Classifier

from sklearn.ensemble import GradientBoostingClassifier

gb = GradientBoostingClassifier()
gb.fit(X_train, y_train)

# accuracy score, confusion matrix, and classification report of gradient boosting classifier

acc_gb = accuracy_score(y_test, gb.predict(X_test))

print(f"Training Accuracy of Gradient Boosting Classifier is {accuracy_score(y_train, gb.predict(X_train))}")
print(f"Test Accuracy of Gradient Boosting Classifier is {acc_gb} \n")

print(f"Confusion Matrix :- \n{confusion_matrix(y_test, gb.predict(X_test))}\n")
print(f"Classification Report :- \n {classification_report(y_test, gb.predict(X_test))}")

Output:

7. Stochastic Gradient Boosting (SGB)

sgb = GradientBoostingClassifier(subsample = 0.90, max_features = 0.70)
sgb.fit(X_train, y_train)

# accuracy score, confusion matrix, and classification report of stochastic gradient boosting classifier

acc_sgb = accuracy_score(y_test, sgb.predict(X_test))

print(f"Training Accuracy of Stochastic Gradient Boosting is {accuracy_score(y_train, sgb.predict(X_train))}")
print(f"Test Accuracy of Stochastic Gradient Boosting is {acc_sgb} \n")

print(f"Confusion Matrix :- \n{confusion_matrix(y_test, sgb.predict(X_test))}\n")
print(f"Classification Report :- \n {classification_report(y_test, sgb.predict(X_test))}")

Output:

8.XGBoost Classifier

from xgboost import XGBClassifier

xgb = XGBClassifier()
xgb.fit(X_train, y_train)

y_pred = xgb.predict(X_test)
# accuracy_score, confusion_matrix and classification_report

acc_xgb_train = accuracy_score(y_train, xgb.predict(X_train))
acc_xgb_test = accuracy_score(y_test, y_pred)

print(f"Training accuracy of XgBoost is : {acc_xgb_train}")
print(f"Test accuracy of XgBoost is : {acc_xgb_test}")

print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

Output:

grid_params = {"n_estimators": [10, 50, 100, 130], "criterion": ['gini', 'entropy'],
                               "max_depth": range(2, 10, 1)}

grid = GridSearchCV(estimator=xgb, grid_params=grid_params, cv=5,  verbose=3,n_jobs=-1)
search_grid.fit(X_train, y_train)

Output:

# best estimator

xgb = search_grid.best_estimator_

y_pred = xgb.predict(X_test)
# accuracy_score, confusion_matrix and classification_report

acc_xgb_train = accuracy_score(y_train, xgb.predict(X_train))
acc_xgb_test = accuracy_score(y_test, y_pred)

print(f"Training accuracy of XgBoost is : {acc_xgb_train}")
print(f"Test accuracy of XgBoost is : {acc_xgb_test}")

print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

Output:

9. Cat Boost Classifier

from catboost import CatBoostClassifier

cat = CatBoostClassifier(iterations=10)
cat.fit(X_train, y_train)

Output:

# accuracy score, confusion matrix, and classification report of cat boost

acc_cat = accuracy_score(y_test, cat.predict(X_test))

print(f"Training Accuracy of Cat Boost Classifier is {accuracy_score(y_train, cat.predict(X_train))}")
print(f"Test Accuracy of Cat Boost Classifier is {acc_cat} \n")

print(f"Confusion Matrix :- \n{confusion_matrix(y_test, cat.predict(X_test))}\n")
print(f"Classification Report :- \n {classification_report(y_test, cat.predict(X_test))}")

Output:

10. Extra Trees Classifier

from sklearn.ensemble import ExtraTreesClassifier

etc = ExtraTreesClassifier()
etc.fit(X_train, y_train)

# accuracy score, confusion matrix, and classification report of extra trees classifier

acc_etc = accuracy_score(y_test, etc.predict(X_test))

print(f"Training Accuracy of Extra Trees Classifier is {accuracy_score(y_train, etc.predict(X_train))}")
print(f"Test Accuracy of Extra Trees Classifier is {acc_etc} \n")

print(f"Confusion Matrix :- \n{confusion_matrix(y_test, etc.predict(X_test))}\n")
print(f"Classification Report :- \n {classification_report(y_test, etc.predict(X_test))}")

Output:

11. LGBM Classifier

from lightgbm import LGBMClassifier

lgbm = LGBMClassifier(learning_rate = 1)
lgbm.fit(X_train, y_train)

# accuracy score, confusion matrix, and classification report of lgbm classifier

acc_lgbm = accuracy_score(y_test, lgbm.predict(X_test))

print(f"Training Accuracy of LGBM Classifier is {accuracy_score(y_train, lgbm.predict(X_train))}")
print(f"Test Accuracy of LGBM Classifier is {acc_lgbm} \n")

print(f"{confusion_matrix(y_test, lgbm.predict(X_test))}\n")
print(classification_report(y_test, lgbm.predict(X_test)))

Output:

12. Voting Classifier

from sklearn.ensemble import VotingClassifier

classifiers = [('Support Vector Classifier', svc), ('KNN', knn),  ('Decision Tree', dt), ('Random Forest', rfc),
               ('Ada Boost', ada), ('XGboost', xgb), ('Gradient Boosting Classifier', gb), ('SGB', sgb),
               ('Cat Boost', cat), ('Extra Trees Classifier', etc), ('LGBM', lgbm)]

vc = VotingClassifier(estimators = classifiers)
vc.fit(X_train, y_train)

y_pred = vc.predict(X_test)

Output:

# accuracy_score, confusion_matrix and classification_report

vc_train_acc = accuracy_score(y_train, vc.predict(X_train))
vc_test_acc = accuracy_score(y_test, y_pred)

print(f"Training accuracy of Voting Classifier is : {vc_train_acc}")
print(f"Test accuracy of Voting Classifier is : {vc_test_acc}")

print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

Output:

Comparing Models

We have already trained and tested our models, and now it's time to compare those So that we can find the most suitable for insurance fraud detection.

models = pd.DataFrame({
    'Model' : ['SVC', 'KNN', 'Decision Tree', 'Random Forest','Ada Boost', 'Gradient Boost', 'SGB', 'Cat Boost', 'Extra Trees', 'LGBM', 'XgBoost', 'Voting Classifier'],
    'Score' : [acc_svc_test, acc_knn_test, acc_dt_test, acc_rfc_test, acc_ada_test, acc_gb, acc_sgb, acc_cat, acc_etc, acc_lgbm, acc_xgb_test, vc_test_acc]
})


models.sort_values(by = 'Score', ascending = False)

Output:

Decision Tree Classifier has the highest performance rate of 79%, and on the other hand, Stochastic Gradient Boosting (SGB) has the lowest performance rate of 31%.

For this, we can say that DTC is one of the best models for insurance fraud detection.

Visualizing the model comparison.

px.bar(data_frame = models, x = 'Score', y = 'Model', color = 'Score', template = 'plotly_dark',
       title = 'Models Comparison')

Output:

Conclusion

Insurance fraud is a severe issue that can negatively affect insurance providers and their clients. By locating patterns and abnormalities in the data, machine learning algorithms may be utilized to detect and stop fraud. To guarantee the accuracy and efficiency of the model, it is crucial to select the appropriate method and manage the unbalanced nature of the data.

Keep that in mind; we need to be very selective while opting for the model, as it will have a greater impact on the prediction.

Next TopicNPS in Machine Learning

← prev next →