Liver Disease Prediction Using Machine Learning

Liver disease is a significant global health concern, affecting millions of individuals worldwide. Early and accurate detection of liver disease is crucial for effective treatment and prevention of further complications. In recent years, machine learning has emerged as a powerful tool in the field of healthcare, enabling the development of predictive models that can assist in the diagnosis and prognosis of various medical conditions, including liver disease.

Application of Machine Learning in Liver Disease Prediction

Machine learning algorithms have found diverse applications in the field of liver disease prediction. By analyzing patient data and medical records, machine-learning models can identify patterns and risk factors associated with liver diseases. Some of the key applications include:

Machine learning models can detect early signs of liver disease, even before symptoms manifest. This allows healthcare providers to intervene at an early stage and potentially prevent the progression of the disease.
Machine learning algorithms can classify patients based on their risk levels for developing liver diseases. This enables personalized treatment plans and better allocation of medical resources.
By continuously analyzing patient data, machine learning models can monitor disease progression and provide real-time updates to medical professionals.
Machine learning can predict how patients will respond to different treatment options, optimizing treatment strategies and improving patient outcomes.

Benefits of Using Machine Learning for Liver Disease Prediction

The integration of machine learning into liver disease prediction offers several benefits:

Enhanced Accuracy: Machine learning models can process vast amounts of data and identify complex patterns, leading to more accurate predictions compared to traditional methods.
Early Detection: Machine learning algorithms can detect liver diseases in their early stages, allowing for timely medical intervention and potential prevention of severe complications.
Personalized Medicine: By analyzing individual patient data, machine learning enables personalized treatment plans tailored to each patient's unique needs.
Improved Patient Outcomes: Accurate predictions and early detection contribute to better patient outcomes and quality of life.
Cost-Efficiency: Machine learning can optimize healthcare resource utilization by identifying high-risk patients and reducing unnecessary tests and hospitalizations.

Challenges in Liver Disease Prediction Using Machine Learning

Despite the numerous advantages, there are challenges associated with applying machine learning to liver disease prediction:

Access to high-quality and diverse medical data is essential for training robust machine learning models. However, obtaining labeled datasets for liver disease prediction can be challenging.
Liver disease datasets are often imbalanced, with a small number of positive cases compared to negative cases. Imbalanced datasets can lead to biased models, affecting their predictive performance.
Some machine learning models, such as deep learning algorithms, are often considered "black boxes" due to their complex nature. Interpreting these models' predictions can be challenging for medical professionals.
Handling sensitive medical data raises ethical and privacy concerns. Safeguarding patient data while ensuring its accessibility for research is a delicate balance.

Here, We will try to implement it in code.

Data Summary

Patients with Liver disease have been continuously increasing because of excessive consumption of alcohol, inhaling of harmful gases, and intake of contaminated food, pickles, and drugs. This dataset was used to evaluate prediction algorithms in an effort to reduce the burden on doctors.

Content

This data set contains 416 liver patient records and 167 non-liver patient records collected from North East of Andhra Pradesh, India. The "Dataset" column is a class label used to divide groups into the liver patient (liver disease) or not (no disease). This data set contains 441 male patient records and 142 female patient records.

Any patient whose age exceeded 89 is listed as being of age "90".

Columns

Age of the patient
Gender of the patient
Total Bilirubin
Direct Bilirubin
Alkaline Phosphatase
Alamine Aminotransferase
Aspartate Aminotransferase
Total Proteins
Albumin
Albumin and Globulin Ratio
Dataset: field used to split the data into two sets (patient with liver disease or no disease)

Importing Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

data=pd.read_csv("../input/indian_liver_patient.csv")
data.head()

Output:

EDA

Output:

The feature named "Albumin_and_Globulin_Ratio" is incomplete as it lacks 583 values. Therefore, we need to address this issue during the data preprocessing phase. Now, We intend to assess the balance of the data by creating a histogram visualization.

# checking the stats
# given on the website 416 liver disease patients and 167 non-liver disease patients
# need to remap the classes liver disease:=1 and no liver disease:=0 (normal convention to be followed)
count_classes = pd.value_counts(data['Dataset'], sort = True).sort_index()
count_classes.plot(kind = 'bar')
plt.title("Liver disease classes histogram")
plt.xlabel("Dataset")
plt.ylabel("Frequency")

Output:

In order to simplify the class labels, we need to reassign them. For patients without liver disease, we will assign the label 0, and for patients with liver disease, we will assign the label 1.

Output:

At this point, I will replace the missing values with zeros.

data_features=data.drop(['Dataset'],axis=1)
data_num_features=data.drop(['Gender','Dataset'],axis=1)
data_num_features.head()

Output:

data_num_features.describe() # check whether feature scaling has to be performed or not 

Output:

Based on the information provided in the table, since the ranges vary for different features, it is necessary to perform feature scaling.

from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()
cols=list(data_num_features.columns)
data_features_scaled=pd.DataFrame(data=data_features)
data_features_scaled[cols]=scaler.fit_transform(data_features[cols])
data_features_scaled.head()

Output:

Now, in order to encode the categorical data into numerical values, We utilized the conventional pandas function called "get_dummies". Since there is only one column that requires encoding, this function was sufficient for the task.

data_exp=pd.get_dummies(data_features_scaled)
data_exp.head()

Output:

To examine the relationships between the features, utilizing the "corr()" function and generating a heatmap is a valuable approach. This allows for a visual representation of the correlations between the features.

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(12, 10))
plt.title('Pearson Correlation of liver disease Features')
# Draw the heatmap using seaborn
sns.heatmap(data_num_features.astype(float).corr(),linewidths=0.25,vmax=1.0, square=True, cmap="YlGnBu", linecolor='black',annot=True)

Output:

Based on the heatmap analysis, it is evident that there is a strong correlation between certain pairs of features. Specifically, there is a high correlation between "Direct_Bilirubin" and "Total_Bilirubin," "Alamine Aminotransferase" and "Aspartate Aminotransferase," and "Total Protiens" and "Albumin."

We will now utilize the Support Vector Classifier (SVC) on the dataset without employing any sampling techniques solely to evaluate its performance.

from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix,precision_recall_curve,auc,roc_auc_score,roc_curve,recall_score,classification_report

import itertools

def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=0)
    plt.yticks(tick_marks, classes)

    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        #print("Normalized confusion matrix")
    else:
        1
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

X=data_exp
y=data['Dataset'] 
X_train,X_test,Y_train,Y_test=train_test_split(X,y,test_size=0.3,random_state=0)

Output:

clf=SVC(random_state=0,kernel='rbf')
clf.fit(X_train,Y_train)
predictions=clf.predict(X_test)

# Compute confusion matrix
cnf_matrix = confusion_matrix(Y_test,predictions)
np.set_printoptions(precision=2)

print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))

# Plot non-normalized confusion matrix
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix
                      , classes=class_names
                      , title='Confusion matrix')
plt.show()

Output:

Based on the confusion matrix, we observe that there are no true negatives, which is an incorrect outcome for the algorithm. This suggests that the algorithm, being unbalanced, consistently predicts that the patient has liver disease. We need to tune the model.

false_positive_rate, true_positive_rate, thresholds = roc_curve(Y_test, predictions)
roc_auc = auc(false_positive_rate, true_positive_rate)
print (roc_auc)

Output:

plt.title('Receiver Operating Characteristic')
plt.plot(false_positive_rate, true_positive_rate, 'b', label='AUC = %0.2f'% roc_auc)
plt.legend(loc='lower right')
plt.plot([0,1],[0,1],'r--')
plt.xlim([-0.1,1.2])
plt.ylim([-0.1,1.2])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')

Output:

Based on the analysis of the ROC curve and confusion matrix, it is evident that there is a need to minimize the number of false positives since they represent incorrect predictions. In order to optimize the model, We have utilized the GridSearchCV method.

# Import 'GridSearchCV', 'make_scorer', and any other necessary libraries
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer, fbeta_score,accuracy_score
#from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
# Initialize the classifier
clf = SVC(random_state=0,kernel='rbf')

#  Create the parameters list you wish to tune, using a dictionary if needed.
#  parameters = {'parameter_1': [value1, value2], 'parameter_2': [value1, value2]}
parameters = {'C': [10,50,100,200],'kernel':['poly','rbf','linear','sigmoid']}

# Make an fbeta_score scoring object using make_scorer()
scorer = make_scorer(fbeta_score,beta=0.5)

# Perform grid search on the classifier using 'scorer' as the scoring method using GridSearchCV()
grid_obj = GridSearchCV(clf,parameters,scoring=scorer,n_jobs=-1)

# Fit the grid search object to the training data and find the optimal parameters using fit()
grid_fit = grid_obj.fit(X_train,Y_train)

# Get the estimator
best_clf = grid_fit.best_estimator_

# Make predictions using the unoptimized and model
predictions = (clf.fit(X_train,Y_train)).predict(X_test)
best_predictions = best_clf.predict(X_test)

# Report the before-and-afterscores
print ("Unoptimized model\n------")
print ("Accuracy score on testing data: {:.4f}".format(accuracy_score(Y_test, predictions)))
print ("F-score on testing data: {:.4f}".format(fbeta_score(Y_test, predictions, beta = 2)))
print ("\nOptimized Model\n------")
print ("Final accuracy score on the testing data: {:.4f}".format(accuracy_score(Y_test, best_predictions)))
print ("Final F-score on the testing data: {:.4f}".format(fbeta_score(Y_test, best_predictions, beta = 2)))
print (best_clf)

Output:

# Compute confusion matrix
cnf_matrix = confusion_matrix(Y_test,best_predictions)
np.set_printoptions(precision=2)

print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))

# Plot non-normalized confusion matrix
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix
                      , classes=class_names
                      , title='Confusion matrix')
plt.show()

Output:

With the inclusion of true negative cases, the ROC curve is expected to demonstrate improved performance.

false_positive_rate, true_positive_rate, thresholds = roc_curve(Y_test, best_predictions)
roc_auc = auc(false_positive_rate, true_positive_rate)
print (roc_auc)

Output:

plt.title('Receiver Operating Characteristic')
plt.plot(false_positive_rate, true_positive_rate, 'b', label='AUC = %0.2f'% roc_auc)
plt.legend(loc='lower right')
plt.plot([0,1],[0,1],'r--')
plt.xlim([-0.1,1.2])
plt.ylim([-0.1,1.2])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')

Output:

The ROC curve exhibits an improved AUC of 0.58 compared to the unoptimized model. However, it is still not considered a highly effective model. This could be attributed to the unbalanced nature of the dataset, which limits the improvement in AUC. Additionally, the relatively small size of the dataset may also contribute to the limitations of the model's performance.

I will apply the oversampling technique to balance the dataset and augment the data volume.

from imblearn.over_sampling import SMOTE
oversampler=SMOTE(random_state=0)
os_features,os_labels=oversampler.fit_sample(X_train,Y_train)

Output:

clf=SVC(random_state=0,kernel='rbf') # unoptimized Model
clf.fit(os_features,os_labels)

Output:

# perform predictions on test set
predictions=clf.predict(X_test)

# Compute confusion matrix
cnf_matrix = confusion_matrix(Y_test,predictions)
np.set_printoptions(precision=2)

print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))

# Plot non-normalized confusion matrix
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix
                      , classes=class_names
                      , title='Confusion matrix')
plt.show()

Output:

The recall metric shows a low value, indicating the need to optimize the model for improvement.

false_positive_rate, true_positive_rate, thresholds = roc_curve(Y_test, predictions)
roc_auc = auc(false_positive_rate, true_positive_rate)
print (roc_auc)

Output:

plt.title('Receiver Operating Characteristic')
plt.plot(false_positive_rate, true_positive_rate, 'b', label='AUC = %0.2f'% roc_auc)
plt.legend(loc='lower right')
plt.plot([0,1],[0,1],'r--')
plt.xlim([-0.1,1.2])
plt.ylim([-0.1,1.2])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')

Output:

#Import 'GridSearchCV', 'make_scorer', and any other necessary libraries
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer, fbeta_score,accuracy_score
#from sklearn.ensemble import RandomForestClassifier
# TODO: Initialize the classifier
clf = SVC(random_state=0,kernel='rbf')

#  Create the parameters list you wish to tune, using a dictionary if needed.
#  parameters = {'parameter_1': [value1, value2], 'parameter_2': [value1, value2]}
parameters = {'C': [10,50,100,200],'kernel':['poly','rbf','linear','sigmoid']}

# Make an fbeta_score scoring object using make_scorer()
scorer = make_scorer(fbeta_score,beta=2)

# Perform grid search on the classifier using 'scorer' as the scoring method using GridSearchCV()
grid_obj = GridSearchCV(clf,parameters,scoring=scorer,n_jobs=-1)

#  Fit the grid search object to the training data and find the optimal parameters using fit()
grid_fit = grid_obj.fit(os_features,os_labels)

# Get the estimator
best_clf = grid_fit.best_estimator_

# Make predictions using the unoptimized and model
predictions = (clf.fit(os_features,os_labels)).predict(X_test)
best_predictions = best_clf.predict(X_test)

# Report the before-and-afterscores
print ("Unoptimized model\n------")
print ("Accuracy score on testing data: {:.4f}".format(accuracy_score(Y_test, predictions)))
print ("F-score on testing data: {:.4f}".format(fbeta_score(Y_test, predictions, beta = 2)))
print ("\nOptimized Model\n------")
print ("Final accuracy score on the testing data: {:.4f}".format(accuracy_score(Y_test, best_predictions)))
print ("Final F-score on the testing data: {:.4f}".format(fbeta_score(Y_test, best_predictions, beta = 2)))
print (best_clf)

Output:

# Compute confusion matrix
cnf_matrix = confusion_matrix(Y_test,best_predictions)
np.set_printoptions(precision=2)

print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))

# Plot non-normalized confusion matrix
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix
                      , classes=class_names
                      , title='Confusion matrix')
plt.show()

Output:

false_positive_rate, true_positive_rate, thresholds = roc_curve(Y_test, best_predictions)
roc_auc = auc(false_positive_rate, true_positive_rate)
print (roc_auc)

Output:

plt.title('Receiver Operating Characteristic')
plt.plot(false_positive_rate, true_positive_rate, 'b', label='AUC = %0.2f'% roc_auc)
plt.legend(loc='lower right')
plt.plot([0,1],[0,1],'r--')
plt.xlim([-0.1,1.2])
plt.ylim([-0.1,1.2])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')

Output:

Despite employing the SMOTE technique, the performance of SVC is still not satisfactory. Both the recall metric and AUC score are approximately 0.67, which falls short of the desired level. Therefore, We decided to explore the RandomForestClassifier as an alternative approach.

from sklearn.ensemble import RandomForestClassifier
clf=RandomForestClassifier(random_state=0) # unoptimized Model
clf.fit(os_features,os_labels)

Output:

# perform predictions on test set
predictions=clf.predict(X_test)

# Compute confusion matrix
cnf_matrix = confusion_matrix(Y_test,predictions)
np.set_printoptions(precision=2)

print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))

# Plot non-normalized confusion matrix
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix
                      , classes=class_names
                      , title='Confusion matrix')
plt.show()

Output:

The recall metric has shown improvement compared to SVC after using the RandomForestClassifier. However, the model still requires further tuning to optimize its performance.

false_positive_rate, true_positive_rate, thresholds = roc_curve(Y_test, predictions)
roc_auc = auc(false_positive_rate, true_positive_rate)
print (roc_auc)

Output:

plt.title('Receiver Operating Characteristic')
plt.plot(false_positive_rate, true_positive_rate, 'b', label='AUC = %0.2f'% roc_auc)
plt.legend(loc='lower right')
plt.plot([0,1],[0,1],'r--')
plt.xlim([-0.1,1.2])
plt.ylim([-0.1,1.2])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')

Output:

# TODO: Import 'GridSearchCV', 'make_scorer', and any other necessary libraries
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer, fbeta_score,accuracy_score
from sklearn.ensemble import RandomForestClassifier
# TODO: Initialize the classifier
clf = RandomForestClassifier(random_state=0)

# TODO: Create the parameters list you wish to tune, using a dictionary if needed.
# HINT: parameters = {'parameter_1': [value1, value2], 'parameter_2': [value1, value2]}
parameters = {'n_estimators': [100,250,500], 'max_depth': [3,6,9]}

# TODO: Make an fbeta_score scoring object using make_scorer()
scorer = make_scorer(fbeta_score,beta=2)

# TODO: Perform grid search on the classifier using 'scorer' as the scoring method using GridSearchCV()
grid_obj = GridSearchCV(clf,parameters,scoring=scorer,n_jobs=-1)

# TODO: Fit the grid search object to the training data and find the optimal parameters using fit()
grid_fit = grid_obj.fit(os_features,os_labels)

# Get the estimator
best_clf = grid_fit.best_estimator_

# Make predictions using the unoptimized and model
predictions = (clf.fit(os_features,os_labels)).predict(X_test)
best_predictions = best_clf.predict(X_test)

# Report the before-and-afterscores
print ("Unoptimized model\n------")
print ("Accuracy score on testing data: {:.4f}".format(accuracy_score(Y_test, predictions)))
print ("F-score on testing data: {:.4f}".format(fbeta_score(Y_test, predictions, beta = 2)))
print ("\nOptimized Model\n------")
print ("Final accuracy score on the testing data: {:.4f}".format(accuracy_score(Y_test, best_predictions)))
print ("Final F-score on the testing data: {:.4f}".format(fbeta_score(Y_test, best_predictions, beta = 2)))
print (best_clf)

Output:

# Compute confusion matrix
cnf_matrix = confusion_matrix(Y_test,best_predictions)
np.set_printoptions(precision=2)

print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))

# Plot non-normalized confusion matrix
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix
                      , classes=class_names
                      , title='Confusion matrix')
plt.show()

Output:

false_positive_rate, true_positive_rate, thresholds = roc_curve(Y_test, best_predictions)
roc_auc = auc(false_positive_rate, true_positive_rate)
print (roc_auc)

Output:

plt.title('Receiver Operating Characteristic')
plt.plot(false_positive_rate, true_positive_rate, 'b', label='AUC = %0.2f'% roc_auc)
plt.legend(loc='lower right')
plt.plot([0,1],[0,1],'r--')
plt.xlim([-0.1,1.2])
plt.ylim([-0.1,1.2])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')

Output:

After applying GridSearchCV to optimize the RandomForestClassifier, the model achieved a recall metric of 0.76 and an AUC of 0.69 on the ROC curve.
Considering the Accuray of the Model, RandomForestClassifier would be the best choice for the predicting the liver disease in a patient because it considers multiple features.

Future Aspects of Liver Disease Prediction Using Machine Learning

As machine learning continues to evolve, several future aspects hold promise for liver disease prediction:

Integrating machine learning algorithms with EHR systems can enhance real-time prediction capabilities and enable continuous patient monitoring.
Combining multiple machine learning models through ensemble methods can improve predictive accuracy and robustness.
Research on explainable AI techniques can provide insights into the decision-making process of complex machine-learning models, making them more transparent and interpretable.
Integrating various types of omics data, such as genomics, proteomics, and metabolomics, can enhance the predictive power of machine-learning models for liver disease.
Developing adaptive machine learning models that can continuously learn from new data can improve prediction accuracy over time.

Conclusion

Machine learning has emerged as a valuable tool for liver disease prediction, offering significant benefits in terms of accuracy, early detection, and personalized medicine. However, challenges such as data availability, model interpretability, and ethical considerations need to be addressed. The future holds immense potential for further advancements in machine learning techniques, enabling more accurate and efficient liver disease prediction. By harnessing the power of machine learning, we can improve patient outcomes and make significant strides in combating liver diseases worldwide.

Next TopicMajority Voting Algorithm in Machine Learning

← prev next →