Probabilistic Model in Machine Learning

A Probabilistic model in machine learning is a mathematical representation of a real-world process that incorporates uncertain or random variables. The goal of probabilistic modeling is to estimate the probabilities of the possible outcomes of a system based on data or prior knowledge.

Probabilistic models are used in a variety of machine learning tasks such as classification, regression, clustering, and dimensionality reduction. Some popular probabilistic models include:

Gaussian Mixture Models (GMMs)
Hidden Markov Models (HMMs)
Bayesian Networks
Markov Random Fields (MRFs)

Probabilistic models allow for the expression of uncertainty, making them particularly well-suited for real-world applications where data is often noisy or incomplete. Additionally, these models can often be updated as new data becomes available, which is useful in many dynamic and evolving systems.

For better understanding, we will implement the probabilistic model on the OSIC Pulmonary Fibrosis problem on the kaggle.

Problem Statement: "In this competition, you'll predict a patient's severity of decline in lung function based on a CT scan of their lungs. You'll determine lung function based on output from a spirometer, which measures the volume of air inhaled and exhaled. The challenge is to use machine learning techniques to make a prediction with the image, metadata, and baseline FVC as input."

Importing Libraries

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

EDA

train_Datafame = pd.read_csv('train.csv')
test_Dataframe = pd.read_csv('test.csv')

Let's see this decline in lung function for three different patients.

def chart_builder(patient_id, ax):
    d = train_Datafame[train_Datafame['Patient'] == patient_id]
    x = d['Weeks']
    y = d['FVC']
    ax.set_title(patient_id)
    ax = sns.regplot(x, y, ax=ax, ci=None, line_kws={'color':'red'})
    

f, axes = plt.subplots(1, 3, figsize=(15, 5))
chart_builder('ID00007637202177411956430', axes[0])
chart_builder('ID00009637202177434476278', axes[1])
chart_builder('ID00010637202177584971671', axes[2])

Output:

It is clearly obvious that lung capacity is declining. Yet as we can see, they vary greatly from case to patient.

Postulate the model

It's time to become imaginative. This tabular dataset might be modeled in a variety of ways. Here are a few tools we might employ:

Hidden Markov Models
Gaussian Processes
Variational Auto Encoders

We will start by attempting the most basic model, a linear regression, as we are still learning. We will, however, get a little more sophisticated. Following are our presumptions:

The linear regression parameters ( α and 𝛽) are particular to each patient. So, we will be able to anticipate the line(s) for each patient and, as a result, his FVC in any week by inferring the appropriate parameters.
These variables are not entirely independent from one another, though. All patients are governed by a fundamental model.
Both have different means and variances and are regularly distributed.
The baseline measure (baseline week, FVC, and Percent), as well as the patient's age, sex, and smoking status, determine these means and variations.
We'll go even more sophisticated in this case by supposing that the parameters are also functions of latent variables discovered from the CT scans. It will happen later, though.

Our model is represented by the Bayesian network shown below:

The logic behind this Model:

FVC_ij is the observed variable we are interested in. At any week j, -12≤j≤133, the FVC of patient is presumed to be normally distributed with mean α_i+β_i i and σ_i²(the confidence asked).
α_i, the intercept of the decline function for each patient i, logically is a function of FVC_i^b (the baseline measurement for patient i ) and ω_i^b (the week when the baseline FVC was measured). We assume it is normally distributed with mean FVC_i^b+ω_i^b β^int and variance σ^int.
β_i, the slope of the decline function for each patient i, logically is a function of A_i (patient's age), sex, and smoking status. We assume it is normally distributed with mean α^s+A_i β_c^s with variance σ^s. We considered six different β_c^s: for women who currently smoke, men who currently smoke, women ex-smokers, men ex-smokers, women who never smoked, and men who never smoked.
For now, to simplify, we left the Percent random variable out. We will include it in a second version.
Finally, we know nothing about the priors β^int, α^s,σ^{i,σ^int, and σ^s. We will model the first two as normals and the last three as half-normals.}

Mathematically Model Specification:

Simple Data Pre-processing

# Importing required library
import pymc3 as pm
import theano
import arviz as az
from sklearn import preprocessing

# Pre-processing that is quite basic: adding the patient class
def patient_class(r):
    if r['Sex'] == 'Male':
        if r['SmokingStatus'] == 'Currently smokes':
            return 0
        elif r['SmokingStatus'] == 'Ex-smoker':
            return 1
        elif r['SmokingStatus'] == 'Never smoked':
            return 2
    else:
        if r['SmokingStatus'] == 'Currently smokes':
            return 3
        elif r['SmokingStatus'] == 'Ex-smoker':
            return 4
        elif r['SmokingStatus'] == 'Never smoked':
            return 5

train_Datafame['Class'] = train_Datafame.apply(patient_class, axis=1)

# Adding FVC and week baselines is a very basic pre-processing step.
auxi = train_Datafame[['Patient', 'Weeks']].groupby('Patient')\
    .min().reset_index()
auxi = pd.merge(auxi, train_Datafame[['Patient', 'Weeks', 'FVC']], how='left', 
               on=['Patient', 'Weeks'])
auxi = auxi.groupby('Patient').mean().reset_index()
auxi['Weeks'] = auxi['Weeks'].astype(int)
auxi['FVC'] = auxi['FVC'].astype(int)
train_Datafame = pd.merge(train_Datafame, auxi, how='left', on='Patient', suffixes=('', '_base'))

# Very simple pre-processing: creating patient indexes
label_encoder = preprocessing.LabelEncoder()
train_Datafame['PatientID'] = label_encoder.fit_transform(train_Datafame['Patient'])

patients = train_Datafame[['Patient', 'PatientID', 'Age', 'Class', 'Weeks_base', 'FVC_base']].drop_duplicates()
data_fvc = train_Datafame[['Patient', 'PatientID', 'Weeks', 'FVC']]

patients.head()

Output:

Modeling in PyMC3

b_FVC = patients['FVC_base'].values
b_w = patients['Weeks_base'].values
age = patients['Age'].values
class_patient = patients['Class'].values

t = data_fvc['Weeks'].values
obs_FVC = data_fvc['FVC'].values
id_patient = data_fvc['PatientID'].values

with pm.Model() as hierarchical_model:
    # Hyperpriors for Alpha
    int_beta = pm.Normal('int_beta', 0, sigma=100)
    int_sigma = pm.HalfNormal('int_sigma', 100)
    
    # Alpha
    alpha_mu = b_FVC + int_beta * b_w
    alpha = pm.Normal('alpha', mu=alpha_mu, sigma=int_sigma, 
                      shape=train_Datafame['Patient'].nunique())
    
    # Hyperpriors for Beta
    s_sigma = pm.HalfNormal('s_sigma', 100)
    s_aplha = pm.Normal('s_aplha', 0, sigma=100)
    cs_beta = pm.Normal('cs_beta', 0, sigma=100, shape=6)
    
    # Beta
    beta_mu = s_aplha + age * cs_beta[class_patient]
    beta = pm.Normal('beta', mu=beta_mu, sigma=s_sigma,
                     shape=train_Datafame['Patient'].nunique())
    
    # Model variance
    sigma = pm.HalfNormal('sigma', 200)
    
    # Model estimate
    est_FVC = alpha[id_patient] + beta[id_patient] * t
    
    # Data likelihood
    like_FVC = pm.Normal('like_FVC', mu=est_FVC,
                          sigma=sigma, observed=obs_FVC)

Fit the model

# Inference button (TM)!
with hierarchical_model:
    trace = pm.sample(2000, tune=2000, target_accept=.9)

Output:

We just sampled 4000 distinct models that account for the data.

Check the model

Let's have a look at the generative model we developed.

with hierarchical_model:
    pm.traceplot(trace);

Output:

It appears that our model has learned unique alphas and betas for each patient.

Checking some patients

ArviZ, an extremely potent visualization tool, is included with PyMC3. Nonetheless, we make use of Seaborn and Matplotlib.

def chart_builder(patient_id, ax):
    d = train_Datafame[train_Datafame['Patient'] == patient_id]
    x = d['Weeks']
    y = d['FVC']
    ax.set_title(patient_id)
    ax = sns.regplot(x, y, ax=ax, ci=None, line_kws={'color':'red'})
    
    x2 = np.arange(-12, 133, step=0.1)
    
    pid = patients[patients['Patient'] == patient_id]['PatientID'].values[0]
    for sample in range(100):
        alpha = trace['alpha'][sample, pid]
        beta = trace['beta'][sample, pid]
        sigma = trace['sigma'][sample]
        y2 = alpha + beta * x2
        ax.plot(x2, y2, linewidth=0.1, color='green')
        y2 = alpha + beta * x2 + sigma
        ax.plot(x2, y2, linewidth=0.1, color='yellow')
        y2 = alpha + beta * x2 - sigma
        ax.plot(x2, y2, linewidth=0.1, color='yellow')

f, axes = plt.subplots(1, 3, figsize=(15, 5))
chart_builder('ID00007637202177411956430', axes[0])
chart_builder('ID00009637202177434476278', axes[1])
chart_builder('ID00010637202177584971671', axes[2])

Output:

100 of the 4000 unique models that each patient possesses is plotted here. The fitted regression line is shown in green, while the standard deviation is shown in yellow. Let's put it all together!

(Iterate and) Use the model

Let's use our generative model now.

Simple Data Pre-processing

# Very simple pre-processing: adding patient class
def patient_class(row):
    if row['Sex'] == 'Male':
        if row['SmokingStatus'] == 'Currently smokes':
            return 0
        elif row['SmokingStatus'] == 'Ex-smoker':
            return 1
        elif row['SmokingStatus'] == 'Never smoked':
            return 2
    else:
        if row['SmokingStatus'] == 'Currently smokes':
            return 3
        elif row['SmokingStatus'] == 'Ex-smoker':
            return 4
        elif row['SmokingStatus'] == 'Never smoked':
            return 5

test_Dataframe['Class'] = test_Dataframe.apply(patient_class, axis=1)
test_Dataframe = test_Dataframe.rename(columns={'FVC': 'FVC_base', 'Weeks': 'Weeks_base'})
test_Dataframe.head()

Output:

# prepare submission dataset
submission = []
for i, patient in enumerate(test['Patient'].unique()):
    df = pd.DataFrame(columns=['Patient', 'Weeks', 'FVC'])
    df['Weeks'] = np.arange(-12, 134)
    df['Patient'] = patient
    df['PatientID'] = i
    df['FVC'] = 0
    submission.append(df)
    
submission = pd.concat(submission).reset_index(drop=True)
submission.head()

Output:

Posterior Prediction

PyMC3 offers two methods for making predictions on held-out data that has not yet been viewed. Using theano.shared variables is part of the initial step. We only need to write 4-5 lines of code to complete it. We tested it, and while it did work flawlessly, we will also utilise the second strategy for greater comprehension.

Although it's a tiny bit longer than the 4-5 lines of code, We find it to be far more instructive. Developers of PyMC3 explain the concept in this response from Luciano Paz. Using the distributions for the parameters learnt on the first model as priors, we will build a second model to predict FVCs on hold-out data. We continuously update our models in accordance with the Bayesian methodology as we gather new data.

b_FVC = test_Dataframe['FVC_base'].values
b_w = test_Dataframe['Weeks_base'].values
age = test_Dataframe['Age'].values
class_patient = test_Dataframe['Class'].values
t = submission['Weeks'].values
id_patient = submission['PatientID'].values
            
with pm.Model() as new_model:
    # Hyperpriors for Alpha
    int_beta = pm.Normal('int_beta', 
                         trace['int_beta'].mean(), 
                         sigma=trace['int_beta'].std())
    int_sigma = pm.TruncatedNormal('int_sigma', 
                                   trace['int_sigma'].mean(),
                                   sigma=trace['int_sigma'].std(),
                                   lower=0)
    
    # Alpha
    alpha_mu = b_FVC + int_beta * b_w
    alpha = pm.Normal('alpha', mu=alpha_mu, sigma=int_sigma, 
                      shape=test_Dataframe['Patient'].nunique())
    
    # Hyperpriors for Beta
    s_sigma = pm.TruncatedNormal('s_sigma', 
                                 trace['s_sigma'].mean(),
                                 sigma=trace['s_sigma'].std(),
                                 lower=0)
    s_aplha = pm.Normal('s_aplha', 
                        trace['s_aplha'].mean(), 
                        sigma=trace['s_aplha'].std())
    cov = np.zeros((6, 6))
    np.fill_diagonal(cov, trace['cs_beta'].var(axis=0))
    cs_beta = pm.MvNormal('cs_beta',
                          mu=trace['cs_beta'].mean(axis=0),
                          cov=cov,
                          shape=6)
    
    # Beta
    beta_mu = s_aplha + age * cs_beta[class_patient]
    beta = pm.Normal('beta', mu=beta_mu, sigma=s_sigma,
                     shape=test_Dataframe['Patient'].nunique())
    
    # Model variance
    sigma = pm.TruncatedNormal('sigma', 
                               trace['sigma'].mean(),
                               sigma=trace['sigma'].std(),
                               lower=0)
    
    # Model estimate
    # Here, there are two methods for calculating FVC. One is stochastic, and the other is deterministic. We determine sigma later by analyzing std dev over the 4000 distinct models, supposing FVC is deterministic. This results in more confidence (lower sigmas). Uneven lines result from assuming that FVC is stochastic (see code comments below). The confidence is significantly lower despite the roughly similar mean FVC values (higher sigmas, about 2x the first case). Try presenting both instances, beginning with the first presumption.
    
  
    est_FVC = pm.Deterministic('est_FVC', alpha[id_patient] + beta[id_patient] * t)
    
    # sigma = pm.HalfNormal('sigma', 200)
    # FVC_like = pm.Normal('FVC_like', mu=alpha[id_patient] + beta[id_patient] * t, 
    #                      sigma=sigma,
    #                      shape=submission.shape[0])

with new_model:
    trace2 = pm.sample(2000, tune=2000, target_accept=.9)

Output:

Let's go! 4000 forecasts for every point!

Generating Final Predictions

preds = pd.DataFrame(data=trace2['FVC_est'].T)
submission = pd.merge(submission, preds, left_index=True, right_index=True)
submission['Patient_Week'] = submission['Patient'] + '_' + submission['Weeks'].astype(str)
submission = submission.drop(columns=['Patient', 'Weeks', 'FVC', 'PatientID'])

FVC = submission.iloc[:, :-1].mean(axis=1)
confidence = submission.iloc[:, :-1].std(axis=1)
submission['FVC'] = FVC
submission['Confidence'] = confidence
submission = submission[['Patient_Week', 'FVC', 'Confidence']]
submission.to_csv('submission.csv', index=False)
submission.head()

Output:

Note: We generate the final prediction so that we can submit it to the competition for evaluation.

Conclusion

At its core, a probabilistic model is simply a model that incorporates uncertainty. In machine learning, this often involves representing the relationships between different variables in a system using probability distributions. For example, in a classification task, a probabilistic model might represent the probability of a particular input belonging to each possible class.

Next TopicSurvival Analysis Using Machine Learning

← prev next →