Protein Folding Using Machine Learning

Proteins are like superheroes in our body, playing crucial roles in supporting the functions of our tissues, organs, and overall body processes. These incredible molecules are composed of 20 different building blocks, each one known as an amino acid. It's mind-blowing to think that within our body, there exists a vast array of proteins, each possessing a unique sequence of dozens or even hundreds of amino acids.

The fascinating part is that the specific sequence of amino acids in a protein is like a secret code that determines its superpowers, such as its functions. This sequence actually dictates the protein's 3D structure and how it behaves under different circumstances. And guess what? This unique 3D structure then defines the protein's special role in various biological processes. So, it's not just any ordinary code; it's like a super blueprint that shapes the protein's form and unleashes its extraordinary functions, making it an essential aspect of how our body works.

But that's not all! Let's delve into the captivating world of protein folding. Picture this: proteins are like masterpieces, formed by long chains of amino acids, and their 3D structure holds the key to unlocking their powers. The process of protein folding is like an intricate dance, where the protein chain elegantly and precisely folds into its own extraordinary and functional shape. It's like the protein discovers its true identity, revealing its unique and powerful abilities to fulfill its mission in our body.

Understanding protein folding is no easy feat, though. It's a complex puzzle to solve, given the immense intricacies of the process and the countless possible ways a protein can adopt its shape. But scientists are on a quest to unlock this mystery, as it holds the key to predicting protein structure, which has far-reaching implications in fields like discovering new medicines, researching diseases, and even advancing bioengineering.

Understanding it is essential because it directly connects to predicting protein structures. This prediction has broad implications in drug discovery, disease research, and bioengineering. However, protein folding poses challenges due to its complex nature and the countless possible conformations that a protein can take.

Machine learning algorithms can be trained on existing protein folding data to learn patterns and relationships between protein sequences and their corresponding structures. These algorithms can then be used to predict the structure of new proteins based on their amino acid sequences. By analyzing large datasets of known protein structures, machine-learning models can uncover hidden patterns and principles that govern protein folding.

Benefits of Machine Learning in Protein Folding

Here are some of the benefits of machine learning in understanding Protein folding:

In recent years, machine learning methods have proven to be highly beneficial in the study of protein folding. These techniques offer researchers the ability to delve into the intricate connections between protein structures and their specific functions, particularly in the context of disease-related proteins. Uncovering such valuable information sheds light on the molecular intricacies of various diseases, opening up possibilities for the development of targeted and effective therapeutic strategies.
In the realm of protein research, machine learning has emerged as a formidable ally. Through the utilization of extensive protein folding data, scientists can now employ machine learning models to predict the complex 3D structures of proteins based solely on their amino acid sequences. This remarkable advancement is revolutionary, considering the conventional approach of determining protein structures through time-consuming and costly experiments.
Understanding the 3D shapes of proteins plays a vital role in drug discovery. When creating new medications, it's crucial to identify proteins that drugs can interact with and modify their functions. By leveraging machine learning, researchers can make precise predictions about protein structures. This valuable information enables them to discover potential drug targets and develop innovative medications that can efficiently interact with these proteins, providing effective treatments for various diseases.
The connection between protein engineering and machine learning holds immense potential with far-reaching implications. In the fields of biotechnology and synthetic biology, the opportunities are vast and exciting. Engineered proteins can find utility in various areas, such as enzyme production, where they can act as catalysts for essential reactions. Additionally, the landscape of biofuel production can be transformed, offering greener and more sustainable alternatives. Even in bioremediation, the process of using organisms to cleanse pollutants, we can witness the benefits of the remarkable progress achieved through the integration of machine learning and protein engineering. The possibilities are endless and promising, offering novel solutions to real-world challenges.

Disadvantages of Protein Folding prediciton using Machine Learning

While protein folding prediction using machine learning offers numerous advantages, there are also some challenges and disadvantages associated with this approach, that are:

Protein folding is a highly complex process that entails a multitude of interactions and shapes. Trying to predict the 3D structure of proteins based on their amino acid sequences is a challenging and resource-intensive endeavor. The intricate nature of folding demands significant computational power and can result in extended processing times, particularly for sizable protein sequences.
Even though there have been notable advancements, the present machine-learning models for predicting protein folding still encounter challenges in accuracy. The intricate nature of protein structures and the immense range of conformations pose difficulties in achieving complete precision in predicting folding patterns. While machine learning models offer valuable insights, experimental methods like X-ray crystallography and NMR spectroscopy remain indispensable for obtaining exceptionally accurate protein structures.
Proteins display notable diversity in their folding patterns, even with slight variations in the amino acid sequences. Machine learning models might face challenges in capturing this intrinsic biological variability, resulting in discrepancies between predicted structures and actual experimental observations.
Proteins are incredibly dynamic and have the ability to take on various shapes in response to their surroundings and interactions with other molecules. Integrating this dynamic information into machine learning models for predicting protein folding is a challenging endeavor.
Machine learning models can sometimes suffer from overfitting, particularly when they are trained on small datasets. In the case of protein folding prediction, overfitting can lead to models that appear to work well on the training data but struggle to make accurate predictions on new and unseen protein sequences. Ensuring that machine learning models are robust and capable of generalizing to different protein structures remains a significant challenge in the field.

Protien Folding Prediction in Machine Learning Using Python

About the Dataset

This dataset contains protein information retrieved from the Research Collaboratory for Structural Bioinformatics (RCSB) Protein Data Bank (PDB). The PDB archive is a vast collection of data that includes atomic coordinates and other details about proteins and important biological macromolecules. To determine the location of each atom within the molecule, structural biologists use various methods like X-ray crystallography, NMR spectroscopy, and cryo-electron microscopy. Once they obtain this information, they deposit it into the archive, where it is annotated and made publicly available by the wwPDB.

The PDB archive is constantly growing as research progresses in laboratories worldwide. This makes it an exciting resource for researchers and educators. It provides structures for many proteins and nucleic acids involved in crucial life processes, including ribosomes, oncogenes, drug targets, and even entire viruses. However, due to the vastness of the database, it can be challenging to navigate and find specific information. There are often multiple structures available for a single molecule or structures that are partial, modified, or different from their natural form.

Despite the challenges, the PDB archive remains a valuable source of data for the scientific community, offering a wealth of information about the structures of various biological molecules. Researchers and educators can explore this vast repository to gain insights into the intricacies of proteins and other macromolecules, supporting advancements in the field of structural biology.

Content

There are two data files. Both are arranged on the "structureId" of the protein:

pdb_data_no_dups.csv contains protein metadata which includes details on protein classification, extraction methods, etc.
data_seq.csv contains >400,000 protein structure sequences.

Now, we will try to make a model that can predict protein structure.

Code:

Importing Libraries

import random
import os
import torch
from torch import nn, einsum

import numpy as np
import matplotlib.pyplot as plt
from tqdm.notebook import tqdm
import gc

from einops import rearrange, repeat, reduce
from einops.layers.torch import Rearrange
from inspect import isfunction

import sidechainnet as scn
from sidechainnet.examples import losses, models
from sidechainnet.structure.structure import inverse_trig_transform
from sidechainnet.structure.build_info import NUM_ANGLES
import py3Dmol

seed = 0

random.seed(seed)
os.environ['PYTHONHASHSEED'] = str(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.backends.cudnn.deterministic = True

Output:

We utilized sidechainnet for training our machine learning models, aiming to predict protein structure (angles or coordinates) based on the given amino acid sequences. These examples are almost at the minimum level required for comprehensive model training.

The code here is set to train on the debug dataset by default. However, you have the freedom to modify the call to "scn.load" and select a different SidechainNet dataset, such as CASP12, for further experimentation and training.

Here, we will be working with two simplified recurrent neural networks (RNNs) to predict angle representations of proteins using their corresponding amino acid sequences:

The sequence + PSSM Net_Protein model uses a combination of the amino acid sequence (one-hot vectors, ), the Position Specific Scoring Matrix (PSSM, ), and the information content () as its input.
The sequence-only Net_Protein model receives amino acid sequences represented as integer tensors (() as its input.

The internal RNN processes the amino acid sequences, generating angle vectors for each amino acid. While other models used only 3 angles, we can predict all 12 angles provided by SidechainNet in our case.

Data Accessing Using Pytorch

When requesting DataLoaders, you will receive a dictionary that maps split names to their respective DataLoaders.

#Prepare the data in a suitable format for training.
load_data = scn.load(
             with_pytorch="dataloaders",
             batch_size=4, 
             dynamic_batching=False)
print("Available Dataloaders =", list(load_data.keys()))

Output:

When batches are yielded, each DataLoader returns a Batch namedtuple object with the following attributes:

pids: A tuple containing Net_Protein/SidechainNet ids for the proteins in this batch.
seqs:A tensor that encodes sequences, presented either as integers or as one-hot vectors, depending on the setting of the scn.load(...seq_as_onehot) parameter.
msks: A tensor of missing residue masks, which might overlap with padding in data.
evos: A tensor of PSSM (Position Specific Scoring Matrix) + information content.
secs:A tensor that represents a secondary structure, presented either as integers or one-hot vectors, depending on the setting of the scn.load(...seq_as_onehot) parameter.
angs: A tensor representing angles.
crds: A tensor representing coordinates.
ress: A tuple containing X-ray crystallographic resolutions.

batch = next(iter(load_data['train']))
print("Protein IDs\n   ", batch.pids)
print("Sequences\n   ", batch.seqs.shape)
print("Evolutionary Data\n   ", batch.evos.shape)
print("Secondary Structure\n   ", batch.secs.shape)
print("Angle Data\n   ", batch.angs.shape)
print("Coordinate Data\n   ", batch.crds.shape)
print("X-ray Resolution\n   ", batch.resolutions)
print("Concatenated Data (seq/evo/2ndary)\n   ", batch.seq_evo_sec.shape)
print("Integer sequence")
print("\tShape:", batch.int_seqs.shape)
print("\tEx:", batch.int_seqs[0,:3])

print("1-hot sequence")
print("\tShape:", batch.seqs.shape)
print("\tEx:\n", batch.seqs[0,:3])

Output:

# In the default integer sequence representation, padding is done using the integer value 20. For instance, if we observe the last 15 amino-acid "characters" of sequence #1, it can be seen that padding with the 20s has been used to match the batch size.

example = 0 # 308, note many indices point to structures that have gaps, and thus cannot be visualzed/constructed from angles
seq, ang, crd, mask, sec = ( batch.str_seqs[example],batch.angs[example], 
                             batch.crds[example], batch.msks[example], 
                             batch.secs[example]
                            )
name = batch.pids[example]

print(f"\nExample using {name}.\n")
print(f"Sequence, Mask, and Secondary Structure:\n{seq}\n{mask}\n{sec}\n")
print(f"Angles:\n{ang[:3]} ...\n")
print(f"Coordinates:\n{crd[:3]} ...\n")

Output:

#sb = scn.StructureBuilder(seq,crd)
#sb.to_3Dmol(width=600, height=300)
# If you want to train with a GPU, navigate to Runtime > Change runtime type
if torch.cuda.is_available():
    device = torch.device("cuda")
else:
    device = torch.device("cpu")
print(f"Using {device} for training.")

Output:

Helper Functions

Helper functions are small, reusable pieces of code that assist in performing specific tasks within a larger program or script. These functions are designed to simplify complex operations, improve code readability, and avoid code duplication. By breaking down complex tasks into smaller, manageable units, helper functions make the main code more organized and easier to maintain.

# helpers

def exists(val):
    return val is not None

def default(val, d):
    if exists(val):
        return val
    return d() if isfunction(d) else d

def cast_tuple(val, depth = 1):
    return val if isinstance(val, tuple) else (val,) * depth

def init_zero_(layer):
    nn.init.constant_(layer.weight, 0.)
    if exists(layer.bias):
        nn.init.constant_(layer.bias, 0.)

Attention Layers

Attention layers are important in deep learning models because they help the model focus on the most relevant parts of the data. They work like human attention, where some things are given more importance than others in the learning process.

class Attention(nn.Module):
    def __init__(
        self,
        dim,
        len_seq = None,
        heads = 8,
        dim_head = 64,
        dropout = 0.0,
        gating = True
    ):
        super().__init__()
        inner_dim = dim_head * heads
        self.len_seq = len_seq
        self.heads= heads
        self.scale = dim_head ** -0.5

        self.to_q = nn.Linear(dim, inner_dim, bias = False)
        self.to_kv = nn.Linear(dim, inner_dim * 2, bias = False)
        self.to_out = nn.Linear(inner_dim, dim)

        self.gating = nn.Linear(dim, inner_dim)
        nn.init.constant_(self.gating.weight, 0.)
        nn.init.constant_(self.gating.bias, 1.)

        self.dropout = nn.Dropout(dropout)
        init_zero_(self.to_out)

    def forward(self, x, mask = None, attn_bias = None, context = None, mask_context = None, tie_dim = None):
        device, orig_shape, h, has_context = x.device, x.shape, self.heads, exists(context)
        context = default(context, x)
        q_0, k_0, v = (self.to_q(x), *self.to_kv(context).chunk(2, dim = -1))
        i, j = q_0.shape[-2], k.shape[-2]
        q_0, k_0, v = map(lambda t: rearrange(t, 'b n (h d) -> b h n d', h = h), (q, k_0, v))

        # scale
        q_0= q_0* self.scale

        # query / key similarities
        if exists(tie_dim):
           # In accordance with the paper, for the additional Multiple Sequence Alignments (MSAs),
            # They take the average of the queries along the rows of the MSAs.
            # They referred to this specific module as MSAColumnGlobalAttention.

            q_0, k_0 = map(lambda t: rearrange(t, '(b r) ... -> b r ...', r = tie_dim), (q, k))
            q_0= q_0.mean(dim = 1)

            0_dots = einsum('b h i d, b r h j d -> b r h i j', q_0, k)
            0_dots = rearrange(0_dots, 'b r ... -> (b r) ...')
        else:
            0_dots = einsum('b h i d, b h j d -> b h i j', q_0, k)

        # If provided, include attention bias to enable communication from pairwise to msa attention.
        if exists(attn_bias):
            0_dots = 0_dots + attn_bias

        # masking
        if exists(mask):
            mask = default(mask, lambda: torch.ones(1, i, device = device).bool())
            mask_context = mask if not has_context else default(mask_context, lambda: torch.ones(1, k.shape[-2], device = device).bool())
            mask_value = -torch.finfo(0_dots.dtype).max
            mask = mask[:, None, :, None] * mask_context[:, None, None, :]
            try:
                mask = mask.to(torch.bool)
                0_dots = 0_dots.masked_fill(~mask, mask_value)
            except:
                0_dots = 0_dots.masked_fill(mask, mask_value)

        # attention
        0_dots = 0_dots - 0_dots.max(dim = -1, keepdims = True).values
        attn = 0_dots.softmax(dim = -1)
        attn = self.dropout(attn)
        # aggregate
        out = einsum('b h i j, b h j d -> b h i d', attn, v)
        # merge heads
        out = rearrange(out, 'b h n d -> b n (h d)')
        # gating
        gates = self.gating(x)
        out = out * gates.sigmoid()
        # combine to out
        out = self.to_out(out)
        return out

class Net_Protein(nn.Module):
    """A model for predicting protein angles from integer-encoded sequences."""
    def __init__(self,
                 d_hidden,
                 dim,
                 d_in=21,
                 d_embedding=32,
                 heads = 8,
                 integer_sequence=True,
                 n_angles=scn.structure.build_info.NUM_ANGLES):
        
        super(Net_Protein, self).__init__()
        # Dimensionality of RNN hidden state
        self.d_hidden = d_hidden
      
        self.attn = Attention(dim = dim, heads = heads)
        # Output vector dimensionality (per amino acid)
        self.d_out = n_angles * 2
        # Output projection layer. (from RNN -> target tensor)
        self.hidden2out = nn.Sequential(
                            nn.Linear(d_embedding, d_hidden),
                            nn.GELU(),
                            nn.Linear(d_hidden, self.d_out)
                                    )
        self.out2attn = nn.Linear(self.d_out, dim)
        self.final = nn.Sequential(
                            nn.GELU(),
                            nn.Linear(dim, self.d_out))
        self.norm_0 = nn.LayerNorm([dim])
        self.norm_1 = nn.LayerNorm([dim])
        self.activation_0 = nn.GELU()
        self.activation_1 = nn.GELU()

        # The activation function used for the output values is designed to bind them within the range of [-1, 1].                            
        self.output_activation = torch.nn.Tanh()

        # The way we embed the input of our model varies depending on the type of input it receives.
        self.integer_sequence = integer_sequence
        if self.integer_sequence:
            self.input_embedding = torch.nn.Embedding(d_in, d_embedding, padding_idx=20)
        else:
            self.input_embedding = torch.nn.Linear(d_in, d_embedding)
    def get_lengths(self, sequence):
        """Calculate the lengths of each sequence in the batch."""
        if self.integer_sequence:
            lengths = sequence.shape[-1] - (sequence == 20).sum(axis=1)
        else:
            lengths = sequence.shape[1] - (sequence == 0).all(axis=-1).sum(axis=1)
        return lengths.cpu()

    def forward(self, sequence, mask=None):
        """Perform a single forward pass of the model."""
        # First, we compute sequence lengths
        lengths = self.get_lengths(sequence)

        #After computing the lengths of each sequence in the batch, we proceed to embed our input tensors to prepare them for input to the Recurrent Neural Network (RNN).
        sequence = self.input_embedding(sequence)

        # After embedding the input tensors, we pass our data into the RNN using PyTorch's `pack_padded_sequences` function. This function helps handle sequences of variable lengths efficiently by packing them together and padding where necessary before passing them to the RNN for processing.
        sequence = torch.nn.utils.rnn.pack_padded_sequence(sequence,
                                                         lengths,
                                                         batch_first=True,
                                                         enforce_sorted=False)
        output, output_lengths = torch.nn.utils.rnn.pad_packed_sequence(sequence,
                                                                      batch_first=True)
       # At this stage, the output tensor has the same dimensionality as the RNN's hidden state, i.e., (batch, length, d_hidden).

      # To obtain the desired output dimensionality of (batch, length, 24), we perform a linear transformation on the output tensor. This transformation ensures that the output is appropriately reshaped to match the required dimensions for further processing.

        output = self.hidden2out(output)
        output = self.out2attn(output)
        output = self.activation_0(output)
        output = self.norm_0(output)
        output = self.attn(output, mask=mask)
        output = self.activation_1(output)
        output = self.norm_1(output)
        output = self.final(output)
      
        # After obtaining the output tensor with dimensions (batch, length, 24), the next step is to bound the output values within the range [-1, 1]. This step is essential to ensure that the predicted angles fall within the valid range and make the predictions meaningful for protein structure analysis. Bounding the output values restricts them to a specific interval, allowing for more accurate and reliable predictions.
        output = self.output_activation(output)

        # Lastly, we reshape the output tensor to have dimensions (batch, length, angle, (sin/cos value)). This reshaping process organizes the predicted angles in a more structured way, where each angle is represented by its corresponding sine and cosine values. This representation is useful for further analysis and interpretation of the predicted protein structures. By organizing the output in this manner, we can easily extract the sine and cosine values of each predicted angle for downstream applications and evaluations.
        output = output.view(output.shape[0], output.shape[1], 12, 2)

        return output

Training

Here, we are going to train the model, such as the Secondary Protein Structure matrix as input.

Model Inputs

The model input is enhanced by incorporating PSSMs, secondary structure, and information content, which are accessed from the batch.seq_evo_sec attribute.
The dataset used is the smallest version of the CASP 12 dataset, with 30% thinning to reduce complexity.
The size of the model is scaled up by increasing the hidden state dimension to 1024 for improved performance.

PSSM

A PSSM, also known as a Position Specific Scoring Matrix or Position Weight Matrix in the context of DNA, represents a matrix that provides specific scores or probabilities for each position in a sequence.

It is like a special code that tells us how likely each letter (amino acid) appears at different positions in a secret message (protein sequence). Scientists create this code by comparing many similar secret messages from different creatures. The PSSM helps them understand which letters are important and which ones can change without affecting the message's meaning. It's like having a secret decoder that helps scientists learn more about the secret messages in proteins and how they work.

Since PSSMs and sequences both have 20 different pieces of information, the secondary structure has 8 possibilities, and the information content is a single number for each piece; when we put all these together, we need a total of 49 values to represent them correctly.

model_pssms = Net_Protein(d_hidden=512,
                           dim=256,
                           d_in=49,
                           d_embedding=32,
                           integer_sequence=False)
model_pssms = model_pssms.to(device)
model_pssms

Output:

def init_loss_optimizer(model):
    optimizer = torch.optim.Adam(model.parameters())
    batch_losses = []
    epoch_training_losses = []
    epoch_validation10_losses = []
    epoch_validation90_losses = []
    mse_loss = torch.nn.MSELoss()
    
    return optimizer, batch_losses, epoch_training_losses, epoch_validation10_losses, epoch_validation90_losses, mse_loss
optimizer, batch_losses, epoch_training_losses, epoch_validation10_losses, epoch_validation90_losses, mse_loss = init_loss_optimizer(model_pssms)

def validation(model, datasplit, mode):
    """Assess the model's performance by evaluating its ability to predict angles represented as sin/cos values between -1 and 1 using the Mean Squared Error (MSE) metric. This evaluation allows us to understand how well the model performs in capturing the relationship between the input sequences and the corresponding angles and provides valuable insights into its accuracy in predicting protein structures."""
    total = 0.0
    n = 0
    with torch.no_grad():
        for batch in datasplit:
            # Set up variables and generate a mask to identify missing angles in the data, which are then padded with zeros. This preparation ensures that the model can handle and process the data effectively, accounting for any missing information in the input sequences.
            # The mask is duplicated along the last dimension to match the sin/cos representation of the data. This ensures that the mask aligns correctly with the format of the angles for proper processing and evaluation. represenation.
            if mode == 'seqs':
                seqs = batch.int_seqs.to(device).long()
            elif mode == 'pssms':
                seqs = batch.seq_evo_sec.to(device)
            mask_ = batch.msks.to(device)
            true_angles_sincosine = scn.structure.trig_transform(batch.angs).to(device)
            mask = (batch.angs.ne(0)).unsqueeze(-1).repeat(1, 1, 1, 2)

            # Generate predictions using the model and perform optimization to improve the model's performance.
            angles_predicted= model(seqs, mask = mask_)
            loss = mse_loss(predicted_angles[mask], true_angles_sincosine[mask])
            
            total += loss
            n += 1

    return torch.sqrt(total/n)

def train(model, mode, n_epoch):
    for epoch in range(n_epoch):
        print(f'Epoch {epoch}')
        bar_progress = tqdm(total=len(load_data['train']), smoothing=0)
        for batch in load_data['train']:
            # Prepare the necessary variables and create a mask to identify missing angles, which will be filled with zeros.
            # Please take note that the mask is repeated in the last dimension to align with the sin/cos representation.
            if mode == 'seqs':
                seqs = batch.int_seqs.to(device).long()
            elif mode == 'pssms':
                seqs = batch.seq_evo_sec.to(device)
            mask_ = batch.msks.to(device)
            true_angles_sincos = scn.structure.trig_transform(batch.angs).to(device)
            mask = (batch.angs.ne(0)).unsqueeze(-1).repeat(1, 1, 1, 2)

            #Generate predictions using the model and perform optimization to improve the model's performance.
            angles_predicted= model(seqs, mask = mask_)
            loss = mse_loss(predicted_angles[mask], true_angles_sincos[mask])
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), 2)
            optimizer.step()

            # Housekeeping
            batch_losses.append(float(loss))
            bar_progress.update(1)
            bar_progress.set_description(f"\rRMSE Loss = {np.sqrt(float(loss)):.4f}")
        # Assess the model's performance on the train-eval set, which has been downsampled for efficiency reasons.
        epoch_training_losses.append(validation(model, load_data['train-eval'], mode))
        #Assess the model's performance on different validation sets.
        epoch_validation10_losses.append(validation(model, load_data['valid-10'], mode))
        epoch_validation90_losses.append(validation(model, load_data['valid-90'], mode))
        print(f"     Train-eval loss = {epoch_training_losses[-1]:.4f}")
        print(f"     Valid-10   loss = {epoch_validation10_losses[-1]:.4f}")
        print(f"     Valid-90   loss = {epoch_validation90_losses[-1]:.4f}")
    # Finally, evaluate the model on the test set
    print(f"Test loss = {validation(model, load_data['test'], mode):.4f}")

train(model_pssms, 'pssms', 10)

Output:

# Create a plot showing the loss of each batch over time.
plt.plot(np.sqrt(np.asarray(batch_losses)), label='batch loss')
plt.ylabel("RMSE")
plt.xlabel("Step")
plt.title("Training Loss over Time")
plt.show()

Output:

# The previous plot illustrates the loss for each batch during the training process. However, the next plot presents the model's performance on various data splits at the end of each epoch.
plt.plot([x.cpu().detach().numpy() for x in epoch_training_losses], label='train-eval')
plt.plot([x.cpu().detach().numpy() for x in epoch_validation10_losses], label='valid-10')
plt.plot([x.cpu().detach().numpy() for x in epoch_validation90_losses], label='valid-90')
plt.ylabel("RMSE")
plt.xlabel("Epoch")
plt.title("Training and Validation Losses over Time")
plt.legend()
plt.show()

Output:

Visualizing Predictions

In many situations, we use the scn.BatchedStructureBuilder, which needs two things:

A tensor of numbers that represent the protein sequences in a group. These numbers come from the data we have when we go through the training or testing process.
A tensor of numbers that show the predicted angles for each part of the protein. These numbers should be in the range of -π to +π.

We have a model that knows how to guess the sin and cos values of some angles. But we need the actual angles, not the sin and cos values. So we use a special tool called scn.structure.inverse_trig_transform to change the sin and cos values back into the real angles. Once we have the real angles, we can give them to the BatchedStructureBuilder.

def build_visualizable_structures(model, data, mode=None):
    """Create visual representations of the predicted structures for one batch of data using the model's output. These visualizations allow us to better understand and analyze the model's performance.."""
    with torch.no_grad():
        for batch in data:
            if mode == "seqs":
                model_input = batch.int_seqs.to(device)
            elif mode == "pssms":
                model_input = batch.seq_evo_sec.to(device)
            mask_ = batch.msks.to(device)
            #Generate predictions for the angles of the protein structures and then use these angle predictions to construct 3D atomic coordinates for the proteins. This step is essential in converting the predicted angles into a spatial representation of the protein structures.
            predicted_angles_sincos = model(model_input, mask = mask_)
            #As the model predicts sin/cos values for the angles, we need to use a function that converts these values back into the original angles. This function helps us recover the true angles from the predicted sin/cos values, allowing us to interpret the results accurately.
            angles_predicted= inverse_trig_transform(predicted_angles_sincos)

            # EXAMPLE
            # We utilize the BatchedStructureBuilder to construct an entire batch of protein structures. This allows us to efficiently create structures for multiple data points in the batch simultaneously, improving the speed and performance of the process. The BatchedStructureBuilder takes the input data, such as sequences and predicted angles, and generates the corresponding 3D atomic coordinates for each protein in the batch.
            sb_pred = scn.BatchedStructureBuilder(batch.int_seqs, predicted_angles.cpu())
            sb_true = scn.BatchedStructureBuilder(batch.int_seqs, batch.crds.cpu())
            break
    return sb_pred, sb_true

def protein_plot(exp1, exp2):
    p = py3Dmol.view(js='https://3dmol.org/build/3Dmol.js', viewergrid=(2,1))
    p.addModel(open(exp1,'r').read(),'pdb', viewer=(0,0))
    p.addModel(open(exp2,'r').read(),'pdb', viewer=(1,0))
    p.setStyle({'cartoon': {'color':'spectrum'}})
    p.zoomTo()
    p.show()

Inference

Here, we compare our model's predicted protein structure with the actual protein structure. To make it easier to understand, we visualize these comparisons using 3D plots. Each example has two plots: the top plot shows the model's prediction of the protein structure, and the bottom plot displays the real protein structure. This allows us to see how well our model's predictions match the actual protein structures.

Example(01)

pred_s, true_s = build_visualizable_structures(model_pssms, load_data["train"], mode="pssms")
z_idx = 0
idx = 0
pred_s.to_pdb(idx,path='{}_{}_pred.pdb'.format(idx, z_idx))
true_s.to_pdb(idx,path='{}_{}_true.pdb'.format(idx, z_idx))
protein_plot('{}_{}_pred.pdb'.format(idx, z_idx), '{}_{}_true.pdb'.format(idx, z_idx))

Output:

Example(02)

idx = 1
pred_s.to_pdb(idx,path='{}_{}_pred.pdb'.format(idx, z_idx))
true_s.to_pdb(idx,path='{}_{}_true.pdb'.format(idx, z_idx))
protein_plot('{}_{}_pred.pdb'.format(idx, z_idx), '{}_{}_true.pdb'.format(idx, z_idx))

Output:

Example(03)

idx = 2
pred_s.to_pdb(idx,path='{}_{}_pred.pdb'.format(idx, z_idx))
true_s.to_pdb(idx,path='{}_{}_true.pdb'.format(idx, z_idx))
protein_plot('{}_{}_pred.pdb'.format(idx, z_idx), '{}_{}_true.pdb'.format(idx, z_idx))

Output:

Example(04)

idx = 3
pred_s.to_pdb(idx,path='{}_{}_pred.pdb'.format(idx, z_idx))
true_s.to_pdb(idx,path='{}_{}_true.pdb'.format(idx, z_idx))
protein_plot('{}_{}_pred.pdb'.format(idx, z_idx), '{}_{}_true.pdb'.format(idx, z_idx))

Output:

Example(05)

pred_s, true_s = build_visualizable_structures(model_pssms, load_data["train"], mode="pssms")
z_idx = 1
idx = 0
pred_s.to_pdb(idx,path='{}_{}_pred.pdb'.format(idx, z_idx))
true_s.to_pdb(idx,path='{}_{}_true.pdb'.format(idx, z_idx))
protein_plot('{}_{}_pred.pdb'.format(idx, z_idx), '{}_{}_true.pdb'.format(idx, z_idx))

Output:

Example(06)

idx = 1
pred_s.to_pdb(idx,path='{}_{}_pred.pdb'.format(idx, z_idx))
true_s.to_pdb(idx,path='{}_{}_true.pdb'.format(idx, z_idx))
protein_plot('{}_{}_pred.pdb'.format(idx, z_idx), '{}_{}_true.pdb'.format(idx, z_idx))

Output:

idx = 2
pred_s.to_pdb(idx,path='{}_{}_pred.pdb'.format(idx, z_idx))
true_s.to_pdb(idx,path='{}_{}_true.pdb'.format(idx, z_idx))
protein_plot('{}_{}_pred.pdb'.format(idx, z_idx), '{}_{}_true.pdb'.format(idx, z_idx))

Output:

idx = 3
pred_s.to_pdb(idx,path='{}_{}_pred.pdb'.format(idx, z_idx))
true_s.to_pdb(idx,path='{}_{}_true.pdb'.format(idx, z_idx))
protein_plot('{}_{}_pred.pdb'.format(idx, z_idx), '{}_{}_true.pdb'.format(idx, z_idx))

Output:

Training(Sequence→ Angles)

Now we are going to train the model while taking Protein Sequence as input.

Information Flow: Information flow here in a simple Transformer (Attention) model that works with sequence data. The input, represented as [Layers*21], goes through an Embedding layer, resulting in [Layers. Dense Embedding]. Then, it passes through an LSTM layer, transforming into [Layers. Dense Hidden]. Finally, the output comes out of the LSTM and goes through the [Layers Dense Output] layer. Throughout this process, the model processes the input data, extracting relevant information and producing the final output without modifying it.

Handling the circular nature of angles: To help our model understand that angles π and -π are the same, we use a special trick. Instead of directly predicting angles, we predict two values for each angle: sin and cos. Then, we use the atan2 function to combine these two values and recover the angles. This way, the model's output will be in the shape of L×12×2, where L is the length of the protein sequence, and the values are between -1 and 1. This approach allows us to handle angles properly and improve the accuracy of our predictions.

model_seqonly = Net_Protein(d_hidden=512,
                           dim=256,
                           d_in=49,
                           d_embedding=32,
                           integer_sequence=True)
model_seqonly = model_seqonly.to(device)
optimizer, batch_losses, epoch_training_losses, epoch_validation10_losses, epoch_validation90_losses, mse_loss = init_loss_optimizer(model_seqonly)
train(model_seqonly, 'seqs', 9)

Output:

# We can visualize the loss of each batch over time during the training process using a line plot. The x-axis represents the training iterations or epochs, while the y-axis shows the loss values for each batch. By observing the plot, we can gain insights into how the model's performance is improving over time and whether the loss is converging or fluctuating during training. This visualization helps us monitor the model's learning progress and make informed decisions on training adjustments if needed.
plt.plot(np.sqrt(np.asarray(batch_losses)), label='batch loss')
plt.ylabel("RMSE")
plt.xlabel("Step")
plt.title("Training Loss over Time")
plt.show()

Output:

plt.plot([x.cpu().detach().numpy() for x in epoch_training_losses], label='train-eval')
plt.plot([x.cpu().detach().numpy() for x in epoch_validation10_losses], label='valid-10')
plt.plot([x.cpu().detach().numpy() for x in epoch_validation90_losses], label='valid-90')
plt.ylabel("RMSE")
plt.xlabel("Epoch")
plt.title("Training and Validation Losses over Time")
plt.legend()
plt.show()

Output:

Inference(Sequence-> Angles)

Example(09)

pred_s, true_s = build_visualizable_structures(model_seqonly, load_data["train"], mode="seqs")
z_idx = 2
idx = 0
pred_s.to_pdb(idx,path='{}_{}_pred.pdb'.format(idx, z_idx))
true_s.to_pdb(idx,path='{}_{}_true.pdb'.format(idx, z_idx))
protein_plot('{}_{}_pred.pdb'.format(idx, z_idx), '{}_{}_true.pdb'.format(idx, z_idx))

Output:

Example(10)

idx = 1
pred_s.to_pdb(idx,path='{}_{}_pred.pdb'.format(idx, z_idx))
true_s.to_pdb(idx,path='{}_{}_true.pdb'.format(idx, z_idx))
protein_plot('{}_{}_pred.pdb'.format(idx, z_idx), '{}_{}_true.pdb'.format(idx, z_idx))

Output:

Example(11)

idx = 2
pred_s.to_pdb(idx,path='{}_{}_pred.pdb'.format(idx, z_idx))
true_s.to_pdb(idx,path='{}_{}_true.pdb'.format(idx, z_idx))
protein_plot('{}_{}_pred.pdb'.format(idx, z_idx), '{}_{}_true.pdb'.format(idx, z_idx))

Output:

Example(12)

idx = 3
pred_s.to_pdb(idx,path='{}_{}_pred.pdb'.format(idx, z_idx))
true_s.to_pdb(idx,path='{}_{}_true.pdb'.format(idx, z_idx))
protein_plot('{}_{}_pred.pdb'.format(idx, z_idx), '{}_{}_true.pdb'.format(idx, z_idx))

Output:

We have successfully built an attention-based model that can predict protein structure with high accuracy. We trained the model using two different approaches: one with the Secondary Protein Structure matrix as input and the other with the Protein Sequence as input. Both approaches yielded promising results.
One potential improvement for our model is to use Multiple Sequence Alignment (MSA) as training data. MSA provides additional information about the evolutionary conservation of amino acids, which could help enhance the model's performance.
Currently, our model predicts angles as the target for protein structure prediction. However, we could explore using coordinate distance and coordinates as the target instead. This approach might lead to even more precise and accurate predictions of protein structure.
Overall, our model shows great potential in the field of protein structure prediction, and further exploration of different training data and target variables could further improve its performance.

Future Aspect of Protein Folding Using Machine Learning

The potential of protein folding in machine learning for the future is incredibly promising, and it has the capability to transform our comprehension of protein structure and function. By harnessing the capabilities of machine learning and adopting interdisciplinary strategies, we stand on the verge of discovering fresh opportunities and expanding the frontiers of scientific exploration. As we persist in uncovering the enigmas surrounding protein folding, we embark on a path of revolutionary research and pioneering applications that will have profound effects on human well-being and beyond.

Conclusion

Protein folding stands as a critical and intricate process that profoundly impacts the behavior and functions of proteins. The fusion of machine learning with bioinformatics presents an exciting avenue to delve into this intricate world, equipping us with the ability to predict protein structures with unparalleled accuracy. The journey into machine learning and bioinformatics promises to uncover transformative discoveries that will revolutionize medicine and biotechnology. As we venture forth, the enigma of protein folding becomes closer to being unraveled, revealing the profound intricacies of life itself. With machine learning as our ally, we inch ever closer to unveiling the secrets that lie within the realm of protein folding and its vast implications in the grand tapestry of life.

Next TopicSentiment Analysis Using Machine Learning

← prev next →