Hidden Markov Model in Machine Learning

Hidden Markov Models (HMMs) are a type of probabilistic model that are commonly used in machine learning for tasks such as speech recognition, natural language processing, and bioinformatics. They are a popular choice for modelling sequences of data because they can effectively capture the underlying structure of the data, even when the data is noisy or incomplete. In this article, we will give a comprehensive overview of Hidden Markov Models, including their mathematical foundations, applications, and limitations.

What are Hidden Markov Models?

A Hidden Markov Model (HMM) is a probabilistic model that consists of a sequence of hidden states, each of which generates an observation. The hidden states are usually not directly observable, and the goal of HMM is to estimate the sequence of hidden states based on a sequence of observations. An HMM is defined by the following components:

A set of N hidden states, S = {s1, s2, ..., sN}.
A set of M observations, O = {o1, o2, ..., oM}.
An initial state probability distribution, ? = {?1, ?2, ..., ?N}, which specifies the probability of starting in each hidden state.
A transition probability matrix, A = [aij], defines the probability of moving from one hidden state to another.
An emission probability matrix, B = [bjk], defines the probability of emitting an observation from a given hidden state.

The basic idea behind an HMM is that the hidden states generate the observations, and the observed data is used to estimate the hidden state sequence. This is often referred to as the forward-backwards algorithm.

Applications of Hidden Markov Models

Now, we will explore some of the key applications of HMMs, including speech recognition, natural language processing, bioinformatics, and finance.

Speech Recognition
One of the most well-known applications of HMMs is speech recognition. In this field, HMMs are used to model the different sounds and phones that makeup speech. The hidden states, in this case, correspond to the different sounds or phones, and the observations are the acoustic signals that are generated by the speech. The goal is to estimate the hidden state sequence, which corresponds to the transcription of the speech, based on the observed acoustic signals. HMMs are particularly well-suited for speech recognition because they can effectively capture the underlying structure of the speech, even when the data is noisy or incomplete. In speech recognition systems, the HMMs are usually trained on large datasets of speech signals, and the estimated parameters of the HMMs are used to transcribe speech in real time.
Natural Language Processing
Another important application of HMMs is natural language processing. In this field, HMMs are used for tasks such as part-of-speech tagging, named entity recognition, and text classification. In these applications, the hidden states are typically associated with the underlying grammar or structure of the text, while the observations are the words in the text. The goal is to estimate the hidden state sequence, which corresponds to the structure or meaning of the text, based on the observed words. HMMs are useful in natural language processing because they can effectively capture the underlying structure of the text, even when the data is noisy or ambiguous. In natural language processing systems, the HMMs are usually trained on large datasets of text, and the estimated parameters of the HMMs are used to perform various NLP tasks, such as text classification, part-of-speech tagging, and named entity recognition.
Bioinformatics
HMMs are also widely used in bioinformatics, where they are used to model sequences of DNA, RNA, and proteins. The hidden states, in this case, correspond to the different types of residues, while the observations are the sequences of residues. The goal is to estimate the hidden state sequence, which corresponds to the underlying structure of the molecule, based on the observed sequences of residues. HMMs are useful in bioinformatics because they can effectively capture the underlying structure of the molecule, even when the data is noisy or incomplete. In bioinformatics systems, the HMMs are usually trained on large datasets of molecular sequences, and the estimated parameters of the HMMs are used to predict the structure or function of new molecular sequences.
Finance
Finally, HMMs have also been used in finance, where they are used to model stock prices, interest rates, and currency exchange rates. In these applications, the hidden states correspond to different economic states, such as bull and bear markets, while the observations are the stock prices, interest rates, or exchange rates. The goal is to estimate the hidden state sequence, which corresponds to the underlying economic state, based on the observed prices, rates, or exchange rates. HMMs are useful in finance because they can effectively capture the underlying economic state, even when the data is noisy or incomplete. In finance systems, the HMMs are usually trained on large datasets of financial data, and the estimated parameters of the HMMs are used to make predictions about future market trends or to develop investment strategies.

Limitations of Hidden Markov Models

Now, we will explore some of the key limitations of HMMs and discuss how they can impact the accuracy and performance of HMM-based systems.

Limited Modeling Capabilities
One of the key limitations of HMMs is that they are relatively limited in their modelling capabilities. HMMs are designed to model sequences of data, where the underlying structure of the data is represented by a set of hidden states. However, the structure of the data can be quite complex, and the simple structure of HMMs may not be enough to accurately capture all the details. For example, in speech recognition, the complex relationship between the speech sounds and the corresponding acoustic signals may not be fully captured by the simple structure of an HMM.
Overfitting
Another limitation of HMMs is that they can be prone to overfitting, especially when the number of hidden states is large or the amount of training data is limited. Overfitting occurs when the model fits the training data too well and is unable to generalize to new data. This can lead to poor performance when the model is applied to real-world data and can result in high error rates. To avoid overfitting, it is important to carefully choose the number of hidden states and to use appropriate regularization techniques.
Lack of Robustness
HMMs are also limited in their robustness to noise and variability in the data. For example, in speech recognition, the acoustic signals generated by speech can be subjected to a variety of distortions and noise, which can make it difficult for the HMM to accurately estimate the underlying structure of the data. In some cases, these distortions and noise can cause the HMM to make incorrect decisions, which can result in poor performance. To address these limitations, it is often necessary to use additional processing and filtering techniques, such as noise reduction and normalization, to pre-process the data before it is fed into the HMM.
Computational Complexity
Finally, HMMs can also be limited by their computational complexity, especially when dealing with large amounts of data or when using complex models. The computational complexity of HMMs is due to the need to estimate the parameters of the model and to compute the likelihood of the data given in the model. This can be time-consuming and computationally expensive, especially for large models or for data that is sampled at a high frequency. To address this limitation, it is often necessary to use parallel computing techniques or to use approximations that reduce the computational complexity of the model.

Implementation of HMM using Python

For reference, we will implement the Hidden Markov Model for POS Tagging in the code.

Importing Libraries

import numpy as np
import pandas as pd
import seaborn as sns
from tqdm import tqdm
from matplotlib import pyplot as plt

from sklearn.model_selection import GroupShuffleSplit
from hmmlearn import hmm
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

Importing Data

dataset = pd.read_csv("NER dataset.csv", encoding='latin1')
dataset = dataset.fillna(method="ffill")
dataset = dataset.rename(columns={'Sentence #': 'sentence'})
dataset.head(5)

Output:

Count the number of tags and words there are overall in the data. This will come in handy later.

tags = list(set(dataset.POS.values))
words = list(set(dataset.Word.values))
len(tags), len(words)

Output:

We are unable to divide data properly using "train test split" since doing so would result in certain sentence components being included in the training set while others are included in the testing set. We substitute "GroupShuffleSplit" instead.

y = dataset.POS
X = dataset.drop('POS', axis=1)

groupshufflesplit = GroupShuffleSplit(n_splits=2, test_size=.33, random_state=42)
ix_train, ix_test = next(groupshufflesplit.split(X, y, groups=dataset['sentence']))

dataset_train = dataset.loc[ix_train]
dataset_test = dataset.loc[ix_test]

dataset_train

Output:

After examining the split data, everything seemed to be in order.

Verify the number of tags and words in the practice set.

tags = list(set(dataset_train.POS.values))
words = list(set(dataset_train.Word.values))
len(tags), len(words)

Output:

The number of tags is sufficient, but the amount of words (29k vs 35k) is insufficient.

Due to this, we must erratically add some UNKNOWN words to the training dataset, after which we must recalculate the word list and generate a number-to-word mapping.

dataframe_update = dataset_train.sample(frac=.15, replace=False, random_state=42)
dataframe_update.Word = 'UNKNOWN'
dataset_train.update(dataframe_update)
words = list(set(dataset_train.Word.values))
# Convert words and tags into numbers
word2id = {w: i for i, w in enumerate(words)}
tag2id = {t: i for i, t in enumerate(tags)}
id2tag = {i: t for i, t in enumerate(tags)}
len(tags), len(words)

Output:

The Hidden Markov Model may be trained through the use of the Baum-Welch algorithm. However, the training's only input is a dataset (Words).

We are unable to link the states back to the POS tag.

Because of this, we must determine the model parameters for "hmmlearn."

tags_count = dict(dataset_train.POS.value_counts())
tags_to_word_count = dataset_train.groupby(['POS']).apply(lambda grp: grp.groupby('Word')['POS'].count().to_dict()).to_dict()
init_tags_count = dict(dataset_train.groupby('sentence').first().POS.value_counts())


tags_to_next_tags_count = np.zeros((len(tags), len(tags)), dtype=int)
sentences = list(dataset_train.sentence)
pos = list(dataset_train.POS)
for i in range(len(sentences)) :
    if (i > 0) and (sentences[i] == sentences[i - 1]):
        prevtagid = tag2id[pos[i - 1]]
        nexttagid = tag2id[pos[i]]
        tags_to_next_tags_count[prevtagid][nexttagid] += 1

my_start_prob = np.zeros((len(tags),))
my_transmat = np.zeros((len(tags), len(tags)))
my_emission_prob = np.zeros((len(tags), len(words)))
num_sentences = sum(init_tags_count.values())
sum_tags_to_next_tags = np.sum(tags_to_next_tags_count, axis=1)
for tag, tagid in tag2id.items():
    floatCountTag = float(tags_count.get(tag, 0))
    my_start_prob[tagid] = init_tags_count.get(tag, 0) / num_sentences
    for word, wordid in word2id.items():
        my_emission_prob[tagid][wordid]= tags_to_word_count.get(tag, {}).get(word, 0) / floatCountTag
    for tag2, tagid2 in tag2id.items():
        my_transmat[tagid][tagid2]= tags_to_next_tags_count[tagid][tagid2] / sum_tags_to_next_tags[tagid]

Initialize HMM

model = hmm.MultinomialHMM(n_components=len(tags), algorithm='viterbi', random_state=42)
model.startprob_ = my_start_prob
model.transmat_ = my_transmat
model.emissionprob_ = my_emission_prob

We must first change certain words into the term "UNKNOWN" because they might not ever exist in the training set.

The "data test" is then divided into "samples" and "lengths" and sent to HMM.

dataset_test.loc[~dataset_test['Word'].isin(words), 'Word'] = 'UNKNOWN'
test_word = list(dataset_test.Word)
samples_of = []
for i, val in enumerate(test_word):
    samples_of.append([word2id[val]])
   
# TODO use panda solution
lengths = []
count = 0
sentences = list(dataset_test.sentence)
for i in range(len(sentences)) :
    if (i > 0) and (sentences[i] == sentences[i - 1]):
        count += 1
    elif i > 0:
        lengths.append(count)
        count = 1
    else:
        count = 1

# This code is very slow
predict_pos = model.predict(samples_of, lengths)
predict_pos

Output:

tags_test = list(dataset_test.POS)
pos_test = np.zeros((len(tags_test), ), dtype=int)
for i, val in enumerate(tags_test):
    pos_test[i] = tag2id[val]
len(predict_pos), len(pos_test), len(samples_of), len(test_word)

Output:

def reportTest(y_pred, y_test):
    print("The accuracy is {}".format(accuracy_score(y_test, y_pred)))
    print("The precision is {}".format(precision_score(y_test, y_pred, average='weighted')))
    print("The recall is {}".format(recall_score(y_test, y_pred, average='weighted')))
    print("The F1-Score is {}".format(f1_score(y_test, y_pred, average='weighted')))

min_length = min(len(pos_test), len(pos_test))

reportTest(pos_test[:min_length], pos_test[:min_length])

Output: