Document Classification Using Machine Learning

In today's era of digital advancements, businesses, and institutions encounter the formidable task of managing copious volumes of information contained within diverse document formats. The efficient organization and classification of this wealth of information are paramount to enable swift retrieval and informed decision-making. In response, the application of machine learning methods to document classification has arisen as a potent remedy, enabling automation and streamlining of these critical processes.

The classification of the documentation assumes a paramount role in the realm of information management, facilitating streamlined storage, retrieval, and analysis. Through the categorization of documents into pertinent classes or categories, organizations are empowered to construct organized repositories, foster knowledge dissemination, and amplify overall productivity. Traditional manual classification methods are laborious, error-prone, and time-consuming, thus underscoring the immense value of automated machine learning techniques in this context.

Sophisticated machine learning algorithms possess the capability to meticulously scrutinize document content, structure, and metadata, thereby ensuring precise classification. A plethora of supervised learning techniques, including Naive Bayes, Support Vector Machines (SVM), and Random Forests, find frequent applications in classification endeavors. These algorithms learn from annotated training data, where documents are assigned corresponding categories. In addition, unsupervised learning methods such as K-means clustering and hierarchical clustering can be harnessed to unveil concealed patterns and aggregate akin documents without the need for pre-established category information.

Now we will try to implement it into code.

Code:

Importing Libraries

import os
import string
import re

import matplotlib.pyplot as plt

# imports
from tensorflow.keras import layers
from tensorflow import keras
import tensorflow as tf

from sklearn.model_selection import train_test_split
from ast import literal_eval

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

Loading the Dataset

DATASET = "/kaggle/input/document-classification/file.txt"
print(f"Dataset found = {os.path.exists(DATASET)}")

data = []
with open(DATASET, "r") as fp:
    rows = fp.read().split("\n")
    for i, row in enumerate(rows):
        if i == 0:
            continue
            
        texts = row.split(" ")
        label = int(texts[0])
        text = " ".join(texts[1:])
        data.append({'label': label, 'text': text})

df = pd.DataFrame(data)
df.head(5)

Output:

print(f"Number of rows in the dataset = {df.shape[0]}")

num_labels = df["label"].nunique()
print(f"Number of unique labels = {num_labels}")

# Label distribution

df["label"].value_counts()

Output:

Note: Class imbalance in the dataset is moderate. Divided in a tiered way

test_split = 0.2

train_df, test_df = train_test_split(df, test_size=test_split, stratify=df['label'].values)

# further dividing the test set into separate test sets for validation.
val_df = test_df.sample(frac=0.5)
test_df.drop(val_df.index, inplace=True)

print(f"Number of rows in training set: {len(train_df)}")
print(f"Number of rows in validation set: {len(val_df)}")
print(f"Number of rows in test set: {len(test_df)}")

Output:

GPU APIs

First, we need to check for the availability of the API.

print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))

gpus = tf.config.list_physical_devices('GPU')
if gpus:
  # Limit TensorFlow's use to just the first GPU
  try:
    tf.config.set_visible_devices(gpus[0], 'GPU')
    logical_gpus = tf.config.list_logical_devices('GPU')
    print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPU")
  except RuntimeError as e:
    # It is necessary to establish visible devices before initializing the GPUs.
    print(e)
    
  try:
    # Currently, memory growth across GPUs must be uniform.
    for gpu in gpus:
      tf.config.experimental.set_memory_growth(gpu, True)
    logical_gpus = tf.config.list_logical_devices('GPU')
    print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
  except RuntimeError as e:
    # Before GPU initialization, memory growth must be configured.
    print(e)
    
tf.config.run_functions_eagerly(False)

Output:

Preparing the Data

We need to prepare the data that would be suitable for the computation.

# Change the labels to one hot encoding.

labels = tf.ragged.constant(train_df["label"].values)

# One hot encoding is used for numerous class classification tasks.
label_lookup = tf.keras.layers.IntegerLookup(output_mode="one_hot")
label_lookup.adapt(labels)
vocab = label_lookup.get_vocabulary()

def invert_multi_hot(encoded_labels):
    """Reverse a single multi-hot encoded label to a tuple of vocab terms."""
    hot_indices = np.argwhere(encoded_labels == 1.0)[..., 0]
    return np.take(vocab, hot_indices)


print("Vocabulary:\n")
print(vocab)

Output:

# Here, we take the label pool's individual unique classes out one by one and use that information to represent a specific label set as a collection of 0s and 1s. Here is an illustration.

sample_label = train_df["label"].iloc[1]
print(f"Original label: {sample_label}")

label_binarized = label_lookup([sample_label])
print(f"Label-binarized representation: {label_binarized}")

Output:

Preprocessing the Data

# We begin by estimating the sequence lengths' percentiles. In a minute, the goal will become evident.

train_df["text"].apply(lambda x: len(x.split(" "))).describe()

Output:

max_seqlen = 107
batch_size = 128
padding_token = "<pad>"
auto = tf.data.AUTOTUNE

def make_dataset(dataframe, is_train=True):
    labels = tf.ragged.constant(dataframe["label"].values)
    label_binarized = label_lookup(labels).numpy()
    dataset = tf.data.Dataset.from_tensor_slices(
        (dataframe["text"].values, label_binarized)
    )
    dataset = dataset.shuffle(batch_size * 10) if is_train else dataset
    return dataset.batch(batch_size)


train_dataset = make_dataset(train_df, is_train=True)
validation_dataset = make_dataset(val_df, is_train=False)
test_dataset = make_dataset(test_df, is_train=False)


""" Dataset preview """

text_batch, label_batch = next(iter(train_dataset))

for i, text in enumerate(text_batch[0:5]):
    label = label_batch[i].numpy()[None, ...]
    print(f"Abstract: {text}")
    print(f"Label(s): {invert_multi_hot(label[0])}")
    print(" ")

Output:

"""
Vectorize the text data using TextVectorization, and TF-IDF vectorization.
"""

vocabulary = set()
train_df["text"].str.lower().str.split().apply(vocabulary.update)
vocabulary_size = len(vocabulary)
print(f"Vocabulary size = {vocabulary_size}")

text_vectorizer = layers.TextVectorization(
    max_tokens=vocabulary_size, ngrams=2, output_mode="tf_idf"
)

# `It is necessary to modify the " TextVectorization" layer in accordance with the terminology in our training set.
with tf.device("/CPU:0"):
    text_vectorizer.adapt(train_dataset.map(lambda text, label: text))

train_dataset = train_dataset.map(
    lambda text, label: (text_vectorizer(text), label), num_parallel_calls=auto
).prefetch(auto)

validation_dataset = validation_dataset.map(
    lambda text, label: (text_vectorizer(text), label), num_parallel_calls=auto
).prefetch(auto)

test_dataset = test_dataset.map(
    lambda text, label: (text_vectorizer(text), label), num_parallel_calls=auto
).prefetch(auto)

Output:

Helper Methods

Helper methods are functions or procedures that assist in performing specific tasks within a program. These methods are designed to handle repetitive or common operations, making the code more modular, readable, and maintainable.

def plot_result(history, item):
    plt.plot(history.history[item], label=item)
    plt.plot(history.history["val_" + item], label="val_" + item)
    plt.xlabel("Epochs")
    plt.ylabel(item)
    plt.title("Train and Validation {} Over Epochs".format(item), fontsize=14)
    plt.legend()
    plt.grid()
    plt.show()

Modeling

Here, We will train the model and look at its accuracy.

1. Simple Feed Forward Network

def make_model():
    model = keras.Sequential(
        [
            layers.Dense(512, activation="relu"),
            layers.Dense(256, activation="relu"),
            layers.Dense(label_lookup.vocabulary_size(), activation="sigmoid")
        ]
    )
    return model

model1 = make_model()
model1.compile(loss="binary_crossentropy", optimizer="adam", metrics=["binary_accuracy"])

epochs = 10
with tf.device('/device:GPU:0'):
    history = model1.fit(
        train_dataset, validation_data=validation_dataset, epochs=epochs
    )

    plot_result(history, "loss")
    plot_result(history, "binary_accuracy")

Output:

# Evaluate
with tf.device('/device:GPU:0'):
    _, binary_acc = model1.evaluate(test_dataset)
    print(f"Categorical accuracy on the test set: {round(binary_acc * 100, 2)}%.")

Output:

Fairly high training and test accuracy were observed - 99.95% and 99.09%, respectively.
One of the epochs had better validation accuracy, and checkpointing would help.
Given the class imbalance important to check per-label precision

## Computing metrics per class.

def per_class_accuracy(model, ):
    model_for_inference = keras.Sequential([text_vectorizer, model])

    per_label_results = {}
    for label in label_lookup.get_vocabulary():
        if label == -1:
            continue
    #     per_label_results[label] = {'tp': 0, 'tn': 0, 'fp': 0, 'fn': 0}
        per_label_results[label] = {'correct': 0, 'incorrect': 0}

    num_labels = len(per_label_results.items()) + 1
    accuracy_map = np.zeros((num_labels, num_labels), dtype=np.int64)

    inference_dataset = make_dataset(test_df, is_train=False)
    iterator = iter(inference_dataset)

    i = 0
    while True:
        i = i + 1
        text_batch, label_batch = None, None
        try:
            text_batch, label_batch = next(iterator)
#             print(f"Loaded {text_batch.shape[0]} items in batch#{i}.")
        except:
            break

        # Make predictions for the whole batch.
        with tf.device('/device:GPU:0'):
            predicted_probabilities = model_for_inference.predict(text_batch)

        for j, text in enumerate(text_batch):
            label_gt_one_hot = label_batch[j].numpy()[None, ...]
            label_gt = invert_multi_hot(label_gt_one_hot[0])[0]

            predicted_proba = [proba for proba in predicted_probabilities[j]]
            top_label = [
                x
                for _, x in sorted(
                    zip(predicted_probabilities[j], label_lookup.get_vocabulary()),
                    key=lambda pair: pair[0],
                    reverse=True,
                )
            ][:1]
            label_predicted = top_label[0]

            accuracy_map[label_gt][label_predicted] = accuracy_map[label_gt][label_predicted] + 1

            if label_predicted == label_gt:
                # True positive
                per_label_results[label_gt]['correct'] = per_label_results[label_gt]['correct'] + 1
            else:
                per_label_results[label_gt]['incorrect'] = per_label_results[label_gt]['incorrect'] + 1

    return per_label_results, accuracy_map

def print_per_class_accuracy(per_label_results):
    for k, v in per_label_results.items():
        correct = v['correct']
        incorrect = v['incorrect']
        accuracy = correct / (correct + incorrect) * 100
        print(f"Label = {k}, Test Accuracy = {accuracy:.2f}%")
    
def plot_accuracy_map(accuracy_map):
    sns.set(font_scale=1.2)
    fig, ax = plt.subplots(1, 1, figsize=(10, 7))
    sns.heatmap(accuracy_map)
    plt.title('Cross-validation Accuracy')


per_label_results, accuracy_map = per_class_accuracy(model1)
print_per_class_accuracy(per_label_results)
plot_accuracy_map(accuracy_map)

Output:

The model evaluation results show a relatively high test accuracy for most labels. Label 6 has a lower accuracy of 88.89%, indicating that the model may struggle in correctly classifying instances belonging to this label. Label 8 also has a lower accuracy of 77.78%. Labels 4 and 5 exhibit accuracies of 85.71% and 80.00%, respectively, indicating some room for improvement in accurately predicting instances of these labels.

2. Deeper FNN

"""
1. Deeper network
2. Add regularization with dropout
"""
dropout_rate = 0.5

model2 = keras.Sequential(
    [
        layers.Dense(512, activation="relu"),
        layers.Dropout(rate=dropout_rate),
        layers.Dense(256, activation="relu"),
        layers.Dropout(rate=dropout_rate),
        layers.Dense(128, activation="relu"),
        layers.Dense(label_lookup.vocabulary_size(), activation="sigmoid")
    ]
)

model2.compile(loss="binary_crossentropy", optimizer="adam", metrics=["binary_accuracy"])

epochs = 10
with tf.device('/device:GPU:0'):
    history = model2.fit(
        train_dataset, validation_data=validation_dataset, epochs=epochs
    )

    plot_result(history, "loss")
    plot_result(history, "binary_accuracy")

Output:

# Evaluate
with tf.device('/device:GPU:0'):
    _, binary_acc = model2.evaluate(test_dataset)
    print(f"Categorical accuracy on the test set: {round(binary_acc * 100, 2)}%.")
    
per_label_results, accuracy_map = per_class_accuracy(model2)
print_per_class_accuracy(per_label_results)
plot_accuracy_map(accuracy_map)

Output:

The results of the model evaluation indicate impressive performance with a high binary accuracy of 99.05% on the test set.

The model demonstrates strong predictive capabilities, with most labels achieving high accuracy. However, there may be room for improvement for labels 8 and 5, which have lower accuracies. Further analysis and refinement of the model could potentially enhance its performance across all labels.

Conclusion

The utilization of machine learning for document classification presents a groundbreaking remedy to efficiently organize and retrieve information. This transformative approach automates the categorization process, enabling organizations to streamline their operations, enhance decision-making, and unveil the concealed value residing within their document repositories. As technological advancements continue to unfold and challenges are systematically tackled, the future holds immense potential for the development of increasingly sophisticated and accurate documentation classification systems.

Next TopicHeart Disease Prediction Using Machine Learning

← prev next →