Digit Recognition Using Machine Learning

In today's digital era, where vast amounts of data are generated every second, the ability to accurately recognize and classify digits holds immense value. Whether it's automated form processing, optical character recognition, or even enhancing user experience in various applications, digit recognition plays a vital role.

At its core, digit recognition is the process of identifying and classifying handwritten or printed digits. Traditionally, this task required complex algorithms and extensive manual effort. However, with the advent of machine learning techniques, we can now automate this process by training models on large datasets of labelled digits.

Machine learning algorithms equip us with the capability to derive valuable characteristics from unprocessed input data and comprehend patterns and connections. In the domain of digit recognition, Convolutional Neural Networks (CNNs) have gained prominence as formidable tools. CNNs emulate the human visual system, displaying proficiency in detecting and discerning patterns within images. Through training these models on datasets containing labelled digits, they acquire the ability to accurately identify and classify digits.

Application of Digit Recognition Using Machine Learning

Digit recognition has a wide range of applications across various industries. In the banking sector, it enables automated check processing, making transactions faster and more efficient. In postal services, it plays a crucial role in automating sorting processes by recognizing postal codes. Moreover, digit recognition is leveraged in the field of document analysis, where it aids in extracting information from forms, invoices, and other handwritten documents. It has also found applications in the field of healthcare, where it assists in analyzing medical records and prescription digitization.

Now we will try to implement digit recognition using Machine Learning into code.

Code:

Importing Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import seaborn as sns
%matplotlib inline

np.random.seed(2)

from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
import itertools

from keras.utils.np_utils import to_categorical # convert to one-hot-encoding
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten, Conv2D, MaxPool2D
from keras.optimizers import RMSprop
from keras.preprocessing.image import ImageDataGenerator
from keras.callbacks import ReduceLROnPlateau


sns.set(style='white', context='notebook', palette='deep')

Data Preparation

Data preparation is a crucial step in any machine-learning project. It involves transforming raw data into a format that is suitable for analysis and modelling. This process ensures that the data is clean, consistent, and ready to be used by machine learning algorithms.

Loading the Data

# Load the data
train = pd.read_csv("../input/train.csv")

Y_train = train["label"]

# Drop the 'label' column
X_train = train.drop(labels = ["label"],axis = 1) 

# free some space
del train 

g = sns.countplot(Y_train)

Y_train.value_counts()

Output:

Digit Recognition Using Machine Learning

We have approximately equal frequencies for each of the 10 digits.

Checking for Null and Missing Values

Now it becomes necessary for us to perform this task of checking for null and missing values.

# Check the data
X_train.isnull().any().describe()

Output:

We examine the dataset for any corrupted images or missing values. Fortunately, there are no missing values in the dataset, allowing us to proceed confidently.

Normalization

We apply grayscale normalization to minimize the impact of variations in illumination. Additionally, Convolutional Neural Networks (CNNs) tend to converge more quickly when trained on data ranging from 0 to 1 rather than 0 to 255.

# Normalize the data
X_train = X_train / 255.0

Reshape

# Reshape image in 3 dimensions (height = 28px, width = 28px , canal = 1)
X_train = X_train.values.reshape(-1,28,28,1)

The training images, initially represented as 1D vectors of 784 values, have been stored in a pandas DataFrame. We then reshape the data into 3D matrices of size 28x28x1.

In the case of grayscale images used in the MNIST dataset, only one channel is required. However, for RGB images with three colour channels, we would reshape the 784-pixel vectors into 3D matrices of size 28x28x3 to accommodate all three channels.

Label Encoding

# Encode labels to one hot vector (ex : 2 -> [0,0,1,0,0,0,0,0,0,0])
Y_train = to_categorical(Y_train, num_classes = 10)

The labels in our dataset are represented as 10-digit numbers ranging from 0 to 9. To process these labels in our machine-learning model, we need to encode them as one-hot vectors. For example, the label '2' would be encoded as [0, 0, 1, 0, 0, 0, 0, 0, 0, 0], where the '1' indicates the corresponding digit and the other positions are filled with '0's to represent the remaining digits.

Splitting the Dataset

# Set the random seed
random_seed = 2

# Split the train and the validation set for the fitting
X_train, X_val, Y_train, Y_val = train_test_split(X_train, Y_train, test_size = 0.1, random_state=random_seed)

We have decided to divide the training set into two parts: a small portion (10%) will be used as the validation set to evaluate the model, and the remaining portion (90%) will be used to train the model.

Since we have a total of 42,000 training images with balanced labels, a random split of the training set will not result in any labels being overrepresented in the validation set. However, it is important to note that when working with unbalanced datasets, a simple random split could lead to inaccurate evaluation during the validation process.

To address this issue, you can use the "stratify=True" option in the "train_test_split" function (available in sklearn versions >=0.17) to ensure that the class distribution is maintained in both the training and validation sets.

One way to gain a better understanding of an example is by visualizing the image and examining its corresponding label. By visualizing the image, we can get a visual representation of the data and observe its features, while the label provides us with information about its classification or category. This visual inspection allows us to interpret and analyze the data more effectively, leading to insights and a deeper understanding of the example at hand.

# Some examples
g = plt.imshow(X_train[0][:,:,0])

Output:

Model

We employed the Sequential API in Keras, which allows us to add one layer at a time, starting from the input.

The first layer is the convolutional (Conv2D) layer, consisting of a set of learnable filters. We opted for 32 filters in the first two Conv2D layers and 64 filters in the remaining two. Each filter applies a transformation to a portion of the image, defined by the kernel size. The kernel filter matrix is then applied to the entire image. These filters can be viewed as a way to transform the image.

The CNN can extract relevant features from these transformed images, which are represented as feature maps.

The next crucial layer in CNN is the pooling (MaxPool2D) layer. This layer acts as a downsampling filter by selecting the maximum value from neighbouring pixels. It helps reduce computational complexity and, to some extent, mitigates overfitting. The pooling size, determining the area pooled at each step, affects the level of downsampling. By combining convolutional and pooling layers, CNNs are capable of capturing both local and global features of the image.

To prevent overfitting, we incorporated dropout regularization. This technique randomly ignores a proportion of nodes in a layer (setting their weights to zero) during training. It introduces randomness, encouraging the network to learn features in a distributed manner and improving generalization.

The activation function 'relu' (rectifier) introduces non-linearity to the network, enhancing its learning capacity. The Flatten layer is used to convert the final feature maps into a 1D vector. This flattening step is necessary to utilize fully connected layers after convolutional and pooling layers. It combines all the local features identified by the preceding layers.

Finally, we employed two fully connected (Dense) layers, which resemble artificial neural networks (ANN) classifiers. In the last layer (Dense(10, activation="softmax")), the network outputs a probability distribution for each class.

# Set the CNN model 
# my CNN architechture is In -> [[Conv2D->relu]*2 -> MaxPool2D -> Dropout]*2 -> Flatten -> Dense -> Dropout -> Out

model = Sequential()

model.add(Conv2D(filters = 32, kernel_size = (5,5),padding = 'Same', 
                 activation ='relu', input_shape = (28,28,1)))
model.add(Conv2D(filters = 32, kernel_size = (5,5),padding = 'Same', 
                 activation ='relu'))
model.add(MaxPool2D(pool_size=(2,2)))
model.add(Dropout(0.25))


model.add(Conv2D(filters = 64, kernel_size = (3,3),padding = 'Same', 
                 activation ='relu'))
model.add(Conv2D(filters = 64, kernel_size = (3,3),padding = 'Same', 
                 activation ='relu'))
model.add(MaxPool2D(pool_size=(2,2), strides=(2,2)))
model.add(Dropout(0.25))


model.add(Flatten())
model.add(Dense(256, activation = "relu"))
model.add(Dropout(0.5))
model.add(Dense(10, activation = "softmax"))

Setting Optimizer and Annealer

Once we have added the layers to our model, we need to configure a score function, a loss function, and an optimization algorithm.

The loss function is defined to measure how accurately our model predicts the labels of the images. It calculates the error rate between the observed labels and the predicted ones. For categorical classification tasks with more than two classes, we use a specific form of loss function called "categorical_crossentropy".

The optimizer is the most crucial function as it iteratively adjusts the parameters of the model, such as filter kernel values, weights, and biases of neurons, to minimize the loss. We have chosen RMSprop as our optimizer, which is highly effective. The RMSProp update is a modification of the Adagrad method that aims to reduce the aggressive, monotonically decreasing learning rate. Alternatively, we could have used a Stochastic Gradient Descent (SGD) optimizer, but it tends to be slower than RMSprop.

The metric function "accuracy" is used to evaluate the performance of our model. It measures how well the model predicts the correct labels. It is important to note that the results from the metric evaluation are not used during the training of the model; they are only used for evaluation purposes.

# Define the optimizer
optimizer = RMSprop(lr=0.001, rho=0.9, epsilon=1e-08, decay=0.0)

# Compile the model
model.compile(optimizer = optimizer , loss = "categorical_crossentropy", metrics=["accuracy"])

To facilitate faster convergence of the optimizer towards the global minimum of the loss function, we implemented an annealing method for the learning rate (LR).

The learning rate determines the size of the steps taken by the optimizer as it navigates the landscape of the loss function. A higher learning rate results in larger steps and faster convergence. However, using a high learning rate can lead to poor sampling, and the optimizer may get stuck in a local minimum.

To overcome this, we employed a decreasing learning rate during training to ensure more efficient convergence towards the global minimum of the loss function.

To leverage the benefits of faster computation with a high learning rate, we dynamically reduced the learning rate every X step (epochs) based on whether it was necessary, specifically when the accuracy did not improve.

We utilized the ReduceLROnPlateau function from the Keras.callbacks module, which automatically reduced the learning rate by half if the accuracy did not improve after 3 epochs. This approach helped us fine-tune the learning rate and optimize the model's performance.

# Set a learning rate annealer
learning_rate_reduction = ReduceLROnPlateau(monitor='val_acc', 
                                            patience=3, 
                                            verbose=1, 
                                            factor=0.5, 
                                            min_lr=0.00001)

epochs = 1 # Turn epochs to 30 to get 0.9967 Accuracy
batch_size = 86

Data Augmentation

To address the issue of overfitting, we employed data augmentation techniques to expand our existing handwritten digit dataset. This approach involved artificially altering the training data through various transformations to replicate the variations that occur when someone writes a digit.

For instance, we accounted for scenarios where the number was not centred, the scale varied (some individuals write with larger or smaller numbers), or the image was rotated.

Data augmentation techniques involve modifying the training data while keeping the label the same, thereby changing the array representation. Popular augmentations include grayscaling, horizontal and vertical flips, random crops, colour jitters, translations, rotations, and more.

By applying just a few of these transformations to our training data, we significantly increased the number of training examples, effectively doubling or even tripling the dataset. This augmentation process enhanced the robustness of our model, enabling it to better generalize and mitigate the risk of overfitting.

Note: The improvement achieved through data augmentation is substantial. When training the model without data augmentation, we obtained an accuracy of 98.114%. However, by implementing data augmentation techniques, we were able to significantly enhance the model's performance, resulting in an impressive accuracy of 99.67%.

# Without data augmentation, i obtained an accuracy of 0.98114
#history = model.fit(X_train, Y_train, batch_size = batch_size, epochs = epochs, 
#          validation_data = (X_val, Y_val), verbose = 2)


# With data augmentation to prevent overfitting (accuracy 0.99286)

datagen = ImageDataGenerator(
        featurewise_center=False,  # set input mean to 0 over the dataset
        samplewise_center=False,  # set each sample mean to 0
        featurewise_std_normalization=False,  # divide inputs by std of the dataset
        samplewise_std_normalization=False,  # divide each input by its std
        zca_whitening=False,  # apply ZCA whitening
        rotation_range=10,  # randomly rotate images in the range (degrees, 0 to 180)
        zoom_range = 0.1, # Randomly zoom image 
        width_shift_range=0.1,  # randomly shift images horizontally (fraction of total width)
        height_shift_range=0.1,  # randomly shift images vertically (fraction of total height)
        horizontal_flip=False,  # randomly flip images
        vertical_flip=False)  # randomly flip images


datagen.fit(X_train)

To augment the training data, we implemented several transformations to introduce variations and increase the diversity of the dataset. Specifically, we chose the following augmentation techniques:

Random Rotation: We randomly rotated some training images by 10 degrees. This helps the model learn to recognize digits from different orientations and improves its robustness to variations in writing styles.
Random Zoom: We randomly applied a zoom effect to some training images, increasing or decreasing their size by 10%. This variation enables the model to better handle different scales at which the digits may appear in real-world scenarios.
Random Horizontal Shift: We randomly shifted images horizontally by 10% of their width. This simulates variations in the positioning of the digits within the image and enhances the model's ability to accurately classify digits regardless of their horizontal placement.
Random Vertical Shift: Similarly, we randomly shifted images vertically by 10% of their height. This introduces variations in the vertical positioning of the digits and helps the model generalize well to different vertical alignments.

We made a deliberate choice not to apply vertical or horizontal flips to the images. This decision was motivated by the fact that flipping symmetrical digits, such as 6 and 9, could potentially lead to misclassification. By excluding these flips, we ensure that the model focuses on learning distinctive features of the digits without being misled by symmetrical similarities.

# Fit the model
history = model.fit_generator(datagen.flow(X_train,Y_train, batch_size=batch_size),
                              epochs = epochs, validation_data = (X_val,Y_val),
                              verbose = 2, steps_per_epoch=X_train.shape[0] // batch_size
                              , callbacks=[learning_rate_reduction])

Output:

Evaluating the Model

To evaluate the performance of our model, we used the validation set, which contains a separate set of images that the model has not seen during training. This allows us to assess how well the model generalizes to new and unseen data.

Training and Validation Curves

# Plot the loss and accuracy curves for training and validation 
fig, ax = plt.subplots(2,1)
ax[0].plot(history.history['loss'], color='b', label="Training loss")
ax[0].plot(history.history['val_loss'], color='r', label="validation loss",axes =ax[0])
legend = ax[0].legend(loc='best', shadow=True)

ax[1].plot(history.history['accuracy'], color='b', label="Training accuracy")
ax[1].plot(history.history['val_accuracy'], color='r',label="Validation accuracy")
legend = ax[1].legend(loc='best', shadow=True)

Output:

The model's performance is impressive, achieving an accuracy of nearly 99% on the validation dataset after just 2 epochs. Notably, the validation accuracy consistently surpasses the training accuracy throughout the training process. This indicates that our model is effectively generalizing and not overfitting the training set.

Confusion Matrix

Analyzing the confusion matrix enables us to identify specific areas where the model may struggle or encounter challenges. This information helps us understand the model's limitations and provides guidance for potential improvements.

To achieve this, we plot the confusion matrix based on the validation results, allowing us to visualize the model's performance and identify any patterns or trends in misclassifications.

# Look at the confusion matrix 

def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

# Predict the values from the validation dataset
Y_pred = model.predict(X_val)
# Convert predictions classes to one hot vector 
Y_pred_classes = np.argmax(Y_pred,axis = 1) 
# Convert validation observations to one hot vector
Y_true = np.argmax(Y_val,axis = 1) 
# compute the confusion matrix
confusion_mtx = confusion_matrix(Y_true, Y_pred_classes) 
# plot the confusion matrix
plot_confusion_matrix(confusion_mtx, classes = range(10)) 

Output:

In our evaluation of the CNN model, we observe excellent performance across all digits with minimal errors, considering the size of the validation set, which consists of 4,200 images.

However, we do notice a slight challenge for our CNN when classifying the digit 4, as it occasionally misclassifies it as 9. This can be attributed to the inherent difficulty in distinguishing between these two digits when their curves are smooth and visually similar.

Despite this minor issue, overall, our CNN demonstrates impressive accuracy and proficiency in recognizing and classifying the various digits in the dataset.

Let's examine the errors more closely.

Our aim is to identify the most significant errors by examining the disparity between the probabilities of the actual values and the predicted values in the results. This will allow us to pinpoint the instances where the model's predictions deviate the most from the true values.

# Display some error results 

# Errors are the differences between predicted labels and true labels
errors = (Y_pred_classes - Y_true != 0)

Y_pred_classes_errors = Y_pred_classes[errors]
Y_pred_errors = Y_pred[errors]
Y_true_errors = Y_true[errors]
X_val_errors = X_val[errors]

def display_errors(errors_index,img_errors,pred_errors, obs_errors):
    """ This function shows 6 images with their predicted and real labels"
    n = 0
    nrows = 2
    ncols = 3
    fig, ax = plt.subplots(nrows,ncols,sharex=True,sharey=True)
    for row in range(nrows):
        for col in range(ncols):
            error = errors_index[n]
            ax[row,col].imshow((img_errors[error]).reshape((28,28)))
            ax[row,col].set_title("Predicted label :{}\nTrue label :{}".format(pred_errors[error],obs_errors[error]))
            n += 1

# Probabilities of the wrong predicted numbers
Y_pred_errors_prob = np.max(Y_pred_errors,axis = 1)

# Predicted probabilities of the true values in the error set
true_prob_errors = np.diagonal(np.take(Y_pred_errors, Y_true_errors, axis=1))

# Difference between the probability of the predicted label and the true label
delta_pred_true_errors = Y_pred_errors_prob - true_prob_errors

# Sorted list of the delta prob errors
sorted_dela_errors = np.argsort(delta_pred_true_errors)

# Top 6 errors 
most_important_errors = sorted_dela_errors[-6:]

# Show the top 6 errors
display_errors(most_important_errors, X_val_errors, Y_pred_classes_errors, Y_true_errors)

Output:

The most crucial errors are also the most intriguing. In these six cases, the model's performance is not absurd. Some of these errors could also be made by humans, particularly in the case of one instance where a 9 closely resembles a 4. The last 9 is also quite misleading, as it appears to be more like a 0, in my opinion.

Challenges and Future Aspects of Digit Recognition Using Machine Learning

While machine learning-based digit recognition has achieved remarkable success, challenges still exist. One significant challenge is the ability to handle variations in writing styles, especially when dealing with handwritten digits. Ongoing research focuses on improving model robustness and addressing these challenges. Researchers are exploring techniques such as data augmentation, where the training dataset is artificially expanded to include variations in writing styles, scale, and orientation. Furthermore, advancements in deep learning, such as the integration of recurrent neural networks, hold promise for enhancing digit recognition accuracy.

Conclusion

Digit recognition using machine learning has revolutionized various industries by automating and streamlining processes that involve the identification and classification of digits. With the power of convolutional neural networks and other machine learning algorithms, we have witnessed significant advancements in digit recognition accuracy. As research and technology continue to evolve, we can expect even more sophisticated models capable of handling complex variations in writing styles. Digit recognition is undoubtedly a field that will continue to thrive, making significant contributions to diverse applications and shaping the future of artificial intelligence.

Next TopicElectricity Consumption Prediction Using Machine Learning

← prev next →