Image Captioning Using Machine Learning

The convergence of natural language processing and computer vision has produced notable progress in the field of image captioning, which is the process of producing accurate and evocative written descriptions for images. Image captioning systems have been developed to deliver rich and contextually appropriate captions by utilizing machine learning techniques. This has improved accessibility and comprehension of visual material.

The practice of automatically creating written descriptions for photos is known as image captioning. Image captioning goes beyond typical image recognition algorithms, which just produce labels or tags for pictures. Instead, it generates human-like descriptions that capture the connections, substance, and context of the image. Integration of natural language processing models for text creation with computer vision models for picture interpretation is necessary for this endeavor.

Application of Image Captioning

Image captioning has numerous real-world applications across various domains:

  • Accessibility: By providing detailed textual descriptions of visual content, such as social media postings, news articles, and instructional materials, image captioning makes visual content accessible to those with visual impairments.
  • Content Understanding: By allowing users to search, explore, and navigate through vast collections of images using textual queries, image captioning improves multimedia applications' ability to comprehend and retrieve content.
  • Assistive technologies: Image captioning is a useful tool that supports applications like augmented reality, navigation help, and scene interpretation in assistive technologies and human-computer interaction.

Code:

Now we will try to make a model that will caption the images.

Importing Libraries

Preprocess Image Data

Output:

Image Captioning Using Machine Learning
Image Captioning Using Machine Learning

Preparing Text Data

Use the following functions to make the explanations cleaner:

  • Make every word lowercase.
  • Eliminate all punctuation.
  • Eliminate any words that are one character or shorter, such as "a."
  • Eliminate any word that contains a number.

A vocabulary that is as limited and evocative as feasible is what we aim for. A reduced vocabulary will lead to a faster training smaller model. To obtain a sense of the extent of our dataset vocabulary, we may convert the clean descriptions into a set and publish its size. Sets have no duplicate values and are well-optimized. There is a hash table-based implementation. As a result, we have a short yet expressive vocabulary.

Output:

Image Captioning Using Machine Learning

Loading the Data

Output:

Image Captioning Using Machine Learning

Before the descriptive language can be fed into the model or compared to the model's predictions, it must be converted to numbers via encoding. To begin encoding the data, a consistent mapping from words to distinct integer values must be made. The Tokenizer class, offered by Keras, is capable of acquiring this mapping using the imported description data.

The functions create_tokenizer() and to_lines(), which translate a dictionary of descriptions into a list of strings and fit a tokenizer given a loaded photo description text, are defined below.

Output:

Image Captioning Using Machine Learning

The text can now be encoded. We'll break up each description into words. One word and a picture will be given to the model, who will then create the following word. The model will then be fed the first two words of the description together with the image to produce the following word. The model will be trained in this manner.

Defining the Model

The whole model is divided into 3 parts:

  • Photo Feature Extractor: A 16-layer VGG model that has been pre-trained on the ImageNet dataset is the Photo Feature Extractor. The extracted characteristics that this model predicted would be used as input. The images were pre-processed using the VGG model (omitting the output layer).
  • Sequence Processor: The word embedding layer, or sequence processor, handles text input. It is succeeded by a recurrent neural network layer with long short-term memory (LSTM).
  • Decoder: A fixed-length vector is the result of the sequence processor and feature extractor. To create a final forecast, these are combined and processed by a Dense layer. The input photo features that the Photo Feature Extractor model expects to receive are vectors with 4,096 elements. A Dense layer processes them and creates a 256-element representation of the image.

The Sequence Processor model feeds input sequences into an Embedding layer that utilizes a mask to disregard padding values. The input sequences have a pre-defined length of 34 words. An LSTM layer with 256 memory units comes next.

A vector of 256 elements is produced by both input models. Additionally, 50% dropout regularization is used in both input models. Because this model configuration learns quickly, the goal is to reduce the overfitting of the training dataset.

Using an additional operation, the Decoder model combines the vectors from the two input models. After that, this is passed to a Dense 256-neuron layer, and finally, to a final output Dense layer, which predicts the next word in the sequence using a softmax across the whole output vocabulary.

Output:

Image Captioning Using Machine Learning
Image Captioning Using Machine Learning

Evaluate the Model

Let us have a look at the accuracy of the model.

Output:

Image Captioning Using Machine Learning

Generating New Descriptions

Now we are going to generate new descriptions for the images.

Output:

Image Captioning Using Machine Learning

Output:

Image Captioning Using Machine Learning

"startseq man in the red shirt is standing on rock endseq"


Next Topic#




Latest Courses