Attention Mechanism

Just as in "Attention," in real life, when we look at a picture or listen to music, we usually pay more attention to some sections and less attention to others. The Attention mechanism in Deep Learning follows the same trend, paying more attention to specific areas of the input while processing it. Attention is one component of a network's architecture.

The encoder and decoder will be different if you follow the specified tasks. In machine translation, the encoder is often set to LSTM/GRU/Bi_RNN, but in image captioning, the encoder is typically set to CNN.

Translating the sentence: 'le chat est noir' to English sentence (the cat is black)

The input consists of four words plus an EOS token at the end (a stop word), equating to five-step steps in translating to English. Attention is applied at each time step by allocating weights to input words; the more essential words are awarded larger weights (using the backprob gradient process). As a result, 5 different time weights (corresponding to 5 time steps) are allocated. The following is the general architecture of seq2seq:

Attention Mechanism

Without attention, the input in the decoder is made up of two parts: the initial decoder input (typically set to EOS token first (start word)) and the final hidden encoder. The disadvantage of this method is that some information from the very first encoder cell may be lost during the process. To deal with this issue, the attention weight is applied to all encoder outputs.

Attention Mechanism

As we can see, the attention weights colors of encoder input are modified differently along with their importance through each decoder output word.

You may wonder how we can correctly weight encoder outputs. The solution is that we just set the weights at random, and the backpropagation gradient mechanism will take care of it during training. What we must do is appropriately construct the forward computational graph.

Attention Mechanism

After calculating the attention weight, we have three components: decoder input, decoder hidden, and (attention weights * encoder outputs), which we feed to the decoder to return the decoder output.

Attention is classified into two types: Bahdanau Attention and Luong Attention. Luong attention is constructed on top of Bahdanau attention and has demonstrated superior performance in a variety of activities. This kernel is centered on Luong's attention.

Computational Graph for Luong Attention

Attention Mechanism

Step 1: Calculating encoder hidden state

Step 2

Step 3: Investigating the Attention Class and Calculating the Alignment Score

There are three approaches to calculating alignment scores in Luong Attention (dot, general, and concat).

  • Dot function : This is the simplest of the functions: alignment score calculated by multiplying the hidden encoder and the hidden decoder.
    SCORE = H(encoder) * H(decoder)
  • General function: Similar to the dot function, except that a weight matrix is added to the equation
    SCORE = W(H(encoder) * H(decoder))
  • Concat function: Contacting encoder and decoder first, the feed to nn.Linear and activation, finally we add W2 to get the final Score
    SCORE = W2 * tanh(W1(H(encoder) + H(decoder)))

Implementing Attention Class

For the sake of better understanding, we will now implement a Translation System using Attention Mechanism

Code:

Importing Libraries

Output:

Attention Mechanism

Preprocessing the Data

Output:

Attention Mechanism

Output:

Attention Mechanism

Here, we will define the function that will preprocess English sentences

Here, we will define the function that will preprocess Portuguese sentences

Now, we will generate pairs of cleaned English and Portuguese sentences with start and end tokens added.

Output:

Attention Mechanism

Now, we will create a class to map every word to an index and vice-versa for any given vocabulary.

Tokenization and Padding


Splitting the Dataset

Output:

Attention Mechanism

GRU Units

We'll be using GRUs instead of LSTMs as we only have to create one state and implementation would be easier.

The input to the encoder will be the sentence in English and the output will be the hidden state and cell state of the GRU.

The next step is to define the decoder. The decoder will have two inputs: the hidden state and cell state from the encoder and the input sentence, which actually will be the output sentence with a token appended at the beginning.

Now, we will create encoder and decoder objects from their respective classes.

Defining optimizer and loss function will be done here.

Training

Here we will have to train our model.

Output:

Attention Mechanism
Attention Mechanism

Testing



Output:

Attention Mechanism
Attention Mechanism

Output:

Attention Mechanism
Attention Mechanism

Output:

Attention Mechanism
Attention Mechanism

Output:

Attention Mechanism
Attention Mechanism

Over a period of time, we can see that the model is improvising.






Latest Courses