Attention MechanismJust as in "Attention," in real life, when we look at a picture or listen to music, we usually pay more attention to some sections and less attention to others. The Attention mechanism in Deep Learning follows the same trend, paying more attention to specific areas of the input while processing it. Attention is one component of a network's architecture. The encoder and decoder will be different if you follow the specified tasks. In machine translation, the encoder is often set to LSTM/GRU/Bi_RNN, but in image captioning, the encoder is typically set to CNN. Translating the sentence: 'le chat est noir' to English sentence (the cat is black)The input consists of four words plus an EOS token at the end (a stop word), equating to five-step steps in translating to English. Attention is applied at each time step by allocating weights to input words; the more essential words are awarded larger weights (using the backprob gradient process). As a result, 5 different time weights (corresponding to 5 time steps) are allocated. The following is the general architecture of seq2seq: Without attention, the input in the decoder is made up of two parts: the initial decoder input (typically set to EOS token first (start word)) and the final hidden encoder. The disadvantage of this method is that some information from the very first encoder cell may be lost during the process. To deal with this issue, the attention weight is applied to all encoder outputs. As we can see, the attention weights colors of encoder input are modified differently along with their importance through each decoder output word. You may wonder how we can correctly weight encoder outputs. The solution is that we just set the weights at random, and the backpropagation gradient mechanism will take care of it during training. What we must do is appropriately construct the forward computational graph. After calculating the attention weight, we have three components: decoder input, decoder hidden, and (attention weights * encoder outputs), which we feed to the decoder to return the decoder output. Attention is classified into two types: Bahdanau Attention and Luong Attention. Luong attention is constructed on top of Bahdanau attention and has demonstrated superior performance in a variety of activities. This kernel is centered on Luong's attention. Computational Graph for Luong AttentionStep 1: Calculating encoder hidden stateStep 2Step 3: Investigating the Attention Class and Calculating the Alignment ScoreThere are three approaches to calculating alignment scores in Luong Attention (dot, general, and concat).
Implementing Attention ClassFor the sake of better understanding, we will now implement a Translation System using Attention Mechanism Code: Importing LibrariesOutput: Preprocessing the DataOutput: Output: Here, we will define the function that will preprocess English sentences Here, we will define the function that will preprocess Portuguese sentences Now, we will generate pairs of cleaned English and Portuguese sentences with start and end tokens added. Output: Now, we will create a class to map every word to an index and vice-versa for any given vocabulary. Tokenization and PaddingSplitting the DatasetOutput: GRU UnitsWe'll be using GRUs instead of LSTMs as we only have to create one state and implementation would be easier. The input to the encoder will be the sentence in English and the output will be the hidden state and cell state of the GRU. The next step is to define the decoder. The decoder will have two inputs: the hidden state and cell state from the encoder and the input sentence, which actually will be the output sentence with a token appended at the beginning. Now, we will create encoder and decoder objects from their respective classes. Defining optimizer and loss function will be done here. TrainingHere we will have to train our model. Output: TestingOutput: Output: Output: Output: Over a period of time, we can see that the model is improvising. Next TopicBackpropagation- Algorithm |