Seq2Seq-Encoder-Decoder-LSTM-Model

8 min readAug 20, 2020

Recurrent Neural Networks (or more precisely LSTM/GRU) have been found to be very effective in solving complex sequence related problems given a large amount of data. They have real time applications in speech recognition, Natural Language Processing (NLP) problems etc

Sequence to Sequence models are a special class of Recurrent Neural Network architectures typically used (but not restricted) to solve complex Language related problems like Machine Translation, Question Answering, creating Chat-bots, Text Summarization, etc. Seq2Seq Models, also known as the Encoder-Decoder model

“The Unreasonable Effectiveness of Recurrent Neural Networks (explains how RNNs can be used to build language models) and Understanding LSTM Networks (explains the working of LSTMs with solid intuition) are two brilliant blogs that I strongly suggest to go through if you haven’t. The concepts explained in these blogs are extensively used in my post.”

Encoder — Decoder Architecture

The most common architecture used to build Seq2Seq models is the Encoder Decoder architecture

Both encoder and the decoder are typically LSTM models (or sometimes GRU models)
Encoder reads the input sequence and summarizes the information in something called as the internal state vectors (in case of LSTM these are called as the hidden state and cell state vectors). We discard the outputs of the encoder and only preserve the internal states
Decoder is an LSTM whose initial states are initialized to the final states of the Encoder LSTM. Using these initial states, decoder starts generating the output sequence.
The decoder behaves a bit differently during the training and inference procedure. During the training, we use a technique call teacher forcing which helps to train the decoder faster. During inference, the input to the decoder at each time step is the output from the previous time step.
the encoder summarizes the input sequence into state vectors (sometimes also called as Thought vectors), which are then fed to the decoder which starts generating the output sequence given the Thought vectors. The decoder is just a language model conditioned on the initial states.

Now understand all the above steps in detail by considering the example of translating an English sentence (input sequence) into its equivalent Marathi sentence (output sequence).

Encoder LSTM

The LSTM reads the data one sequence after the other. Thus if the input is a sequence of length ‘k’, we say that LSTM reads it in ‘k’ time steps

Referring to the above diagram, below are the 3 main components of an LSTM:

Xi => Input sequence at time step i

hi and ci => LSTM maintains two states (‘h’ for hidden state and ‘c’ for cell state) at each time step. Combined together these are internal state of the LSTM at time step i.

Yi => Output sequence at time step i

Let’s try to map all of these in the context of our problem. our problem is to translate an English sentence to its Marathi equivalent.we will consider the below example :

Input sentence (English)=> “Rahul is a good boy”

Output sentence (Marathi) => “राहुल चांगला मुलगा आहे”

Explanation for Xi:

Input of the LSTM model

We will break the sentence by words as this scheme is more common in real world applications. Hence the name ‘Word Level NMT’. So, referring to the diagram above, we have the following input:

X1 = ‘Rahul’, X2 = ‘is’, X3 = ‘a’, X4 = ‘good, X5 = ‘boy’.

The LSTM will read this sentence word by word in 5 time steps as follows

first use embed layer before LSTM layer

There are various word embedding techniques which map a word into a fixed length vector.

Explanation for hi and ci: In very simple terms, they remember what the LSTM has read (learned) till now. For example:

h3, c3 =>These two vectors will remember that the network has read “Rahul is a” till now. Basically its the summary of information till time step 3 which is stored in the vectors h3 and c3 (thus called the states at time step 3).

Similarly, we can thus say that h5, c5 will contain the summary of the entire input sentence, since this is where the sentence ends (at time step 5). These states coming out of the last time step are also called as the “Thought vectors” as they summarize the entire sequence in a vector form.

h0,c0? These vectors are typically initialized to zero as the model has not yet started to read the input.

Note: The size of both of these vectors is equal to number of units (neurons) used in the LSTM cell.

Explanation for Yi:

These are the output (predictions) of the LSTM model at each time step.

each Yi is actually a probability distribution over the entire vocabulary which is generated by using a softmax activation. Thus each Yi is a vector of size “vocab_size” representing a probability distribution.

Depending on the context of the problem they might sometimes be used or sometimes be discarded.

Summary of the encoder

We will read the input sequence (English sentence) word by word
Preserve the internal states of the LSTM network generated after the last time step hk, ck (assuming the sentence has ‘k’ words)
These vectors (states hk and ck) are called as the encoding of the input sequence, as they encode (summarize) the entire input in a vector form.
Generate the output once we have read the entire sequence, outputs (Yi) of the Encoder at each time step are discarded.

Decoder LSTM — Training Mode

As the Encoder scanned the input sequence word by word, similarly the Decoder will generate the output sequence word by word.

For some technical reasons (explained later) we will add two tokens in the output sequence as follows:

Output sequence => “START_ राहुल चांगला मुलगा आहे _END”

Now consider the diagram below:

The most important point is that the initial states (h0, c0) of the decoder are set to the final states of the encoder. This means that the decoder is trained to start generating the output sequence depending on the information encoded by the encoder.

Obviously the translated Marathi sentence must depend on the given English sentence.

In the first time step we provide the START_ token so that the decoder starts generating the next token (the actual first word of Marathi sentence). And after the last word in the Marathi sentence, we make the decoder learn to predict the _END token. This will be used as the stopping condition during the inference procedure

We use a technique called “Teacher Forcing” wherein the input at each time step is given as the actual output (and not the predicted output) from the previous time step.This helps in more faster and efficient training of the network. To understand more about teacher forcing, refer this blog

The entire training process (Encoder + Decoder) can be summarized in the below diagram:

And the loss is calculated on the predicted outputs from each time step and the errors are back propagated through time in order to update the parameters of the network.