There are lots of similarities between the Transformer encoder and decoder, resembling of their implementation of multi-head consideration, layer normalization and a completely linked feed-forward community as their remaining sub-layer. Having applied the Transformer encoder, we’ll now proceed to use our data in implementing the Transformer decoder, as an extra step in the direction of implementing the whole Transformer mannequin. Our finish purpose stays the applying of the whole mannequin to Pure Language Processing (NLP).
On this tutorial, you’ll uncover implement the Transformer decoder from scratch in TensorFlow and Keras.
After finishing this tutorial, you’ll know:
- The layers that kind a part of the Transformer decoder.
- How you can implement the Transformer decoder from scratch.
Let’s get began.

Implementing the Transformer Decoder From Scratch in TensorFlow and Keras
Photograph by François Kaiser, some rights reserved.
Tutorial Overview
This tutorial is split into three elements; they’re:
- Recap of the Transformer Structure
- The Transformer Decoder
- Implementing the Transformer Decoder From Scratch
- The Decoder Layer
- The Transformer Decoder
- Testing Out the Code
Conditions
For this tutorial, we assume that you’re already conversant in:
- The Transformer mannequin
- The scaled dot-product consideration
- The multi-head consideration
- The Transformer positional encoding
- The Transformer encoder
Recap of the Transformer Structure
Recall having seen that the Transformer structure follows an encoder-decoder construction: the encoder, on the left-hand aspect, is tasked with mapping an enter sequence to a sequence of steady representations; the decoder, on the right-hand aspect, receives the output of the encoder along with the decoder output on the earlier time step, to generate an output sequence.

The Encoder-Decoder Construction of the Transformer Structure
Taken from “Consideration Is All You Want“
In producing an output sequence, the Transformer doesn’t depend on recurrence and convolutions.
We had seen that the decoder a part of the Transformer shares many similarities in its structure with the encoder. This tutorial might be exploring these similarities.
The Transformer Decoder
Much like the Transformer encoder, the Transformer decoder additionally consists of a stack of $N$ equivalent layers. The Transformer decoder, nonetheless, implements an extra multi-head consideration block, for a complete of three fundamental sub-layers:
- The primary sub-layer includes a multi-head consideration mechanism that receives the queries, keys and values as inputs.
- The second sub-layer includes a second multi-head consideration mechanism.
- The third sub-layer includes a fully-connected feed-forward community.

The Decoder Block of the Transformer Structure
Taken from “Consideration Is All You Want“
Every certainly one of these three sub-layers can be adopted by layer normalisation, the place the enter to the layer normalization step is its corresponding sub-layer enter (by a residual connection) and output.
On the decoder aspect, the queries, keys and values which are fed into the primary multi-head consideration block additionally signify the identical enter sequence. Nonetheless, this time spherical, it’s the goal sequence that’s embedded and augmented with positional data earlier than being provided to the decoder. The second multi-head consideration block, alternatively, receives the encoder output within the type of keys and values, and the normalized output of the primary decoder consideration block because the queries. In each circumstances, the dimensionality of the queries and keys stays equal to $d_k$, whereas the dimensionality of the values stays equal to $d_v$.
Vaswani et al. introduce regularization into the mannequin on the decoder aspect too, by making use of dropout to the output of every sub-layer (earlier than the layer normalization step), in addition to to the positional encodings earlier than these are fed into the decoder.
Let’s now see implement the Transformer decoder from scratch in TensorFlow and Keras.
Implementing the Transformer Decoder From Scratch
The Decoder Layer
Since we now have already applied the required sub-layers after we coated the implementation of the Transformer encoder, we’ll create a category for the decoder layer that makes use of those sub-layers immediately:
from multihead_attention import MultiHeadAttention from encoder import AddNormalization, FeedForward class DecoderLayer(Layer): def __init__(self, h, d_k, d_v, d_model, d_ff, charge, **kwargs): tremendous(DecoderLayer, self).__init__(**kwargs) self.multihead_attention1 = MultiHeadAttention(h, d_k, d_v, d_model) self.dropout1 = Dropout(charge) self.add_norm1 = AddNormalization() self.multihead_attention2 = MultiHeadAttention(h, d_k, d_v, d_model) self.dropout2 = Dropout(charge) self.add_norm2 = AddNormalization() self.feed_forward = FeedForward(d_ff, d_model) self.dropout3 = Dropout(charge) self.add_norm3 = AddNormalization() ...
Discover right here that since my code for the totally different sub-layers had been saved into a number of Python scripts (specifically, multihead_attention.py and encoder.py), it was essential to import them to have the ability to use the required courses.
As we had performed for the Transformer encoder, we’ll proceed to create the category methodology, name()
, that implements the entire decoder sub-layers:
... def name(self, x, encoder_output, lookahead_mask, padding_mask, coaching): # Multi-head consideration layer multihead_output1 = self.multihead_attention1(x, x, x, lookahead_mask) # Anticipated output form = (batch_size, sequence_length, d_model) # Add in a dropout layer multihead_output1 = self.dropout1(multihead_output1, coaching=coaching) # Adopted by an Add & Norm layer addnorm_output1 = self.add_norm1(x, multihead_output1) # Anticipated output form = (batch_size, sequence_length, d_model) # Adopted by one other multi-head consideration layer multihead_output2 = self.multihead_attention2(addnorm_output1, encoder_output, encoder_output, padding_mask) # Add in one other dropout layer multihead_output2 = self.dropout2(multihead_output2, coaching=coaching) # Adopted by one other Add & Norm layer addnorm_output2 = self.add_norm1(addnorm_output1, multihead_output2) # Adopted by a completely linked layer feedforward_output = self.feed_forward(addnorm_output2) # Anticipated output form = (batch_size, sequence_length, d_model) # Add in one other dropout layer feedforward_output = self.dropout3(feedforward_output, coaching=coaching) # Adopted by one other Add & Norm layer return self.add_norm3(addnorm_output2, feedforward_output)
The multi-head consideration sub-layers may also obtain a padding masks or a look-ahead masks. As a quick reminder of what we had mentioned in a earlier tutorial, the padding masks is critical to suppress the zero padding within the enter sequence from being processed together with the precise enter values. The look-ahead masks prevents the decoder from attending to succeeding phrases, such that the prediction for a specific phrase can solely rely upon identified outputs for the phrases that come earlier than it.
The identical name()
class methodology may also obtain a coaching
flag to solely apply the Dropout layers throughout coaching, when the worth of this flag is ready to True
.
The Transformer Decoder
The Transformer decoder takes the decoder layer that we now have simply applied, and replicates it identically $N$ occasions.
We might be creating the next Decoder()
class to implement the Transformer decoder:
from positional_encoding import PositionEmbeddingFixedWeights class Decoder(Layer): def __init__(self, vocab_size, sequence_length, h, d_k, d_v, d_model, d_ff, n, charge, **kwargs): tremendous(Decoder, self).__init__(**kwargs) self.pos_encoding = PositionEmbeddingFixedWeights(sequence_length, vocab_size, d_model) self.dropout = Dropout(charge) self.decoder_layer = [DecoderLayer(h, d_k, d_v, d_model, d_ff, rate) for _ in range(n) ...
As in the Transformer encoder, the input to the first multi-head attention block on the decoder side receives the input sequence after this would have undergone a process of word embedding and positional encoding. For this purpose, an instance of the PositionEmbeddingFixedWeights
class (covered in this tutorial) is initialized and its output assigned to the pos_encoding
variable.
The final step is to create a class method, call()
, that applies word embedding and positional encoding to the input sequence and feeds the result, together with the encoder output, to $N$ decoder layers:
... def call(self, output_target, encoder_output, lookahead_mask, padding_mask, training): # Generate the positional encoding pos_encoding_output = self.pos_encoding(output_target) # Expected output shape = (number of sentences, sequence_length, d_model) # Add in a dropout layer x = self.dropout(pos_encoding_output, training=training) # Pass on the positional encoded values to each encoder layer for i, layer in enumerate(self.decoder_layer): x = layer(x, encoder_output, lookahead_mask, padding_mask, training) return x
The code listing for the full Transformer decoder is the following:
from tensorflow.keras.layers import Layer, Dropout from multihead_attention import MultiHeadAttention from positional_encoding import PositionEmbeddingFixedWeights from encoder import AddNormalization, FeedForward # Implementing the Decoder Layer class DecoderLayer(Layer): def __init__(self, h, d_k, d_v, d_model, d_ff, rate, **kwargs): super(DecoderLayer, self).__init__(**kwargs) self.multihead_attention1 = MultiHeadAttention(h, d_k, d_v, d_model) self.dropout1 = Dropout(rate) self.add_norm1 = AddNormalization() self.multihead_attention2 = MultiHeadAttention(h, d_k, d_v, d_model) self.dropout2 = Dropout(rate) self.add_norm2 = AddNormalization() self.feed_forward = FeedForward(d_ff, d_model) self.dropout3 = Dropout(rate) self.add_norm3 = AddNormalization() def call(self, x, encoder_output, lookahead_mask, padding_mask, training): # Multi-head attention layer multihead_output1 = self.multihead_attention1(x, x, x, lookahead_mask) # Expected output shape = (batch_size, sequence_length, d_model) # Add in a dropout layer multihead_output1 = self.dropout1(multihead_output1, training=training) # Followed by an Add & Norm layer addnorm_output1 = self.add_norm1(x, multihead_output1) # Expected output shape = (batch_size, sequence_length, d_model) # Followed by another multi-head attention layer multihead_output2 = self.multihead_attention2(addnorm_output1, encoder_output, encoder_output, padding_mask) # Add in another dropout layer multihead_output2 = self.dropout2(multihead_output2, training=training) # Followed by another Add & Norm layer addnorm_output2 = self.add_norm1(addnorm_output1, multihead_output2) # Followed by a fully connected layer feedforward_output = self.feed_forward(addnorm_output2) # Expected output shape = (batch_size, sequence_length, d_model) # Add in another dropout layer feedforward_output = self.dropout3(feedforward_output, training=training) # Followed by another Add & Norm layer return self.add_norm3(addnorm_output2, feedforward_output) # Implementing the Decoder class Decoder(Layer): def __init__(self, vocab_size, sequence_length, h, d_k, d_v, d_model, d_ff, n, rate, **kwargs): super(Decoder, self).__init__(**kwargs) self.pos_encoding = PositionEmbeddingFixedWeights(sequence_length, vocab_size, d_model) self.dropout = Dropout(rate) self.decoder_layer = [DecoderLayer(h, d_k, d_v, d_model, d_ff, rate) for _ in range(n)] def name(self, output_target, encoder_output, lookahead_mask, padding_mask, coaching): # Generate the positional encoding pos_encoding_output = self.pos_encoding(output_target) # Anticipated output form = (variety of sentences, sequence_length, d_model) # Add in a dropout layer x = self.dropout(pos_encoding_output, coaching=coaching) # Move on the positional encoded values to every encoder layer for i, layer in enumerate(self.decoder_layer): x = layer(x, encoder_output, lookahead_mask, padding_mask, coaching) return x
Testing Out the Code
We might be working with the parameter values specified within the paper, Consideration Is All You Want, by Vaswani et al. (2017):
h = 8 # Variety of self-attention heads d_k = 64 # Dimensionality of the linearly projected queries and keys d_v = 64 # Dimensionality of the linearly projected values d_ff = 2048 # Dimensionality of the internal totally linked layer d_model = 512 # Dimensionality of the mannequin sub-layers' outputs n = 6 # Variety of layers within the encoder stack batch_size = 64 # Batch measurement from the coaching course of dropout_rate = 0.1 # Frequency of dropping the enter items within the dropout layers ...
As for the enter sequence we might be working with dummy knowledge in the intervening time till we arrive to the stage of coaching the whole Transformer mannequin in a separate tutorial, at which level we might be utilizing precise sentences:
... dec_vocab_size = 20 # Vocabulary measurement for the decoder input_seq_length = 5 # Most size of the enter sequence input_seq = random.random((batch_size, input_seq_length)) enc_output = random.random((batch_size, input_seq_length, d_model)) ...
Subsequent, we’ll create a brand new occasion of the Decoder
class, assigning its output to the decoder
variable, and subsequently passing within the enter arguments and printing the consequence. We might be setting the padding and look-ahead masks to None
in the intervening time, however we will return to those after we implement the whole Transformer mannequin:
... decoder = Decoder(dec_vocab_size, input_seq_length, h, d_k, d_v, d_model, d_ff, n, dropout_rate) print(decoder(input_seq, enc_output, None, True)
Tying the whole lot collectively produces the next code itemizing:
from numpy import random dec_vocab_size = 20 # Vocabulary measurement for the decoder input_seq_length = 5 # Most size of the enter sequence h = 8 # Variety of self-attention heads d_k = 64 # Dimensionality of the linearly projected queries and keys d_v = 64 # Dimensionality of the linearly projected values d_ff = 2048 # Dimensionality of the internal totally linked layer d_model = 512 # Dimensionality of the mannequin sub-layers' outputs n = 6 # Variety of layers within the decoder stack batch_size = 64 # Batch measurement from the coaching course of dropout_rate = 0.1 # Frequency of dropping the enter items within the dropout layers input_seq = random.random((batch_size, input_seq_length)) enc_output = random.random((batch_size, input_seq_length, d_model)) decoder = Decoder(dec_vocab_size, input_seq_length, h, d_k, d_v, d_model, d_ff, n, dropout_rate) print(decoder(input_seq, enc_output, None, True))
Operating this code produces an output of form, (batch measurement, sequence size, mannequin dimensionality). Word that you’ll doubtless see a special output as a result of random initialization of the enter sequence, and the parameter values of the Dense layers.
tf.Tensor( [[[-0.04132953 -1.7236308 0.5391184 ... -0.76394725 1.4969798 0.37682498] [ 0.05501875 -1.7523409 0.58404493 ... -0.70776534 1.4498456 0.32555297] [ 0.04983566 -1.8431275 0.55850077 ... -0.68202156 1.4222856 0.32104644] [-0.05684051 -1.8862512 0.4771412 ... -0.7101341 1.431343 0.39346313] [-0.15625843 -1.7992781 0.40803364 ... -0.75190556 1.4602519 0.53546077]] ... [[-0.58847624 -1.646842 0.5973466 ... -0.47778523 1.2060764 0.34091905] [-0.48688865 -1.6809179 0.6493542 ... -0.41274604 1.188649 0.27100053] [-0.49568555 -1.8002801 0.61536175 ... -0.38540334 1.2023914 0.24383534] [-0.59913146 -1.8598882 0.5098136 ... -0.3984461 1.2115746 0.3186561 ] [-0.71045107 -1.7778647 0.43008155 ... -0.42037937 1.2255307 0.47380894]]], form=(64, 5, 512), dtype=float32)
Additional Studying
This part gives extra assets on the subject if you’re trying to go deeper.
Books
Papers
Abstract
On this tutorial, you found implement the Transformer decoder from scratch in TensorFlow and Keras.
Particularly, you discovered:
- The layers that kind a part of the Transformer decoder.
- How you can implement the Transformer decoder from scratch.
Do you’ve gotten any questions?
Ask your questions within the feedback beneath and I’ll do my greatest to reply.
The submit Implementing the Transformer Decoder From Scratch in TensorFlow and Keras appeared first on Machine Studying Mastery.