Having seen the right way to implement the scaled dot-product consideration, and combine it inside the multi-head consideration of the Transformer mannequin, we might progress one step additional in the direction of implementing an entire Transformer mannequin by implementing its encoder. Our finish purpose stays the appliance of the whole mannequin to Pure Language Processing (NLP).
On this tutorial, you’ll uncover the right way to implement the Transformer encoder from scratch in TensorFlow and Keras.
After finishing this tutorial, you’ll know:
- The layers that kind a part of the Transformer encoder.
- Find out how to implement the Transformer encoder from scratch.
Let’s get began.

Implementing the Transformer Encoder From Scratch in TensorFlow and Keras
Picture by ian dooley, some rights reserved.
Tutorial Overview
This tutorial is split into three components; they’re:
- Recap of the Transformer Structure
- The Transformer Encoder
- Implementing the Transformer Encoder From Scratch
- The Absolutely Linked Feed-Ahead Neural Community and Layer Normalization
- The Encoder Layer
- The Transformer Encoder
- Testing Out the Code
Conditions
For this tutorial, we assume that you’re already conversant in:
- The Transformer mannequin
- The scaled dot-product consideration
- The multi-head consideration
- The Transformer positional encoding
Recap of the Transformer Structure
Recall having seen that the Transformer structure follows an encoder-decoder construction: the encoder, on the left-hand facet, is tasked with mapping an enter sequence to a sequence of steady representations; the decoder, on the right-hand facet, receives the output of the encoder along with the decoder output on the earlier time step, to generate an output sequence.

The Encoder-Decoder Construction of the Transformer Structure
Taken from “Consideration Is All You Want“
In producing an output sequence, the Transformer doesn’t depend on recurrence and convolutions.
We had seen that the decoder a part of the Transformer shares many similarities in its structure with the encoder. On this tutorial, we shall be specializing in the elements that kind a part of the Transformer encoder.
The Transformer Encoder
The Transformer encoder consists of a stack of $N$ similar layers, the place every layer additional consists of two important sub-layers:
- The primary sub-layer contains a multi-head consideration mechanism that receives the queries, keys and values as inputs.
- A second sub-layer that contains a fully-connected feed-forward community.

The Encoder Block of the Transformer Structure
Taken from “Consideration Is All You Want“
Following every of those two sub-layers is layer normalisation, into which the sub-layer enter (by way of a residual connection) and output are fed. The output of every layer normalization step is the next:
LayerNorm(Sublayer Enter + Sublayer Output)
So as to facilitate such an operation, which includes an addition between the sublayer enter and output, Vaswani et al. designed all sub-layers and embedding layers within the mannequin to provide outputs of dimension, $d_{textual content{mannequin}}$ = 512.
Recall as effectively the queries, keys and values because the inputs to the Transformer encoder.
Right here, the queries, keys and values carry the identical enter sequence after this has been embedded and augmented by positional data, the place the queries and keys are of dimensionality, $d_k$, whereas the dimensionality of the values is $d_v$.
Moreover, Vaswani et al. additionally introduce regularization into the mannequin by making use of dropout to the output of every sub-layer (earlier than the layer normalization step), in addition to to the positional encodings earlier than these are fed into the encoder.
Let’s now see the right way to implement the Transformer encoder from scratch in TensorFlow and Keras.
Implementing the Transformer Encoder From Scratch
The Absolutely Linked Feed-Ahead Neural Community and Layer Normalization
We will start by creating lessons for the Feed Ahead and Add & Norm layers which might be proven within the diagram above.
Vaswani et al. inform us that the totally linked feed-forward community consists of two linear transformations with a ReLU activation in between. The primary linear transformation produces an output of dimensionality, $d_{ff}$ = 2048, whereas the second linear transformation produces an output of dimensionality, $d_{textual content{mannequin}}$ = 512.
For this objective, let’s first create the category, FeedForward
that inherits kind the Layer
base class in Keras, and initialize the dense layers and the ReLU activation:
class FeedForward(Layer): def __init__(self, d_ff, d_model, **kwargs): tremendous(FeedForward, self).__init__(**kwargs) self.fully_connected1 = Dense(d_ff) # First totally linked layer self.fully_connected2 = Dense(d_model) # Second totally linked layer self.activation = ReLU() # ReLU activation layer ...
We’ll add to it the category technique, name()
, that receives an enter and passes it by way of the 2 totally linked layers with ReLU activation, returning an output of dimensionality equal to 512:
... def name(self, x): # The enter is handed into the 2 fully-connected layers, with a ReLU in between x_fc1 = self.fully_connected1(x) return self.fully_connected2(self.activation(x_fc1))
The following step is to create one other class, AddNormalization
, that additionally inherits kind the Layer
base class in Keras, and initialize a Layer normalization layer:
class AddNormalization(Layer): def __init__(self, **kwargs): tremendous(AddNormalization, self).__init__(**kwargs) self.layer_norm = LayerNormalization() # Layer normalization layer ...
In it, we’ll embody the next class technique that sums its sub-layer’s enter and output, which it receives as inputs, and applies layer normalization to the outcome:
... def name(self, x, sublayer_x): # The sublayer enter and output have to be of the identical form to be summed add = x + sublayer_x # Apply layer normalization to the sum return self.layer_norm(add)
The Encoder Layer
Subsequent, we’ll implement the encoder layer, which the Transformer encoder will replicate identically $N$ occasions.
For this objective, let’s create the category, EncoderLayer
, and initialize all the sub-layers that it consists of:
class EncoderLayer(Layer): def __init__(self, h, d_k, d_v, d_model, d_ff, fee, **kwargs): tremendous(EncoderLayer, self).__init__(**kwargs) self.multihead_attention = MultiHeadAttention(h, d_k, d_v, d_model) self.dropout1 = Dropout(fee) self.add_norm1 = AddNormalization() self.feed_forward = FeedForward(d_ff, d_model) self.dropout2 = Dropout(fee) self.add_norm2 = AddNormalization() ...
Right here it’s possible you’ll discover that we’ve got initialized cases of the FeedForward
and AddNormalization
lessons, which we’ve got simply created within the earlier part, and assigned their output to the respective variables, feed_forward
and add_norm
(1 and a couple of). The Dropout
layer is self-explanatory, the place fee
defines the frequency at which the enter models are set to 0. We had created the MultiHeadAttention
class in a earlier tutorial, and if you happen to had saved the code right into a separate Python script, then don’t forget to import
it. I saved mine in a Python script named, multihead_attention.py, and for that reason I want to incorporate the road of code, from multihead_attention import MultiHeadAttention.
Let’s now proceed to create the category technique, name()
, that implements all the encoder sub-layers:
... def name(self, x, padding_mask, coaching): # Multi-head consideration layer multihead_output = self.multihead_attention(x, x, x, padding_mask) # Anticipated output form = (batch_size, sequence_length, d_model) # Add in a dropout layer multihead_output = self.dropout1(multihead_output, coaching=coaching) # Adopted by an Add & Norm layer addnorm_output = self.add_norm1(x, multihead_output) # Anticipated output form = (batch_size, sequence_length, d_model) # Adopted by a totally linked layer feedforward_output = self.feed_forward(addnorm_output) # Anticipated output form = (batch_size, sequence_length, d_model) # Add in one other dropout layer feedforward_output = self.dropout2(feedforward_output, coaching=coaching) # Adopted by one other Add & Norm layer return self.add_norm2(addnorm_output, feedforward_output)
Along with the enter information, the name()
technique may also obtain a padding masks. As a quick reminder of what we had stated in a earlier tutorial, the padding masks is important to suppress the zero padding within the enter sequence from being processed together with the precise enter values.
The identical class technique can obtain a coaching
flag which, when set to True
, will solely apply the Dropout layers throughout coaching.
The Transformer Encoder
The final step is to create a category for the Transformer encoder, which we will be naming Encoder
:
class Encoder(Layer): def __init__(self, vocab_size, sequence_length, h, d_k, d_v, d_model, d_ff, n, fee, **kwargs): tremendous(Encoder, self).__init__(**kwargs) self.pos_encoding = PositionEmbeddingFixedWeights(sequence_length, vocab_size, d_model) self.dropout = Dropout(fee) self.encoder_layer = [EncoderLayer(h, d_k, d_v, d_model, d_ff, rate) for _ in range(n)] ...
The Transformer encoder receives an enter sequence after this may have undergone a means of phrase embedding and positional encoding. So as to compute the positional encoding, we’ll make use of the PositionEmbeddingFixedWeights
class described by Mehreen Saeed in this tutorial.
As we’ve got equally executed within the earlier sections, right here we may also create a category technique, name()
, that applies phrase embedding and positional encoding to the enter sequence, and feeds the outcome to $N$ encoder layers:
... def name(self, input_sentence, padding_mask, coaching): # Generate the positional encoding pos_encoding_output = self.pos_encoding(input_sentence) # Anticipated output form = (batch_size, sequence_length, d_model) # Add in a dropout layer x = self.dropout(pos_encoding_output, coaching=coaching) # Cross on the positional encoded values to every encoder layer for i, layer in enumerate(self.encoder_layer): x = layer(x, padding_mask, coaching) return x
The code itemizing for the total Transformer encoder is the next:
from tensorflow.keras.layers import LayerNormalization, Layer, Dense, ReLU, Dropout from multihead_attention import MultiHeadAttention from positional_encoding import PositionEmbeddingFixedWeights # Implementing the Add & Norm Layer class AddNormalization(Layer): def __init__(self, **kwargs): tremendous(AddNormalization, self).__init__(**kwargs) self.layer_norm = LayerNormalization() # Layer normalization layer def name(self, x, sublayer_x): # The sublayer enter and output have to be of the identical form to be summed add = x + sublayer_x # Apply layer normalization to the sum return self.layer_norm(add) # Implementing the Feed-Ahead Layer class FeedForward(Layer): def __init__(self, d_ff, d_model, **kwargs): tremendous(FeedForward, self).__init__(**kwargs) self.fully_connected1 = Dense(d_ff) # First totally linked layer self.fully_connected2 = Dense(d_model) # Second totally linked layer self.activation = ReLU() # ReLU activation layer def name(self, x): # The enter is handed into the 2 fully-connected layers, with a ReLU in between x_fc1 = self.fully_connected1(x) return self.fully_connected2(self.activation(x_fc1)) # Implementing the Encoder Layer class EncoderLayer(Layer): def __init__(self, h, d_k, d_v, d_model, d_ff, fee, **kwargs): tremendous(EncoderLayer, self).__init__(**kwargs) self.multihead_attention = MultiHeadAttention(h, d_k, d_v, d_model) self.dropout1 = Dropout(fee) self.add_norm1 = AddNormalization() self.feed_forward = FeedForward(d_ff, d_model) self.dropout2 = Dropout(fee) self.add_norm2 = AddNormalization() def name(self, x, padding_mask, coaching): # Multi-head consideration layer multihead_output = self.multihead_attention(x, x, x, padding_mask) # Anticipated output form = (batch_size, sequence_length, d_model) # Add in a dropout layer multihead_output = self.dropout1(multihead_output, coaching=coaching) # Adopted by an Add & Norm layer addnorm_output = self.add_norm1(x, multihead_output) # Anticipated output form = (batch_size, sequence_length, d_model) # Adopted by a totally linked layer feedforward_output = self.feed_forward(addnorm_output) # Anticipated output form = (batch_size, sequence_length, d_model) # Add in one other dropout layer feedforward_output = self.dropout2(feedforward_output, coaching=coaching) # Adopted by one other Add & Norm layer return self.add_norm2(addnorm_output, feedforward_output) # Implementing the Encoder class Encoder(Layer): def __init__(self, vocab_size, sequence_length, h, d_k, d_v, d_model, d_ff, n, fee, **kwargs): tremendous(Encoder, self).__init__(**kwargs) self.pos_encoding = PositionEmbeddingFixedWeights(sequence_length, vocab_size, d_model) self.dropout = Dropout(fee) self.encoder_layer = [EncoderLayer(h, d_k, d_v, d_model, d_ff, rate) for _ in range(n)] def name(self, input_sentence, padding_mask, coaching): # Generate the positional encoding pos_encoding_output = self.pos_encoding(input_sentence) # Anticipated output form = (batch_size, sequence_length, d_model) # Add in a dropout layer x = self.dropout(pos_encoding_output, coaching=coaching) # Cross on the positional encoded values to every encoder layer for i, layer in enumerate(self.encoder_layer): x = layer(x, padding_mask, coaching) return x
Testing Out the Code
We shall be working with the parameter values specified within the paper, Consideration Is All You Want, by Vaswani et al. (2017):
h = 8 # Variety of self-attention heads d_k = 64 # Dimensionality of the linearly projected queries and keys d_v = 64 # Dimensionality of the linearly projected values d_ff = 2048 # Dimensionality of the internal totally linked layer d_model = 512 # Dimensionality of the mannequin sub-layers' outputs n = 6 # Variety of layers within the encoder stack batch_size = 64 # Batch dimension from the coaching course of dropout_rate = 0.1 # Frequency of dropping the enter models within the dropout layers ...
As for the enter sequence we shall be working with dummy information in the intervening time till we arrive to the stage of coaching the whole Transformer mannequin in a separate tutorial, at which level we shall be utilizing precise sentences:
... enc_vocab_size = 20 # Vocabulary dimension for the encoder input_seq_length = 5 # Most size of the enter sequence input_seq = random.random((batch_size, input_seq_length)) ...
Subsequent, we’ll create a brand new occasion of the Encoder
class, assigning its output to the encoder
variable, and subsequently feeding within the enter arguments and printing the outcome. We shall be setting the padding masks argument to None
in the intervening time, however we will return to this after we implement the whole Transformer mannequin:
... encoder = Encoder(enc_vocab_size, input_seq_length, h, d_k, d_v, d_model, d_ff, n, dropout_rate) print(encoder(input_seq, None, True))
Tying the whole lot collectively produces the next code itemizing:
from numpy import random enc_vocab_size = 20 # Vocabulary dimension for the encoder input_seq_length = 5 # Most size of the enter sequence h = 8 # Variety of self-attention heads d_k = 64 # Dimensionality of the linearly projected queries and keys d_v = 64 # Dimensionality of the linearly projected values d_ff = 2048 # Dimensionality of the internal totally linked layer d_model = 512 # Dimensionality of the mannequin sub-layers' outputs n = 6 # Variety of layers within the encoder stack batch_size = 64 # Batch dimension from the coaching course of dropout_rate = 0.1 # Frequency of dropping the enter models within the dropout layers input_seq = random.random((batch_size, input_seq_length)) encoder = Encoder(enc_vocab_size, input_seq_length, h, d_k, d_v, d_model, d_ff, n, dropout_rate) print(encoder(input_seq, None, True))
Operating this code produces an output of form, (batch dimension, sequence size, mannequin dimensionality). Observe that you’ll seemingly see a unique output as a result of random initialization of the enter sequence, and the parameter values of the Dense layers.
tf.Tensor( [[[-0.4214715 -1.1246173 -0.8444572 ... 1.6388322 -0.1890367 1.0173352 ] [ 0.21662089 -0.61147404 -1.0946581 ... 1.4627445 -0.6000164 -0.64127874] [ 0.46674493 -1.4155326 -0.5686513 ... 1.1790234 -0.94788337 0.1331717 ] [-0.30638126 -1.9047263 -1.8556844 ... 0.9130118 -0.47863355 0.00976158] [-0.22600567 -0.9702025 -0.91090447 ... 1.7457147 -0.139926 -0.07021569]] ... [[-0.48047638 -1.1034104 -0.16164204 ... 1.5588069 0.08743562 -0.08847156] [-0.61683714 -0.8403657 -1.0450369 ... 2.3587787 -0.76091915 -0.02891812] [-0.34268388 -0.65042275 -0.6715749 ... 2.8530657 -0.33631966 0.5215888 ] [-0.6288677 -1.0030932 -0.9749813 ... 2.1386387 0.0640307 -0.69504136] [-1.33254 -1.2524267 -0.230098 ... 2.515467 -0.04207756 -0.3395423 ]]], form=(64, 5, 512), dtype=float32)
Additional Studying
This part offers extra sources on the subject if you’re seeking to go deeper.
Books
Papers
Abstract
On this tutorial, you found the right way to implement the Transformer encoder from scratch in TensorFlow and Keras.
Particularly, you discovered:
- The layers that kind a part of the Transformer encoder.
- Find out how to implement the Transformer encoder from scratch.
Do you’ve any questions?
Ask your questions within the feedback under and I’ll do my finest to reply.
The publish Implementing the Transformer Encoder From Scratch in TensorFlow and Keras appeared first on Machine Studying Mastery.