From Self-Attention to the Transformer
This exposition is going to assume you already have an idea of what self-attention is and how it works. If you don't, I recommend reading this post first. Though attention is the backbone of the Transformer model, it doesn’t quite get us there on its own. There are a few other building blocks that make up the Transformer (and are used, in various permutations, in pretty much all Transformer-y models like BERT and GPT-3).
Token Embedding and Positional Encoding
This is the first section.
Token Embedding
Implicit in all the previous discussions of neural machine translation is that you can feed “words” to a neural network. Really, you can’t do that directly. Instead, words are represented as a word embedding in
Positional Encoding
Unlike recurrent neural networks they’ve largely replaced, where tokens are fed to the model one at a time, transformers digest a sequence all at once. Unlike convolutional neural networks, which employ sliding windows to compute functions of “nearby” values, there is no notion of “local” relationships, as the Transformer computes self-attention of all input vectors with respect to all input vectors for the entire sequence, and then summarizes the result. This is why it is permutation-invariant: if you shuffle the input, you’d get back a shuffled output, but it wouldn’t otherwise change the result. This poses a problem: the order of words in a sentence actually matters. “The dog bit the man” means something different than “the man bit the dog.” Because the Transformer doesn’t really have a notion of order, we have to add information to give the model a hint about the order of tokens in the input sequence: a positional encoding.
In practice, the way this is done is with periodic functions of various wavelengths. For each position
Normalization
Various forms of normalization and standardization are common in machine learning in order to improve learning. A more recent proposal is to explicitly include normalization at multiple stages of neural network architectures, rather than simply as a data preprocessing step. Because normalization (subtracting a mean, dividing by standard deviation) is a mathematical operation like any other, we can compute gradients and back-propagate through normalization layers. This helps training by reducing internal covariate shift, i.e. the shifting distributions of values at each layer as parameters update, which can cause problems with gradients and saturating nonlinearities. Normalization helps networks train faster, and even provides some regularization.
BatchNorm proposes normalizing the same neuron across a batch of inputs. LayerNorm averages in the other direction, i.e. across all neurons in a layer for a single training example. This has some advantages over BatchNorm: it is agnostic to batch size, and it can be used in recurrent architectures. In the Transformer, it is only used to average across the embedding dimension (i.e. we normalize each element
RMSNorm is a more recent proposal that normalizes across the entire input sequence. It is used in the Transformer, and is similar to LayerNorm, but [...].
Dropout
Dropout is a regularization technique for neural networks that zeroes out neurons with some small probability during training. This results in a model that is more robust—it can’t rely too heavily on any particular neuron, and it prevents neurons from “co-adapting”. This prevents the model from overfitting, and basically gives rise to a neural network that, at test time, is equivalent the average of an “ensemble” of different neural networks. Dropout is applied at various points in the Transformer for regularization.
Residual Connections
Residual connections are... yeah. I don’t really know how to explain them. They’re just a way of making it easier for the model to learn. They’re used in the Transformer, and they’re used in pretty much every other neural network architecture. I don’t really know how to explain them. I’m sorry.
Position-Wise Feed Forward Network
Each Transformer self-attention layer is followed by a “position-wise” feed-forward network, which is a fancy way of saying the same small neural network is applied to each sequence element
The Encoder Block
Attention -> Layernorm -> Feed Forward -> Layernorm, with residual connections.
Subsection 1
This is the first subsection. Here's some JS code:
function myFunction() {
return true;
}
...and here is some Python code:
import sys
sys.exit(0)
Subsection 2
This is the second subsection.
The end!