From Self-Attention to the Transformer

This exposition is going to assume you already have an idea of what self-attention is and how it works. If you don't, I recommend reading this post first. Though attention is the backbone of the Transformer model, it doesn’t quite get us there on its own. There are a few other building blocks that make up the Transformer (and are used, in various permutations, in pretty much all Transformer-y models like BERT and GPT-3).

Token Embedding and Positional Encoding

This is the first section.

Token Embedding

Implicit in all the previous discussions of neural machine translation is that you can feed “words” to a neural network. Really, you can’t do that directly. Instead, words are represented as a word embedding in $R^{d}$ . There is a fixed vocabulary, and each word in the vocabulary corresponds to a d-dimensional vector in a lookup table. In the past, these might have been learned separately (via a procedure like GloVe or Word2Vec), and then taken as fixed inputs to the model. In more recent research, including the Transformer, embeddings are initialized randomly, and learned during training to minimize loss.

Positional Encoding

Unlike recurrent neural networks they’ve largely replaced, where tokens are fed to the model one at a time, transformers digest a sequence all at once. Unlike convolutional neural networks, which employ sliding windows to compute functions of “nearby” values, there is no notion of “local” relationships, as the Transformer computes self-attention of all input vectors with respect to all input vectors for the entire sequence, and then summarizes the result. This is why it is permutation-invariant: if you shuffle the input, you’d get back a shuffled output, but it wouldn’t otherwise change the result. This poses a problem: the order of words in a sentence actually matters. “The dog bit the man” means something different than “the man bit the dog.” Because the Transformer doesn’t really have a notion of order, we have to add information to give the model a hint about the order of tokens in the input sequence: a positional encoding.

In practice, the way this is done is with periodic functions of various wavelengths. For each position $(1, 2, \dots, n)$ we compute a series of sine and cosine functions to get position vectors $({\vec{p}}_{1}, {\vec{p}}_{2}, \dots, {\vec{p}}_{n})$ . Certain mathematical properties of these periodic functions make them useful for allowing the model to attend to relative positions. (If you’re interested, you can read a much more detailed explanation here.) These vectors are then concatenated or added to the input sequence $(x_{1}, x_{2}, \dots, x_{n}) .$ Positional encoding only happens once, right before the input sequence (i.e. a sequence of word embedding vectors) is fed to the model.

Normalization

Various forms of normalization and standardization are common in machine learning in order to improve learning. A more recent proposal is to explicitly include normalization at multiple stages of neural network architectures, rather than simply as a data preprocessing step. Because normalization (subtracting a mean, dividing by standard deviation) is a mathematical operation like any other, we can compute gradients and back-propagate through normalization layers. This helps training by reducing internal covariate shift, i.e. the shifting distributions of values at each layer as parameters update, which can cause problems with gradients and saturating nonlinearities. Normalization helps networks train faster, and even provides some regularization.

BatchNorm proposes normalizing the same neuron across a batch of inputs. LayerNorm averages in the other direction, i.e. across all neurons in a layer for a single training example. This has some advantages over BatchNorm: it is agnostic to batch size, and it can be used in recurrent architectures. In the Transformer, it is only used to average across the embedding dimension (i.e. we normalize each element $s_{1}, s_{2}, \dots, s_{n}$ in the input sequence $S$ separately). After normalization, there is a linear layer (i.e. multiply each neuron by a learned weight W and add a bias b).

RMSNorm is a more recent proposal that normalizes across the entire input sequence. It is used in the Transformer, and is similar to LayerNorm, but [...].

Dropout

Dropout is a regularization technique for neural networks that zeroes out neurons with some small probability during training. This results in a model that is more robust—it can’t rely too heavily on any particular neuron, and it prevents neurons from “co-adapting”. This prevents the model from overfitting, and basically gives rise to a neural network that, at test time, is equivalent the average of an “ensemble” of different neural networks. Dropout is applied at various points in the Transformer for regularization.

Residual Connections

Residual connections are... yeah. I don’t really know how to explain them. They’re just a way of making it easier for the model to learn. They’re used in the Transformer, and they’re used in pretty much every other neural network architecture. I don’t really know how to explain them. I’m sorry.

Position-Wise Feed Forward Network

Each Transformer self-attention layer is followed by a “position-wise” feed-forward network, which is a fancy way of saying the same small neural network is applied to each sequence element $s_{1}, s_{2}, \dots, s_{n}$ . Each layer of the Transformer has its own separate position-wise feed-forward network, but the same network is shared across sequence elements (i.e. the first word in a sentence has the same feed-forward network applied to it as the last word).

The Encoder Block

Attention -> Layernorm -> Feed Forward -> Layernorm, with residual connections.

Subsection 1

This is the first subsection. Here's some JS code:

function myFunction() {
  return true;
}

...and here is some Python code:

import sys
sys.exit(0)