Wednesday 8th December, 2021 - Bruce Sterling

Transformers were introduced in this 2017 paper as a tool for sequence transduction—converting one sequence of symbols to another. The most popular examples of this are translation, as in English to German.

It has also been modified to perform sequence completion—given a starting prompt, carry on in the same vein and style. They have quickly become an indispensible tool for research and product development in natural language processing.

Before we start, just a heads-up. We’re going to be talking a lot about matrix multiplications and touching on backpropagation (the algorithm for training the model), but you don’t need to know any of it beforehand. We’ll add the concepts we need one at a time, with explanation.

This isn’t a short journey, but I hope you’ll be glad you came.

► One-hot encoding

► Dot product

► Matrix multiplication

► Matrix multiplication as a table lookup

► First order sequence model

► Second order sequence model

► Second order sequence model with skips

► Masking

► Rest Stop and an Off Ramp

► Attention as matrix multiplication

Second order sequence model as matrix multiplications

► Sequence completion

► Embeddings

► Positional encoding

► De-embeddings

► Softmax

► Multi-head attention

► Single head attention revisited

► Skip connection

► Multiple layers

► Decoder stack

► Encoder stack

► Cross-attention

► Tokenizing

► Byte pair encoding

► Audio input

► Resources and credits…