NLP Course - Self-Attention and Positional Encoding

Introduction to Self-Attention

In the previous notebook, we learned about the Attention mechanism, which allows a model to focus on relevant parts of an input sequence when generating an output, like focusing on "cat" when translating "The cat is on the mat" to "chat" in French. Self-Attention, introduced in the Transformer model by Vaswani et al. (2017), is a special type of Attention where the model attends to all words in the same sequence to understand their relationships.

Imagine reading a sentence: "The cat, which is fluffy, sleeps." To understand "fluffy," you need to connect it to "cat," not "sleeps." Self-Attention enables each word in the sentence to "look" at every other word, computing how much each word (e.g., "cat") influences the representation of another word (e.g., "fluffy"). This is done using queries, keys, and values, as introduced in the previous notebook.

In Self-Attention, for each word in the input sequence, the model:

These vectors are used to compute attention weights, which determine how much each word contributes to the representation of every other word, forming a new, context-aware representation for each word.

Why Use Self-Attention?

Self-Attention is a cornerstone of modern NLP models like Transformers because it addresses limitations of previous approaches, such as RNNs and standard Attention in Seq2Seq models:

For example, in translating "The cat, which is fluffy, sleeps" to French, Self-Attention ensures "fluffy" is correctly associated with "cat," producing an accurate translation like "Le chat, qui est fluffy, dort."

Differences from Standard Attention

While both Self-Attention and standard Attention (as used in Seq2Seq models) use queries, keys, and values to compute weighted sums, they differ in their application and scope:

In mathematical terms, both use similar formulations (e.g., scaled dot-product attention), but Self-Attention applies it internally:

Attention(Q, K, V) = softmax((Q Kᵀ) / √dₖ) V
Where Q, K, V are matrices of queries, keys, and values, and dₖ is the key dimension. In Self-Attention, Q, K, V are derived from the same input sequence.

Position-Agnostic Issue and Positional Encoding

A key limitation of Self-Attention is that it is position-agnostic. It treats the input sequence as a set of words, not an ordered sequence, because it computes attention weights based solely on content (queries and keys), ignoring word positions. For example, "The cat chased the dog" and "The dog chased the cat" would have identical Self-Attention outputs, as the mechanism doesn’t know which word comes first.

In NLP, word order is critical (e.g., subject-verb-object changes meaning). To address this, Positional Encoding is added to the input embeddings to encode the position of each word in the sequence.

What is Positional Encoding?

Positional Encoding adds a vector to each word’s embedding that represents its position in the sequence. This vector ensures the model can distinguish "cat" in position 2 from "cat" in position 5. The Transformer model uses fixed sinusoidal functions for positional encodings, defined as:

PE(pos, 2i) = sin(pos / 10000^(2i/d_model)) PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

Where:

These sinusoidal functions create unique, periodic patterns for each position, allowing the model to learn relative positions (e.g., "word 3 is two positions after word 1"). The input to the model becomes:

xᵢ = embedding(wordᵢ) + PE(posᵢ)

Alternatively, positional encodings can be learned (trainable parameters), but fixed sinusoidal encodings are often used because they generalize well to longer sequences.

Why Positional Encoding Works

For example, in "The cat chased the dog," positional encodings ensure "cat" (position 2) and "dog" (position 5) have distinct representations, allowing Self-Attention to capture the correct subject-object relationship.

Linearity Problem and Feed-Forward Networks

Self-Attention is a linear operation because it computes a weighted sum of value vectors:

cᵢ = Σⱼ αᵢⱼ vⱼ
Where αᵢⱼ are attention weights and vⱼ are value vectors. This linearity limits the model’s ability to capture complex, non-linear relationships between words, which are often crucial in NLP tasks (e.g., understanding nuanced sentiment or grammar).

To address this, a Feed-Forward Neural Network (FFNN) or Multi-Layer Perceptron (MLP) is applied to each word’s attention output independently. In Transformers, the FFNN is applied after Self-Attention in each layer:

FFNN(x) = max(0, x W₁ + b₁) W₂ + b₂

Where:

The FFNN transforms each word’s representation, allowing the model to learn complex patterns. For example, it can capture non-linear interactions like "not happy" implying negative sentiment, which a linear operation might miss.

Why FFNN Helps

Masked Self-Attention for Generative Tasks

In generative tasks, like language modeling or machine translation, the model generates output words one at a time, using only the words it has already produced. For example, when generating "Je t'aime," the model predicts "t'aime" based on "Je," not future words like "aime." Masked Self-Attention ensures the model only attends to previous words in the sequence, preventing it from "cheating" by looking at future words.

How Masked Self-Attention Works

In standard Self-Attention, each word attends to all words in the sequence, including future ones. In Masked Self-Attention, a mask is applied to the attention scores to block attention to future positions:

eᵢⱼ = (qᵢᵀ kⱼ) / √dₖ eᵢⱼ = -∞ if j > i (mask future positions) αᵢⱼ = softmax(eᵢⱼ)

Where:

The mask is typically a triangular matrix (1s on and below the diagonal, 0s above), ensuring word i only attends to words at positions 1 to i.

Why Masked Self-Attention?

For example, in generating "Je t'aime," Masked Self-Attention ensures "t'aime" is predicted based only on "Je," preventing the model from using "aime" prematurely.

Practical Example

Let’s build a Transformer-based model for translating English to French using a small parallel corpus. The encoder uses Self-Attention to process the English sentence (e.g., "I love you"), with positional encodings added to word embeddings to capture word order. Each word attends to all others, creating context-aware representations (e.g., linking "love" to "you"). The decoder uses Masked Self-Attention to generate the French translation ("Je t'aime") word by word, attending only to previous words in the output sequence (e.g., "t'aime" attends to "Je"). A Feed-Forward Network is applied after each attention layer to introduce non-linearity, enhancing expressiveness. This model leverages Self-Attention’s parallelization, positional encodings for order, and masking for autoregressive generation, outperforming RNN-based Seq2Seq models in accuracy and speed.

Created by wikm.ir with ❤️