Master LLM and Gen AI with 600+ Real Interview Questions
What is the difference between self-attention and multi-head attention in the Transformer architecture?
A) Self-attention focuses on global dependencies, while multi-head attention combines local features.
B) Self-attention processes individual tokens, while multi-head attention applies parallel attention mechanisms.
C) Self-attention computes the relationship between tokens, while multi-head attention applies multiple attention mechanisms to capture diverse relationships.
D) Self-attention normalizes token embeddings, while multi-head attention scales them for better gradient flow.
Correct Answer:
C) Self-attention computes the relationship between tokens, while multi-head attention applies multiple attention mechanisms to capture diverse relationships.
Explanation:
Self-attention focuses on token relationships within a sequence, while multi-head attention extends this by splitting the attention mechanism into multiple “heads.” Each head attends to different aspects of the input, enriching the model’s understanding of the data.

