What is the difference between self-attention and multi-head attention in the Transformer architecture?

Master LLM and Gen AI with 600+ Real Interview Questions

What is the difference between self-attention and multi-head attention in the Transformer architecture?


A) Self-attention focuses on global dependencies, while multi-head attention combines local features.
B) Self-attention processes individual tokens, while multi-head attention applies parallel attention mechanisms.
C) Self-attention computes the relationship between tokens, while multi-head attention applies multiple attention mechanisms to capture diverse relationships.
D) Self-attention normalizes token embeddings, while multi-head attention scales them for better gradient flow.

Correct Answer:
C) Self-attention computes the relationship between tokens, while multi-head attention applies multiple attention mechanisms to capture diverse relationships.

Explanation:
Self-attention focuses on token relationships within a sequence, while multi-head attention extends this by splitting the attention mechanism into multiple “heads.” Each head attends to different aspects of the input, enriching the model’s understanding of the data.

Master Python With Real Coding Interview Questions

Leave a Reply