What is the purpose of using multi-head attention in Transformer models?

Question:
What is the purpose of using multi-head attention in Transformer models?
A) To reduce the complexity of training by splitting attention across multiple layers.
B) To capture diverse relationships in the data by attending to different parts of the sequence simultaneously.
C) To enhance gradient flow across layers using parallel attention heads.
D) To perform a hierarchical clustering of tokens based on similarity.

Correct Answer:
B) To capture diverse relationships in the data by attending to different parts of the sequence simultaneously.

Explanation:
Multi-head attention allows the model to focus on different subsets of relationships in parallel, such as short-term and long-term dependencies. This makes the model more powerful and flexible in understanding the input context.

Master LLM and Gen AI with 600+ Real Interview Questions

Leave a Reply