How does Multi-head attention differ from standard attention?
Answer options
A
It is faster
B
It is only used in GPT
C
It allows the model to focus on multiple parts of the input
D
simultaneously
E
It uses fewer parameters
F
None of the options given
Correct answer: It allows the model to focus on multiple parts of the input, simultaneously
Explanation
Multi-head attention runs multiple parallel attention mechanisms (heads), each attending to different representation subspaces, allowing the model to jointly attend to information from multiple positions simultaneously and capture diverse relationships. Options [2] and [3] form the complete answer: 'It allows the model to focus on multiple parts of the input simultaneously.'