Encoder-Decoder Transformers (T5)
The third Transformer family
The Transformer comes in three flavours. BERT uses only the encoder (great at understanding). GPT uses only the decoder (great at generating). The original design uses both — an encoder to read the input and a decoder to write the output — joined by cross-attention. T5 and BART are the famous examples.
Bidirectional self-attention builds a rich representation of the entire input in parallel.
Causal self-attention over its own output so far, generating one token at a time.
At every step the decoder attends over all the encoder's outputs — modern, learned alignment.
Read in full, then generate
Translate "the cat" → "le chat". The encoder processes both input tokens together; the decoder then emits the output token by token, each time looking back at the encoded input through cross-attention.
Which family for which job?
Classification, NER, search — anything that maps text to a label or embedding.
Open-ended generation and chat — predict the next token, autoregressively.
T5's trick was to frame every task as text-in, text-out — prefix the input with "translate English to German:" or "summarize:" and the same model handles them all. A clean, unified take on the seq2seq idea.