Encoder-Decoder Transformers (T5)

The third Transformer family

The Transformer comes in three flavours. BERT uses only the encoder (great at understanding). GPT uses only the decoder (great at generating). The original design uses both — an encoder to read the input and a decoder to write the output — joined by cross-attention. T5 and BART are the famous examples.

Encoder reads all at once

Bidirectional self-attention builds a rich representation of the entire input in parallel.

Decoder writes left-to-right

Causal self-attention over its own output so far, generating one token at a time.

Cross-attention the bridge

At every step the decoder attends over all the encoder's outputs — modern, learned alignment.

Read in full, then generate

Translate "the cat" → "le chat". The encoder processes both input tokens together; the decoder then emits the output token by token, each time looking back at the encoded input through cross-attention.

Which family for which job?

Encoder-only BERT

Classification, NER, search — anything that maps text to a label or embedding.

Decoder-only GPT

Open-ended generation and chat — predict the next token, autoregressively.

Encoder-decoder T5 / BART

Transform one sequence into another: translation, summarization, Q&A.

"Text-to-text" everything

T5's trick was to frame every task as text-in, text-out — prefix the input with "translate English to German:" or "summarize:" and the same model handles them all. A clean, unified take on the seq2seq idea.