Recurrent Neural Networks (RNNs) · Suman Bhadra Notes

Networks with memory

Dense nets and CNNs take a fixed-size input all at once. But language, audio, and time-series arrive as sequences, where order and context matter. RNNs handle them by processing one element at a time and carrying a hidden state forward — a memory of everything seen so far.

The recurrence

hₜ = f(W·xₜ + U·hₜ₋₁ + b)

The new hidden state depends on the current input xₜ and the previous hidden state hₜ₋₁. The same weights are reused at every step.

Unroll it through time

Watch the same cell process a sentence word by word, passing its hidden state along — then see how gradients flow backward through the unrolled chain.

Key ideas

Hidden state the memory

A vector summarizing the sequence so far. It's the channel through which the past influences the present.

Shared weights same cell each step

One set of weights handles any sequence length — like weight sharing, but across time.

BPTT backprop through time

Unroll the loop into a deep chain, then backpropagate through every time step to update the shared weights.

The Achilles' heel

Short memory

Because BPTT multiplies gradients across many time steps, plain RNNs suffer badly from vanishing/exploding gradients. They forget what happened more than a few steps ago — they struggle to connect "I grew up in France … I speak fluent ___" across 50 tokens, while a short gap like "The clouds are in the sky" is fine.

The fix was gated cells that learn what to remember and forget: see LSTM and GRU. (And for very long range, transformers dropped recurrence entirely.)