LSTM — Long Short-Term Memory · Suman Bhadra Notes

Memory that lasts

Plain RNNs forget quickly. The LSTM (1997) fixed this with a clever design: a separate cell state that runs through time like a conveyor belt, with gates that carefully add to or remove from it.

The cell state is the key. Information can ride along it almost unchanged for many steps — so gradients survive and long-range dependencies are learned. The gates decide what flows where.

The three gates in action

Watch the cell state flow across the top, while the forget, input, and output gates decide what to drop, what to store, and what to reveal.

What each gate does

Forget gate what to drop

Looks at the input and previous output, and decides which parts of the old cell state to erase (a value 0–1 per element).

Input gate what to store

Decides which new candidate information to write into the cell state.

Output gate what to reveal

Decides which parts of the (updated) cell state to expose as this step's hidden output.

Why gates beat plain RNNs

Each gate is a small sigmoid network that learns when to remember and when to forget. The near-linear cell-state highway lets gradients flow far back in time without vanishing.

Operate the gates yourself. Old memory C(t−1) = 0.8 arrives on the belt and a candidate C̃ = 0.6 wants in. The three sliders are the gates — watch the arithmetic C(t) = f·C(t−1) + i·C̃ happen in the bars.

forget f input i output o

Set f = 1, i = 0: perfect memory — the old value rides through untouched (this is how LSTMs span hundreds of steps). Set f = 0: total amnesia. And with o = 0 the cell can know something yet reveal nothing.

Where LSTMs fit

Great for

Long-range sequence dependencies
Time-series, speech, handwriting, older NLP
When data is modest and sequential

But

Sequential — can't parallelize across time
Heavier than the simpler GRU
For very long range, transformers now dominate