LSTM — Long Short-Term Memory
Memory that lasts
Plain RNNs forget quickly. The LSTM (1997) fixed this with a clever design: a separate cell state that runs through time like a conveyor belt, with gates that carefully add to or remove from it.
The cell state is the key. Information can ride along it almost unchanged for many steps — so gradients survive and long-range dependencies are learned. The gates decide what flows where.
The three gates in action
Watch the cell state flow across the top, while the forget, input, and output gates decide what to drop, what to store, and what to reveal.
What each gate does
Looks at the input and previous output, and decides which parts of the old cell state to erase (a value 0–1 per element).
Decides which new candidate information to write into the cell state.
Decides which parts of the (updated) cell state to expose as this step's hidden output.
Each gate is a small sigmoid network that learns when to remember and when to forget. The near-linear cell-state highway lets gradients flow far back in time without vanishing.
Operate the gates yourself. Old memory C(t−1) = 0.8 arrives on the belt and a candidate C̃ = 0.6 wants in. The three sliders are the gates — watch the arithmetic C(t) = f·C(t−1) + i·C̃ happen in the bars.
Set f = 1, i = 0: perfect memory — the old value rides through untouched (this is how LSTMs span hundreds of steps). Set f = 0: total amnesia. And with o = 0 the cell can know something yet reveal nothing.
Where LSTMs fit
- Long-range sequence dependencies
- Time-series, speech, handwriting, older NLP
- When data is modest and sequential
- Sequential — can't parallelize across time
- Heavier than the simpler GRU
- For very long range, transformers now dominate