Generative AI Handbook Book Club

Long Short-Term Memory nets

A special kind of RNNs capable of learning long-term dependencies (no vanishing gradient)
Invented by Hochreiter and Schmidhuber in 1995. LSTMs became the default choice for RNN architecture.

If we would like to predict the words in orange:

Alice is allergic to nuts. (…). She can’t eat peanut butter.

If there were many sentences in (…), the context might get lost in a standard RNN.

A chain-like structure just like RNNs.

4 modules instead of a single hidden layer: the cell state, the forget gate layer, the input gate layer, and the output gate.

$\sigma(z)=\frac{1}{1+e^{-z}}$

$\text{tanh}(z)=\frac{e^z-e^{-z}}{e^z+e^{-z}}$

The cell state represents the Long-Term Memory

It passes vectors without weights.

LSTM can remove or add information to the cell state

LSTM cell state

LSTM forget gate

To decide what information we’re keeping from the previous output. It uses a sigmoid function:

It decides which values we’ll update:

The tanh layer creates a vector of new candidate values, a potential memory to add to the long-term memory
The $\sigma$ layer decides the percentage of the potential memory to be added to the state.