A RNN using LSTM Architecture used to generate texts based on a prime word.
This project is maintained by infiniteoverflow
In this notebook, I’ll construct a character-level LSTM with PyTorch. The network will train character by character on some text, then generate new text character by character. As an example, I will train on Anna Karenina. This model will be able to generate new text based on the text from the book!
This network is based off of Andrej Karpathy’s post on RNNs and implementation in Torch. Below is the general architecture of the character-wise RNN.
To train on this data, we also want to create mini-batches for training. Remember that we want our batches to be multiple sequences of some desired number of sequence steps. Considering a simple example, our batches would look like this:
In this example, we’ll take the encoded characters (passed in as the arr
parameter) and split them into multiple sequences, given by batch_size
. Each of our sequences will be seq_length
long.
1. The first thing we need to do is discard some of the text so we only have completely full mini-batches.
Each batch contains $N \times M$ characters, where $N$ is the batch size (the number of sequences in a batch) and $M$ is the seq_length or number of time steps in a sequence. Then, to get the total number of batches, $K$, that we can make from the array arr
, you divide the length of arr
by the number of characters per batch. Once you know the number of batches, you can get the total number of characters to keep from arr
, N * M * K.
2. After that, we need to split arr
into $N$ batches.
You can do this using arr.reshape(size)
where size
is a tuple containing the dimensions sizes of the reshaped array. We know we want $N$ sequences in a batch, so let’s make that the size of the first dimension. For the second dimension, you can use -1
as a placeholder in the size, it’ll fill up the array with the appropriate data for you. After this, you should have an array that is $N \times (M * K)$.
3. Now that we have this array, we can iterate through it to get our mini-batches.
The idea is each batch is a $N \times M$ window on the $N \times (M * K)$ array. For each subsequent batch, the window moves over by seq_length
. We also want to create both the input and target arrays. Remember that the targets are just the inputs shifted over by one character. The way I like to do this window is use range
to take steps of size n_steps
from $0$ to arr.shape[1]
, the total number of tokens in each sequence. That way, the integers you get from range
always point to the start of a batch, and each window is seq_length
wide.
Below is where you’ll define the network.
<img src=”images/charRNN.png” width=500px>
Next, you’ll use PyTorch to define the architecture of the network. We start by defining the layers and operations we want. Then, define a method for the forward pass. You’ve also been given a method for predicting characters.
In __init__
the suggested structure is as follows:
n_hidden
, a number of layers n_layers
, a dropout probability drop_prob
, and a batch_first boolean (True, since we are batching)dropout_prob
n_hidden
and output size (the number of characters)Note that some parameters have been named and given in the __init__
function, and we use them and store them by doing something like self.drop_prob = drop_prob
.
You can create a basic LSTM layer as follows
self.lstm = nn.LSTM(input_size, n_hidden, n_layers,
dropout=drop_prob, batch_first=True)
where input_size
is the number of characters this cell expects to see as sequential input, and n_hidden
is the number of units in the hidden layers in the cell. And we can add dropout by adding a dropout parameter with a specified probability; this will automatically add dropout to the inputs or outputs. Finally, in the forward
function, we can stack up the LSTM cells into layers using .view
. With this, you pass in a list of cells and it will send the output of one cell into the next cell.
We also need to create an initial hidden state of all zeros. This is done like so
self.init_hidden()
Now that the model is trained, we’ll want to sample from it and make predictions about next characters! To sample, we pass in a character and have the network predict the next character. Then we take that character, pass it back in, and get another predicted character. Just keep doing this and you’ll generate a bunch of text!
predict
functionThe output of our RNN is from a fully-connected layer and it outputs a distribution of next-character scores.
To actually get the next character, we apply a softmax function, which gives us a probability distribution that we can then sample to predict the next character.
Our predictions come from a categorical probability distribution over all the possible characters. We can make the sample text and make it more reasonable to handle (with less variables) by only considering some $K$ most probable characters. This will prevent the network from giving us completely absurd characters while allowing it to introduce some noise and randomness into the sampled text. Read more about topk, here.