Generating Sequences With Recurrent Neural Networks
A recurrent neural network is trained on the first 96 MB of enwik8 and tested on the last 4 MB, achieving compression of 1.42 bpc on the training set and 1.33 bpc on the test set. This is quite a good result. I compared several compressors under the same conditions by compressing the first 96 MB, then the complete file and subtracting to get the ratio for the test data.
For durilca'kingsize I included the size of the compressed dictionary EnWiki.dur and UnDur.exe from durilca4_decoder, as tested for the #1 spot on LTCB. Although it uses 13 GB memory for enwik9, it needs only 1.1 GB for enwik8. I did not include the decompresser size for the other programs since they don't use external dictionaries.
train test bits per character
1.668 1.537 bsc -b100
1.626 1.528 zpaq -m 57
1.528 1.427 ppmonstr -o16 -m1700
1.531 1.426 zpaq -m 67
1.494 1.384 nanozip -cc -m1.6g
1.42 1.33 RNN (from paper)
1.334 1.202 durilca'kingsize_4 -o32 -m3500 -t2
He uses a neural network with 204 input neurons (one for each character that appears at least once), 7 hidden layers with 100 long short-term memory (LTSM) cells each, and one output layer with 204 neurons. The input neuron representing the current character is set to 1 and all others to 0. The output after normalization is the probability distribution of the next character.
LSTM cells have a gated feedback loop to retain state information and also gated inputs and outputs, each gate controlled by a separate weight matrix from the input layer and all previous hidden layers. The output layer also receives input from all 7 hidden layers. These extra connections speed up back propagation. Weights are trained by gradient descent to minimize coding cost (like paq8 and zpaq) rather than RSME error. The weights are trained in 4 epochs with a learning rate of 0.0001 and momentum of 0.9. Weights are clamped to [-1, 1]. Weights were updated every 100 characters and the LTSM cell states were reset every 10K characters. Training continues on the test set in a single pass; without this training, the static model only compresses to 1.67 bpc.