Compression is a tricky thing. Complicated methods would not always come with the best result. So does math. Simple is better/Worse is better.
Background: I am making a compressor for database engine that should support random access on compressed data.
What is self-describe dictionary? Well, it is just a portion of original data placed in agreed "specific position". Others could reference it later. Sounds too easy? But it is amazingly strong.
I divided data into 16MB big chunk. The first big chunk is not compressed then a suffix array is built upon it. We can always refer anywhere in the chunk by 3bytes(24bits).
Big chunks except the very first one are further divided into 64KB small chunk. Again, the first small chunk is not compressed and suffix array is built, therefore 2bytes(16bits) are sufficient to point anywhere in small chunk.
The first big chunk and the first small chunks of big chunks become so-called "self-describe dictionary". They are data, at the same time, they are dictionary. The rest of data is compressed based on 4 options:
1. reference of big chunk
2. reference of small chunk
3. reference of self(normal LZ77 sliding window)
4. original data
For plain English, the simple and naive method is competitive against gzip in terms of compression ratio(9GB to 3GB) and it is much much faster than other double-scan/off-line methods such as Re-Pair, O0 or O1 entropy coder. It is good enough for me!
The method can also benefit from off-line process. We can carefully select some records in the big and small non-compressed chunks.
At last, finding optimal dictionary are proved to be NP-Hard!!! You will never get a best one but the most suitable one is possible.