Results 1 to 3 of 3

Thread: Self-Describe Dictionary Compression

  1. #1
    Member
    Join Date
    Apr 2018
    Location
    GUANGZHOU
    Posts
    5
    Thanks
    1
    Thanked 0 Times in 0 Posts

    Self-Describe Dictionary Compression

    Compression is a tricky thing. Complicated methods would not always come with the best result. So does math. Simple is better/Worse is better.

    Background: I am making a compressor for database engine that should support random access on compressed data.

    What is self-describe dictionary? Well, it is just a portion of original data placed in agreed "specific position". Others could reference it later. Sounds too easy? But it is amazingly strong.

    I divided data into 16MB big chunk. The first big chunk is not compressed then a suffix array is built upon it. We can always refer anywhere in the chunk by 3bytes(24bits).

    Big chunks except the very first one are further divided into 64KB small chunk. Again, the first small chunk is not compressed and suffix array is built, therefore 2bytes(16bits) are sufficient to point anywhere in small chunk.

    The first big chunk and the first small chunks of big chunks become so-called "self-describe dictionary". They are data, at the same time, they are dictionary. The rest of data is compressed based on 4 options:
    1. reference of big chunk
    2. reference of small chunk
    3. reference of self(normal LZ77 sliding window)
    4. original data

    For plain English, the simple and naive method is competitive against gzip in terms of compression ratio(9GB to 3GB) and it is much much faster than other double-scan/off-line methods such as Re-Pair, O0 or O1 entropy coder. It is good enough for me!

    The method can also benefit from off-line process. We can carefully select some records in the big and small non-compressed chunks.

    At last, finding optimal dictionary are proved to be NP-Hard!!! You will never get a best one but the most suitable one is possible.
    Last edited by JIMLYN; 12th April 2018 at 20:04.

  2. #2
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    437
    Thanks
    137
    Thanked 152 Times in 100 Posts
    I don't see why you cannot compress the first big chunk using your method 3/4 and the first small chunk using 1/3/4.

    As far as the scheme goes, this seems equivalent to the strategy mentioned in another thread here of using non-contiguous history for LZ encoding.

  3. #3
    Member
    Join Date
    Apr 2018
    Location
    GUANGZHOU
    Posts
    5
    Thanks
    1
    Thanked 0 Times in 0 Posts
    I am new here, sorry if I added some redundancy for others. I am just surprised how a simple strategy could improve compression ratio. As I said, it is competitive against gzip. I hope it will help others. An acceptable solution could be the simplest one instead of spending 100+ hours on reading and 1000+ hours on implementing cutting-edge tech.

    The reason why first big chunk and first small chunk of big chunks cannot be compressed is that database engine is so real-time. If I see a reference I cannot afford too much jumps, i.e. a reference of references is quite bad.
    Firstly, I have to de-reference serval times.
    Secondly, it is really unfriendly for cache. Modern file system can do better if access pattern is known(less randomness/jumping).

    Let's say RocksDB that is most popular open-sourced database engine. RocksDB divides data into roughly 4KB block(configurable), each block is compressed separately. Do you see the problem? It lacks "globality". There is
    redundancy among blocks.
    I think it is mainly due to LevelDB(RocksDB's ancestor) did that way.

    I know there are many compression experts here, but it is a different story for database engine. Facebook currently has two nice teams, zstd and rocksdb. I think they may deal with this issue in a not very future.
    Last edited by JIMLYN; 13th April 2018 at 15:20.

Similar Threads

  1. Replies: 33
    Last Post: 27th August 2011, 05:13
  2. LZW with unbounded dictionary
    By encode in forum Data Compression
    Replies: 34
    Last Post: 28th September 2010, 03:30
  3. Experiments with small dictionary coding
    By Matt Mahoney in forum Data Compression
    Replies: 13
    Last Post: 28th June 2010, 16:34
  4. bzip2 dictionary size
    By Wladmir in forum Data Compression
    Replies: 3
    Last Post: 7th April 2010, 16:09
  5. LZSS with a large dictionary
    By encode in forum Data Compression
    Replies: 31
    Last Post: 31st July 2008, 21:15

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •