Results 1 to 28 of 28

Thread: clz - Context Modelling for lz77

  1. #1
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    464
    Thanks
    202
    Thanked 81 Times in 61 Posts

    clz - Context Modelling for lz77

    Quoting Mauro Vezzosi from Community Archiver thread:

    Quote Originally Posted by Mauro Vezzosi View Post
    I don't know if this is the right place to put my posts about clz_x64 or if you already know these:
    Code:
    clz_x64 c ENWIK8 ENWIK8.clz-c
     Compressed 100000000 bytes into 29390608, ratio 29.39%
     Completed in 109.04 seconds
    
    clz_x64 d ENWIK8.clz-c ENWIK8.clz-d
    position 32785629, match 13, distance 396842944, block size 32891136
    
     Error: Invalid or corrupt file
    
    clz_x64 c FP.LOG FP.LOG.clz-c
     Compressed 20617071 bytes into 1104360, ratio 5.36%
     Completed in 30.06 seconds
    
    clz_x64 d FP.LOG.clz-c FP.LOG.clz-d
    position 19505673, match 3, distance 688742203, block size 20617071
    
     Error: Invalid or corrupt file
    Code:
    clz_x64 c FileSize0 FileSize0.clz-c
     Compressed 0 bytes into 8, ratio 1.#J%
     Completed in 0.00 seconds
    
    clz_x64 d FileSize0.clz-c FileSize0.clz-d
     Decompressed 8 bytes into 0, ratio 1.#J%
     Completed in 0.00 seconds

    This is not my tool; it is developed by Lucas Marsh.

  2. #2
    Member
    Join Date
    May 2012
    Location
    United States
    Posts
    323
    Thanks
    174
    Thanked 51 Times in 37 Posts
    Looks interesting. Is there any more info available?

  3. #3
    Member
    Join Date
    Feb 2015
    Location
    United Kingdom
    Posts
    154
    Thanks
    20
    Thanked 66 Times in 37 Posts
    Quote Originally Posted by comp1 View Post
    Looks interesting. Is there any more info available?
    Yes, it uses lz77 to add words to an age sorted word-wise dictionary, upon finding a word in the dictionary it encodes a single flag and a reduced offset (word node), this reduced offset has an implicit length equal to the highest available order. When a longer word is observed the word node is swapped for the current match position. The dictionary only contains 128K words and is implemented as a single 8N array, so it uses about 1MB of memory for the whole dictionary. It's still in the R&D phase and there's no entropy coding so there's definitely room for improvement. Right now I'm rewriting the match finder to run on a parallel binary tree similar to Radyx. After which I'll be experimenting with using a single word node to encode varying orders which should make it stronger.

    Edit: found the bug, will fix it.

  4. The Following 2 Users Say Thank You to Lucas For This Useful Post:

    Stephan Busch (6th March 2018),willvarfar (7th March 2018)

  5. #4
    Member
    Join Date
    Apr 2015
    Location
    Greece
    Posts
    68
    Thanks
    31
    Thanked 22 Times in 15 Posts
    So it is something like glza?If so good parsing is difficult.

  6. #5
    Member
    Join Date
    Feb 2015
    Location
    United Kingdom
    Posts
    154
    Thanks
    20
    Thanked 66 Times in 37 Posts
    GLZA is lz78 based and creates codes for global context, CLZ just uses lz77 with an alternative encode method for context so it remains a single pass compressor. It's not going to be as strong as ACB but it will be a decent middle ground between ACB and lz77.

  7. The Following 3 Users Say Thank You to Lucas For This Useful Post:

    jibz (7th March 2018),Stephan Busch (6th March 2018),willvarfar (7th March 2018)

  8. #6
    Member
    Join Date
    May 2012
    Location
    United States
    Posts
    323
    Thanks
    174
    Thanked 51 Times in 37 Posts
    Sounds great Lucas. Please post here with updates. I'm quite interested in testing this and seeing further development.

  9. #7
    Tester
    Stephan Busch's Avatar
    Join Date
    May 2008
    Location
    Bremen, Germany
    Posts
    872
    Thanks
    457
    Thanked 175 Times in 85 Posts
    me too..

  10. #8
    Member
    Join Date
    Feb 2015
    Location
    United Kingdom
    Posts
    154
    Thanks
    20
    Thanked 66 Times in 37 Posts
    Okay so I've got the first version of a parallel binary tree in place. Without a radix sort it uses 8N memory, however it's really mean to cache-lines (it tends to invalidate a lot of them when more than 1 thread accesses the same line) and so it only runs 2.4x faster (73 sec vs 170 sec on enwik8) with 16 threads than the single threaded version. I'll likely end up using an initial radix pass to spread all the threads cache-lines apart, though that may use 12N memory.
    As for implementing a parallel binary tree without a radix sort here's some pseudo code:
    Code:
    void parallel_bst(uint8_t *data, uint32_t length)
    {
        uint32_t bst_entry[256] = {0};
        uint32_t *inverse_buffer = (uint32_t*)calloc(length, sizeof(uint32_t));
        if(inverse_buffer == NULL)
             perror("Failed to allocate inverse bst buffer");
        
        uint32_t buffer_buckets[257] = {0};
        for(uint32_t i =0 ; i < length; i++)
            buffer_buckets[data[i] + 1]++;
        
        for(uint32_t i = 1; i < 257; i++)
            buffer_buckets[i] += buffer_buckets[i - 1];
        
        // Build an inverse mapping for all entry points into the bst array
        for(uint32_t i = 0; i < length; i++)
            inverse_buffer[buffer_buckets[data[i]]++] = i;
        
        for(uint32_t i = 0; i < 256; i++)
            buffer_buckets[i] = buffer_buckets[i + 1];
        
        // Now build the tree concurrently
        #pragma omp parallel for
        for(uint32_t i = 0; i < 256; i++)
        {
            for(uint32_t k = buffer_buckets[i]; k < buffer_buckets[i + 1]; k++)
            {
                // Use the inverse buffer to grab the next node without signaling
                uint32_t current_match_pos = inverse_buffer[k];
                uint32_t previous_match_pos = bst_entry[i];  // Entry point into the bst, notice how it depends solely on the symbol? This is where the parallelism comes in.
                bst_entry[i] = current_match_pos;            // Chain the symbol to the current position.
                while(previous_match_pos > 0) ...            // Exercise, make the rest of the match finder yourself
            }
        }
        free(inverse_buffer);
    }
    This system is actually derived from how hash chains are linked together, so theoretically a hash chain can be made to run in parallel, as with MMC, and even a BST. The only thing which prevents this from linear performance scaling is cpu architecture. This is just something I think you folks might find interesting since I haven't seen it described anywhere else before.
    Attached Files Attached Files
    Last edited by Lucas; 17th March 2018 at 18:36. Reason: Added test executable using new concept match finder

  11. The Following 2 Users Say Thank You to Lucas For This Useful Post:

    elit (15th February 2019),Stephan Busch (17th March 2018)

  12. #9
    Member
    Join Date
    Feb 2015
    Location
    United Kingdom
    Posts
    154
    Thanks
    20
    Thanked 66 Times in 37 Posts
    Due to a small error on my part the previous versions were capped at 1 thread per symbol per real thread. This new version removes that limit and uses 256 virtual threads at once, and boy does that help a lot. Speed on enwik8 went from 74 seconds to 36 seconds. Memory usage is 8N for the binary tree, and 1N for the input.
    It's still mean on cache-lines but it's better than I anticipated. I don't think I'll bother with the radix variant I was working on since this is already new territory for binary trees and I'm happy with it. Since the radix stage is avoided the pbst match finder supports threaded sliding windows. Obviously that means support for GPU parallelism for lz77; which you can expect to see in future versions of CLZ.
    Here's the new exe: CLZ.exe

    Have fun with it

  13. The Following 2 Users Say Thank You to Lucas For This Useful Post:

    hunman (19th March 2018),Stephan Busch (20th March 2018)

  14. #10
    Member
    Join Date
    Sep 2015
    Location
    Italy
    Posts
    216
    Thanks
    97
    Thanked 128 Times in 92 Posts
    clz_x64 (Community Archiver/Lobby 2018/05/03 22:17) crashes on ENWIK0 (1st byte of ENWIK9), ENWIK1 (first 10 bytes of ENWIK9) and ENWIK2 (first 100 bytes of ENWIK9):
    clz_x64 c ENWIK0 ENWIK0.clz_x64
    clz_x64 c ENWIK1 ENWIK1.clz_x64
    clz_x64 c ENWIK2 ENWIK2.clz_x64
    Compression + decompression are ok on ENWIK3..ENWIK8, Maximum Compression, Silesia and some other files.

  15. #11
    Member
    Join Date
    Feb 2015
    Location
    United Kingdom
    Posts
    154
    Thanks
    20
    Thanked 66 Times in 37 Posts
    Quote Originally Posted by Mauro Vezzosi View Post
    clz_x64 (Community Archiver/Lobby 2018/05/03 22:17) crashes on ENWIK0 (1st byte of ENWIK9), ENWIK1 (first 10 bytes of ENWIK9) and ENWIK2 (first 100 bytes of ENWIK9):
    clz_x64 c ENWIK0 ENWIK0.clz_x64
    clz_x64 c ENWIK1 ENWIK1.clz_x64
    clz_x64 c ENWIK2 ENWIK2.clz_x64
    Compression + decompression are ok on ENWIK3..ENWIK8, Maximum Compression, Silesia and some other files.
    Thanks for spotting the bug with tiny files, that release was a complete rewrite built for higher speed so I guess I missed some cases.

    It's fixed now and an updated version available: https://drive.google.com/drive/folde...mdZllcQFcLuace

    All bug fixes and new versions will be posted to this folder.

  16. #12
    Member
    Join Date
    Sep 2015
    Location
    Italy
    Posts
    216
    Thanks
    97
    Thanked 128 Times in 92 Posts
    clz_x64 still crashes on file of size 1.
    -----
    clz_x64 c ENWIK9 ENWIK9.clz
    clz error code 2 : failed to allocate match finder!
    At the moment I have ~1800 MB of free memory, does clz need more free memory?

  17. #13
    Member
    Join Date
    Feb 2015
    Location
    United Kingdom
    Posts
    154
    Thanks
    20
    Thanked 66 Times in 37 Posts
    Fixed it, now it works on files of zero bytes. Yeah currently it's very memory intensive, it's memory to memory compression of the whole input up to 2GB, the the old one used streaming and supported a lot more memory control options. I think the old link is still up if you want to try that, the main difference is the decode speed of this release vs the old one. Encoding uses about 10N memory currently, decoding is 1N to 2N. I'm currently implementing the streaming version, so soon enough memory limiting controls will be added back in and you'll be able to benchmark enwik9 on your system.

  18. #14
    Member
    Join Date
    Dec 2008
    Location
    Poland, Warsaw
    Posts
    909
    Thanks
    531
    Thanked 359 Times in 267 Posts
    Could you post win executable with this fix?

  19. #15
    Member
    Join Date
    Feb 2015
    Location
    United Kingdom
    Posts
    154
    Thanks
    20
    Thanked 66 Times in 37 Posts
    Quote Originally Posted by Darek View Post
    Could you post win executable with this fix?
    Quote Originally Posted by Lucas View Post
    It's fixed now and an updated version available: https://drive.google.com/drive/folde...mdZllcQFcLuace

    All bug fixes and new versions will be posted to this folder.
    I uploaded it in the same link as in the Gitter post. It should be fixed, if you manage to find anymore bugs let me know :)

  20. The Following User Says Thank You to Lucas For This Useful Post:

    Darek (4th May 2018)

  21. #16
    Member
    Join Date
    Dec 2008
    Location
    Poland, Warsaw
    Posts
    909
    Thanks
    531
    Thanked 359 Times in 267 Posts
    Scores for enwik7, 8 and 9:
    Attached Thumbnails Attached Thumbnails Click image for larger version. 

Name:	clz_020a.jpg 
Views:	114 
Size:	23.3 KB 
ID:	5908  

  22. The Following User Says Thank You to Darek For This Useful Post:

    Lucas (5th May 2018)

  23. #17
    Member
    Join Date
    Feb 2015
    Location
    United Kingdom
    Posts
    154
    Thanks
    20
    Thanked 66 Times in 37 Posts
    Wait a minute did it actually take less than 1 second to decode enwik9? Correct me if I'm wrong but I'm assuming the comma separates the encode and decode timings, right?

  24. #18
    Member
    Join Date
    Dec 2008
    Location
    Poland, Warsaw
    Posts
    909
    Thanks
    531
    Thanked 359 Times in 267 Posts
    Quote Originally Posted by Lucas View Post
    Wait a minute did it actually take less than 1 second to decode enwik9? Correct me if I'm wrong but I'm assuming the comma separates the encode and decode timings, right?
    No. Comma separates seconds from miliseconds.

  25. #19
    Member
    Join Date
    Feb 2015
    Location
    United Kingdom
    Posts
    154
    Thanks
    20
    Thanked 66 Times in 37 Posts
    That makes more sense, thanks for the clarification.

  26. #20
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    772
    Thanks
    63
    Thanked 270 Times in 190 Posts
    Code:
    enwik8   100,000,000 bytes,  27,993,192 bytes,     31.172 sec.,  0.345 sec., CLZ v0.20a
    enwik9 1,000,000,000 bytes, 225,248,568 bytes,    441.857 sec.,  3.670 sec., CLZ v0.20a
    nowiki   618,960,704 bytes,  75,631,068 bytes, 12,901.655 sec.,  1.098 sec., CLZ v0.20a
    iis      623,108,642 bytes,  27,004,788 bytes,    154.370 sec.,  0.664 sec., CLZ v0.20a
    gaia     481,136,694 bytes,  96,569,304 bytes,    145.225 sec.,  1.428 sec., CLZ v0.20a
    nowiki takes a strange long time to compress.
    Last edited by Sportman; 5th May 2018 at 20:01. Reason: Added nowiki

  27. #21
    Member
    Join Date
    Feb 2015
    Location
    United Kingdom
    Posts
    154
    Thanks
    20
    Thanked 66 Times in 37 Posts
    Could you please provide a link to the nowiki dataset? My guess is there are a ton of high lcp matches just below the current greedy threshold. The original version had a heuristic threshold so I might have to re-implement that back in.

  28. #22
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    772
    Thanks
    63
    Thanked 270 Times in 190 Posts
    Quote Originally Posted by Lucas View Post
    Could you please provide a link to the nowiki dataset?
    https://encode.ru/threads/2829-RAZOR...ll=1#post56605

  29. #23
    Member
    Join Date
    Feb 2015
    Location
    United Kingdom
    Posts
    154
    Thanks
    20
    Thanked 66 Times in 37 Posts
    New release is out; I fixed the cause of the insane slowdown on the nowiki test set. Memory usage got reduced in this version. Encode memory is now 9N and decode memory is 1N. Thread controls and memory controls are added back so you should be able to run any file on any system with a reasonable block size.
    Last edited by Lucas; 8th May 2018 at 03:32.

  30. The Following User Says Thank You to Lucas For This Useful Post:

    Stephan Busch (8th May 2018)

  31. #24
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    464
    Thanks
    202
    Thanked 81 Times in 61 Posts
    I'm running some tests on clz. I wonder if memory use scales linearly with the number of cpu cores, or 9n is a global fixed ratio. Care to elaborate? Thanks in advance!

  32. #25
    Member
    Join Date
    Feb 2015
    Location
    United Kingdom
    Posts
    154
    Thanks
    20
    Thanked 66 Times in 37 Posts
    CLZ uses a fixed amount of memory regardless of the number of threads you use for encoding, it's always 8N for the binary tree and 1N for window +25MB for the optimal parsing buffer +1MB for the IO buffer, the only downside to using threads is that performance doesn't scale linearly due to cache invalidation, this is because the threads access a ton of data really close together in memory and trip up other threads. But in pretty much every case using threads is much faster than using a single thread while using as much memory as a single threaded system. I think it's pretty neat

  33. The Following 2 Users Say Thank You to Lucas For This Useful Post:

    elit (15th February 2019),Gonzalo (9th May 2018)

  34. #26
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    464
    Thanks
    202
    Thanked 81 Times in 61 Posts
    Pretty neat indeed!
    I noticed that clz is relatively weak for the time it spends on compression. I always compare it to lzma and flzma2 and usually lzma is stronger for the same time, and flzma2 faster in the orders of magnitude for the same or better ratio, all threads used and the same memory approximately. Is this only due to the early state of development, or a limitation of the algorithm itself? ie, can we expect a leap in speed eventually?
    By the way: amazing decompression speed

  35. The Following User Says Thank You to Gonzalo For This Useful Post:

    Lucas (9th May 2018)

  36. #27
    Member
    Join Date
    Feb 2015
    Location
    United Kingdom
    Posts
    154
    Thanks
    20
    Thanked 66 Times in 37 Posts
    I have yet to add optimizations like prefetching hashes ahead of the current position, so there's definitely room for speeding up the encoder by at least 20%, maybe more. Currently there is no entropy coding, the only systems it's currently comparable to would be Ilya's CRUSH, Hamid's lzturbo 29 mode, and LZOMA. I'm currently developing the models to encode the lz77 structure and making appropriate changes to the optimal parser. The next release will have entropy coding which will make it's ratios more comparable to lzma.

    My future plan does involve a GPU implementation of the parallel binary tree implementation which should have higher scalability since GPU architecture doesn't have large cache-lines and should allow for thousands of threads to build the tree without signaling. But that will come once the format is finalized.

  37. #28
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    464
    Thanks
    202
    Thanked 81 Times in 61 Posts
    Thanks for the insights. I'll be trying to make it crash in the meantime

Similar Threads

  1. Replies: 15
    Last Post: 4th September 2015, 15:10
  2. Modelling... what exactly?
    By Shelwien in forum Data Compression
    Replies: 24
    Last Post: 5th May 2012, 17:00
  3. Context Mixing
    By Cyan in forum Data Compression
    Replies: 9
    Last Post: 23rd December 2010, 21:45
  4. Context mixing
    By Cyan in forum Data Compression
    Replies: 7
    Last Post: 4th December 2009, 19:12
  5. Modelling techniques
    By Shelwien in forum Data Compression
    Replies: 14
    Last Post: 1st June 2008, 23:15

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •