Results 1 to 23 of 23

Thread: Nanzip compressed 2GB to 95KB ! How is it even possible?

  1. #1
    Member
    Join Date
    May 2012
    Location
    usa
    Posts
    24
    Thanks
    5
    Thanked 0 Times in 0 Posts

    Cool Nanzip compressed 2GB to 95KB ! How is it even possible?

    Hi there,
    I have a txt file that I generated using Crunch, basically it's a word list txt file containing every combination of the English alphabets limited to 6 characters, example:
    aaaaaa
    aaaaab
    aaaaac
    .
    .
    zzzzzz

    The txt file size is ~2 Gigabyte which I tried to compress using everything from 7zip, zpaq, arc, zstandard, etc... the results were in the hundredth's of MB, I even tested nanozip nz_cm compressor ( nanzip's best compressor) which got similar results as the earlier.
    Finally, I tested nanozip nz_optimum1 compressor, and jackpot!! I got 95KB in 30 seconds @ 76 MB/s !!!!! The md5 hashes match.
    - My question is that, How is this even possible? how did the nz_optimum1 get impossible results, 1000x better than other state-of-the-art data compression algorithms.
    - My second question is, How can I achieve similar or even better results on other compression tools/software?
    nz_optimum is using block-sorting compression (BWT) QLFC based

    The compressed nanozip file is attached
    Attached Files Attached Files

  2. #2
    Member
    Join Date
    Feb 2015
    Location
    United Kingdom
    Posts
    154
    Thanks
    20
    Thanked 66 Times in 37 Posts
    The sequences are easily predictable, it only varies by 1 byte per sequence. Nz uses a delta coder for this type of data, Winrar also has special handling of this data.

  3. #3
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    437
    Thanks
    137
    Thanked 152 Times in 100 Posts
    Clearly it is possible because we could write a program to print it up in an even smaller size.

    Likely this compressor has some form of string delta which produced lots of +1 values (likely not handling the carry-over case of az to ba, but that's still a good start), and then those deltas get compressed again.

    Eg a really basic string delta:

    Code:
    #include <stdio.h>
    #include <string.h>
    
    int main(int argc, char **argv) {
      char last_a[8192], *last = last_a;
      char curr_a[8192], *curr = curr_a;
      *last = 0;
    
      if (argc > 1 && strcmp(argv[1], "-d") == 0) {
        while (fgets(curr, 8192, stdin)) {
            int d = *curr, l = strlen(curr+1)-1+*curr;
            memmove(curr+*curr, curr+1, l);
            memcpy(curr, last, d);
            curr[l] = '\0';
            puts(curr);
            char *cp1 = last; last = curr; curr = cp1;
        }
    
      } else {
        while (fgets(curr, 8192, stdin)) {
            char *cp1, *cp2;
    
            // Find delta point
            for (cp1 = last, cp2 = curr; *cp1 == *cp2; cp1++, cp2++)
            ;
    
            // Chomp nl
            for (cp1 = cp2; *cp1 != '\n'; cp1++)
            ;
            *cp1 = 0;
    
            // Encode
            putchar(cp2-curr);
            puts(cp2);
            cp1 = last; last = curr; curr = cp1;
        }
      }
      return 0;
    }
    bsc -m0e2b1000M5 on the output of this then gets you to 712 bytes! Whereas on the original file it was much slower and gave around 100Mb.
    (Amazingly 712 bytes is under 2x larger than the trivial little program I wrote to create the input data. )

    Edit: tweaked bsc params.

  4. #4
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    464
    Thanks
    202
    Thanked 81 Times in 61 Posts
    Well, that is a very good example of why the need of a project like Fairytale.
    You can see how incredibly more efficient is one method than others depending on which kind of data it's involved. Digital information in the real world is seldom that predictable. Even so, when it is, most archivers will fail to even see it. An ideal compressor would be capable of "see" and understand the contents of a given file, and adapt itself to best fit a solution. In this case, only NanoZip realized it could use a delta transform, and that isn't probably what really happened; most likely nz -co is just supposed to use delta while the rest of the modes don't.


    A truly intelligent archiver would be even capable of creating one computer program that will output that file of yours (like JamesB did) without human intervention. But that is plain science fiction right now.

  5. #5
    Member
    Join Date
    May 2012
    Location
    usa
    Posts
    24
    Thanks
    5
    Thanked 0 Times in 0 Posts
    JamesB, I have compiled your string delta script (Windows) now how can I use it to compress the txt?
    Btw, you only need to run bsc -e2b1000M5 , m0 is the default.
    Click image for larger version. 

Name:	jRj3EAd_700wa_0.gif 
Views:	76 
Size:	93.5 KB 
ID:	5929
    Attached Files Attached Files

  6. #6
    Member
    Join Date
    Sep 2007
    Location
    Denmark
    Posts
    856
    Thanks
    45
    Thanked 104 Times in 82 Posts
    I got almost 4TB off data down too a handfuld of bytes by using a 64bit counter RLE encoding.
    its was all o's but the md5 hash checked out afterwards


    Actually you shoudl be more empressed by yourself.
    You managed to compres the entier file down to below 1kb

    Code:
    Hi there,
    I have a txt file that I generated using Crunch, basically it's a word list txt file containing every combination of the English alphabets limited to 6 characters, example:
    aaaaaa
    aaaaab
    aaaaac
    .
    .
    zzzzzz
    Contains enough information to decompress to the full size for any human beeing

  7. #7
    Member
    Join Date
    May 2012
    Location
    usa
    Posts
    24
    Thanks
    5
    Thanked 0 Times in 0 Posts
    .
    Last edited by 2pact; 15th May 2018 at 04:58.

  8. #8
    Member
    Join Date
    May 2012
    Location
    usa
    Posts
    24
    Thanks
    5
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by SvenBent View Post
    its was all o's but the md5 hash checked out afterwards
    I see, so basically u had a 1 letter repeating for 4Tb.

    This topic is about finding the common/universal pattern in a wordlist txt file, so SvenBent take your 4TB file another place, perhaps the Recycle Bin would be a more efficient algorithm for you.

  9. #9
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    NanoZip just had good enough heuristics and flexible enough filters to apply delta filter on fixed size records. That made data more compressible for later stages. You can create such filters as a standalone application, like word transforms are often distributed as separate programs (xml-wrt, drt, etc).

    I don't see much sense in throughly analyzing this case, though. How often you encounter such files in reality? It makes no sense to store on disk a file that you can generate on the fly faster than disk transfer and with a simple program.

  10. #10
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    464
    Thanks
    202
    Thanked 81 Times in 61 Posts
    Last edited by Gonzalo; 16th May 2018 at 16:27.

  11. #11
    Programmer
    Join Date
    Feb 2007
    Location
    Germany
    Posts
    420
    Thanks
    28
    Thanked 150 Times in 18 Posts

    Smile

    Hundreds of MB... It's not that bad.

    7zip compresses the file to 14.763.253. Razor needs 675.573 bytes. But both archives can be compressed once more (see attachments):
    LZMA -> 1.308.467
    rz -> 85.701

    Since LZMA and RZ have no fitting filter or transform, their LZ-syntax must suffice. Thus, this file's content is an extreme example/spam of MRU0-match + symbol-after-match over and over again. And this is the reason why both archives can be compressed once more.

    If an LZ-Algorithm's syntax has no MRU-matches, no symbol-after-match, a bad parser or uses just prefix-coding, it'll perform pretty bad on this file.

    Anyway, you can construct a billion examples of funky generating functions. It'd would be a bad design decision to bloat a general purpose compression algorithm for these corner-cases.

    Interesting example. Thanks!
    Attached Files Attached Files

  12. #12
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    The released version of GLZA doesn't support compression of 2 GB files, but I added large file support in my developmental version and get a compressed file that is 100 bytes using GLZA c -r8500 englishlist.txt englishlist.glza. That seemed too good, but decompressed the file and it matched. It uses a delta filter with a stride of 7 and then massive hierarchical deduplication of repeating strings. The final grammar size is 195 symbols using 37 grammar rules, 5 terminals, and 115 repeats. I will try to release an update with large file support later this month.

    Code:
    c:\timer64 GLZA d englishlist.glza a:englishlist.txt
    Decompressed 100 bytes -> 2162410432 bytes (0.0000 bpB) in 2.872 seconds.
    
    
    Kernel  Time =     0.546 =   19%
    User    Time =     3.468 =   120%
    Process Time =     4.015 =   139%    Virtual  Memory =    884 MB
    Global  Time =     2.875 =   100%    Physical Memory =    703 MB
    Last edited by Kennon Conrad; 16th May 2018 at 02:00.

  13. #13
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Here is a 102 byte file that decompresses to englishlist.txt with GLZA v0.10.1 and the following command. It is NOT a .zip file but the site filters won't allow me to post without an extension or with the .glza extension.
    Code:
    GLZA d englishlist.zip englishlist.txt
    Attached Files Attached Files

  14. #14
    Member
    Join Date
    May 2012
    Location
    usa
    Posts
    24
    Thanks
    5
    Thanked 0 Times in 0 Posts
    .
    Last edited by 2pact; 17th May 2018 at 00:48.

  15. #15
    Member
    Join Date
    May 2012
    Location
    usa
    Posts
    24
    Thanks
    5
    Thanked 0 Times in 0 Posts
    .
    Last edited by 2pact; 17th May 2018 at 00:46.

  16. #16
    Member
    Join Date
    May 2012
    Location
    usa
    Posts
    24
    Thanks
    5
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    I don't see much sense in throughly analyzing this case, though. How often you encounter such files in reality? It makes no sense to store on disk a file that you can generate on the fly faster than disk transfer and with a simple program.
    On my computer, generating the list using Crunch took ~120 seconds, while extracting the nanzip archive took 30 seconds (4x faster than generating).

  17. #17
    Member
    Join Date
    Feb 2015
    Location
    United Kingdom
    Posts
    154
    Thanks
    20
    Thanked 66 Times in 37 Posts
    I think Piotr meant that if someone wrote a program to print that output in a language like C and with proper optimizations then it can reach higher throughput than any decompressor can achieve, this is because compression itself is an abstraction of the original content. Building a dedicated algorithm to produce a known result will always be faster than reconstructing it from a ton of references into a window plus delta transforming, and whatever added overhead of entropy coding etc. Honestly you can probably build an algorithm to recreate this file at over 1GB/s. I'm surprised this needs to be explained considering you joined this forum so long ago.

  18. #18
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts

    Smile

    Quote Originally Posted by Lucas View Post
    I think Piotr meant that if someone wrote a program to print that output in a language like C and with proper optimizations then it can reach higher throughput than any decompressor can achieve, this is because compression itself is an abstraction of the original content. Building a dedicated algorithm to produce a known result will always be faster than reconstructing it from a ton of references into a window plus delta transforming, and whatever added overhead of entropy coding etc. Honestly you can probably build an algorithm to recreate this file at over 1GB/s. I'm surprised this needs to be explained considering you joined this forum so long ago.
    Absolutely. Even GLZA gets 750 MB/s, it should be easy to bump that up to 1 GB/sec by taking out all the decoding and maybe doing a little optimization. And Piotr could probably beat that by 2x

    Back to 2pact's comment, I want to be clear GLZA does not have BWT capability. It compresses this file well because it is able to "describe" the file with a tiny grammar compared to the file size. There are very few compressors that use grammar rules and it's not necessarily the first thing that pops in peoples heads as an answer to your questions. GLZA works really well on files like yours, but if you make the data more random the advantage will disappear at some point.

    These are good people posting here and they deserve to be respected. They are trying to help even if the answers were not helpful to you and perhaps slightly hostile. People come on here looking for magic potions, insisting they should exist. I'm not saying that's you, it's just that sometimes people having bad days may not think things through all the way and the expectation of the existence of magic potions can get tiring. When I am in similar situations, I often find I can learn more by asking people to explain their comment or why they consider it helpful than I can by replying in kind.

  19. The Following 2 Users Say Thank You to Kennon Conrad For This Useful Post:

    Gonzalo (16th May 2018),hexagone (16th May 2018)

  20. #19
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    464
    Thanks
    202
    Thanked 81 Times in 61 Posts
    I'm sorry for my comments. I see that they hurt you and that wasn't the intention. I didn't mean to be seen like patronizing in any way. I was just describing what was or what I wanted to be my own path from noob to not-so-noob, while making a little fun of myself and other similar people too along the way.
    It did seem to me like you didn't understand basic compression concepts, though. Perhaps I misunderstood, but you make some generalizations usually people without much knowledge do, like trying to extrapolate one case to all others. Your take on how delta works and/or what a natural language description of a string is, it's plain wrong as I see it. Even if it were, I have done worse comments and made worse mistakes along the years, so I don't have a problem with that...
    I repeat, maybe it's just me that misunderstood something so in any case, let me erase my post and apologize once more. Have a nice day!

  21. #20
    Member
    Join Date
    May 2012
    Location
    usa
    Posts
    24
    Thanks
    5
    Thanked 0 Times in 0 Posts
    Thank you all for your help and clarification. You are all appreciated.
    I created a script using python that can generate such list promptly.
    https://github.com/christophered/PyC...PyCrunchpy3.py

    Can anyone share a script that can generate utf-8 wordlist with about 1GB/s, it would be helpful to me

  22. #21
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    Here's a C code for generating "word" list:
    Code:
    #include <stdint.h>
    #include <stdio.h>
    #include <stdlib.h>
    
    int main(int argc, char** argv) {
        int64_t ALPHABET_SIZE = 'z' - 'a' + 1;
        int64_t LINE_SIZE = 8;
        int64_t SECTION_SIZE = ALPHABET_SIZE * ALPHABET_SIZE * ALPHABET_SIZE
                * ALPHABET_SIZE * LINE_SIZE;
        uint8_t * section = malloc(SECTION_SIZE);
        for (uint8_t l2 = 'a'; l2 <= 'z'; l2++) {
            int64_t i2 = l2 - 'a';
            for (uint8_t l3 = 'a'; l3 <= 'z'; l3++) {
                int64_t i3 = l3 - 'a';
                for (uint8_t l4 = 'a'; l4 <= 'z'; l4++) {
                    int64_t i4 = l4 - 'a';
                    for (uint8_t l5 = 'a'; l5 <= 'z'; l5++) {
                        int64_t i5 = l5 - 'a';
                        int64_t line_index = ((i2 * ALPHABET_SIZE + i3)
                                * ALPHABET_SIZE + i4) * ALPHABET_SIZE + i5;
                        int64_t byte_index = line_index * LINE_SIZE;
                        section[byte_index + 0] = 'a';
                        section[byte_index + 1] = 'a';
                        section[byte_index + 2] = l2;
                        section[byte_index + 3] = l3;
                        section[byte_index + 4] = l4;
                        section[byte_index + 5] = l5;
                        section[byte_index + 6] = 13; // CR
                        section[byte_index + 7] = 10; // LF
                    }
                }
            }
        }
        for (uint8_t l0 = 'a'; l0 <= 'z'; l0++) {
            for (uint8_t l1 = 'a'; l1 <= 'z'; l1++) {
                for (uint64_t byte_index = 0; byte_index < SECTION_SIZE;
                        byte_index += LINE_SIZE) {
                    section[byte_index + 0] = l0;
                    section[byte_index + 1] = l1;
                }
                fwrite_unlocked(section, 1, SECTION_SIZE, stdout);
            }
        }
        return EXIT_SUCCESS;
    }
    Test run:
    Code:
    $ wordlist | head
    aaaaaa
    aaaaab
    aaaaac
    aaaaad
    aaaaae
    aaaaaf
    aaaaag
    aaaaah
    aaaaai
    aaaaaj
    
    real	0m0.011s
    user	0m0.001s
    sys	0m0.012s
    $ wordlist | tail
    zzzzzq
    zzzzzr
    zzzzzs
    zzzzzt
    zzzzzu
    zzzzzv
    zzzzzw
    zzzzzx
    zzzzzy
    zzzzzz
    
    real	0m1.910s
    user	0m1.670s
    sys	0m0.900s
    $ wordlist | wc
    308915776 308915776 2471326208
    
    real	0m30.642s
    user	0m30.677s
    sys	0m0.877s
    $ wordlist > /dev/null
    
    real	0m0.203s
    user	0m0.203s
    sys	0m0.000s
    $
    Generation speed is about 12 GB/s on my computer. I could probably make it somewhat higher, but the result isn't that bad for 20 minutes of coding

  23. The Following 2 Users Say Thank You to Piotr Tarsa For This Useful Post:

    Kennon Conrad (17th May 2018),load (17th May 2018)

  24. #22
    Member
    Join Date
    Mar 2016
    Location
    USA
    Posts
    47
    Thanks
    5
    Thanked 22 Times in 14 Posts
    Not that fast, but compact code:

    Code:
    perl -E"say for'a'x6..'z'x6">word.txt

  25. The Following 2 Users Say Thank You to MegaByte For This Useful Post:

    JamesB (17th May 2018),load (17th May 2018)

  26. #23
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    456
    Thanks
    46
    Thanked 164 Times in 118 Posts
    It is possible to have zero byte compression by computing the keys on the fly using base26 conversion.

    Here a program to generate the file for a given word length (default 6):
    Code:
    #include <stdlib.h>
    #include <stdio.h>
    #include <string.h>
    
    
    #define N 16
    char *utostr26(unsigned v, char *s, int n) {
      unsigned char *p=&s[n];
      *p=0;
      do *--p = (v % 26)+'a'; while (v /= 26);
      return p;
    }
    
    
    int main(int argc, char *argv[]) {
      size_t   n = (1ull<<40),i,len=6;
      unsigned char s[N+1]="aaaaaaaaaaaaaaaa",e[N+1]="zzzzzzzzzzzzzzzz";
      if(argc > 1) len = atoi(argv[1]);
      if(argc > 2) n   = atoi(argv[2]);
      if(len>N) len = N;
      for(i = 0; i < n; i++) {  
        utostr26(i,s,len);
        puts(s);
        if(!memcmp(s,e,len)) break;
      }
    }

Similar Threads

  1. Can this be compressed better?
    By Warvstar in forum Data Compression
    Replies: 21
    Last Post: 25th December 2016, 18:24
  2. LPAQ8 Disable 2GB Filesize Limit?
    By comp1 in forum Data Compression
    Replies: 3
    Last Post: 16th May 2014, 20:40
  3. Decompressing - KGB Archive GTA IV.kgb 2GB into 64kb ?
    By apollo in forum Data Compression
    Replies: 17
    Last Post: 14th April 2011, 11:43
  4. Does rep support >2GB Dictionary size ?
    By SvenBent in forum Data Compression
    Replies: 12
    Last Post: 6th June 2009, 00:08
  5. Precomp v0.3.8 >2GB test version
    By schnaader in forum Data Compression
    Replies: 22
    Last Post: 15th July 2008, 12:47

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •