+ Reply to Thread
Page 10 of 11 FirstFirst ... 891011 LastLast
Results 271 to 300 of 314

Thread: zpaq updates

  1. #271
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    1,294
    I thought this might be a problem when I took out the "e" command. Maybe I will put it back, but I don't want to duplicate a lot of stuff you could do with shell commands. Meanwhile you would have to create the archive with relative paths or no paths. Otherwise you can only extract them to the same place or extract them one at a time with new names. Build the archive like this:

    cd /tmp/compresiontest/folder/subfolder/content
    rm a /tmp/archive.zpaq
    zpaq a /tmp/archive *
    cd /home/user/Downloads/cal
    zpaq x /tmp/archive

  2. #272
    Member
    Join Date
    Nov 2011
    Location
    Germany
    Posts
    6
    OK, but just imaginge I'm not the one who built it. And the content would be /etc/something. This can only be overwritten by root and so there would be no way to extract it

  3. #273
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    1,294
    Yeah, that's a problem. I'm thinking about adding the following:

    zpaq x archive out/
    To strip off internal paths and extract to directory out.

    zpaq c archive files...
    To create a new archive. Yes you could delete the old one and add files, but it's a nuisance when you are tuning a config file.

    zpaq c archive
    To compress archive to archive.zpaq without saving the file name. Extraction would drop the .zpaq extension. When you are compressing just one file, the name isn't needed. It's also needed to make tiny archives like http://mattmahoney.net/dc/pi.txt.zpaq (pi.txt compressed to 114 bytes).

    zpaq -mF l
    To translate the ZPAQL code in F.cfg to a byte string that you can paste into a C++ program. I plan to use this when I add better compression to future versions. Eventually I want compression that applies different models by file type like bmp, jpg, exe, text, and binary.

    Edit: zpaq v4.02 is released with these changes. http://mattmahoney.net/dc/zpaq.html
    zpaq.exe (Win32), zpaq (Linux 64 bit), zpaq.402.zip (source, POD), zpaq.1.html (docs), pi.cfg (modified comments for new version).
    Last edited by Matt Mahoney; 29th November 2011 at 01:36.

  4. #274
    Member
    Join Date
    Nov 2011
    Location
    Germany
    Posts
    6
    Ok, this works fine. However this means that the internal folder structure of the archive is lost so the last improvement to perfect archiving would be that you can extract with zpaq x archive.zpaq out/ to out/etc/somefolder/somefile.


    I think the file permissions are not that important, because for full disk backup they are too complex (symlinks, SUID and all these differences for all filesystems) - people would use .tar of course in order to be very sure.
    Thank you for this fast reaction and bringing a usable ZPAQ to us.

  5. #275
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    1,294
    Yes, you lose directory structure, but you can avoid the problem my making archives with relative paths like it's usually done. And I agree about file system metadata. It is OS dependent and you lose information if you don't run as root (and even if you do, depending on how permissions are set). zpaq backs up files, not file systems.

  6. #276
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    1,294
    Added a FAQ to http://mattmahoney.net/dc/zpaq.html
    If you can think of anything else, let me know and I'll add it.

  7. #277
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    1,294
    libzpaq v4.01. Minor fix for Mac OS. Changed MAP_ANONYMOUS to MAP_ANON in libzpaq.cpp. Also removed commented out debugging code and unneeded #include <stdio.h>. This has no effect on the posted Linux and Windows zpaq executables, so I did not update them.

  8. #278
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    1,294
    Unofficial pre-release of zpaq v4.03. http://mattmahoney.net/dc/zpaq403b1.zip (source code only).

    - Fixed bug where "zpaq u archive" would not save filenames of updated files.
    - If zpaq skips comparing files due to missing checksums or file sizes, it now reports that files are not compared rather than that they are different (which might be wrong). However it still treats them as different.
    - -bs (solid archive) option saves locator tags, file sizes, and checksums just like other block size options. Added -n option if you don't want to save these to get the absolute smallest archive. Either option may cause incremental update to fail.
    - Added -r option to recurse subdirectories for compression. So far it only works in Windows. Final release will also support Linux, but not today.
    - Wildcard expansion works in Windows with any compiler.
    - Tried to make the help screen more clear.
    - zpaq.1.pod describes new options.

    Changes I am considering, but probably not all in the next version:
    - Faster decompression for -m1 and -m2 using cache-friendly IBWT (already developed as config files).
    - File-specific models for -m4 (.bmp, .jpg, .exe, .wav, etc)
    - A new faster LZ-based model -m0 (not yet developed).

  9. #279
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    1,294
    zpaq v4.03 is released (source, Win32 exe, Linux64 exe, docs). http://mattmahoney.net/dc/zpaq.html

    In addition to above changes, I implemented -r for Linux and added -f option to force overwrite during extraction. (I found it useful when testing big archives). Compression improvements will wait for the next version. In the Linux version, -r only compresses regular files and ignores symbolic links, devices, named pipes, sockets, etc. You can still compress these without -r. The Windows version doesn't treat files any differently with -r. It will compress system and hidden files and other special types either way. You probably don't want to do that because attributes are not preserved.

    But already it seems that disk I/O is a significant fraction of total time with default -m1 compression. I need to think about how to fix this. Right now all compression is to temporary files in separate threads, and then appended to the archive serially if there are no errors. Extraction is to temporary files only if the file is split into blocks. I suppose it is possible to buffer output to memory if not too big, and use a separate thread to append output in parallel with compression. I did optimize the append() function to use a larger buffer and not open and close the destination file for every source file, but this didn't seem to help any.

  10. #280
    Programmer osmanturan's Avatar
    Join Date
    May 2008
    Location
    Mersin, Turkiye
    Posts
    649
    Quote Originally Posted by Matt Mahoney View Post
    ...Right now all compression is to temporary files in separate threads, and then appended to the archive serially if there are no errors...
    Why don't you use a virtual stream instead of temporary files? I mean, let's allocate some predefined memory block (say 1 MiB) for each output stream. And write your output to that memory area instead of a file until memory block is full. If it's full, write whole memory to a temporary file, and continue to work on that temporary file. With that approach you can clearly accelerate your processing speed (especially on directories with lots of small files). IIRC, 7-zip uses same approach for it's multi-stream coders.
    BIT Archiver homepage: www.osmanturan.com

  11. #281
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    1,294
    That's the approach I am thinking about. Currently zpaq divides the input into blocks and allocates them to threads from largest to smallest so that all of the cores can finish at about the same time. To use in-memory buffers I would need to do this over a sliding window so there are not too many buffers in memory at once. Then a separate thread at the back of the window would append the output buffers while others continue to work. This would lose a safety feature where if any threads detect compression errors, then all of the temporaries are discarded and the archive is not updated.

  12. #282
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    1,294
    I am developing a level 2 update to the ZPAQ standard. The idea is to support very fast compression algorithms like lzpre by omitting the context model entirely and using only the postprocessor to decompress. Here is an unreleased preview of the standard: http://mattmahoney.net/dc/zpaq200.pdf

    The change is very minimal but unfortunately it breaks forward compatibility with older decompressers, thus level 2. It allows the number of components n in the COMP section to be 0. In this case, the input to the postprocessor is read directly from the compressed data, except that it is almost uncompressed. I say "almost" because we need to avoid runs of 4 consecutive 0 bytes to mark the end of the segment, as is done with the arithmetic coded data. This is achieved by encoding runs of k=1...255 zero bytes as (0 0 0 k). Runs of 1 or 2 zeros do not need to be RLE encoded, but runs of 3 or more do. The arithmetic decoder is now called a ZRLE/arithmetic decoder. (ZRLE = zero run length encoding).

    There are a few other changes.

    - I specified maximum sizes of the components, usually 4-16 GB memory. These are already the limits supported by libzpaq.
    - I fixed an error in the description of the find() (hash table lookup) function. In one place in the standard I wrote <=, but in the code I wrote <, so I updated the standard to match the code so as to preserve compatibility. I had previously written that I would resolve errors this way.
    - I made libzpaq rather than unzpaq the reference standard. libzpaq is available under a less restrictive license (modified MIT rather than GPL).

    Before I release the standard I need to update libzpaq and zpaq and test it with lzpre and maybe a few other fast algorithms. The zpaq changes would be to allow the COMP and HCOMP sections to be empty (n = 0 in the COMP header). It will create a level 2 archive in this case and level 1 otherwise, to be compatible with older programs. libzpaq will need updates to the arithmetic encoder and decoder to read and write ZRLE data. It will also need updates to the comments to describe which parts of the code are parts of the standard. The JIT code and the compression code will not be. Probably that means compiling with -DNOJIT -DDEBUG to test for conformance. A test case for level 2 compliance is also needed because calgarytest.zpaq only tests level 1.

    Comments are welcome.

  13. #283
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    1,895
    > The idea is to support very fast compression algorithms like lzpre
    > by omitting the context model entirely

    Maybe its possible to setup a "transparent" context model instead?
    (one that would process all bits with probability 0.5).
    Then maybe you'd be able to optimize this specific case for speed
    without breaking the compatibility.

    > I say "almost" because we need to avoid runs of 4 consecutive 0
    > bytes to mark the end of the segment

    This idea actually always seemed bothersome.
    It only accidentally appeared simple to implement with your current rangecoder.
    And it adds some clearly visible redundancy:
    Code:
    calgarytest.zpaq      1003792
    calgarytest.zpaq.zip  1001127
    Afaik, the best way is to process the data as a sequence of small blocks (<=64k),
    with some headers (like size==max flag, size, stored data flag etc).
    Having such a header is good for various asymmetric optimizations, and knowing
    the code size in advance makes it possible to cut off the rangecoder's "tail"
    (ie code after EOF can be considered an infinite stream of 00 or FF, and such
    bytes can be cut off from the end of the stream).

    > update libzpaq and zpaq and test it with lzpre and maybe a few other fast algorithms

    I hope you understand that zpaq would never be a competition for any fast coders.
    I suppose you could get some support from corresponding communities by using a
    java or python (or php : ) bytecode instead of zpaql, but you're not doing that either,
    and having to use zpaql to create a new codec is much harder than using any
    normal programming language.

    Imho the tasks where zpaq could be helpful are completely different.
    For example, automatically generating efficient models by a format spec
    similar to http://www.sweetscape.com/010editor/templates.html .
    Also data analysis (detecting various types of data based on model properties),
    maybe segmentation (reordering of data segments to improve further compression).
    Also experiments with massively parallel compression - having a formal model description
    is very useful there, because the model is much easier to understand in sequential form,
    but "massively parallel" implies things like each counter running in its own thread
    (it doesn't need to sync with anything if it stores the predictions), which would
    require a complete manual rewrite, if there's no way to automatically generate it.
    Surely, atm that's only applicable to encoder, but maybe we'd be able to find some ways
    to thread the decoder too.

  14. #284
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    1,294
    Quote Originally Posted by Shelwien View Post
    > The idea is to support very fast compression algorithms like lzpre
    > by omitting the context model entirely

    Maybe its possible to setup a "transparent" context model instead?
    (one that would process all bits with probability 0.5).
    Then maybe you'd be able to optimize this specific case for speed
    without breaking the compatibility.
    I thought about that but unfortunately I would have a lot of optimizing to do. Even the simplest possible model is slow, in fact slower than lz1.cfg for decompression because there is more data to decode.
    Code:
    comp 0 0 0 0 1
      0 const 128 (fixed probability = 1/2).
    hcomp
      halt
    post 0 end
    
    (enwik8.zpaq -> 100,000,393, 26.7s, 35.0s, 2.5s)
    (enwik8.zpaq.zpaq-m3 -> 100,004,946, 326s)
    The times are shown for compression, decompression, and a second decompression which computes the SHA-1 checksum and stops because it is identical. So removing the SHA-1 would only improve the first 2 times by 2.5 sec. Also, the output is not further compressible, so there isn't an obvious way to decode it that would be faster than the current arithmetic coding.

    > I say "almost" because we need to avoid runs of 4 consecutive 0
    > bytes to mark the end of the segment

    This idea actually always seemed bothersome.
    It only accidentally appeared simple to implement with your current rangecoder.
    And it adds some clearly visible redundancy:
    Code:
    calgarytest.zpaq      1003792
    calgarytest.zpaq.zip  1001127
    Yes, but most of this is due to some very large models for testing that exercise all possible test cases. There are 36 segments with 4 4x0 markers, so only 144 bytes are due to this.

    Afaik, the best way is to process the data as a sequence of small blocks (<=64k),
    with some headers (like size==max flag, size, stored data flag etc).
    Having such a header is good for various asymmetric optimizations, and knowing
    the code size in advance makes it possible to cut off the rangecoder's "tail"
    (ie code after EOF can be considered an infinite stream of 00 or FF, and such
    bytes can be cut off from the end of the stream).
    Yes, having pointers to the end of segments (and a maximum size) would be faster.

    I also thought about a completely different but universal self-describing format, a string description language that would just be a program followed by its uncompressed input data.

    > update libzpaq and zpaq and test it with lzpre and maybe a few other fast algorithms

    I hope you understand that zpaq would never be a competition for any fast coders.
    I suppose you could get some support from corresponding communities by using a
    java or python (or php : ) bytecode instead of zpaql, but you're not doing that either,
    and having to use zpaql to create a new codec is much harder than using any
    normal programming language.
    Well, I'm hoping to fix at least the speed problem.

    enwik8.gz -> 36,518,325, 10.6s, 1.7s
    enwik8.lzpre -> 37,144,854, 19.2s, 2.0s (3.8s with zpaq -mlz1 r)
    enwik8-mlz1-q.zpaq -> 30,746,245, 35.6s, 24.3s

    I designed ZPAQL to be fast rather than easy to program in. I assumed it would be possible to write a Java byte code or C or some other language to ZPAQL translator, but didn't focus my efforts there. As for speed, I could probably enlarge the ZPAQL instruction set to support a more direct mapping to x86. Currently the second test shows that ZPAQL is almost twice as slow as optimized C++, but most of the bottleneck is in an inefficient OUT instruction that goes through 3 layers of function calls including one virtual. I plan to fix this by adding a block write function to class Writer that should improve the decompression time to 2.4 seconds (based on removing the OUT instruction from lz1.cfg). I will still need to test the ZRLE decoder but I don't suppose it would be any slower than using small blocks with sizes in the headers.

    Imho the tasks where zpaq could be helpful are completely different.
    For example, automatically generating efficient models by a format spec
    similar to http://www.sweetscape.com/010editor/templates.html .
    Also data analysis (detecting various types of data based on model properties),
    maybe segmentation (reordering of data segments to improve further compression).
    Also experiments with massively parallel compression - having a formal model description
    is very useful there, because the model is much easier to understand in sequential form,
    but "massively parallel" implies things like each counter running in its own thread
    (it doesn't need to sync with anything if it stores the predictions), which would
    require a complete manual rewrite, if there's no way to automatically generate it.
    Surely, atm that's only applicable to encoder, but maybe we'd be able to find some ways
    to thread the decoder too.
    Yeah, I have in mind compressors that analyze the data and choose a model or search over a large space of models for the best compression. This would make compression very much slower than decompression, but compression could be made parallel. Or it is possible to statically analyze an input file like a database or image to find the row length and modify the model to contain a PAQ-like repeat model with a fixed size in ZPAQL. But these things all take a lot of time to develop.

  15. #285
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Kce, PL
    Posts
    1,037
    Decoder could recognise 1 particular model of identity and knowing that is has no effect - skip it. You don't need anything generic IMO. And it seems much better than breaking the format. After all, I see forward compatibility as the biggest value of ZPAQ.

  16. #286
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    1,294
    Yes, but ZPAQ has not yet been widely adopted, and I think the reason is the lack of very fast compression models. Going to a higher level maintains backward compatibility with old archives, but you will need to update the decompresser to read new archives. Also, existing models (non-empty COMP section) would still be saved in level 1 format.

    Here are some process times (2.0 GHz T3200, Win32, g++ 4.6.1 -O3) on a ZRLE encoder and decoder with enwik8:

    zrle c enwik8 enwik8.copy 1.0s (copy)
    zrle c enwik8 nul: 0.6s
    zrle e enwik8 enwik8.zrle 1.4s (encode)
    zrle e enwik8 nul: 1.0s
    zrle d enwik8.zrle enwik8 1.7s (decode)
    zrle d enwik8.zrle nul: 1.3s

    The encoder encodes runs of k=3..255 zero bytes as (0 0 0 k) and marks EOF with (0 0 0 0).

    Code:
    #include <stdio.h>
    int main(int argc, char** argv) {
      FILE* in=fopen(argv[2], "rb");
      FILE* out=fopen(argv[3], "wb");
      int c;
      unsigned int z=0;
      switch(argv[1][0]) {
        case 'c':  // copy
          while ((c=getc(in))!=EOF) putc(c, out);
          break;
        case 'e':  // ZRLE encode
          while ((c=getc(in))!=EOF) {
            if (c) {
              if (z>2) fprintf(out, "%c%c%c%c", 0, 0, 0, z), z=0;
              else for (; z; --z) putc(0, out);
              putc(c, out);
            }
            else {
              if (z==255) fprintf(out, "%c%c%c%c", 0, 0, 0, 255), z=0;
              ++z;
            }
          }
          if (z) fprintf(out, "%c%c%c%c", 0, 0, 0, z);
          fprintf(out, "%c%c%c%c", 0, 0, 0, 0);
          break;
        case 'd': // ZRLE decode
          z=getc(in);
          z=z<<8|getc(in);
          z=z<<8|getc(in);
          z=z<<8|getc(in);
          while (z) {
            if (z>255)
              c=z>>24, z=z<<8|getc(in), putc(c, out);
            else {
              putc(0, out);
              if (--z==0) {
                z=getc(in);
                z=z<<8|getc(in);
                z=z<<8|getc(in);
                z=z<<8|getc(in);
              }
            }
          }
          break;
      }
      return 0;
    }
    Edit: the following decodes 0.1 sec. faster:

    Code:
          while (z) {
            if (z>255) {
              z=z<<8|z>>24;
              putc(z, out);
              z=z&-256|getc(in);
            }
    Edit: some more tests using block encoding with the block size in the header. This appears to be faster.

    zrle b enwik8 enwik8.block 1.0s
    zrle b enwik8 nul: 0.6s
    zrle a enwik8.block enwik8 1.0s
    zrle a enwik8 nul: 0.6s

    Code:
        case 'b':  // block encode with size in header
          fseek(in, 0, SEEK_END);
          z=ftell(in);
          rewind(in);
          fprintf(out, "%c%c%c%c", z>>24, z>>16, z>>8, z);
          while ((c=getc(in))!=EOF) putc(c, out);
          break;
        case 'a': // block decode
          z=getc(in);
          z=z<<8|getc(in);
          z=z<<8|getc(in);
          z=z<<8|getc(in);
          for (; z; --z) putc(getc(in), out);
          break;
    Last edited by Matt Mahoney; 6th January 2012 at 02:21.

  17. #287
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    1,294
    I took Shelwien's advice and replace ZRLE with uncompressed blocks each preceded by a 4 byte length, since my tests show this to be faster. http://mattmahoney.net/dc/zpaq200.pdf

  18. #288
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    1,294
    zpaq 5.00 and libzpaq 5.00 preview. http://mattmahoney.net/dc/zpaq500b1.zip

    Implements proposed level 2 standard. Files can be stored uncompressed (with or without preprocessing). It also has some performance enhancements. Decompression with -n skips the checksum check, saving 1.3 seconds on enwik8. Also adds read() and write() functions to libzpaq Reader and Writer classes. enwik8 compresses to 37 MB with lz1.cfg with an empty COMP section (bytewise LZ77 with no modeling). Then it decompresses in 2.4 seconds on my laptop in 1 thread, same speed as lzpre d.

  19. #289
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    1,294
    ZPAQ level 2 documentation and libzpaq support is released. http://mattmahoney.net/dc/zpaq.html

    ZPAQ level 2 is backward compatible with level 1 but not forward compatible. New programs can read old archives, but older programs cannot read files saved in the new format. The only change to the format is that n = 0 components are now allowed. In this case, the data is stored uncompressed (but possibly preprocessed). This change will allow very fast algorithms (like LZ77+Huffman) to be coded in ZPAQL.

    Specification: http://mattmahoney.net/dc/zpaq200.pdf
    Reference decoder: http://mattmahoney.net/dc/unzpaq200.cpp
    Test archive: http://mattmahoney.net/dc/calgarytest2.zpaq
    Updated libzpaq v5.00: http://mattmahoney.net/dc/libzpaq500.zip
    Updated libzpaq documentation: http://mattmahoney.net/dc/libzpaq.3.html

    Also, zpaq.exe, zpaq (64 bit Linux executable), and zpipe.exe are linked to the new libzpaq 5.00, but source is not changed. They can now all read level 2, but still save in level 1 format. The exception is when compressing with zpaq with a config file with an empty HCOMP section (n=0). In this case, it saves in level 2 format. All of the built in compression modes are still level 1.

    The reference decoder is derived from the new libzpaq v5.00 but is a stand-alone program with optimizations and unnecessary code removed. To make it simple as possible, it decompresses to a single output file (or stdout), listing but otherwise ignoring stored filenames. I tested it with g++, Microsoft, Borland, and Mars.

    calgarytest2.zpaq is just calgarytest.zpaq with trans replaced with several blocks compressed in level 2 format, with and without preprocessing. There are both empty and non-empty blocks.

    libzpaq compresses in level 1 format whenever Compressor::startBlock() is given an HCOMP string with n > 0, or level 2 otherwise. The old version would reject n = 0. Another change (so far not used) is the addition of Reader::read() and Writer::write() to read and write blocks of data faster than get() and put() (like fread() and fwrite()). I plan to use this feature in the next version of zpaq when I introduce a fast LZ77 mode.

  20. #290
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    1,294
    I accidentally left some debugging code in libzpaq.cpp causing it to create a file "obj" (containing the x86 JIT code) in the current directory when compressing or decompressing. I fixed it and updated libzpaq to v5.01 and updated all the executables that used it (zpaq.exe, zpaq, zpipe.exe).

  21. #291
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    1,294
    I put ZPAQ on sourceforge. https://sourceforge.net/projects/zpaq/

    Not sure which one I want to maintain older versions on. So far sourceforge has only the latest version, but it also has version control tools that I haven't figured out how to use yet.

  22. #292
    Member
    Join Date
    May 2008
    Location
    Germany
    Posts
    238
    is there a specific reason, because you use g++ 4.6.1 instead of g++ 4.6.2 (oct 2011) ?

    best regards

  23. #293
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    1,294
    It's just the version on my computer.

  24. #294
    Member
    Join Date
    Nov 2011
    Location
    France
    Posts
    13
    Hi, I might have found what seems to be a possible bug in the algorithm. When the other day I was trying to compress text files with zpaq, I immediately saw that the archive was too large to be normal, and so I did a bit of forensics and I found out that the problematic files was *m3u and maybe *cue also. So I isolated them and if confirmed my hypothesis:
    97.9% of m3u +2.1% of cue in 4 496 930 bytes
    zpaq -m4 ----------------> 4 020 630 bytes !
    tar + zpaq -m4 -------------> 766 222

    and here is a short log of the end of the compression (zpaq -m4), I wasn't able to ouput a log file with the command for unknow reasons, so I will just copy-paste it:

    Compressed: cuem3u\936.m3u 36 -> 284 (63.1111 bpc)
    Compressed: cuem3u\1749.m3u 35 -> 282 (64.4571 bpc)
    Compressed: cuem3u\2121.m3u 35 -> 283 (64.6857 bpc)
    Compressed: cuem3u\2333.m3u 35 -> 282 (64.4571 bpc)
    Compressed: cuem3u\2594.m3u 35 -> 281 (64.2286 bpc)
    Compressed: cuem3u\2868.m3u 35 -> 282 (64.4571 bpc)
    Compressed: cuem3u\33.m3u 35 -> 280 (64.0000 bpc)
    Compressed: cuem3u\6480.m3u 35 -> 282 (64.4571 bpc)
    Compressed: cuem3u\7676.m3u 35 -> 283 (64.6857 bpc)
    Compressed: cuem3u\7895.m3u 35 -> 283 (64.6857 bpc)
    Compressed: cuem3u\8523.m3u 35 -> 282 (64.4571 bpc)
    Compressed: cuem3u\8859.m3u 35 -> 282 (64.4571 bpc)
    Compressed: cuem3u\8536.m3u 35 -> 281 (64.2286 bpc)
    Compressed: cuem3u\8874.m3u 35 -> 282 (64.4571 bpc)
    Compressed: cuem3u\9060.m3u 35 -> 282 (64.4571 bpc)
    Compressed: cuem3u\1731.m3u 34 -> 281 (66.1176 bpc)
    Compressed: cuem3u\2851.m3u 34 -> 282 (66.3529 bpc)
    Compressed: cuem3u\3358.m3u 34 -> 282 (66.3529 bpc)
    Compressed: cuem3u\3484.m3u 34 -> 281 (66.1176 bpc)
    Compressed: cuem3u\5605.m3u 34 -> 282 (66.3529 bpc)
    Compressed: cuem3u\6701.m3u 34 -> 281 (66.1176 bpc)
    Compressed: cuem3u\8943.m3u 34 -> 281 (66.1176 bpc)
    Compressed: cuem3u\1439.m3u 33 -> 281 (68.1212 bpc)
    Compressed: cuem3u\2420.m3u 33 -> 281 (68.1212 bpc)
    Compressed: cuem3u\2655.m3u 33 -> 281 (68.1212 bpc)
    Compressed: cuem3u\3435.m3u 33 -> 281 (68.1212 bpc)
    Compressed: cuem3u\4463.m3u 33 -> 279 (67.6364 bpc)
    Compressed: cuem3u\4476.m3u 33 -> 280 (67.8788 bpc)
    Compressed: cuem3u\4488.m3u 33 -> 280 (67.8788 bpc)
    Compressed: cuem3u\4848.m3u 33 -> 281 (68.1212 bpc)
    Compressed: cuem3u\5154.m3u 33 -> 281 (68.1212 bpc)
    Compressed: cuem3u\5425.m3u 33 -> 281 (68.1212 bpc)
    Compressed: cuem3u\5792.m3u 33 -> 280 (67.8788 bpc)
    Compressed: cuem3u\6133.m3u 33 -> 281 (68.1212 bpc)
    Compressed: cuem3u\8525.m3u 33 -> 281 (68.1212 bpc)
    Compressed: cuem3u\8862.m3u 33 -> 281 (68.1212 bpc)
    Compressed: cuem3u\9074.m3u 33 -> 280 (67.8788 bpc)
    Compressed: cuem3u\1454.m3u 32 -> 280 (70.0000 bpc)
    Compressed: cuem3u\1695.m3u 32 -> 278 (69.5000 bpc)
    Compressed: cuem3u\2344.m3u 32 -> 281 (70.2500 bpc)
    Compressed: cuem3u\2352.m3u 32 -> 280 (70.0000 bpc)
    Compressed: cuem3u\3025.m3u 32 -> 280 (70.0000 bpc)
    Compressed: cuem3u\4605.m3u 32 -> 280 (70.0000 bpc)
    Compressed: cuem3u\3379.m3u 32 -> 280 (70.0000 bpc)
    Compressed: cuem3u\5980.m3u 32 -> 280 (70.0000 bpc)
    Compressed: cuem3u\6008.m3u 32 -> 279 (69.7500 bpc)
    Compressed: cuem3u\8093.m3u 32 -> 279 (69.7500 bpc)
    Compressed: cuem3u\2360.m3u 31 -> 279 (72.0000 bpc)
    Compressed: cuem3u\2371.m3u 31 -> 279 (72.0000 bpc)
    Compressed: cuem3u\2503.m3u 31 -> 280 (72.2581 bpc)
    Compressed: cuem3u\2586.m3u 31 -> 279 (72.0000 bpc)
    Compressed: cuem3u\2793.m3u 31 -> 279 (72.0000 bpc)
    Compressed: cuem3u\4059.m3u 31 -> 278 (71.7419 bpc)
    Compressed: cuem3u\4473.m3u 31 -> 278 (71.7419 bpc)
    Compressed: cuem3u\7359.m3u 31 -> 279 (72.0000 bpc)
    Compressed: cuem3u\1421.m3u 30 -> 278 (74.1333 bpc)
    Compressed: cuem3u\1390.m3u 30 -> 278 (74.1333 bpc)
    Compressed: cuem3u\2321.m3u 30 -> 278 (74.1333 bpc)
    Compressed: cuem3u\2718.m3u 30 -> 278 (74.1333 bpc)
    Compressed: cuem3u\2781.m3u 30 -> 279 (74.4000 bpc)
    Compressed: cuem3u\4522.m3u 30 -> 279 (74.4000 bpc)
    Compressed: cuem3u\4678.m3u 30 -> 278 (74.1333 bpc)
    Compressed: cuem3u\4850.m3u 30 -> 279 (74.4000 bpc)
    Compressed: cuem3u\4700.m3u 30 -> 278 (74.1333 bpc)
    Compressed: cuem3u\5629.m3u 30 -> 279 (74.4000 bpc)
    Compressed: cuem3u\6036.m3u 30 -> 279 (74.4000 bpc)
    Compressed: cuem3u\6134.m3u 30 -> 278 (74.1333 bpc)
    Compressed: cuem3u\7919.m3u 30 -> 279 (74.4000 bpc)
    Compressed: cuem3u\8089.m3u 30 -> 277 (73.8667 bpc)
    Compressed: cuem3u\8090.m3u 30 -> 279 (74.4000 bpc)
    Compressed: cuem3u\8101.m3u 30 -> 279 (74.4000 bpc)
    Compressed: cuem3u\8423.m3u 30 -> 279 (74.4000 bpc)
    Compressed: cuem3u\8764.m3u 30 -> 280 (74.6667 bpc)
    Compressed: cuem3u\963.m3u 30 -> 278 (74.1333 bpc)
    Compressed: cuem3u\1586.m3u 29 -> 278 (76.6897 bpc)
    Compressed: cuem3u\1587.m3u 29 -> 278 (76.6897 bpc)
    Compressed: cuem3u\2114.m3u 29 -> 278 (76.6897 bpc)
    Compressed: cuem3u\3275.m3u 29 -> 278 (76.6897 bpc)
    Compressed: cuem3u\4872.m3u 29 -> 277 (76.4138 bpc)
    Compressed: cuem3u\5978.m3u 29 -> 278 (76.6897 bpc)
    Compressed: cuem3u\6048.m3u 29 -> 277 (76.4138 bpc)
    Compressed: cuem3u\6344.m3u 29 -> 277 (76.4138 bpc)
    Compressed: cuem3u\7545.m3u 29 -> 278 (76.6897 bpc)
    Compressed: cuem3u\8096.m3u 29 -> 278 (76.6897 bpc)
    Compressed: cuem3u\8353.m3u 29 -> 277 (76.4138 bpc)
    Compressed: cuem3u\1521.m3u 28 -> 276 (78.8571 bpc)
    Compressed: cuem3u\1902.m3u 28 -> 277 (79.1429 bpc)
    Compressed: cuem3u\1706.m3u 28 -> 277 (79.1429 bpc)
    Compressed: cuem3u\2435.m3u 28 -> 277 (79.1429 bpc)
    Compressed: cuem3u\921.m3u 28 -> 277 (79.1429 bpc)
    Compressed: cuem3u\2036.m3u 27 -> 276 (81.7778 bpc)
    Compressed: cuem3u\170.m3u 27 -> 276 (81.7778 bpc)
    Compressed: cuem3u\2193.m3u 27 -> 276 (81.7778 bpc)
    Compressed: cuem3u\3940.m3u 27 -> 276 (81.7778 bpc)
    Compressed: cuem3u\5592.m3u 27 -> 276 (81.7778 bpc)
    Compressed: cuem3u\8181.m3u 27 -> 276 (81.7778 bpc)
    Compressed: cuem3u\84.m3u 27 -> 276 (81.7778 bpc)
    Compressed: cuem3u\2314.m3u 26 -> 275 (84.6154 bpc)
    Compressed: cuem3u\2872.m3u 26 -> 275 (84.6154 bpc)
    Compressed: cuem3u\3155.m3u 26 -> 275 (84.6154 bpc)
    Compressed: cuem3u\9061.m3u 26 -> 274 (84.3077 bpc)
    Compressed: cuem3u\441.m3u 25 -> 272 (87.0400 bpc)
    Compressed: cuem3u\168.m3u 25 -> 274 (87.6800 bpc)
    Compressed: cuem3u\7841.m3u 25 -> 275 (88.0000 bpc)
    Compressed: cuem3u\1746.m3u 24 -> 273 (91.0000 bpc)
    Compressed: cuem3u\4455.m3u 24 -> 273 (91.0000 bpc)
    Compressed: cuem3u\4874.m3u 24 -> 272 (90.6667 bpc)
    Compressed: cuem3u\7500.m3u 24 -> 274 (91.3333 bpc)
    Compressed: cuem3u\4530.m3u 23 -> 272 (94.6087 bpc)
    Compressed: cuem3u\7785.m3u 23 -> 273 (94.9565 bpc)
    Compressed: cuem3u\1743.m3u 22 -> 272 (98.9091 bpc)
    Compressed: cuem3u\2635.m3u 22 -> 272 (98.9091 bpc)
    Compressed: cuem3u\4062.m3u 21 -> 271 (103.2381 bpc)
    Compressed: cuem3u\4402.m3u 21 -> 271 (103.2381 bpc)
    Compressed: cuem3u\2610.m3u 18 -> 268 (119.1111 bpc)
    Compressed: cuem3u\4895.m3u 17 -> 267 (125.6471 bpc)
    Compressed: cuem3u\2289.m3u 0 -> 252 (2016000000.0000 bpc)
    Compressed: cuem3u\5602.m3u 0 -> 252 (2016000000.0000 bpc)
    Compressed: cuem3u\5729.m3u 0 -> 252 (2016000000.0000 bpc)
    Compressed: cuem3u\5984.m3u 0 -> 252 (2016000000.0000 bpc)
    Compressed: cuem3u\6492.m3u 0 -> 252 (2016000000.0000 bpc)
    Compressed: cuem3u\6829.m3u 0 -> 252 (2016000000.0000 bpc)
    Compressed: cuem3u\8218.cue 0 -> 252 (2016000000.0000 bpc)
    Compressed: cuem3u\8904.cue 0 -> 252 (2016000000.0000 bpc)
    Compressed: cuem3u\8969.cue 0 -> 252 (2016000000.0000 bpc)

    It actually started from 1.1314 bpc, increased slowly and sort of exponnentially increased for the last files.
    Attached, the tested files. Hope this will help you.
    Attached Files Attached Files

  25. #295
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    1,294
    zpaq is adding a couple hundred bytes of overhead for each file. You could try -bs -n to create a solid archive, with no comments, checksums, or locator tags. That will save some space but it still has to store the filenames. You can cd to cuem3u so it won't store the path. But you still might do better with tar when you have lots of tiny files.

  26. #296
    Member
    Join Date
    Nov 2011
    Location
    France
    Posts
    13
    So you're telling me that it's not a bug at all, and the only reason of that problem is that zpaq store a big relative path and so with lots of tiny files, it would result in a much larger archive than expected?
    Well at least I'm glad that this is not a bug. Thank you for your answer.

  27. #297
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    1,294
    Yes. By default, it stores each file in a separate block that includes a header locator tag (13 bytes), and a description of the algorithm (a couple hundred bytes for -m4). Then the block is divided into segments. Each segment stores a filename, the original size as a comment (a few bytes), and a 20 byte SHA-1 checksum. The arithmetic coder adds 4 bytes to mark EOF. The options mentioned can save a lot of this overhead except for the filenames.

  28. #298
    Member
    Join Date
    Nov 2011
    Location
    France
    Posts
    13
    That's great advises, and could definitely save great space in some cases. It's not really intuitive so maybe hard to guess. I think you should put that somewhere in the documentation ( somewhere in readme maybe? ).

  29. #299
    Member
    Join Date
    Mar 2012
    Location
    Paris
    Posts
    3
    Matt,

    Would it be possible to create blocks by extension group? I mean compress all .txt in one block then bmp in another block. I don't like the -bs mode because it doesn't use multithreading and "one block per file" or "16MB per file" doesn't enable high compression ratios.
    I asked if you could add some partial solid mode with group by extension.
    thanks

  30. #300
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    1,294
    It would be possible, but I haven't implemented it. There are a lot of ways that I could improve compression, but it will take time. Meanwhile you can do something like:

    zpaq -m4 -bs c archive *.txt
    zpaq -m4 -bs a archive *.bmp

    to create 2 blocks. If you want to do this in parallel, run these in separate windows to different archives, then concatenate them when done.

    Or better yet, use custom models. bmp_j4.cfg gets very good compression for .bmp files but it will fail in solid mode (-bs). Use:

    zpaq -mbmp_j4 a archive *.bmp

+ Reply to Thread
Page 10 of 11 FirstFirst ... 891011 LastLast

Similar Threads

  1. ZPAQ self extracting archives
    By Matt Mahoney in forum Data Compression
    Replies: 29
    Last Post: 30th July 2011, 00:13
  2. ZPAQ 1.05 preview
    By Matt Mahoney in forum Data Compression
    Replies: 11
    Last Post: 30th September 2009, 05:26
  3. zpaq 1.02 update
    By Matt Mahoney in forum Data Compression
    Replies: 11
    Last Post: 10th July 2009, 01:55
  4. Metacompressor.com benchmark updates
    By Sportman in forum Data Compression
    Replies: 79
    Last Post: 22nd April 2009, 04:24
  5. ZPAQ pre-release
    By Matt Mahoney in forum Data Compression
    Replies: 54
    Last Post: 23rd March 2009, 03:17

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts