Page 1 of 3 123 LastLast
Results 1 to 30 of 86

Thread: RH4 - Solid multifile compressor

  1. #1
    Member
    Join Date
    Nov 2013
    Location
    US
    Posts
    131
    Thanks
    31
    Thanked 29 Times in 19 Posts

    RH4 - Solid multifile compressor

    I rewrote RH2 to compress and decompress faster, and to use much less memory (still a 16 MB dictionary). I've also added the ability to pack and unpack multiple files and directories into a solid archive. Appending isn't supported. x86 and x64 builds produce compatible archives.

    Example:
    RH4 c archive single_file.txt *.txt *.log dir dir2\subdir

    RH4 d archive unpacked
    unpacked is created as a file if archive contains one file.
    unpacked is created as a folder if archive contains multiple files.

    Edit: Fixed bugs in paths and matching, it is ok now.
    Attached Files Attached Files
    Last edited by cade; 21st March 2014 at 01:49. Reason: Updated attachments to new file format

  2. #2
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    How do I get RH4 to restore the directory hierarchy? I am trying to test on the 10GB benchmark, but all the files are restored in one directory.

  3. #3
    Member
    Join Date
    Nov 2013
    Location
    US
    Posts
    131
    Thanks
    31
    Thanked 29 Times in 19 Posts
    Add a directory instead of individual files.

    RH4 c out *.txt dir dirxyz/*.txt
    RH2 d out unpacked

    unpacked/ now has all text files from the root and dirxyz, but dir has its original structure.

  4. #4
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    I was testing in Ubuntu with Wine 1.6 like this:

    time RH4_x64.exe c1 usb/10gb.rh4 10gb
    time RH4_x64.exe d usb/10gb.rh4 tmp

    usb is a link to an external drive.

  5. #5
    Member
    Join Date
    Nov 2013
    Location
    US
    Posts
    131
    Thanks
    31
    Thanked 29 Times in 19 Posts
    Sorry, I broke how paths are recreated, definitely a silly mistake somewhere. I will fix it and reupload with a notification post.

  6. #6
    Member
    Join Date
    Nov 2013
    Location
    US
    Posts
    131
    Thanks
    31
    Thanked 29 Times in 19 Posts
    Double post. Oops.
    Last edited by cade; 6th March 2014 at 20:09. Reason: Double post because the previous didn't show up, sorry.

  7. #7
    Member Skymmer's Avatar
    Join Date
    Mar 2009
    Location
    Russia
    Posts
    681
    Thanks
    37
    Thanked 168 Times in 84 Posts
    Both compiles are not compatible with WinXP x64 SP2.
    The procedure entry point InitOnceExecuteOnce could not be located in the dynamic link library KERNEL32.dll

  8. #8
    Member
    Join Date
    Nov 2013
    Location
    US
    Posts
    131
    Thanks
    31
    Thanked 29 Times in 19 Posts
    Quote Originally Posted by Matt Mahoney View Post
    How do I get RH4 to restore the directory hierarchy? I am trying to test on the 10GB benchmark, but all the files are restored in one directory.
    With the silly mistake that I have now, you have to do dir/*, not just dir. For example,
    RH4 c out dir/*
    I will fix this in the next build.

    Quote Originally Posted by Skymmer View Post
    Both compiles are not compatible with WinXP x64 SP2.
    The procedure entry point InitOnceExecuteOnce could not be located in the dynamic link library KERNEL32.dll
    In the next build, I will add x86 compiled with g++. I currently use MSVC 2012.

  9. #9
    Member
    Join Date
    Nov 2013
    Location
    US
    Posts
    131
    Thanks
    31
    Thanked 29 Times in 19 Posts
    Ok, I have fixed the mistake in paths stored and uploaded a x86 version compiled with gcc in the first post.

  10. #10
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Tested RH4_x64.exe c1 10gb in Linux under Wine 1.6. 76140 of 83437 extracted files differ, although the sizes are correct. It looks like most of them have long strings from the wrong files. Compressed size is 4557377678 bytes if it matters. I didn't test the other compression levels.

    Edit: tested c2 10gb which also fails, and c1 1gb and c2 1gb which also fails. However, all compression levels work on the 100mb set. All sets can be downloaded from http://mattmahoney.net/dc/10gb.html
    Last edited by Matt Mahoney; 7th March 2014 at 04:48.

  11. The Following User Says Thank You to Matt Mahoney For This Useful Post:

    cade (7th March 2014)

  12. #11
    Member
    Join Date
    Nov 2013
    Location
    US
    Posts
    131
    Thanks
    31
    Thanked 29 Times in 19 Posts
    Strange. Could you please try with total <4 GB? I've compressed several source trees (all <1 GB), so I'm guessing I have an overflow somewhere.

    Edit: Something is wrong with the matcher, I'll keep looking.
    Last edited by cade; 7th March 2014 at 06:27.

  13. #12
    Member
    Join Date
    Nov 2013
    Location
    US
    Posts
    131
    Thanks
    31
    Thanked 29 Times in 19 Posts
    Note to self: stop matching when the window ends, not just maximum length.

    I have fixed it and updated the first post.

  14. #13
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Works now. Tested on system 4. http://mattmahoney.net/dc/10gb.html

  15. The Following User Says Thank You to Matt Mahoney For This Useful Post:

    cade (8th March 2014)

  16. #14
    Member
    Join Date
    Nov 2013
    Location
    US
    Posts
    131
    Thanks
    31
    Thanked 29 Times in 19 Posts
    Some small updates: cleaner size formatting, l displays packed files, x86_gcc is slightly faster, and c5/6 use forward optimal parsing now (much slower, slightly better compression). Decompression is the same model as before.

    If people are interested, I will probably try to make an open-source command-line archive compressor from this proof-of-concept with more features (e.g., filters)
    Attached Files Attached Files

  17. #15
    Member
    Join Date
    Nov 2013
    Location
    US
    Posts
    131
    Thanks
    31
    Thanked 29 Times in 19 Posts
    Updates:
    Speed improvements for c1 to c3, decompression is also faster now.
    Compressing with -sfx flag creates a self-extracting exe.
    Slightly less memory for compression (still 16 MB non-overlapping windows). enwik8 is 25,740 KB with a 128 MB window + ring binary tree search and 24,193 KB with simple order-0 CM, but this isn't my objective (too much memory).
    Multiple files are packed with duplicates together.
    Maximum match length is 2047.
    Attached Files Attached Files

  18. #16
    Member
    Join Date
    Nov 2013
    Location
    US
    Posts
    131
    Thanks
    31
    Thanked 29 Times in 19 Posts
    Another update:
    Duplicate files are detected.
    Decompression is faster now, x86 is ~25% faster than before.
    x86 version works on XP.

    The format is slightly different now because I re-did IO, so it's not compatible with the old versions.
    Attached Files Attached Files

  19. #17
    Member
    Join Date
    May 2012
    Location
    United States
    Posts
    323
    Thanks
    174
    Thanked 51 Times in 37 Posts
    Quote Originally Posted by cade View Post
    Another update:
    Duplicate files are detected.
    Decompression is faster now, x86 is ~25% faster than before.
    x86 version works on XP.

    The format is slightly different now because I re-did IO, so it's not compatible with the old versions.
    Will test later...

  20. #18
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    what technologies of duplicate file dection & removal are used? is it safe?

  21. #19
    Member
    Join Date
    Nov 2013
    Location
    US
    Posts
    131
    Thanks
    31
    Thanked 29 Times in 19 Posts
    I sort the files by size at input time. I go through the list and when two neighbours have the same size, I open both and compare by reading blocks, and this continues while the next neighbours have the same size. Testing hashes first (then compare for collision detection) would be smarter, but I have to read the file to hash it anyway, so it's not a terrible method for me.

    I do solid packing in a stream, it's a replacement for fread and fwrite. If a file has "neighbour is duplicate", then whenever a block is written to it, it's also written to all duplicates. Example, files with contents A, B and C:
    Write A
    Write B -> copy B -> copy B
    Write C

    This way, the whole size is ABBBC, but the processed amount will only be ABC.

    Sorry, I don't understand what you mean by safe.

  22. #20
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    f.e. comparing files by crc32 would be unsafe. i just want to help by checking your scheme

  23. #21
    Member
    Join Date
    Nov 2013
    Location
    US
    Posts
    131
    Thanks
    31
    Thanked 29 Times in 19 Posts
    Certainly, I thought collisions would be unreliable. Since I have to read the whole file anyway, this was easier.

  24. #22
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    collisions are reliable when you have 1) cryptographically-strong hash and 2) enough bits to deal with the birthday paradox. i.e. ram/hdd fails for one of 1e10..1e20 bits so with 1e40 ~= 2**128, i.e. 128 bit-large cryptohash you are fine against inadvertent collisions

  25. #23
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Testing on 10GB benchmark.
    1. Files with 0 bytes (and also empty directories) are not restored.
    2. Most of compression time seems to be disk access (comparing files?) rather than CPU.
    More later.

  26. #24
    Member
    Join Date
    Nov 2013
    Location
    US
    Posts
    131
    Thanks
    31
    Thanked 29 Times in 19 Posts
    I don't store directories, only file paths. 0b files are considered "done" before they are written, I'll take a look at this again. I'm limited by IO when checking for duplicates or compressing redundant data. Thanks for testing.

  27. #25
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Test results on system 4. http://mattmahoney.net/dc/10gb.html
    I tested c2 and c5. Compression is much improved but slower. It might be faster to use a secure hash instead of comparing files directly. There is public domain code for SHA-1 and SHA-256 in libzpaq.

  28. #26
    Member
    Join Date
    Nov 2013
    Location
    US
    Posts
    131
    Thanks
    31
    Thanked 29 Times in 19 Posts
    It is a good idea, but I still have to read the file to hash it. In the current system, a pair (or more) files are read together once when they have the same sizes.

  29. #27
    Member
    Join Date
    May 2012
    Location
    United States
    Posts
    323
    Thanks
    174
    Thanked 51 Times in 37 Posts
    Simple test comparing RH4 with RAR 2.60 (currently my best low resource archiver).

    COMPUTER SPECS
    Intel i7-377k
    16GB RAM
    Windows XP SP3 (32-bit)

    SFC DIRECTORY (53,136,054 bytes)
    Code:
    A10.jpg
    AcroRd32.exe
    english.dic
    FlashMX.pdf
    FP.LOG
    MSO97.DLL
    ohs.doc
    rafale.bmp
    vcfiu.hlp
    world95.txt
    Time was taken with ptime

    PROG - SIZE - COMP TIME - DECOMP TIME
    RH4 (c6) - 13,949,609 - 16.055 - 0.875
    RAR (-s -m5 -mm -md1024) - 13,890,337 - 9.635 - 0.371

    Not bad cade!

  30. #28
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    If you have n = hundreds of files of the same size, do you read them all in parallel and make n*(n-1) comparisons? Or do you read and compare them pair-wise, reading each file up to n times?

    The 10GB benchmark has lots of identical files and lots more that are the same size. I noticed that time to compress with c2 went from 1144 to 2941 seconds and c6 went from 1665 to 4919 seconds. I am testing RH4_x64.exe under Wine 1.6 in Ubuntu. I watch the CPU in top and I noticed that c2 runs at about 58% CPU for a few minutes, then drops to 2-3% for most of the time. On the system I am testing, libzpaq::SHA1 adds about 6 seconds per GB, but overall time should get faster because there is less data to compress.

  31. The Following User Says Thank You to Matt Mahoney For This Useful Post:

    cade (22nd March 2014)

  32. #29
    Member
    Join Date
    Nov 2013
    Location
    US
    Posts
    131
    Thanks
    31
    Thanked 29 Times in 19 Posts
    Sorry, you're right, I thought I was doing n*(n-1) tests in parallel, I didn't finish that and left pairwise comparisons in. The worst case is reading 2n files (A-B, B-C, C-D...). Even then, the difference in speed shouldn't be so large. I will take a look at your SHA1 implementation, thanks for the tip. I have CRC32 implemented, but I disabled it because it was slowing packing.

    @comp1:
    I don't use any filters. c4/5 are much faster than c6, small difference in ratio. RAR can find 2 and 3 byte matches, but I test less matches (at c1-3) and don't spend so much space on match offsets.
    Last edited by cade; 22nd March 2014 at 04:00.

  33. #30
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    sha is pretty slow on cpus without dedicated sha commands (64-bit ARM and future intel skylake). the fastest cryptographically-strong hashing i know is vmac algorithm employed by srep - vmac-128 on x64 handles 10 GB/s per core on i7-2600

    crc32c is even faster, running at 32 GB/s per core, but crc-based hashes aren't crypto-strong

  34. The Following User Says Thank You to Bulat Ziganshin For This Useful Post:

    cade (22nd March 2014)

Page 1 of 3 123 LastLast

Similar Threads

  1. Bittorrent and solid archives
    By lunaris in forum Data Compression
    Replies: 8
    Last Post: 29th December 2010, 11:54

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •