Results 1 to 18 of 18

Thread: Radix sort match finder

  1. #1
    Member
    Join Date
    Feb 2015
    Location
    Australia
    Posts
    73
    Thanks
    13
    Thanked 65 Times in 23 Posts

    Radix sort match finder

    I've uploaded a project to github that you guys may be interested in:
    https://github.com/conor42/Radyx

    It's an LZMA2 implementation based on the radix sort match finder that I used in Imp back in the late 90s. It's not a complete sort because the radix bins are not sorted relative to each other. The output is similar to a suffix array + LCP, but because the suffixes aren't sorted in lexicographic order it's not a true suffix array. Using it for dictionary compression is very simple - it's just a lookup table containing the longest match for every position, and the match length. The one significant advantage over BT4 is in parallelization. The memory requirement increases little as more threads are added. More details are in the project readme file.

    Currently there are binaries for Windows x86 and x64. It compiles with g++ 4.9.1 under Ubuntu but I've done very little testing on that platform.
    https://github.com/conor42/Radyx/rel...yx090a_x64.zip
    https://github.com/conor42/Radyx/rel...yx090a_x86.zip

    My computer doesn't have the ram to test >680Mb dictionary size so if anyone tests 2Gb or 4Gb please let me know!

    Conor
    Last edited by Conor; 12th February 2015 at 16:06.

  2. The Following 4 Users Say Thank You to Conor For This Useful Post:

    Bulat Ziganshin (11th February 2015),Cyan (11th February 2015),encode (11th February 2015),Nania Francesco (11th February 2015)

  3. #2
    Member
    Join Date
    May 2009
    Location
    France
    Posts
    95
    Thanks
    13
    Thanked 72 Times in 42 Posts
    Hello,

    Don't know what to test so... Ask me what you would possibly want me to check.
    (7z 9.20, radyx 0.9)

    Code:
    enwik8               Process Time  Global Time  Virt. Mem  Phys. Mem  Compressed size
    radyx a                 116.485 s      6.240 s     295 MB     256 MB       26,887,485
    radyx a -md=1g          154.035 s      8.314 s    7345 MB    1613 MB       25,939,628
    radyx a -md=1g -ma=2    177.373 s      9.310 s    7351 MB    1621 MB       25,288,644
    radyx a -md=2g          170.774 s      9.204 s   14623 MB    2698 MB       25,939,628
    radyx a -md=4g          205.687 s     11.388 s   29179 MB    4857 MB       25,939,628
    7z a -m0=LZMA            74.568 s     49.764 s     195 MB     197 MB       25,895,909
    7z a -m0=LZMA2           84.287 s     38.142 s    1877 MB     461 MB       26,080,291
    
    enwik9               Process Time  Global Time  Virt. Mem  Phys. Mem  Compressed size
    radyx a                1265.339 s     64.553 s     295 MB     259 MB      236,205,819
    radyx a -md=1g         2467.889 s    118.398 s    7345 MB    5929 MB      213,044,387
    radyx a -md=1g -ma=2   2533.986 s    122.655 s    7351 MB    5936 MB      207,194,396
    radyx a -md=2g         2570.958 s    122.302 s   14623 MB    7044 MB      213,044,387
    radyx a -md=4g         2696.087 s    128.388 s   29179 MB    9253 MB      213,044,387
    7z a -m0=LZMA           658.932 s    459.503 s     195 MB     197 MB      227,905,645
    7z a -m0=LZMA2         1280.518 s    109.113 s    3527 MB    2934 MB      230,135,764
    AiZ
    Last edited by AiZ; 11th February 2015 at 19:33. Reason: -ma=2 results added, columns

  4. The Following 2 Users Say Thank You to AiZ For This Useful Post:

    Bulat Ziganshin (11th February 2015),Conor (11th February 2015)

  5. #3
    Member
    Join Date
    Feb 2015
    Location
    Australia
    Posts
    73
    Thanks
    13
    Thanked 65 Times in 23 Posts
    Quote Originally Posted by AiZ View Post
    Hello,

    Don't know what to test so... Ask me what you would possibly want me to check.
    Just that it produces an error-free output according to 7z t. You will need to use at least 4Gb of data, and enwik9 is only 953Mb which is why the compressed size is unchanged above 1Gb. For more than 1 file you also need to set the solid unit size above 4Gb, eg. -ms5g

    Thanks,
    Conor

  6. #4
    Member
    Join Date
    Jun 2013
    Location
    Sweden
    Posts
    150
    Thanks
    9
    Thanked 25 Times in 23 Posts
    Z:\>radyx a -ma=2 -md=3gb -mmt=1 -mds=64kb -mfb=254 e:\test.radyx MOVIE.DVD9.PAL.iso
    Radyx [64] 0.9 beta Copyright 2015 Conor McCarthy 2015-02-10
    This software has NO WARRANTY and is released under the
    GNU General Public License: www.gnu.org/licenses/gpl.html

    Found 1 file.
    Adding MOVIE.DVD9.PAL.iso

    Completed after ~45 minutes and used about 22GB of memory. Used 13% of CPU for the most part, but exceeded a few times.

    7-Zip [64] 9.20 Copyright (c) 1999-2010 Igor Pavlov 2010-11-18

    Listing archive: e:\test.radyx

    Type = 7z
    Method = LZMA2 (lzma2:3072m shown in GUI)
    Solid = -
    Blocks = 1
    Physical Size = 6972076308
    Headers Size = 196

    2014-12-18 08:02:46 ....A 7293587456 6972076112 MOVIE.DVD9.PAL.iso
    Tested with 7-zip file manager 9.20 (GUI) and There Are No Errors!

    7z-GUI compression comparizon (Ultra,LZMA2,1024MB,273,solid,3threads) (~40minutes)
    2014-12-18 08:02:46 ....A 7293587456 6967329735 MOVIE.DVD9.PAL.iso

    If I have set the parameters badly, feel free to suggest new syntax.

  7. The Following User Says Thank You to a902cd23 For This Useful Post:

    Conor (11th February 2015)

  8. #5
    Member
    Join Date
    Feb 2015
    Location
    Australia
    Posts
    73
    Thanks
    13
    Thanked 65 Times in 23 Posts
    Good to know there were no errors

    Do you have enough ram to do 4Gb? Using the -ar=off switch reduces the requirement. It's ok to use more threads if you like.

  9. #6
    Member
    Join Date
    May 2009
    Location
    France
    Posts
    95
    Thanks
    13
    Thanked 72 Times in 42 Posts
    Hi,

    First results, I've to leave the office now...

    I've tested radyx with the 10GB benchmark set from Matt Mahoney. Unfortunately, first run went bad.
    Code:
    Z:\>timer64 radyx a -md=4g -ma=2 -ms=5g -r -q 10gb.7z 10gb\
    Radyx [64] 0.9 beta  Copyright 2015 Conor McCarthy  2015-02-10
    This software has NO WARRANTY and is released under the
    GNU General Public License: www.gnu.org/licenses/gpl.html
    
    Found 79431 files.
    
    Kernel  Time =  2486.687 =   76%
    User    Time = 23723.228 =  727%
    Process Time = 26209.915 =  803%    Virtual  Memory =  29208 MB
    Global  Time =  3260.898 =  100%    Physical Memory =  29173 MB
    
    Z:\>dir 10gb.7z
     Volume in drive Z has no label.
     Volume Serial Number is 9E91-C1F1
    
     Directory of Z:\
    
    02/11/2015  12:37 PM     3,150,781,491 10gb.7z
                   1 File(s)  3,150,781,491 bytes
                   0 Dir(s)  18,741,440,512 bytes free
    
    Z:\>7z t 10gb.7z
    
    7-Zip [64] 9.20  Copyright (c) 1999-2010 Igor Pavlov  2010-11-18
    
    Processing archive: 10gb.7z
    
    
    Exit code: -1073741819
    
    Z:\>
    So, I tested it gradually.
    'radyx a' was Ok (639 seconds, 3,854,783,580 bytes), decompressed and tested with 'zpaq list 10gb -not == -force'.
    'radyx a -md=1g' was Ok (1434 seconds, 3,237,864,427 bytes), decompressed and tested with 'zpaq list 10gb -not == -force'.
    'radyx a -md=2g' was Ok (1855 seconds, 2,990,393,682 bytes), decompressed and tested with 'zpaq list 10gb -not == -force'.
    'radyx a -md=4g' was Ok (2974 seconds, 2,990,131,209 bytes), decompressed and tested with 'zpaq list 10gb -not == -force'.

    Then, I made a tar file from 10gb directory. I compressed it with '-md=4g' (865 seconds, 3,582,670,511 bytes). Testing it was impossible, 7z.exe showed 'Testing 10gb.tar' and after perhaps a minute, I got the dialog '7-Zip Console has stopped working'.

    Also, when adding a folder in archive with 'radyx a <archive name> subdir\', the filenames in archive don't contain the subdir\ prefix (different behaviour than the original 7z command).

    AiZ
    Last edited by AiZ; 11th February 2015 at 23:24. Reason: last line, typo

  10. #7
    Member
    Join Date
    Jun 2013
    Location
    Sweden
    Posts
    150
    Thanks
    9
    Thanked 25 Times in 23 Posts
    radyx a -ma=2 -md=4gb -mmt=8 -mds=64kb -mfb=254 -ar=off e:\test.radyx Movie.DVD9.PAL.iso
    Radyx [64] 0.9 beta Copyright 2015 Conor McCarthy 2015-02-10
    This software has NO WARRANTY and is released under the
    GNU General Public License: www.gnu.org/licenses/gpl.html

    Found 1 file.
    Adding Movie.DVD9.PAL.iso

    7z l \test.radyx

    7-Zip [64] 9.20 Copyright (c) 1999-2010 Igor Pavlov 2010-11-18

    Listing archive: \test.radyx

    --
    Path = \test.radyx
    Type = 7z
    Method = LZMA2 (lzma2:0m shown in GUI)
    Solid = -
    Blocks = 1
    Physical Size = 6965619557
    Headers Size = 196

    Date Time Attr Size Compressed Name
    ------------------- ----- ------------ ------------ ------------------------
    2014-12-18 08:02:46 ....A 7293587456 6965619361 Movie.DVD9.PAL.iso
    ------------------- ----- ------------ ------------ ------------------------

    E:\Downloads>7z t \test.radyx

    7-Zip [64] 9.20 Copyright (c) 1999-2010 Igor Pavlov 2010-11-18

    Processing archive: \test.radyx

    Testing Movie.DVD9.PAL.iso

    Everything is Ok

    Size: 7293587456
    Compressed: 6965619557

    Time = 26 minutes. Firefox crashed while watching flash-news. Explorer.exe is still unresponsive, needed to kill and restart it after this.
    After a few minutes 99% CPU usage, after written almost 4GB to disk CPU was 0% and disk activity almost 100% but it looked like nothing was done.
    After a long time it continued with compression at 99% until done.
    Memory usage 25.636MB
    7zG test reports There Are No Errors!

    C:\HD-TEMP>7z x E:\test.radyx

    7-Zip [64] 9.20 Copyright (c) 1999-2010 Igor Pavlov 2010-11-18

    Processing archive: E:\test.radyx

    Extracting Jackie.Brown.(1997).DVD9.PAL.[CDC4673E].iso

    Everything is Ok

    Size: 7293587456
    Compressed: 6965619557

    CRC32 checked and OK!
    7z CLI and GUI used 4GB while testing and extracting.

    -ar is in textfile but there is no mention of ON or OFF.

  11. #8
    Member
    Join Date
    Feb 2015
    Location
    Australia
    Posts
    73
    Thanks
    13
    Thanked 65 Times in 23 Posts
    Thanks both of you for the tests. It looks like 7-zip has an issue with decompression of a 4Gb dictionary in some cases. The problem with the progress monitor occurs because only the main thread can write compressed data before encoding is complete. The rest must wait for completion. So with 8 threads, 7 large buffers were being written while progress was 99%. I'll try to improve that at some point.

    I made a new release with -ar documentation fixed and there's a 32-bit version which is limited to a 192Mb dictionary.

  12. #9
    Member
    Join Date
    Jun 2013
    Location
    Sweden
    Posts
    150
    Thanks
    9
    Thanked 25 Times in 23 Posts
    Something is wrong, today I watched some clip from facebook and chrome crashed. I tried to restart chrome but it looked like it could not connect to internet.
    This during compression of 2352267 files (438GB) with md=3968m, 4g was not possible due to insufficient memory. Perhaps I should mention that this could be my fault because I have swapfile disabled. Memory is 32GB and ramdisk of 6GB.
    But I have not encountered this with 7zip / zpaq or any other application that requires a lot of memory

  13. #10
    Member
    Join Date
    Feb 2015
    Location
    Australia
    Posts
    73
    Thanks
    13
    Thanked 65 Times in 23 Posts
    If you got an insufficient memory error at 4Gb then I'd say 3968Mb with no swapfile could easily cause instability.

  14. #11
    Member
    Join Date
    Jun 2013
    Location
    Sweden
    Posts
    150
    Thanks
    9
    Thanked 25 Times in 23 Posts
    Quote Originally Posted by Conor View Post
    If you got an insufficient memory error at 4Gb then I'd say 3968Mb with no swapfile could easily cause instability.
    I removed my ramdisk and tried 4g again, still crashes. Change to 4m, still crashes. This means that radyx should not be used without swapfile.

    radyx.exe a -ar=off -ma=2 -md=4m -mds=64k -mf=off -mfb=64 -r save.7z * (same 2352267 files)

    This line still causes alot of memory fault in windows, since no other program has done this for me (bcm zpaq 7z rar nz zcm and alot more) I say there is something wrong with memory allocation in radyx.

  15. #12
    Member
    Join Date
    Feb 2015
    Location
    Australia
    Posts
    73
    Thanks
    13
    Thanked 65 Times in 23 Posts
    Looks like it has a problem with a large number of files. I'll have to look into that. Has radyx crashed or does it only cause other programs to crash?

  16. #13
    Member
    Join Date
    Jun 2013
    Location
    Sweden
    Posts
    150
    Thanks
    9
    Thanked 25 Times in 23 Posts
    Other program! Crash perhaps isnt the correct description, more like halting/freezing/stuttering. And this did occur with single file also.

    added:
    Currently I am not using my SSD C: so I enabled windows swapfile @ 32GB. Chrome was running fine until radyx started writing 4GB to disk (compressed same 7GB dvd-movie againg). A few seconds later everything become unresponsiv. After 1 or 2 minutes (i didnt looked at the time) windows began redrawing things on screen, really slow here and there. After 1 or 2 minutes later more chrome began to show webpage again, a few moments after that I could continue watch flashvideo. Later I started firefor to write this post and windows is still jerky. Now maybe windows is back to normal again.
    Last edited by a902cd23; 12th February 2015 at 23:04.

  17. #14
    Member
    Join Date
    Feb 2015
    Location
    Australia
    Posts
    73
    Thanks
    13
    Thanked 65 Times in 23 Posts
    Thanks for your help with this.
    It could be a problem with large write operations in Windows. The OS allocates ram to handle it which is eventually freed but maybe not right away. AFAIK most compression programs don't write all of the data in one operation. Can you try once more with nul as the archive name? This causes the data to be discarded instead of written to disk. I can also try changing the write operations so they are done in small chunks, and then I can upload the exe for you to try.

    Update: yes Windows does allocate a lot of ram for writing, but writing in chunks doesn't help because it caches it all anyway. I may need to write a custom output stream to turn off caching.
    Last edited by Conor; 13th February 2015 at 04:55.

  18. #15
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by Conor View Post
    Thanks for your help with this.
    It could be a problem with large write operations in Windows. The OS allocates ram to handle it which is eventually freed but maybe not right away. AFAIK most compression programs don't write all of the data in one operation. Can you try once more with nul as the archive name? This causes the data to be discarded instead of written to disk. I can also try changing the write operations so they are done in small chunks, and then I can upload the exe for you to try.

    Update: yes Windows does allocate a lot of ram for writing, but writing in chunks doesn't help because it caches it all anyway. I may need to write a custom output stream to turn off caching.
    Are you flushing the output between writes when writing in chunks?

  19. #16
    Member
    Join Date
    Feb 2015
    Location
    Australia
    Posts
    73
    Thanks
    13
    Thanked 65 Times in 23 Posts
    I wasn't but just tried it, and Windows seems to go ahead and cache to the max anyway. The bump in ram usage is the same and I still get 7-8Mb/s write activity to the file after the program terminates. I did manage to make the display go blank by trying to open facebook too.

    Turning off caching if free ram is low seems to be the best solution. Apart from that, people will need to avoid using nearly all of the ram. That can cause instability with any program but the big write at the end makes it worse.

  20. #17
    Member
    Join Date
    Jun 2013
    Location
    Sweden
    Posts
    150
    Thanks
    9
    Thanked 25 Times in 23 Posts
    Using NUL as filename did look good until it was time to write "nothing" to disk. Chrome really crashed this time.

    But of course we should use all ram, its microsoft that needs to learn how to make an OS... i dont have x32 but i bet there still is EDLIN.EXE somewhere in windows folder.
    Last edited by a902cd23; 13th February 2015 at 11:22.

  21. #18
    Member
    Join Date
    Feb 2015
    Location
    Australia
    Posts
    73
    Thanks
    13
    Thanked 65 Times in 23 Posts
    Funny, it doesn't seem to do any caching writing to NUL (well it shouldn't anyway ) and the ram usage graph is dead flat to the end. Maybe it still runs into some kind of resource limitation.

    Quote Originally Posted by a902cd23 View Post
    But of course we should use all ram, its microsoft that needs to learn how to make an OS... i dont have x32 but i bet there still is EDLIN.EXE somewhere in windows folder.
    Haha, I wouldn't be surprised Even now, VS2013 sees fit to reset the project properties while I'm editing the command line. Drives me nuts

Similar Threads

  1. Duplicate File Finder Engine
    By david_werecat in forum Download Area
    Replies: 10
    Last Post: 10th February 2018, 14:08
  2. Index-Compress-Update: parallel LZ match finder algo
    By Bulat Ziganshin in forum Data Compression
    Replies: 22
    Last Post: 10th January 2012, 20:36
  3. Replies: 7
    Last Post: 19th March 2011, 10:50
  4. A new match searching structure ?
    By Cyan in forum Data Compression
    Replies: 71
    Last Post: 3rd January 2011, 10:19
  5. CM Match model
    By toffer in forum Forum Archive
    Replies: 32
    Last Post: 1st February 2008, 20:26

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •