Results 1 to 6 of 6

Thread: REP, SREP: FreeArc deduplication engines

  1. #1
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts

    REP, SREP: FreeArc deduplication engines

    I should create separate SREP thread many years ago. Finally i fix that omission

    Please ask (S)REP-related questions here. SREP definitely suffers from lack of documentation, but it's so large now that i will write it sequentially, in the form of FAQ.

  2. The Following 7 Users Say Thank You to Bulat Ziganshin For This Useful Post:

    78372 (27th September 2017),encode (19th January 2018),Gonzalo (25th September 2017),hunman (25th September 2017),kassane (3rd October 2017),Peachill (28th March 2019),Simorq (26th September 2017)

  3. #2
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    Q: Can SREP process data from stdin to stdout?

    SREP deduplicates data on the distances larger than RAM available, which makes it impossible to use stdin+stdout on both compression and decompression stage. Something should be sacrificed and various SREP modes allows you to choose the victim.

    -f (Future-LZ) mode is ideal for installations and other single-compression multi-decompression scenarios. It sacrifices stdin on compression stage - input data are read twice. On the first pass we build a list of matches, the second pass inserts matches at the point of source data. On decompression, match data are saved at the source pos into RAM/vmfile and flushed at the destination pos. So, decompression requires to use as much RAM+vmfile as much match data may be need to stored at any point of file.

    default (index-LZ) mode is ideal for standalone SREP usage as well as quick compression. It sacrifices stdin on decompression stage - input data are read out-of-order. It saves index (list of matches) at the end of compressed file, so decompressor first reads the index, then returns to the start of file and performs exactly as Future-LZ decompressor.

    -o (IO-LZ) mode is just old mode from pre-3.0 versions. It sacrifices stdout on decompression. Matches are stored at their destination point, and on decompression their data are literally read from the decompressed file. Decompression doesn't allocate RAM itself, but seriously trash the disk cache, making it inferior to index-LZ mode.

  4. The Following 2 Users Say Thank You to Bulat Ziganshin For This Useful Post:

    kassane (3rd October 2017),Simorq (26th September 2017)

  5. #3
    Member
    Join Date
    May 2017
    Location
    Sealand
    Posts
    15
    Thanks
    7
    Thanked 2 Times in 2 Posts
    In SREP always large compression block sizes or can smaller block sizes less than 8mb provide for a smaller output than using larger block sizes(>8mb).

  6. #4
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    the block size has very little influence on the compressed size

  7. The Following 3 Users Say Thank You to Bulat Ziganshin For This Useful Post:

    Chirantan (25th September 2017),kassane (3rd October 2017),Simorq (26th September 2017)

  8. #5
    Member
    Join Date
    Mar 2018
    Location
    sun
    Posts
    33
    Thanks
    18
    Thanked 15 Times in 7 Posts
    - How do I calculate what memory during compression srep need for -m0? I thought dictionary is doing that, so I set for example -d12g thinking it would use up to ~12g max(I don't mind little variation). But it went all the way to 16g+. It seems that just like -m3..-m5 modes, parameters like -dl and -dc have same big impact on memory, not dictionary?

    - Could you please explain more in depth on how to control/calculate amount of memory needed for compression in -m0 mode and how does it relate between dictionary -d and -dc/-dl?
    So if for example I want to use max ~12gb for srep in -m0 mode during compression with -dl512 -dc512 parameters set, how to calculate and set dictionary and/or other parameters in relation to that?

    - What exactly is "hash size: -dh" and how does it relate to "hash chunk: -dc"? I understand that hash chunk is that minimum size of data that is found(and hashed), so I thought -dh mean size(in bits) of hash itself? Like crc64 vs crc32 vs sha256 bits?

    - In -m0(and -m3f) mode during compression, if I exceed amount of RAM, does srep continue with virtual memory file or fail because of insufficient RAM?

    - Is it ok to disable -hash- in -m0(and -m3f) mode if I use FreeArc(assuming FreeArc itself does hash check on overall archive)? So that as long as FA can unpack without error, data are guaranteed correct?

    - Does -m0 mode also use accelerator -aXY(default srep is -a4/4) and is there much point of using it in -m0? Would disabling it save RAM significantly?

    - Is there really much difference between -m0 and -m3f for compression if I use - - <stdin> <stdout>?

    - Is -sBYTES used in -m0 mode?

    - For -m3f..-m5f in - <stdin> mode during compression, what is the point of setting manualy -sBYTES when you could hardcoded it in code to INT64_MAX internally? I mean srep does not allocate all needed RAM immediately anyway, default is 25gb but there is no error if input data is less - for example 4gb and it also seem to allocate memory on the go as is needed, depending on data. Considering that less -sBYTES than data itself would anyway result in error or crash, there seems little reason not to set it to max possible value, even if input data is much less?

    - How exactly -c and -l cooperate? If -l mean minimum match length, that is the minimum size of data found that is duplicate right? So why is then -c needed further and is it to search within -l matched data? I thought dedup process is simply like:
    example first -l1 match found "ABCD", another -l2 same "ABCD" so -l2 cancel with only reference to -l1. But why further need -c chunking(to search within -l?)?

    Thank you.

  9. #6
    Member
    Join Date
    Mar 2018
    Location
    sun
    Posts
    33
    Thanks
    18
    Thanked 15 Times in 7 Posts
    Click image for larger version. 

Name:	srep.png 
Views:	134 
Size:	45.3 KB 
ID:	6464

    Btw lol, looks like another underrated parameter most people wont touch ^

Similar Threads

  1. FreeArc compression suite (4x4, Tornado, REP, Delta, Dict...)
    By Bulat Ziganshin in forum Data Compression
    Replies: 554
    Last Post: 26th September 2018, 02:41
  2. Context mixing for recommendation engines
    By nburns in forum Data Compression
    Replies: 10
    Last Post: 9th January 2017, 23:04
  3. Does rep support >2GB Dictionary size ?
    By SvenBent in forum Data Compression
    Replies: 12
    Last Post: 6th June 2009, 00:08
  4. Noob question about dictionary size (and about rep)
    By SvenBent in forum Data Compression
    Replies: 1
    Last Post: 23rd January 2009, 01:35
  5. REP and Delta fails with big files
    By SvenBent in forum Data Compression
    Replies: 14
    Last Post: 23rd November 2008, 20:41

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •