Results 1 to 8 of 8

Thread: Fairytale Archive Structure

  1. #1
    Tester
    Stephan Busch's Avatar
    Join Date
    May 2008
    Location
    Bremen, Germany
    Posts
    872
    Thanks
    457
    Thanked 175 Times in 85 Posts

    Talking Fairytale Archive Structure

    A proposal for optimal steps during archiving. As this doesn't exist right now I call it fairytale.

    Click image for larger version. 

Name:	fairytale-.png 
Views:	690 
Size:	12.1 KB 
ID:	2646

    * only red colored groups benefit from solid archiving
    * data segmentation means to check if input data is smaller when encoded or just stored and apply best choice
    (to be tested on each block of each group, e.g. every few megabytes)
    * deduplication in style of ZPAQ, Exdupe...
    * enhanced matchfinder such as REP / SREP
    * integrity checks using adler-32
    * encryption using AES
    * unicode support

    Where the data segmentation is an important feature that is not widely spread/researched at the moment.
    To my knowledge, only the following programs/algorithms use segmentation: WinRK, BSC, RAR5 (see unrar source)

  2. #2
    Member
    Join Date
    Mar 2010
    Location
    Germany
    Posts
    116
    Thanks
    18
    Thanked 32 Times in 11 Posts
    Quote Originally Posted by Stephan Busch View Post
    A proposal for optimal steps during archiving. As this doesn't exist right now I call it fairytale.
    * only red colored groups benefit from solid archiving
    * integrity checks using adler-32
    * encryption using AES
    IMHO:
    - Adler-32 is to weak, to many collisions, manipulation is not guaranteed (BLAKE2 does a better job imho)
    - After the NSA disaster, I'm not sure that AES is a good solution anymore.
    Last edited by Biozynotiker; 17th December 2013 at 20:20.

  3. #3
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    - data segmenination usually means exactly what you've called grouping
    - it's hard to revert result of compression once you found that some part of data is ahardly compressible. usually this means that you need to have older copy of all compression stats. moreover, since we are already sorted out noncompressible data, checking compressibility usually will be useless
    - adler32 is a joke, at very least you can use CRC32C what's as fast as 8bytes*4cores*4GHz = 128 GB/sec on modern cpus
    - rep/srep is rather close to deduplication. also you don't said where it's on your scheme. inside lzma?
    - unicode, encryption, recovery record, security attributes and so on is just another level - don't mix everything up
    - the same note about AES - since it's actively pushed by american companies like Intel, i cannot 100% belong on it. better to use multiple encryptors

  4. #4
    Member
    Join Date
    Nov 2012
    Location
    Bangalore
    Posts
    114
    Thanks
    9
    Thanked 37 Times in 22 Posts
    What you described does exist in Pcompress at present except for Rzip/Srep kind of range match finding and Unicode. Current Pcompress code sorts files by extension and size and then passes them through LibArchive. The output datastream from LibArchive is split at boundaries (segmentation) where file types change or a hash anchor based boundary if maximum chunk size is reached. This way similar files are grouped together.

    Multiple chunks are compressed in parallel. If compression does not reduce chunk size then they are stored.
    The following sequence of processing takes place:
    Enumerate and Sort Files (external merge sort) -> Detect File types based on extension and magic number analysis -> Apply PackJPG for Jpegs -> LibArchive -> Chunk splitting (Segmentation) -> Deduplication (block size can be selected from 2KB upto 64KB) -> Filtering (DisPack for 32-bit Exes, LZP excluding BMP and TIFF, Delta2 excluding certain file types) -> Compression (Fast LZ4 for compressed data, Libbsc for markup, BMP, DNA Sequence, MP4, Flac and AVI, PPMD8 for generic text, LZMA for the rest).

    Each chunk gets a Checksum (selectable from BLAKE2-512/256 - default, SHA512, SHA512-256, SKEIN512/256, KECCAK512/256).
    Encryption using AES or Salsa20.
    HMAC for Authentication.

    I am looking to add Wavpack processing.

  5. #5
    Tester
    Stephan Busch's Avatar
    Join Date
    May 2008
    Location
    Bremen, Germany
    Posts
    872
    Thanks
    457
    Thanked 175 Times in 85 Posts
    Data segmentation can be seen as splitting files in compressible and incompressible parts and sorting similar parts together to improve compression.
    Grouping is just collecting similar files where we don"t care about the mixed parts inside each file.

    I don"t know how RAR5 segmentation (aka fragmented window) works in detail, but it compresses large installers and iso files better than all competitors.

    Unicode, encryption and integrity check variants can of course be different and everybody may choose his or her best fit.

    Deduplication takes care of the whole data set while a matchfinder can't see beyond a specific limit. Each one has its advantages.
    I am impressed by both. Maybe there is a way to use both (on blocks it would make sense).

    I cannot test Pcompress because my linux system is not working at the moment.

  6. #6
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    Data segmentation can be seen as splitting files in compressible and incompressible parts and sorting similar parts together to improve compression.
    so it should be performed instead of grouping, no?

  7. #7
    Tester
    Stephan Busch's Avatar
    Join Date
    May 2008
    Location
    Bremen, Germany
    Posts
    872
    Thanks
    457
    Thanked 175 Times in 85 Posts
    that has to be tested.. but I think segmentation should improve compression on every archiver/compressor.
    I think my image above might be wrong - segmentation should be placed in or instead of grouping stage.
    What do you think?

  8. #8
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    may be it's my poor english but i wrote exactly about that: grouping is poor men's segmentation. it have its own advantages but at least they should be placed into one box: "grouping/segmentation"

Similar Threads

  1. Clever data structure for range sum queries
    By Piotr Tarsa in forum Data Compression
    Replies: 17
    Last Post: 7th March 2012, 22:32
  2. Clever data structure for range sum queries
    By Piotr Tarsa in forum The Off-Topic Lounge
    Replies: 3
    Last Post: 30th December 2011, 04:13
  3. A new match searching structure ?
    By Cyan in forum Data Compression
    Replies: 71
    Last Post: 3rd January 2011, 10:19
  4. New data structure for bitwise CM
    By Shelwien in forum Data Compression
    Replies: 5
    Last Post: 27th November 2010, 17:35
  5. Structure detection
    By Shelwien in forum Data Compression
    Replies: 0
    Last Post: 13th September 2010, 09:54

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •