Results 1 to 24 of 24

Thread: Fairytale

  1. #1
    Member
    Join Date
    Feb 2016
    Location
    Luxembourg
    Posts
    520
    Thanks
    196
    Thanked 744 Times in 301 Posts

    Fairytale

    The code for the Fairytale community archiver is now available on the GitLab official repository.

    It includes a very rough prototype that allows for testing of the pre-processing library.
    If anyone wishes to contribute, feel free to join our Gitter chat for more information.

    Best regards
    Last edited by mpais; 11th June 2018 at 20:02.

  2. The Following 21 Users Say Thank You to mpais For This Useful Post:

    78372 (23rd March 2018),Christoph Diegelmann (28th March 2018),dado023 (19th March 2018),Darek (18th March 2018),encode (20th May 2018),Gonzalo (18th March 2018),inikep (18th March 2018),kaitz (18th March 2018),kassane (8th April 2018),load (18th March 2018),Mauro Vezzosi (18th March 2018),Mike (18th March 2018),necros (23rd March 2018),RamiroCruzo (18th March 2018),Razor12911 (27th May 2018),Simorq (23rd March 2018),slipstream (17th May 2018),snowcat (20th March 2018),Stephan Busch (18th March 2018),Zonder (20th March 2018),ZuLeweiner (18th March 2018)

  3. #2
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    464
    Thanks
    202
    Thanked 81 Times in 61 Posts
    Quick update: Now we have confirmed working binaries for:

    * ARM Linux 32 bits
    * x86-64 Windows MSVC and GCC
    * x86-64 Linux GCC and clang
    * OSX clang

    So now, every major platform is supported and compiling is as easy as typing "make" or "make -f Makefile.your_platform" Use -j8 on multi-threaded machines to compile faster.

  4. The Following 4 Users Say Thank You to Gonzalo For This Useful Post:

    78372 (23rd March 2018),dado023 (19th March 2018),Simorq (23rd March 2018),SolidComp (20th March 2018)

  5. #3
    Member
    Join Date
    Feb 2017
    Location
    none
    Posts
    17
    Thanks
    2
    Thanked 10 Times in 5 Posts
    I tried few examples and works fast , what about to add this "compressors" in the detection like zlib/gzip/deflate...?

    -lzo
    -lzx (xnb/pkg -as used in the game Transistor or Bastion)
    -lzma
    -lz4

    I don't know if you already know this tools:
    AFR (Anvil Forge Recompressor -Ubisoft-)
    http://krinkels.org/threads/afr-anvi...mpressor.3589/

    Unreal Package Extractor
    http://krinkels.org/resources/unreal...or.171/history

    Punity game file precompressor (i know Razor12911 is in this forum too)
    http://fileforums.com/showthread.php?t=101557
    Attached Files Attached Files

  6. #4
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    464
    Thanks
    202
    Thanked 81 Times in 61 Posts
    Quote Originally Posted by redrabbit View Post
    I tried few examples and works fast , what about to add this "compressors" in the detection like zlib/gzip/deflate...?

    -lzo
    -lzx (xnb/pkg -as used in the game Transistor or Bastion)
    -lzma
    -lz4

    I don't know if you already know this tools:
    AFR (Anvil Forge Recompressor -Ubisoft-)
    http://krinkels.org/threads/afr-anvi...mpressor.3589/

    Unreal Package Extractor
    http://krinkels.org/resources/unreal...or.171/history

    Punity game file precompressor (i know Razor12911 is in this forum too)
    http://fileforums.com/showthread.php?t=101557
    I really like the concept of recompression. But it's not a trivial task. Are those tools you mentioned open-sourced? If so, maybe their authors could make a little extra effort and port them to the infrastructure of Fairytale. Márcio Pais has done a great work keeping it neat and easy to modify. Right now, Fairytale doesn't support external compressors, and I doubt it will ever do.

    This is a field with endless possibilities. Only inside quickbms source are enough algorithms to recompress practically 100% of all games in the world and much other file types. But! Work. A whole lot of it.

  7. #5
    Member
    Join Date
    Apr 2017
    Location
    Bangladesh
    Posts
    9
    Thanks
    50
    Thanked 2 Times in 2 Posts
    Staff like pZlib, plz4,pOodle,pzstd,plzo are really fast, but problem is they aren't universal(except pZlib) and they are specially optimized for games only.

  8. The Following User Says Thank You to 78372 For This Useful Post:

    Gonzalo (23rd March 2018)

  9. #6
    Member
    Join Date
    Feb 2016
    Location
    Luxembourg
    Posts
    520
    Thanks
    196
    Thanked 744 Times in 301 Posts
    Here is our current draft file format proposal.

    As always, your criticism and suggestions are highly valued.

  10. The Following 3 Users Say Thank You to mpais For This Useful Post:

    load (8th April 2018),Mike (8th April 2018),Zonder (8th April 2018)

  11. #7
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    667
    Thanks
    204
    Thanked 241 Times in 146 Posts
    Quote Originally Posted by Gonzalo View Post
    I really like the concept of recompression.
    https://github.com/google/grittibanzli is a new deflate/gzip re-compressor.

  12. #8
    Member
    Join Date
    May 2008
    Location
    Kuwait
    Posts
    301
    Thanks
    26
    Thanked 22 Times in 15 Posts
    Quote Originally Posted by Jyrki Alakuijala View Post
    https://github.com/google/grittibanzli is a new deflate/gzip re-compressor.
    do you have a win compiled file?

  13. #9
    Member cfeck's Avatar
    Join Date
    Jan 2012
    Location
    Germany
    Posts
    50
    Thanks
    0
    Thanked 17 Times in 9 Posts
    Quote Originally Posted by mpais View Post
    Here is our current draft file format proposal. As always, your criticism and suggestions are highly valued.
    Is CRC32 the CRC32C variant? Faster on newer processors because of the special x86 instruction.

  14. #10
    Member
    Join Date
    Feb 2016
    Location
    Luxembourg
    Posts
    520
    Thanks
    196
    Thanked 744 Times in 301 Posts
    @Jyrki Alakuijala
    Thanks, we'll look into it.
    There are currently a few options for DEFLATE recompression: preflate, difflate (unreleased, by Christian Schneider) and reflate (closed-source, by Shelwien).
    Do you have any data on how this new recompressor fares against reflate?

    @cfeck
    Currently no, but it's just a matter of changing the LUT. Optimized implementations will come later.

  15. #11
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    667
    Thanks
    204
    Thanked 241 Times in 146 Posts
    Quote Originally Posted by mpais View Post
    Do you have any data on how this new recompressor fares against reflate?
    Grittibanzli -- we didn't benchmark it, and it is more of a quick hack than a well-thought optimized system.

    It is likely particularly bad on unoptimized PNGs and other non-predicted slowly varying signals, and okeyish on text. You will likely get a 20 % of density increase over gzip, even on small files, and the decoding speed should be better than with other approaches.

    We usually develop on Ubuntu-like computers, and it is not easy for us to produce high quality Windows releases. Boot to Linux with a boot dvd and you can build grittibanzli relatively easily?

  16. #12
    Member
    Join Date
    May 2008
    Location
    France
    Posts
    78
    Thanks
    436
    Thanked 22 Times in 17 Posts
    Quote Originally Posted by maadjordan View Post
    do you have a win compiled file?
    Shelwien's compile:
    http://nishi.dreamhosters.com/u/grittibanzli.exe

    Another bread: Grittibänz

  17. #13
    Member
    Join Date
    Sep 2017
    Location
    Berlin, Germany
    Posts
    20
    Thanks
    7
    Thanked 15 Times in 5 Posts
    Quote Originally Posted by mpais View Post
    Here is our current draft file format proposal.

    As always, your criticism and suggestions are highly valued.
    I am probably a little to tired right now but maybe I can still write some usful comments:

    The ordering File header, Compressed data, metadata probably makes compression easy, but isn't decompression maybe more important? What about partially downloaded files or streaming decompression? Maybe even Header, metadata part1, compressed data part1, metadata part2, comressed data part2 could be an option?
    Edit:
    This could also be useful to allow adding more files after initial compression

    How about additional recovery information for damaged archives, or at least some reserved fields for later addition?

    It looks like this format is not strictly optimized for size.

    I saw a file header somewhere that contained a url as one of the first parts, maybe this could be useful (for a small audience)

    In the appendix there is only the long name of VLI, not the abbreviation. This is bad for searching the pdf.

    VLI: In case of a 64bit number the last end of number flag/bit needs to be implicit, otherwise it won't fit.

    Table of contents?

    I don't quite get where the checksums actually are. Maybe you can add a full example of a compressed file in a nice tree view.
    Edit:
    Somewhat like
    Code:
    File header
            Size VLI
            ...
    Compressed data
            ...
    Concerning compression formats: I am not sure if the following recompression use case is currently possible: extract two differently compressed files and after that solidly compressing both together. Except for maybe specifying a compression format that is able to (de-)compress both different file formats.

    And another thing that maybe is a little beyond the scope of this document:
    I am not sure how you are planning to use different blocks, maybe for deduplication?

    Consider this case: you have a match finder that only removes long matches but otherwise another compression method is working on the file. Imagine two partially redundant wav-files. The first one is compressed normally, the second one will have some parts missing so the wav-model can probably not work properly directly behind the "missing" parts. So how about a way to inform the wav model, that there are some missing parts or even inform it which data exactly are missing so it can use them as context but it knows that it doesn't actually need to store them. Maybe this is more realistic with multiple versions of text files or something. Will probably be complicated, though.

    Edit:
    what about multi-part archives?

    Edit:
    Efficient storage of identical files consisting of multiple blocks?
    Efficient storage of files of size 0 (number of blocks implicitly = 0)
    Better specification of file metadata (attributes/access rights/hardlink/softlink/whatever)
    Last edited by Urist McComp; 15th May 2018 at 01:38. Reason: typo, indenting, some additions

  18. #14
    Member
    Join Date
    Feb 2016
    Location
    Luxembourg
    Posts
    520
    Thanks
    196
    Thanked 744 Times in 301 Posts
    Thanks for your interest Urist.

    Quote Originally Posted by Urist McComp View Post
    The ordering File header, Compressed data, metadata probably makes compression easy, but isn't decompression maybe more important? What about partially downloaded files or streaming decompression? Maybe even Header, metadata part1, compressed data part1, metadata part2, comressed data part2 could be an option?
    Edit:
    This could also be useful to allow adding more files after initial compression
    Yes, this order is chosen to make compression easier, since the size and composition of the structures can only be determined after the compression stage, so this way we avoid having to use a temp file to store the compressed data, which would be needed since we'd have to first write those structures and only then copy the actual compressed data to the output file. And since compression relies on the complex pre-processing stage that makes heavy use of temporary storage itself, it seems best to avoid having 2 different components battling it out for temp files. It's something that is still in limbo right now.

    Quote Originally Posted by Urist McComp View Post
    How about additional recovery information for damaged archives, or at least some reserved fields for later addition?
    Data is compressed in chunks of blocks of the same type. A chunk may encode a single block (non-solid mode) or all the existing blocks of that type (full solid mode), or any other configuration in this interval (quasi-solid mode). Each chunk info sequence contains a metadata tag list, so defining a tag to represent recovery information handles that option. If no metadata is specified, a single tag with id set to 0 (zero) acts as a terminator, wasting only 1 byte.

    Quote Originally Posted by Urist McComp View Post
    It looks like this format is not strictly optimized for size.
    It was designed for maximum extensibility, so it could support all the proposed features of Fairytale. Attention was given to providing ways of efficiently storing numeric data and to omitting unnecessary/unused fields with minimum redundancy.

    Quote Originally Posted by Urist McComp View Post
    In the appendix there is only the long name of VLI, not the abbreviation. This is bad for searching the pdf.
    Thanks, that's why I hate writing technical documentation, I always forget something

    Quote Originally Posted by Urist McComp View Post
    VLI: In case of a 64bit number the last end of number flag/bit needs to be implicit, otherwise it won't fit.
    We only use VLI to represent positive numbers, hence for a 64 bit signed integer we only need 63 bits. And it allows us another oportunity to validate the archive for corruption, if on the 9th byte of a VLI the flag bit is set.

    Quote Originally Posted by Urist McComp View Post
    Table of contents?
    See above about my willingness to write documentation.

    Quote Originally Posted by Urist McComp View Post
    I don't quite get where the checksums actually are.
    Checksums are stored per chunk, to validate the actual compressed data. But since checksums are also used for block-based deduplication, it is possible to store a checksum for each individual block, to allow for much faster deduplication when adding new content to an existing Fairytale archive. If that option (Use checksum-per-block) is set in the flags field of the file header, it signals any compatible encoder that when adding new content, we request that it stores a checksum for any new compressed block.
    These checksums are stored in the metadata tag lists of the chunk info sequences and the block node sequences.

    Quote Originally Posted by Urist McComp View Post
    Concerning compression formats: I am not sure if the following recompression use case is currently possible: extract two differently compressed files and after that solidly compressing both together. Except for maybe specifying a compression format that is able to (de-)compress both different file formats.
    That is possible, for instance, a GIF image will be decompressed to the actual 8bpp pixel data, which will be an image block (as a child block of the GIF block), so it can be compressed in solid-mode along side other image blocks from BMP/PNG/PGM/etc. Should the actual pixel data match, we could even deduplicate a GIF image from a PNG image.

    Quote Originally Posted by Urist McComp View Post
    And another thing that maybe is a little beyond the scope of this document:
    I am not sure how you are planning to use different blocks, maybe for deduplication?

    Consider this case: you have a match finder that only removes long matches but otherwise another compression method is working on the file. Imagine two partially redundant wav-files. The first one is compressed normally, the second one will have some parts missing so the wav-model can probably not work properly directly behind the "missing" parts. So how about a way to inform the wav model, that there are some missing parts or even inform it which data exactly are missing so it can use them as context but it knows that it doesn't actually need to store them. Maybe this is more realistic with multiple versions of text files or something. Will probably be complicated, though.
    The block segmentation is designed for content-aware deduplication and similarity clustering. Besides allowing us to deduplicate content across recursion depth levels and even diferent file formats, we can then apply an optional clustering stage before compression, to order the blocks by similarity for better gains from solid compression. Texts can be ordered by language, images would first be grouped by bits-per-pixel and then analysed to extract the general object structure for further sorting, JPEG images could be sorted by analysing EXIF data about camera model and date of picture taken, or if otherwise EXIF data is unavailable, by similar resolution, chroma-subsampling profile, quantization level, etc.

    Quote Originally Posted by Urist McComp View Post
    what about multi-part archives?
    Currently we have no plans for doing that.

    Quote Originally Posted by Urist McComp View Post
    Efficient storage of identical files consisting of multiple blocks?
    Identical files can be deduplicated, independently of the number of blocks that compose them.

    Quote Originally Posted by Urist McComp View Post
    Efficient storage of files of size 0 (number of blocks implicitly = 0)
    A file entry with field number of blocks set to 0.

    Quote Originally Posted by Urist McComp View Post
    Better specification of file metadata (attributes/access rights/hardlink/softlink/whatever)
    Each file entry has its own metadata tag list.

  19. #15
    Member
    Join Date
    Sep 2017
    Location
    Berlin, Germany
    Posts
    20
    Thanks
    7
    Thanked 15 Times in 5 Posts
    Well, today I found the original thread ( https://encode.ru/threads/2856-Community-Archiver ) and also saw, that many of my points were already addressed.


    Quote Originally Posted by mpais View Post
    Yes, this order is chosen to make compression easier, since the size and composition of the structures can only be determined after the compression stage, so this way we avoid having to use a temp file to store the compressed data, which would be needed since we'd have to first write those structures and only then copy the actual compressed data to the output file. And since compression relies on the complex pre-processing stage that makes heavy use of temporary storage itself, it seems best to avoid having 2 different components battling it out for temp files. It's something that is still in limbo right now.
    Using temp files can be avoided for some modern file systems, if you don't mind some small gaps (could be ok when storing a single big block of metadata in the beginning):
    http://man7.org/linux/man-pages/man2/fallocate.2.html (look foor FALLOC_FL_INSERT_RANGE)
    https://docs.microsoft.com/en-us/win.../block-cloning
    It will require non-portabe code, though.

    In case the block sizes are small, it could be ok to keep one block in memory until it is comressed completely and then first write its metadata (once it is compressed you hopefully know all required information) and then the block. But you probably already thought of that and its drawbacks. And small block sizes will probably be good for deduplication anyway. In case deduplication is implemented like in one of the hints shelwien gave (if I understand it correctly):
    Quote Originally Posted by Shelwien View Post
    I suggest to look at zstd (it implements anchor-hashing version),
    Using this block positions/sizes probably don't even need to be stored explicitly, they can be inferred during decompression. Maybe being able to not only store a list of single block ids but also ranges of block ids for files could reduce the needed amount of metadata considerably.


    Quote Originally Posted by mpais View Post
    Data is compressed in chunks of blocks of the same type. A chunk may encode a single block (non-solid mode) or all the existing blocks of that type (full solid mode), or any other configuration in this interval (quasi-solid mode). Each chunk info sequence contains a metadata tag list, so defining a tag to represent recovery information handles that option. If no metadata is specified, a single tag with id set to 0 (zero) acts as a terminator, wasting only 1 byte.
    This would at least allow for recovery of errors in the chunks but not for recovery of metadata errors which are probably even worse. I only used hamming codes once and you probably want something more sophisticated in special position for good recoverability of as many errors as possible. So I guess the best way (recovery whise) would be if the recovery data existed "on top of"/"one layer above" the normal archive structure but maybe there is someone else here who knows a lot more about it than I do.

    Quote Originally Posted by mpais View Post
    It was designed for maximum extensibility, so it could support all the proposed features of Fairytale. Attention was given to providing ways of efficiently storing numeric data and to omitting unnecessary/unused fields with minimum redundancy.
    I like that you don't store unneeded fields the VLI-encoding, even if i would prefer some context modelling - it would probably be slower, more complicated, error-prone and require more memory, though.
    So I was mostly thinking about compression of file names.

    Some thought about tag lists for file metadata: should there be tag-ids which represent multiple data at once? For exampe a tag for unix metadata which contains date, time, access rights ... to conserve a few bits?


    Quote Originally Posted by mpais View Post
    Thanks, that's why I hate writing technical documentation, I always forget something
    I somehow like writing technical documentation. A huge part of my motivation comes from reading bad documentation or undocumented code. But it requires me to know the topic well, so I probably can't help you out with content, just with structure. I guess it is not written in LateX and on github? It looks more like a Word document.


    Quote Originally Posted by mpais View Post
    We only use VLI to represent positive numbers, hence for a 64 bit signed integer we only need 63 bits. And it allows us another oportunity to validate the archive for corruption, if on the 9th byte of a VLI the flag bit is set.
    Well, I overlooked this, was already too tired, sorry.

    Quote Originally Posted by mpais View Post
    Checksums are stored per chunk, to validate the actual compressed data. But since checksums are also used for block-based deduplication, it is possible to store a checksum for each individual block, to allow for much faster deduplication when adding new content to an existing Fairytale archive. If that option (Use checksum-per-block) is set in the flags field of the file header, it signals any compatible encoder that when adding new content, we request that it stores a checksum for any new compressed block.
    These checksums are stored in the metadata tag lists of the chunk info sequences and the block node sequences.
    Now I got it, reading it again helped. I missed what you meant by "structure", before. I'd still like a complete example, though. Maybe someone else can do it, too (maybe me) if you could provide the sources for the .pdf. So how about LateX or markdown or html somewhere in the repository ( https://github.com/schnaader/fairytale )? Or did I just overlook it?

    Quote Originally Posted by mpais View Post
    That is possible, for instance, a GIF image will be decompressed to the actual 8bpp pixel data, which will be an image block (as a child block of the GIF block), so it can be compressed in solid-mode along side other image blocks from BMP/PNG/PGM/etc. Should the actual pixel data match, we could even deduplicate a GIF image from a PNG image.
    That is exactly what I would like to have

    Edit: writing this part took me a long time and I (hopefully) now understand a lot more. So in the beginning it still looks like I barely understand anything. I decided to keep it anyway to make it more clear what was easy and what was hard to understand (for me).

    I still don't really understand how the chunk and block structure is supposed to work, though. This is what I hopefully got right:
    - One chunk contains one (non-solid) or more blocks
    - Compression can be defined for each chunk

    So at least one question that I can clearly describe without feeling too stupid:
    Both chunk info sequence and block node sequence have "Block type" fields of the VLI-type. But in the example in the appendix they look like strings, though. Where or at least how (hardcoded?) is the mapping between VLI-value and string defined? Some block types indicete the presence of an optional info field. Where is the information, whether it exists?

    Even if it's all written down it took me quite a while to understand how new artificial blocks are created by block node sequences. I still don't understand their meaning, though. Is this information how to restore the original files after decompression?

    Does this mean there is compression information in the chunks and after that there is also information on how to restore the original compression for some blocks? I have to think about this - and I hope this is even correct. Maybe you could specify it a little further in the pdf.

    Quote Originally Posted by mpais View Post
    The block segmentation is designed for content-aware deduplication and similarity clustering. Besides allowing us to deduplicate content across recursion depth levels and even diferent file formats, we can then apply an optional clustering stage before compression, to order the blocks by similarity for better gains from solid compression. Texts can be ordered by language, images would first be grouped by bits-per-pixel and then analysed to extract the general object structure for further sorting, JPEG images could be sorted by analysing EXIF data about camera model and date of picture taken, or if otherwise EXIF data is unavailable, by similar resolution, chroma-subsampling profile, quantization level, etc.
    Thanks, this answers the first part of my question. The second one was also worded by shelwien before:
    Quote Originally Posted by Shelwien View Post
    3. Proper implementation would require archiver framework support,
    in particular because of context/dictionary mode.
    I mean, suppose we're analyzing a file, and see that some chunks can
    be better compressed with ppmd, some with lzma.
    Independent compression of these chunks is obvious, but we actually
    can implement solid mode too - by providing previously unpacked data
    as a dictionary for the next codec.
    This certainly requires support in archiver framework, though.
    So is there a possibility to mark blocks as context-only-blocks for a given compressor? That could allow some even better reordering, current reordering seems to be linear, this would allow for (limited)"multi dimensional reordering": one block could precede multiple blocks at the same time but only be stored once. Or is this already possible and I only overlooked it?

    Quote Originally Posted by mpais View Post
    Currently we have no plans for doing that.


    Quote Originally Posted by mpais View Post
    Identical files can be deduplicated, independently of the number of blocks that compose them.
    This is not what I meant, I meant not having to list a (possibly) large number of blocks for a 2nd file, again.
    But I think I got it now:
    Does this work by specifying one "virtual block" consisting of all the blocks in a file and then the files both just refer to this one virtual block?

    Quote Originally Posted by mpais View Post
    A file entry with field number of blocks set to 0.
    My description probably was too short. Make the "number of blocks" field optional. It should only be present if a file has a size > 0.

    Quote Originally Posted by mpais View Post
    Each file entry has its own metadata tag list.
    I meant there should be more entries in the File metadata tag list in the appendix
    And mabye also a link for quickly jumping inside the pdf.


    One more questions came into my mind:
    How will all the data compressed by different codecs (chunks/blocks) be appended to each other, is there some padding after each chunk in case bytes are not fully used?



    To conclude:
    Two things took me quite a while to understand, maybe you could try to make it easier for future readers:
    - It is possible to define "virtual" blocks - which is awesome!
    - There are two kinds of compression codecs stored in different places. The ones that the archive is comressed with (and that need to be decompressed for unpacking the archive), they are specified in the chunk metadata. And the ones that are needed to restore other compressed formats that are stored inside the archive in an uncompressed form, they are specified in the block metadata.
    Last edited by Urist McComp; 15th May 2018 at 23:49. Reason: typos

  20. #16
    Member
    Join Date
    Sep 2017
    Location
    Berlin, Germany
    Posts
    20
    Thanks
    7
    Thanked 15 Times in 5 Posts
    I think there is one use case which is already supported by this specification but which could be supported better.

    Consider the following folder structure:

    Code:
    dir_to_comp
        1.rar_uncompressed_dir
        1.rar
    Rar (for example) files probably will not be able to be reconstructed using uncompressed files any time soon, so they need to be stored directly. The extracted files on the other hand may be extracted from the rar instead of being stored seperately. Right now I think each file needs its own block node entry, if you want to split it in multiple blocks for deduplication even more. So how about allowing a single block node entry to create multiple blocks, defined by the underlaying block itself? Maybe it would be helpful to be able to redundantly store just the information how many blocks it will be to allow for easier/faster parsing of the remaining block nodes without having to decompress the .rar first. This count could also be useful for some corruption checks.

    Some examples where I think this could be useful (for reducing the number of explicitly listed block IDs):
    - Compressed archives, which cannot be recreated but where (some parts of) the included files are also stored somewhere else again. Maybe even jpegs which are (for some weird reason) also stored as a bitmap or even png extracted from them.
    - Splitting files into multiple blocks for deduplication, for example using anchor hashing. It should only be used as long as the blocks should only be referenced and not removed by deduplication itself. So the most relevant use case will probably be files inside a non recreatable archive.

    On the other hand this may increase the count of block IDs that "pollute the namespace" if only a few of them are needed. Although files where no blocks at all are needed could be skipped.
    Last edited by Urist McComp; 16th May 2018 at 10:10.

  21. #17
    Member
    Join Date
    Feb 2016
    Location
    Luxembourg
    Posts
    520
    Thanks
    196
    Thanked 744 Times in 301 Posts
    Quote Originally Posted by Urist McComp View Post
    Using temp files can be avoided for some modern file systems...
    The problem with temporary storage comes from the pre-processing stage, that needs a lot of it. Since the idea is to have recompression capabilities for as many formats as possible, we need a robust temporary storage management backend to handle it. If we can skip using temp files as much as possible for the actual compression stage, we'd put less pressure on the storage manager. But like I said, this is just a proposal for a possible file format, nothing is set in stone yet.

    Quote Originally Posted by Urist McComp View Post
    In case deduplication is implemented like in one of the hints shelwien gave (if I understand it correctly):
    Deduplication will be performed in 2 stages, both optional. The first stage dedupes blocks by comparing them between themselves, and then the second stage will further process all default blocks with an anchor hashing approach, which will sub-divide these default blocks and dedup them. Currently only the first stage is implemented in the prototype.

    Quote Originally Posted by Urist McComp View Post
    Using this block positions/sizes probably don't even need to be stored explicitly, they can be inferred during decompression. Maybe being able to not only store a list of single block ids but also ranges of block ids for files could reduce the needed amount of metadata considerably.
    We only segment into blocks of specific data types recognized by the parsers, or any default block segmented by the 2nd stage deduplication. Any file starts as just 1 default block, and can then be segmented into blocks of a myriad of types. The block ids themselves aren't stored, they're assigned in order of appearance in the chunks, which themselves group blocks of the same type, so even though a file may be described by 3 blocks of different types, their ids will probably not be consecutive.

    Quote Originally Posted by Urist McComp View Post
    This would at least allow for recovery of errors in the chunks but not for recovery of metadata errors which are probably even worse. I only used hamming codes once and you probably want something more sophisticated in special position for good recoverability of as many errors as possible. So I guess the best way (recovery whise) would be if the recovery data existed "on top of"/"one layer above" the normal archive structure but maybe there is someone else here who knows a lot more about it than I do.
    We could handle it with a flag bit in the file header that signals that after reading the data size field, we should read another 64 bit signed integer which specifies the size of all the structures. This would allows us to append recovery data for the structures at the end of the archive.

    Quote Originally Posted by Urist McComp View Post
    I like that you don't store unneeded fields the VLI-encoding, even if i would prefer some context modelling - it would probably be slower, more complicated, error-prone and require more memory, though.
    So I was mostly thinking about compression of file names.
    I wouldn't mind it either, but if we're going for practical usage as our goal, we probably shouldn't add even more unnecessary complexity.

    Quote Originally Posted by Urist McComp View Post
    Some thought about tag lists for file metadata: should there be tag-ids which represent multiple data at once? For exampe a tag for unix metadata which contains date, time, access rights ... to conserve a few bits?
    Sure, it's just a matter of defining tags. We must decide for which tags support must be mandatory, so that a compatible decoder must handle them and could safely ignore any unknown tags.

    Quote Originally Posted by Urist McComp View Post
    I somehow like writing technical documentation. A huge part of my motivation comes from reading bad documentation or undocumented code. But it requires me to know the topic well, so I probably can't help you out with content, just with structure. I guess it is not written in LateX and on github? It looks more like a Word document.
    Yes, bingo, it was just a quick sketch in Word (actually my 3rd sketch, but who's counting.. ). I know it's probably hard to grasp taken out of context, we discussed most options on Gitter so those who followed it probably found it much easier to understand. And at least it made sense to me (at the time, don't hold me to it).

    Quote Originally Posted by Urist McComp View Post
    Now I got it, reading it again helped. I missed what you meant by "structure", before. I'd still like a complete example, though. Maybe someone else can do it, too (maybe me) if you could provide the sources for the .pdf. So how about LateX or markdown or html somewhere in the repository ( https://github.com/schnaader/fairytale )? Or did I just overlook it?
    Sure, I'll try to cook something up to serve as a better example. Gonzalo is our documentation guy (he drew the short straw ) so I defer to him on that. I know he was writing stuff for the wiki but don't know how far along it is.

    Quote Originally Posted by Urist McComp View Post
    So at least one question that I can clearly describe without feeling too stupid:
    Both chunk info sequence and block node sequence have "Block type" fields of the VLI-type. But in the example in the appendix they look like strings, though. Where or at least how (hardcoded?) is the mapping between VLI-value and string defined? Some block types indicete the presence of an optional info field. Where is the information, whether it exists?
    The strings are just for making it easier to understand than listing types by a number. Default is type 0, Deflate may be 1, etc. As for the optional field, it depends on the data type itself, so yes, you can say it's "hardcoded" in the sense that any decoder that supports a certain block type must know if it has optional reconstruction info. For instance, with Deflate we currently need to store some zlib parameters and possibly diff bytes (this will hopefully change if we integrate preflate). But if we're only doing an RGB color transform on pixel data from an image, we don't need any additional info.

    Quote Originally Posted by Urist McComp View Post
    Even if it's all written down it took me quite a while to understand how new artificial blocks are created by block node sequences. I still don't understand their meaning, though. Is this information how to restore the original files after decompression?

    Does this mean there is compression information in the chunks and after that there is also information on how to restore the original compression for some blocks? I have to think about this - and I hope this is even correct. Maybe you could specify it a little further in the pdf.
    Yes, that's it. The actual data that is compressed in the chunks is that of the deepest recursion levels, i.e., the final child blocks. But we still need to know how to reconstruct their parent blocks, what was their type, do we need the optional reconstruction data to do it, etc. For example, a zip file would be decompressed and each stream would be a child block, marked as default on creation. After further parsing is done on them, some may have been detected as JPEGs or GIFs or whatever, so they themselves may become parent blocks. Since we only compress the child blocks, we need information about the parents to be able to recreate the original file.

    Quote Originally Posted by Urist McComp View Post
    So is there a possibility to mark blocks as context-only-blocks for a given compressor? That could allow some even better reordering, current reordering seems to be linear, this would allow for (limited)"multi dimensional reordering": one block could precede multiple blocks at the same time but only be stored once. Or is this already possible and I only overlooked it?
    Not currently, and honestly I don't see the point. The similarity clustering stage will already try to order the blocks in such a way as to maximize compression when in solid mode, so that at any point we're already trying to have the best possible previous blocks as context.

    Quote Originally Posted by Urist McComp View Post
    This is not what I meant, I meant not having to list a (possibly) large number of blocks for a 2nd file, again.
    But I think I got it now:
    Does this work by specifying one "virtual block" consisting of all the blocks in a file and then the files both just refer to this one virtual block?
    No, I considered that (keeping a root block per file, so any initial segmentation implies we descend a level) but that would bloat the structures even more, for the possibility of slightly better gains on some files. But deduping a whole file is already as good compression as you're going to get, so even if we need to store its segmentation info, it's still pretty good in terms of ratio. But again, this is just my humble opinion, and it's why we're only in the planning stage, to make sure we get this right.

    Quote Originally Posted by Urist McComp View Post
    My description probably was too short. Make the "number of blocks" field optional. It should only be present if a file has a size > 0.
    We don't explicitly store the file size, they're deduced from the block lengths, so if number of blocks is 0, it's a 0 byte file, it works out to the same thing (note that one of the proposed metadata tags for the file entries is precisely a file size tag, to speed up listing archive contents).

    Quote Originally Posted by Urist McComp View Post
    I meant there should be more entries in the File metadata tag list in the appendix
    And mabye also a link for quickly jumping inside the pdf.
    Sure, we just haven't defined them yet (damn lazy coders )

    Quote Originally Posted by Urist McComp View Post
    One more questions came into my mind:
    How will all the data compressed by different codecs (chunks/blocks) be appended to each other, is there some padding after each chunk in case bytes are not fully used?
    All codecs must flush when finishing compression of a chunk, so it is implementation dependent. A chunks data must always start and end at a byte-boundary.

    Quote Originally Posted by Urist McComp View Post
    I think there is one use case which is already supported by this specification but which could be supported better.

    Consider the following folder structure:

    Code:
    dir_to_comp
        1.rar_uncompressed_dir
        1.rar
    Rar (for example) files probably will not be able to be reconstructed using uncompressed files any time soon, so they need to be stored directly. The extracted files on the other hand may be extracted from the rar instead of being stored seperately. Right now I think each file needs its own block node entry, if you want to split it in multiple blocks for deduplication even more. So how about allowing a single block node entry to create multiple blocks, defined by the underlaying block itself? Maybe it would be helpful to be able to redundantly store just the information how many blocks it will be to allow for easier/faster parsing of the remaining block nodes without having to decompress the .rar first. This count could also be useful for some corruption checks.

    Some examples where I think this could be useful (for reducing the number of explicitly listed block IDs):
    - Compressed archives, which cannot be recreated but where (some parts of) the included files are also stored somewhere else again. Maybe even jpegs which are (for some weird reason) also stored as a bitmap or even png extracted from them.
    - Splitting files into multiple blocks for deduplication, for example using anchor hashing. It should only be used as long as the blocks should only be referenced and not removed by deduplication itself. So the most relevant use case will probably be files inside a non recreatable archive.

    On the other hand this may increase the count of block IDs that "pollute the namespace" if only a few of them are needed. Although files where no blocks at all are needed could be skipped.
    If we can't reconstruct the compressed original data, we can't extract its content and compress that. So we couldn't "see" the files inside the Rar archive.

  22. #18
    Member
    Join Date
    Sep 2017
    Location
    Berlin, Germany
    Posts
    20
    Thanks
    7
    Thanked 15 Times in 5 Posts
    Quote Originally Posted by mpais View Post
    The problem with temporary storage comes from the pre-processing stage, that needs a lot of it. Since the idea is to have recompression capabilities for as many formats as possible, we need a robust temporary storage management backend to handle it. If we can skip using temp files as much as possible for the actual compression stage, we'd put less pressure on the storage manager. But like I said, this is just a proposal for a possible file format, nothing is set in stone yet.
    I thought only of one special case of temp storage:
    Consider you produce the data in the order File header -> Compressed data -> Metadata. But you want File header -> Compressed data -> metadata
    In this case you can avoid writing Compressed data to a tempfile. Just write it to the final file and after that (when you hopfeully know the size of the metadata) just use these special functions to move the Compressed data back only using a metadata operation for making room for the metadata at the beginning.


    Quote Originally Posted by mpais View Post
    Sure, I'll try to cook something up to serve as a better example. Gonzalo is our documentation guy (he drew the short straw ) so I defer to him on that. I know he was writing stuff for the wiki but don't know how far along it is.
    Ah ok. So is it possible to create the pdf from the wiki? Because having two different documents that are intended to document the same stuff often leads to inconsistencies (at least in my experience).


    Quote Originally Posted by mpais View Post
    The strings are just for making it easier to understand than listing types by a number. Default is type 0, Deflate may be 1, etc. As for the optional field, it depends on the data type itself, so yes, you can say it's "hardcoded" in the sense that any decoder that supports a certain block type must know if it has optional reconstruction info. For instance, with Deflate we currently need to store some zlib parameters and possibly diff bytes (this will hopefully change if we integrate preflate). But if we're only doing an RGB color transform on pixel data from an image, we don't need any additional info.
    What I actually wanted to say: I'd like a (short) explaination of what it is supposed to mean in the documentation.


    Quote Originally Posted by mpais View Post
    No, I considered that (keeping a root block per file, so any initial segmentation implies we descend a level) but that would bloat the structures even more, for the possibility of slightly better gains on some files. But deduping a whole file is already as good compression as you're going to get, so even if we need to store its segmentation info, it's still pretty good in terms of ratio. But again, this is just my humble opinion, and it's why we're only in the planning stage, to make sure we get this right.
    You don't need any additional structures for that. Just a copy-"Block type": It just appends all blocks listed in its block node entry behind each other to create a new, bigger "virtual" block.


    Quote Originally Posted by mpais View Post
    We don't explicitly store the file size, they're deduced from the block lengths, so if number of blocks is 0, it's a 0 byte file, it works out to the same thing (note that one of the proposed metadata tags for the file entries is precisely a file size tag, to speed up listing archive contents).
    Sorry, I somehow missed that. I probably interpreted the string length as the file size or something.


    Quote Originally Posted by mpais View Post
    Sure, we just haven't defined them yet (damn lazy coders )
    What I actually wanted to say: I like writing "TODO" in my documentations in such cases


    Quote Originally Posted by mpais View Post
    All codecs must flush when finishing compression of a chunk, so it is implementation dependent. A chunks data must always start and end at a byte-boundary.
    Could be included in the documentation



    Quote Originally Posted by mpais View Post
    If we can't reconstruct the compressed original data, we can't extract its content and compress that. So we couldn't "see" the files inside the Rar archive.
    Oh, well you must have thought: what the ???? is he talking about, this is ridiculously supid. So let me try to rephrase what I wanted to say:

    Imagine someone has a folder where he downloads a lot of stuff. Usually these downloads are compressed so he decomresses them into the same folder. But the user doen't delete the compressed files. Now he wants to make a backup of this folder using Fairytale. So the *.rar files need to be stored inside the Fairytale-archive because they cannot be recreated. But the files that were unpacked from these *.rar files don't need to be also stored (redundantly) in the *.ftl. They can just be extacted from the *.rar files (which need to be stored inside the *.tfl archive) during Fairytale-decompression, again.

  23. #19
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    for start, i recommend to look into other formats:
    - classic: zip, rar
    - modern: 7z, arc
    - future: thoughts, some ideas
    - inspiration: zstd

    basic stuff:
    - everything including info you saved in the archive header should be protected by checksums
    - it will be great to have archive format that can be written strictly sequentially. this means that index to metadata block(s) should be placed at the archive end
    - in 99.9% cases, it's enough to put all metadata into single block, checksummed, compressed and encrypted as the single entity
    - all metainfo should be encoded as SoA rather than AoS
    - allow to disable ANY metainfo field (crc, filesize...) - this requires that any field should be preceded by tag. but in order to conserve space, you can place flags for standard set of fields into the block start, so you can still disable any, but at very small cost
    - similarly, if you have default compression/deduplication/checksumming/... algorithms, allow to encode them with just a few bits, but provide the option to replace them with arbitrary custom algos

    I have started to develop nextgen archive format based on all those ideas, but aborted it pretty early. Nevertheless, I recommend to use it as the basis and just fill in unfinished parts of my spec.

  24. The Following User Says Thank You to Bulat Ziganshin For This Useful Post:

    Mike (20th May 2018)

  25. #20
    Member
    Join Date
    Feb 2016
    Location
    Luxembourg
    Posts
    520
    Thanks
    196
    Thanked 744 Times in 301 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    - everything including info you saved in the archive header should be protected by checksums
    I'll probably add a checksum byte to the file header, all other structures are already checksummed.
    Quote Originally Posted by Bulat Ziganshin View Post
    - it will be great to have archive format that can be written strictly sequentially. this means that index to metadata block(s) should be placed at the archive end
    As-is, everything can be written sequentially, if we use a temporary stream to hold the compressed data before commiting it to the output archive (since we need to know the size of the compressed data for the data size field of the file header).
    Quote Originally Posted by Bulat Ziganshin View Post
    - in 99.9% cases, it's enough to put all metadata into single block, checksummed, compressed and encrypted as the single entity
    In the current proposal, metadata refers exclusively to non-essential data. We use 4 different structures to hold the info required to describe and perform operations on the archive, checksummed and possibly encrypted. As suggested here, we might add an additional recovery structure for these 4 main structures. Compression of these structures is as of yet not defined, but can be turned on with a flag on the file header if needed.
    Quote Originally Posted by Bulat Ziganshin View Post
    - all metainfo should be encoded as SoA rather than AoS
    I disagree, seeing as almost all structures have variable length, so encoding as SoA won't really provide many benefits and makes it harder to simply delete unneeded metadata, should the user wish to strip it.
    Quote Originally Posted by Bulat Ziganshin View Post
    - allow to disable ANY metainfo field (crc, filesize...) - this requires that any field should be preceded by tag. but in order to conserve space, you can place flags for standard set of fields into the block start, so you can still disable any, but at very small cost
    Checksums are only mandatory when needed to protect data from corruption, and those needed for block-based deduplication are optional. All metadata is stored in an efficient tag list representation, and only data that is required is actually stored (for instance, filesizes are optional metadata since they can be deduced from the total of the blocks that represent a file).
    Quote Originally Posted by Bulat Ziganshin View Post
    - similarly, if you have default compression/deduplication/checksumming/... algorithms, allow to encode them with just a few bits, but provide the option to replace them with arbitrary custom algos
    Deduplication is optional and can use 2 stages, the first is implemented and consists of content-aware block deduplication, the second stage will then use anchor-hashing sequentially on all default block types to further segment them. Checksumming is currently handled by CRC32 (we're considering changing to the CRC32c variant) and can eventually be changed by a flag on the file header. The compression algorithms used are user-defined, since that is one of the core ideas of the project. They are represented in the codec definitions structure, which allows for multiple sequential codecs to be used per block-type.

  26. #21
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    539
    Thanks
    192
    Thanked 174 Times in 81 Posts
    The project is migrating to GitLab because of this GitLab offer and for better coordination using the tools there. GitHub and GitLab are very similar, so this should be no big deal. As for the Microsoft part of it, I doubt they ruin GitHub as they did with Skype, but you never know and there's no reason why we shouldn't give GitLab a chance, they seem to have some nice additional infrastructure over there.
    http://schnaader.info
    Damn kids. They're all alike.

  27. #22
    Member
    Join Date
    Nov 2014
    Location
    California
    Posts
    122
    Thanks
    36
    Thanked 33 Times in 24 Posts
    New link ?

  28. #23
    Member
    Join Date
    Feb 2015
    Location
    United Kingdom
    Posts
    154
    Thanks
    20
    Thanked 66 Times in 37 Posts

  29. The Following User Says Thank You to Lucas For This Useful Post:

    hexagone (11th June 2018)

  30. #24
    Member
    Join Date
    Mar 2016
    Location
    Croatia
    Posts
    181
    Thanks
    74
    Thanked 10 Times in 10 Posts
    I am just wondering, ...does anyone know what is happening with Fairytale community archiver?

Similar Threads

  1. Fairytale Archive Structure
    By Stephan Busch in forum Data Compression
    Replies: 7
    Last Post: 18th December 2013, 00:40

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •