Last edited by mpais; 11th June 2018 at 20:02.
78372 (23rd March 2018),Christoph Diegelmann (28th March 2018),dado023 (19th March 2018),Darek (18th March 2018),encode (20th May 2018),Gonzalo (18th March 2018),inikep (18th March 2018),kaitz (18th March 2018),kassane (8th April 2018),load (18th March 2018),Mauro Vezzosi (18th March 2018),Mike (18th March 2018),necros (23rd March 2018),RamiroCruzo (18th March 2018),Razor12911 (27th May 2018),Simorq (23rd March 2018),slipstream (17th May 2018),snowcat (20th March 2018),Stephan Busch (18th March 2018),Zonder (20th March 2018),ZuLeweiner (18th March 2018)
Quick update: Now we have confirmed working binaries for:
* ARM Linux 32 bits
* x86-64 Windows MSVC and GCC
* x86-64 Linux GCC and clang
* OSX clang
So now, every major platform is supported and compiling is as easy as typing "make" or "make -f Makefile.your_platform" Use -j8 on multi-threaded machines to compile faster.
I tried few examples and works fast , what about to add this "compressors" in the detection like zlib/gzip/deflate...?
-lzx (xnb/pkg -as used in the game Transistor or Bastion)
I don't know if you already know this tools:
AFR (Anvil Forge Recompressor -Ubisoft-)
Unreal Package Extractor
Punity game file precompressor (i know Razor12911 is in this forum too)
This is a field with endless possibilities. Only inside quickbms source are enough algorithms to recompress practically 100% of all games in the world and much other file types. But! Work. A whole lot of it.
Staff like pZlib, plz4,pOodle,pzstd,plzo are really fast, but problem is they aren't universal(except pZlib) and they are specially optimized for games only.
Gonzalo (23rd March 2018)
Here is our current draft file format proposal.
As always, your criticism and suggestions are highly valued.
Thanks, we'll look into it.
There are currently a few options for DEFLATE recompression: preflate, difflate (unreleased, by Christian Schneider) and reflate (closed-source, by Shelwien).
Do you have any data on how this new recompressor fares against reflate?
Currently no, but it's just a matter of changing the LUT. Optimized implementations will come later.
It is likely particularly bad on unoptimized PNGs and other non-predicted slowly varying signals, and okeyish on text. You will likely get a 20 % of density increase over gzip, even on small files, and the decoding speed should be better than with other approaches.
We usually develop on Ubuntu-like computers, and it is not easy for us to produce high quality Windows releases. Boot to Linux with a boot dvd and you can build grittibanzli relatively easily?
The ordering File header, Compressed data, metadata probably makes compression easy, but isn't decompression maybe more important? What about partially downloaded files or streaming decompression? Maybe even Header, metadata part1, compressed data part1, metadata part2, comressed data part2 could be an option?
This could also be useful to allow adding more files after initial compression
How about additional recovery information for damaged archives, or at least some reserved fields for later addition?
It looks like this format is not strictly optimized for size.
I saw a file header somewhere that contained a url as one of the first parts, maybe this could be useful (for a small audience)
In the appendix there is only the long name of VLI, not the abbreviation. This is bad for searching the pdf.
VLI: In case of a 64bit number the last end of number flag/bit needs to be implicit, otherwise it won't fit.
Table of contents?
I don't quite get where the checksums actually are. Maybe you can add a full example of a compressed file in a nice tree view.
Concerning compression formats: I am not sure if the following recompression use case is currently possible: extract two differently compressed files and after that solidly compressing both together. Except for maybe specifying a compression format that is able to (de-)compress both different file formats.Code:File header Size VLI ... Compressed data ...
And another thing that maybe is a little beyond the scope of this document:
I am not sure how you are planning to use different blocks, maybe for deduplication?
Consider this case: you have a match finder that only removes long matches but otherwise another compression method is working on the file. Imagine two partially redundant wav-files. The first one is compressed normally, the second one will have some parts missing so the wav-model can probably not work properly directly behind the "missing" parts. So how about a way to inform the wav model, that there are some missing parts or even inform it which data exactly are missing so it can use them as context but it knows that it doesn't actually need to store them. Maybe this is more realistic with multiple versions of text files or something. Will probably be complicated, though.
what about multi-part archives?
Efficient storage of identical files consisting of multiple blocks?
Efficient storage of files of size 0 (number of blocks implicitly = 0)
Better specification of file metadata (attributes/access rights/hardlink/softlink/whatever)
Last edited by Urist McComp; 15th May 2018 at 01:38. Reason: typo, indenting, some additions
Thanks for your interest Urist.
These checksums are stored in the metadata tag lists of the chunk info sequences and the block node sequences.
Well, today I found the original thread ( https://encode.ru/threads/2856-Community-Archiver ) and also saw, that many of my points were already addressed.
http://man7.org/linux/man-pages/man2/fallocate.2.html (look foor FALLOC_FL_INSERT_RANGE)
It will require non-portabe code, though.
In case the block sizes are small, it could be ok to keep one block in memory until it is comressed completely and then first write its metadata (once it is compressed you hopefully know all required information) and then the block. But you probably already thought of that and its drawbacks. And small block sizes will probably be good for deduplication anyway. In case deduplication is implemented like in one of the hints shelwien gave (if I understand it correctly):
So I was mostly thinking about compression of file names.
Some thought about tag lists for file metadata: should there be tag-ids which represent multiple data at once? For exampe a tag for unix metadata which contains date, time, access rights ... to conserve a few bits?
https://github.com/schnaader/fairytale )? Or did I just overlook it?
Edit: writing this part took me a long time and I (hopefully) now understand a lot more. So in the beginning it still looks like I barely understand anything. I decided to keep it anyway to make it more clear what was easy and what was hard to understand (for me).
I still don't really understand how the chunk and block structure is supposed to work, though. This is what I hopefully got right:
- One chunk contains one (non-solid) or more blocks
- Compression can be defined for each chunk
So at least one question that I can clearly describe without feeling too stupid:
Both chunk info sequence and block node sequence have "Block type" fields of the VLI-type. But in the example in the appendix they look like strings, though. Where or at least how (hardcoded?) is the mapping between VLI-value and string defined? Some block types indicete the presence of an optional info field. Where is the information, whether it exists?
Even if it's all written down it took me quite a while to understand how new artificial blocks are created by block node sequences. I still don't understand their meaning, though. Is this information how to restore the original files after decompression?
Does this mean there is compression information in the chunks and after that there is also information on how to restore the original compression for some blocks? I have to think about this - and I hope this is even correct. Maybe you could specify it a little further in the pdf.
But I think I got it now:
Does this work by specifying one "virtual block" consisting of all the blocks in a file and then the files both just refer to this one virtual block?
And mabye also a link for quickly jumping inside the pdf.
One more questions came into my mind:
How will all the data compressed by different codecs (chunks/blocks) be appended to each other, is there some padding after each chunk in case bytes are not fully used?
Two things took me quite a while to understand, maybe you could try to make it easier for future readers:
- It is possible to define "virtual" blocks - which is awesome!
- There are two kinds of compression codecs stored in different places. The ones that the archive is comressed with (and that need to be decompressed for unpacking the archive), they are specified in the chunk metadata. And the ones that are needed to restore other compressed formats that are stored inside the archive in an uncompressed form, they are specified in the block metadata.
Last edited by Urist McComp; 15th May 2018 at 23:49. Reason: typos
I think there is one use case which is already supported by this specification but which could be supported better.
Consider the following folder structure:
Rar (for example) files probably will not be able to be reconstructed using uncompressed files any time soon, so they need to be stored directly. The extracted files on the other hand may be extracted from the rar instead of being stored seperately. Right now I think each file needs its own block node entry, if you want to split it in multiple blocks for deduplication even more. So how about allowing a single block node entry to create multiple blocks, defined by the underlaying block itself? Maybe it would be helpful to be able to redundantly store just the information how many blocks it will be to allow for easier/faster parsing of the remaining block nodes without having to decompress the .rar first. This count could also be useful for some corruption checks.Code:dir_to_comp 1.rar_uncompressed_dir 1.rar
Some examples where I think this could be useful (for reducing the number of explicitly listed block IDs):
- Compressed archives, which cannot be recreated but where (some parts of) the included files are also stored somewhere else again. Maybe even jpegs which are (for some weird reason) also stored as a bitmap or even png extracted from them.
- Splitting files into multiple blocks for deduplication, for example using anchor hashing. It should only be used as long as the blocks should only be referenced and not removed by deduplication itself. So the most relevant use case will probably be files inside a non recreatable archive.
On the other hand this may increase the count of block IDs that "pollute the namespace" if only a few of them are needed. Although files where no blocks at all are needed could be skipped.
Last edited by Urist McComp; 16th May 2018 at 10:10.
Consider you produce the data in the order File header -> Compressed data -> Metadata. But you want File header -> Compressed data -> metadata
In this case you can avoid writing Compressed data to a tempfile. Just write it to the final file and after that (when you hopfeully know the size of the metadata) just use these special functions to move the Compressed data back only using a metadata operation for making room for the metadata at the beginning.
Imagine someone has a folder where he downloads a lot of stuff. Usually these downloads are compressed so he decomresses them into the same folder. But the user doen't delete the compressed files. Now he wants to make a backup of this folder using Fairytale. So the *.rar files need to be stored inside the Fairytale-archive because they cannot be recreated. But the files that were unpacked from these *.rar files don't need to be also stored (redundantly) in the *.ftl. They can just be extacted from the *.rar files (which need to be stored inside the *.tfl archive) during Fairytale-decompression, again.
for start, i recommend to look into other formats:
- classic: zip, rar
- modern: 7z, arc
- future: thoughts, some ideas
- inspiration: zstd
- everything including info you saved in the archive header should be protected by checksums
- it will be great to have archive format that can be written strictly sequentially. this means that index to metadata block(s) should be placed at the archive end
- in 99.9% cases, it's enough to put all metadata into single block, checksummed, compressed and encrypted as the single entity
- all metainfo should be encoded as SoA rather than AoS
- allow to disable ANY metainfo field (crc, filesize...) - this requires that any field should be preceded by tag. but in order to conserve space, you can place flags for standard set of fields into the block start, so you can still disable any, but at very small cost
- similarly, if you have default compression/deduplication/checksumming/... algorithms, allow to encode them with just a few bits, but provide the option to replace them with arbitrary custom algos
I have started to develop nextgen archive format based on all those ideas, but aborted it pretty early. Nevertheless, I recommend to use it as the basis and just fill in unfinished parts of my spec.
Mike (20th May 2018)
The project is migrating to GitLab because of this GitLab offer and for better coordination using the tools there. GitHub and GitLab are very similar, so this should be no big deal. As for the Microsoft part of it, I doubt they ruin GitHub as they did with Skype, but you never know and there's no reason why we shouldn't give GitLab a chance, they seem to have some nice additional infrastructure over there.
Damn kids. They're all alike.
New link ?
hexagone (11th June 2018)
I am just wondering, ...does anyone know what is happening with Fairytale community archiver?