Results 1 to 12 of 12

Thread: Universal archive format

  1. #1
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts

    Universal archive format

    All archive signatures are 8-byte values, with common 7-byte prefix specific for this particular archive type, and last byte determing the particular signature (start of archive, archive descriptor and so on..)

    Archive starts with archive start signature, used by file type identification tools. It is optionally followed by arhive header block that holds global information about the archive

    Then goes compressed archive data, followed by archive directory describing these data, followed by directory descriptor used to find and decipher the directory. Such sequence of 3 blocks may be repeated several times. Optionally it follows by the archive footer block that finished the archive. Alternative way of archive identification is by signature contained in the archive footer or by signature of archive descriptor that should be looked in the last 4KB of the file

    ... TBD

  2. #2
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    For FreeArc? Aren't there already a bunch of universal archive formats (zip, 7zip, rar, zpaq...)

  3. #3
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    it is based on my experience with existing FreeArc format and should be used in the next FreeArc format. i publish it because it may be useful for others and in hope to receive feedback

  4. #4
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    It's going to depend a lot on what features you want to support, like what compression formats, deduplication, fast incremental updates, history (like zpaq), encryption, volumes, error detection, error correction, remote backup support, Windows/Linux compatibility, and which file attributes to save (can you back up and restore the OS?). What are your goals?

  5. #5
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    464
    Thanks
    202
    Thanked 81 Times in 61 Posts
    IMHO, it must not be done what is already well done. But if you want to include support for more features than other archive formats I have to say 'very good'.

    Here is something I was wondering:
    What if I want to, for example:
    * pre-process all .gif images from the set to be compressed in the way "precomp -cn"
    * pre-process all $exe with rep+dispack+delta
    * pre-process all $compressed with nothing more than rep
    * then compress everything with LZMA as a unique solid packet.

    In the archives I've seen, the three first lines of the description above would make one packet each.

    Will allow me ARC format to store this another scheme? I'm not talking about a workaround like those that can be used to include packMP3 seamlessly by editing arc.ini
    I say no more temp files than required for precomp in this example...

    I repeat the question: "Is it possible to include different pre-processing stages into one solid stream?" If not, I'd like you to include that possibility into this new format.

  6. #6
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    Gonzalo, it may be preprocessed with rep+precomp+dispack+delta, since precomp/dispack/delta doesn't touch unsupported data types and quickly pass them to the next stage (at least on decompression)

  7. #7
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    464
    Thanks
    202
    Thanked 81 Times in 61 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    Gonzalo, it may be preprocessed with rep+precomp+dispack+delta, since precomp/dispack/delta doesn't touch unsupported data types and quickly pass them to the next stage (at least on decompression)
    Thanks, but they actually touch unsupported data. I mean, if we have 5 mb in *.gif and 595 mb in the other type of data, and we select "precomp", FA will make a ~600 mb tmp file to be digested by the whole chain. And precomp will write a lot of unneeded data into the disk. All the more if we're compressing an 8 gb folder, like I do every once in a while...

    Also, the other filters will try to find meaningful data where there's not, i.e. executable code into a .mp4, wasting time and resources.
    Last edited by Gonzalo; 26th November 2014 at 04:36. Reason: I wrote 'hole' instead of 'whole'... Now I know I'm a beast XD

  8. #8
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    Gonzalo, the main problem in your first example is that precomp is external packer. with cls-precomp.dll, the data are streamed through precomp in the memory (on decompression). overall, it may be useful, just low-priority for me. now, i plan to support two features extending standard solid blocks:

    1. support compression algorithms with multiple output streams like BCJ2. it will require to save for each solid block its list of chunks, i.e. pairs (oustream_number,chunk_size)
    2. support for interleaved solid blocks, as already implemented in the nanozip. it also require list of chunks (solid_block_number,chunk_size)

    you propose to support compression algorithms with multiple input streams. i will save it to my memory, but i don't have any concrete plans to implement it

    i.e. executable code into a .mp4, wasting time and resources.
    it can be made faster by keeping mp4 and executables in separae solid blocks, or we can ensure maximum compression by joining them into one solid block and looking for executable data in mp4 too - just for case. you propose some intermediate solution that may be not very meaningful - if we are sure that mp4 cannot contain executable code, then why we expect that it will have common data with executable and other well-compressible files? without common data, why we need to join mp4 and well-—Āompressible data into the same solid block?

  9. #9
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    Quote Originally Posted by Matt Mahoney View Post
    It's going to depend a lot on what features you want to support, like what compression formats, deduplication, fast incremental updates, history (like zpaq), encryption, volumes, error detection, error correction, remote backup support, Windows/Linux compatibility, and which file attributes to save (can you back up and restore the OS?). What are your goals?
    well, 10 years ago i've developed the FreeArc archive format based on my previous experience with ARJZ plus ideas from the 7-zip format. some people asked me to describe it to use these ideas for their own archivers

    now i'm starting the FA'Next project and eventually i will implement new archive format as part of this project, since existing FA format has some serious drawbacks: it doesn't allow to disable standard fields and variable-length integer encoding makes the (de)serialization slower. may be instead of making format incompatible in both ways, i will just make backward-incompatible extensions to existing FreeArc format so that new code will be able to extract old archives too (alternatively, i can just include old code to extract old archives)

    my goals are 1) present new format for discusion with program users and 2) share with other developers ideas what they can use in their programs. may be i should rename the thread to "New FreeArc archive format"

    compared to the existing FreeArc format, i want to fix two above-mentioned problems - allow faster (de)serialization and make archive format more flexible, for example allowing to drop CRCs or Unix filetimes, as well as smaller issues such as non-flexible descriptors of solid blocks as i mentioned in the previous post

    the archive format i will describe doesn't support zpaq-style deduplication (with global deduplication map and quasifixed-size solid blocks), but other features you mentioned may be implemented as work goes on. in particular:

    1) compression formats - those supported by freearc (internal/cls/external)
    2) deduplication - only as the compression method rep/srep. although, whole-file deduplication may be implemented too, as it was requested by many FreeArc users
    3) fast updates - yes. archive directory may include multiple parts so new data may be just added to the end of archive
    4) history - yes, directory block descriptor will include the "generation" number plus we may implement "antifiles"
    5) encryption - yes, i plan to improve the support compared to FreeArc. eventually, it will be great to implement AE/AEAD and asymmetric encryption
    6) volumes - only in the form of splitted archive files or multiple proper archive files considered as single archive
    7) error detection - not sure what you mean. i plan to use tree version of SHA256 to protect whole files, and may be umac/vmac to protect fixed-size blocks of compressed data
    8) error correction - yes, i will implement RAR5-like algorithm althpough i don't yet know how to put it into the archive fromat
    9) remote backup support - i don't know what you mean
    10) of course, linux/windows will be supported from the start and i hope that other platfroms (mac,smartphones) will be supported too, as well as arm/mips/...
    11) which file attributes to save - since archive format will be flexible, extra attributes will be saved as optional fields. it just needs some time to add their full support
    Last edited by Bulat Ziganshin; 26th November 2014 at 12:14.

  10. The Following User Says Thank You to Bulat Ziganshin For This Useful Post:

    seth (28th November 2014)

  11. #10
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    464
    Thanks
    202
    Thanked 81 Times in 61 Posts
    1. support compression algorithms with multiple output streams like BCJ2. it will require to save for each solid block its list of chunks, i.e. pairs (oustream_number,chunk_size)
    Good. What I'm about to say is just a side comment. In almost every comparison I did (a few years ago I spent some time on exe filters) dispack showed itself very more useful than bcj2. This is even on ELF. Of course is slower but still very fast having in mind the huge impact it have on final ratio. Perhaps you've found some cases where bcj2 is better than dispack. If it is so, please let me know the details, for I'm interested yet.

    2. support for interleaved solid blocks, as already implemented in the nanozip. it also require list of chunks (solid_block_number,chunk_size)
    This feature (if I'm understanding it correctly) would be extremely convenient for it allows you to make a heuristic separation inside unknown files (a good example is a linux distro live-cd preprocessed by precomp or a tar-like archive) and compress every different data type found with its own method. At present the last example is packed with just one method supposed to perform randomly well, i.e. rep+exe+delta+lzma for the whole file.
    If I'm not wrong, this is already attempted by pcompress. So you have something to start with.

    you propose to support compression algorithms with multiple input streams. i will save it to my memory, but i don't have any concrete plans to implement it
    Now we're understanding each-other!
    Even if you don't implement it right now, it is very important that you design the archive architecture with the idea in mind.

    without common data, why we need to join mp4 and well-—Āompressible data into the same solid block?
    You have a point there. But it is actually what is done right now when FA stores a whole folder without looking at what's inside into a temp file then throws it to precomp, or srep for instance. So srep eats a lot of memory trying to find correlations where there are not.
    What I propose is, separate first different data types, pre-process them individually (or not if not needed) and after all use the final codec to pack everything. With the exception of leaving $compressed out and very specific methods like $wav too.
    The need of doing this is not clearly seen in the exe+mp4 example. But think about this scenario: Very much, small BMP images, plus executables with BMPs inside as resources compiled, and PNGs images after been inflated. This scheme is very likely to be found in a normal program folder. Despite that these three seems to be of a different kind, they're actually very similar... And would benefit from solid-packing BUT the three needs to have different pre-processing stages.

    UF! For now, 'nuff said.
    Last edited by Gonzalo; 26th November 2014 at 22:11.

  12. #11
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    464
    Thanks
    202
    Thanked 81 Times in 61 Posts
    Besides everything else... There are some features that, in my opinion, can make a real difference in the user experience. For example:

    *Comments possible for every item (files and folders). This can be done by at least two methods:
    1) User specified description. You can arbitrarily write "this is an awful photo of mine while sleeping on the chair". Most modern file managers support a similar behavior by reading a file called "descript.ion"
    2) This one require more effort: Formal descriptions extracted from the metadata of the actual file stored. For example, ID3 tags, exif info, etc.
    The comments can be displayed in a side column. Explorer.exe and Dolphin from KDE does it so.

    *Thumbnails pre-computed and stored while compressing for faster and better experience when browsing. Let's hope DLI author wants to open its source.

    *Icons for everything displayed as %1 (*.exe, *.icl, etc)

    The why of these rare comments of mine is rather simple. For example... If you have a folder with 500 images in the name "DCIM-YYYYMMDD-HHMMSS.jpg"... how in the world you will know which one of all these you want to see??? Now if you have thumbnails... Who is WinZip???

    *I'm very tired so let's continue at another time
    Last edited by Gonzalo; 26th November 2014 at 22:45.

  13. #12
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    9) remote backup support - i don't know what you mean
    Like I just added to zpaq. You do an incremental backup and it produces a file that you upload and delete locally. You need a local index to know what files you already have backed up and when they were last modified. If you want dedupe then you need to store hashes too.

    Anyway, designing an archive format is a hard problem. When I designed zpaq originally, I thought I had enough hooks to add stuff I might want later, but I didn't have a good way to add dedupe and rollback without breaking forward compatibility. So now it is more complex than it needs to be to support streaming and journaling format for backward compatibility. I would rather keep it as simple as possible. The longer the spec, the more code you have to write.

    But I guess you don't plan to write a general purpose decompression language either.

Similar Threads

  1. Universal LZ Format
    By Kennon Conrad in forum Data Compression
    Replies: 43
    Last Post: 15th October 2014, 19:07
  2. Universal data detector?
    By lunaris in forum Data Compression
    Replies: 6
    Last Post: 17th July 2014, 12:55
  3. Universal Archive Format
    By Bulat Ziganshin in forum Data Compression
    Replies: 1
    Last Post: 9th July 2008, 00:54
  4. Bit Archive Format
    By osmanturan in forum Forum Archive
    Replies: 39
    Last Post: 29th December 2007, 00:57
  5. New archive format
    By Matt Mahoney in forum Forum Archive
    Replies: 9
    Last Post: 25th December 2007, 12:22

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •