Page 1 of 2 12 LastLast
Results 1 to 30 of 46

Thread: Community Archiver

  1. #1
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    539
    Thanks
    192
    Thanked 174 Times in 81 Posts

    Community Archiver

    Lately, some of the people here, including me, thought about how it was to join forces and build some encode.ru community archiver. There are many talented people here and they are already working together in small, loosely coupled teams - this is great, and I really like the spirit of the forum and the various discussions, projects and benchmarks here. But it seems that though all the puzzle pieces for a very good archiver are there (compression, archiving, usability, extensibility, benchmarking, experimental stuff), there's no common codebase yet. So let's create one!

    The latest discussion was in the paq8px and Precomp threads (I think there were other threads, feel free to link them):
    paq8px thread (Aug 2017), post by mpais
    Precomp thread (Oct 2017), posts by mpais/schnaader/Gonzalo - post 1, post 2, post 3, post 4

    Main goals:
    - Getting many people involved: After the rough archiver skeleton is ready, it should be as easy as possible for people to join the project, test their ideas, benchmark the archiver and use it to get their stuff compressed.
    - Modularity/extensibility: For developers, it should be easy to create new modules to compress certain file formats or replace existing modules
    - Flexibility: Users will decide what they want - experimental state-of-the-art compression (something like paq/cmix/EMMA), practical max. compression (everything above 1 MB/s, for example), very fast compression or even deduplication only
    - Early prototypes: Users and benchmarkers should be able to start right away with the first versions, both a CLI and a GUI should be available, satisfying both simple (just compress it) and advanced (let's see what this parameter does) use cases
    - Wide range of algorithms: Texts, images, audio, structured data, recompressing already compressed data, ... - and if you're missing something or want to test something, feel free to do so!
    - Documentation, teaching, best practices, showcase: I really like Matt's DCE, as it shows that compression is no black magic and there are many different facets of compression - documentation in the project should be good and design choices should be explained, so even if people won't contribute, they can browse the code or the wiki and learn
    something.

    I created a Github repository for the planning stage and used issues to enable discussion on things like naming, programming language and license choice. Feel free to share your own ideas and comments there or in this thread. The "restrict editing to collaborators only" is disabled, so the only barrier should be a GitHub account, but there might be others - please tell me if you're not allowed to do something and I'll try to fix it.
    http://schnaader.info
    Damn kids. They're all alike.

  2. The Following 10 Users Say Thank You to schnaader For This Useful Post:

    78372 (17th February 2018),Bulat Ziganshin (6th November 2017),Darek (1st November 2017),load (1st November 2017),nuclear_hangul (25th February 2018),PrinceGupta (1st November 2017),samsat1024 (2nd November 2017),Shelwien (2nd November 2017),Simorq (17th February 2018),Stephan Busch (1st November 2017)

  3. #2
    Tester
    Stephan Busch's Avatar
    Join Date
    May 2008
    Location
    Bremen, Germany
    Posts
    872
    Thanks
    457
    Thanked 175 Times in 85 Posts
    I would suggest to call this community archiver 'fairytale' because it matches my dreams of a parsing and archiving engine.
    Fairytale also stands for everything magic and unbelievable, everything that is beyond the border of possible at this time.

  4. #3
    Member
    Join Date
    Feb 2016
    Location
    Luxembourg
    Posts
    520
    Thanks
    196
    Thanked 744 Times in 301 Posts
    Well, I'm obviously in

    I already have a rough sketch of what I'd do in the pre-processing stage, as I mentioned in the precomp thread.
    I'll see if I can write it up in detail, including pros and cons of existing solutions, so others can chime in with their insights.

    I suppose anyone interested in joining should describe what they bring to the table, so we can better assign tasks, so I'll start.

    I'm mostly just a researcher, so most of my projects aren't meant for wide distribution, meaning that I lack experience in the "best practices"
    of software development. I also never really cared much about low-level stuff, so I'm not the guy to squeeze every last drop of performance
    from some routine. I'm not well versed in C++ but agree that it should be the language of choice, at least for the main core.
    My experience is in writing compression routines, as in EMMA/PackRAW/paq8, and to a lesser extent, in writing parsers/transforms for many formats.

  5. The Following 3 Users Say Thank You to mpais For This Useful Post:

    PrinceGupta (1st November 2017),samsat1024 (2nd November 2017),Stephan Busch (1st November 2017)

  6. #4
    Member
    Join Date
    Dec 2008
    Location
    Poland, Warsaw
    Posts
    909
    Thanks
    531
    Thanked 359 Times in 267 Posts
    If I could help in testing I'm also in
    According to name: Fairytale - I like it!

  7. The Following User Says Thank You to Darek For This Useful Post:

    Stephan Busch (1st November 2017)

  8. #5
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    464
    Thanks
    202
    Thanked 81 Times in 61 Posts
    I can help with bug hunting and translation of the documentation and interface to Spanish.

    As to the name, my choice would be 'Definitive', because if this works, it will be the definitive answer to the data compression issue! It's kind of catchy too . But Fairytale also sounds good.

  9. #6
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    539
    Thanks
    192
    Thanked 174 Times in 81 Posts
    Self description - good idea, here's mine:

    Additionally to the C++ and compression experience I got from Precomp, I'm working as a software developer and study something closely related to software development, so I know most of the best practices (and at work, I even do my best to follow them ). I did much C# and GUI development at work and had a look at C++ with QT last year at the university, so I'm looking forward to also do some GUI-related things in the project.
    http://schnaader.info
    Damn kids. They're all alike.

  10. #7
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    464
    Thanks
    202
    Thanked 81 Times in 61 Posts
    Some top of the class open source algorithms and programs that can be reused:


    Archive parsers and extractors:

    *libzpaq
    *7-zip code for about two dozen different types.
    *The unarchiver for "more formats than I can remember", "stuff I don't even know what it is", in its author words.
    *QuickBMS: supports tons of file formats, archives, encryptions, compressions (over 500), obfuscations and other algorithms. Currently, +2100 plugins to open different archives. Mostly games but also normal packers, like balz.


    Specific compressors and recompressors:

    Uncompressed audio:
    *TTA: very fast while maintaining good compression.
    *Optimfrog: stronger/slower
    *Wavpack: the one used on zipx
    *FLAC: "the fastest and most widely supported lossless audio codec" according to its authors.
    *ALAC.
    JPG images:
    *Lepton: fastest, weakest.
    *PackJPG: medium, no arithmetic.
    *Paq model: strongest, slowest, no progressive.
    MP3 audio:
    *PackMP3: ~15% savings.
    MP2 audio:
    *unpackMP2+grzip:m3 (as in fazip): ~19% savings, 2-3x faster than packMP3.
    Deflate, bzip and LZW (gif):
    *precomp
    zlib:
    *Anti-z
    Microsoft algorithms:
    *wimlib (not yet implemented a working recompressor but code ready to use)


    General purpose codecs:

    Asymmetric:
    *LZMA - Deprecated in favour of LZMA2
    *LZMA2
    *Radyx: LZMA2 with a more parallelizable match finder, can fit a larger dictionary in the same RAM so helpful with ~2-4gb machines and large archives.
    *CSArc: faster than LZMA2, still good compression and good filters too.
    *BSC: A little stronger/slower than LZMA.
    *ZSTD: very efficient on fast compression.
    *LZO: hellishly fast compression.
    *GLZA: good on text, not so much on binary.

    Symmetric:
    *MCM: fast cm
    *Grzip: bwt
    *ppmd: good and fast on text, not so much on binary
    *paq* family: best ratios, worst speed.


    Filters:

    Dedupe:
    *Per file: as in WIM or squashfs files.
    *Bulat's rep: Very fast and efficient; memory hungry.
    *zpaq's hash based: works best at large distances and can be reused in an incremental run.
    *rzip
    *zstd new implementation

    Executables:
    *BCJ2
    *E8E9
    *Dispack
    Delta:
    *Bulat's
    *Igor's
    Text:
    *XWRT
    *FA's lzp and dict


    Data rippers (used to identify, for example, a JPG image embedded in an unknown container and process it with a corresponding algorithm):

    *paq8px detection code for uncompressed audio and bitmaps, exe code, gif, jpeg and zlib
    *precomp detection code for gif, jpeg, mp3, pdf bitmaps, deflate and bz2 streams
    *extrJPG (from the author of packJPG)
    *Dragon UnPACKer / Hyper Ripper: 23 formats supported: AVI;BIK;BMP;DDS;EMF;GIF;IFF;JPEG;MIDI;MOV;MPEG Audio;OGG;PNG;TGA;VOC;WAV;WMF;XM and a few more prone to false positives. Pretty slow if the container is unknown.

    Those are just a few. Feel free to add your thoughts.

  11. #8
    Member
    Join Date
    May 2008
    Location
    Estonia
    Posts
    377
    Thanks
    139
    Thanked 198 Times in 108 Posts
    What i would like to see is something similar to RarVM,zpaql in archiver but higher level language. For compression/ filters in mind. Something similar i attempted to test in https://encode.ru/threads/1464-Paq8p...ll=1#post47706
    But more advanced. Cant explain exactly what i mean.

    This is probably not in the scope of Community Archiver.
    KZo


  12. #9
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    539
    Thanks
    192
    Thanked 174 Times in 81 Posts
    I'm often thinking about something like that, too - and that's one of the reasons why I wouldn't say it's out of scope (although it definitely is a thing for later stages and needs lots of work and planning).

    The other reason is that it supports the goals: Extensibility (kind of updating the archiver without the need for new binaries) and modularity (in a very extreme and often useful way, making it possible to use specialized routines for some seldom used format or even just a single unique file).

    However, even with a high level abstraction, it will be very hard to tell most users how to use or even what it is, so there's kind of a conflict with usability. E.g. even for me as an experienced developer, zpaql has a very steep learning curve and I didn't manage to make use of it yet because of that.

    On the other hand, when properly done, people can tell the archiver things like "Well, my file consists of 2 columns, sorted by the first, which are dates from this year and in the second column, there are descending hex numbers" assisted by a high level description language or even a GUI assistant. Which is much better than "Learn C++ and write an extension/module for the archiver"
    http://schnaader.info
    Damn kids. They're all alike.

  13. #10
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    464
    Thanks
    202
    Thanked 81 Times in 61 Posts
    An alternative to write an archiver from scratch, which is a daunting task, is to apply ourselves to improve an existing archiver. For example, PeaZip is already cross-platform and has a GUI. Freearc was one of the best in its day and untill now. Also GUI and multi-platform, with a very flexible archive structure. Now is abandoned, so why not pick up where the author left? With the additional advantage that it is pretty known.

    Of course, the internal design may not be what we have in mind but is good to start with and we can change anytime we want, until it reaches a stable version.

    https://github.com/svn2github/freearc/network/members
    Last edited by Gonzalo; 2nd November 2017 at 00:02.

  14. #11
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Actually codecs are not very important at this point.
    What's much more important is a modern format design.
    Years ago, when formats like .zip and .rar were designed, it was ok for an archive to be just a sequence of compressed files (with headers containing metainfo).
    But now we have new features, like MT, dedup and recompression, which require support in format.
    And unfortunately there's no existing format which we could use as a reference - for example, only .7z supports filter trees and multiple output streams
    (which is very helpful for recompression), while .rar5 has file dedup and error recovery which .7z lacks.

    In short, I think the new format has to integrate as much support for new ideas as possible.
    Filter trees and related stuff, codec switching, new solid modes (eg. dictionary-based compression), virtual folders, generated contexts...
    And for that we need to gather these ideas and think what kind of format support they would require.

    Btw, I think that rarVM/zpaql idea proved to be a failure - it adds potential exploits and redundancy to archives,
    while making it _much_ harder to extend the format. Both .zpaq and .rar never had any useful additions based on this feature.
    So my proposed solution is signed binary plugins - once archiver encounters an unknown codec id, it can simply look it up
    and download from some repository, while plugin signature would prevent exploits.

    ...Also its always a good idea to consider the use cases first.
    Like, do we care about compression time or archive size, incremental backup, pipe i/o support (aka stdio), etc.

  15. #12
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    539
    Thanks
    192
    Thanked 174 Times in 81 Posts
    Quote Originally Posted by Shelwien View Post
    What's much more important is a modern format design.
    I agree, that should be one the first concrete steps in the planning stage so we've got a first draft of the format and can evaluate pros and cons. Some comments on the things you mentioned in your post:

    Filter trees

    With the recursive two stage deterministic/non-deterministic concept and the flexible codec choice that mpais has suggested, we already have something like the filter trees in 7-Zip. The user can define which codecs should be used and they are recursively applied to the data until either recursion limit was hit or the parsers didn't find more.

    (File) deduplication

    We need and will have deduplication - what mpais suggested sounds similar to the deduplication I want to implement in Precomp, kind of a "parser assisted deduplication" having markers at useful places (beginning of a file, beginning of a stream, ...) as a good balance between file deduplication (much too coarse, doesn't work if only a part of a file changes) and a "bytewise" full-blown deduplication like SREP (needs much time + memory). Although I tend to implement the latter as an optional choice for users.

    Codec switching

    Not sure what you exactly mean here, but sounds like something we have - as we want to give users the chance to define which codecs they use, both in a very coarse "fast .. slow" way and in a finer way (packJPG on/off, exe filter on/off, ...) for more advanced users. This also covers "compression time or archive size" topic you mentioned - the user is free to choose, we try to implement both as good as we can.

    Signed binary plugins

    Agree with you here, this would be a nice modern feature and is easier to implement and more flexible than the other "forward compatibility/VM/intermediate language" solutions.

    (Incremental) backup

    Looking at ZPAQ, I'm not sure if we should mix backup with compression. It adds a new layer of complexity and requirements (that sometimes contradict compression). Especially the ZPAQ CLI suffered from it, first it was very overloaded and after Matt cleaned it, people needed some of the removed functionality added back again. Also, when deduplication and recompression are combined in an archiver and done right, it's not that far from an incremental backup solution and can be adjusted/used in a seperate backup project.

    Pipe support (aka stdio)
    I think this is ruled out by the recursive two stage concept. If I got mpais' idea right, it's about processing the whole file first and building kind a tree map of it where we can see what parts it consists of. After that, we can let the codecs process the parts of the file where they fit, in parallel, which supports MT. We also want reordering of the data later for better locality. Doing all this with pipe support is only possible if we buffer the whole file or at least very much of it in memory. This memory need conflicts with three requirements I find much more important than pipe support: MT (basically multiplies memory needed with the number of cores), advanced compression (needs all the memory it can get, e.g. think of cmix) and big file support/deduplication (sorry, we can't process your 100 GB file until you have X + 100 GB of RAM).
    Yes, I'm sure there are ways to deal with it, but my personal opinion on this point is to be very careful about it because it might get very complex very fast (similar to the incremental backup). But I'm also curious about other opinions here, especially mpais' as he followed such a strict streaming strategy in EMMA.
    http://schnaader.info
    Damn kids. They're all alike.

  16. The Following 2 Users Say Thank You to schnaader For This Useful Post:

    Bulat Ziganshin (1st March 2018),Stephan Busch (2nd November 2017)

  17. #13
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > The user can define which codecs should be used and they are recursively
    > applied to the data until either recursion limit was hit or the parsers
    > didn't find more.

    What I mean is something like this: nishi.dreamhosters.com/7zcmd.html
    Filter chains are different from filter trees.
    Also, multiple-output-stream support is very important imho - its usually
    possible to integrate stream interleaving into any specific codec/filter,
    but then you have to integrate specialized compression for the streams, too.

    > We need and will have deduplication

    I suggest to look at zstd (it implements anchor-hashing version),
    zpaq (there's an interesting tweak with data entropy estimation -
    redundant data ends up having longer chunks) and pcompress
    (it has minhash and bsdiff for similar chunks).

    > as a good balance between file deduplication (much too coarse, doesn't work
    > if only a part of a file changes) and a "bytewise" full-blown deduplication
    > like SREP (needs much time + memory).

    1. Actually a good archive format probably needs all of them.
    File dedup can be more efficient than stream dedup, because you
    can simply copy files at extract - no need to keep everything in memory.

    Also, sometimes it can be better to dedup filtered data, sometimes not.

    And, as I mentioned, I'm also considering generated contexts for dedup -
    for example, we can compress reference data with lzma to dedup with lzma stream,
    instead of unpacking lzma stream (which is very hard to do with a small diff size).

    > Codec switching
    > Not sure what you exactly mean here, but sounds like something we have

    1. Most formats only can assign codecs per file, not for random chunks of the file.
    Only rar had proper LZ/PPMd switching, but they dropped it in rar5.

    2. An important feature is optimization - ie assigning codecs to chunks
    in a way that improves compression.
    For example, lzma2 has an option to set different lc/lp/pb for specific chunks,
    but I'm not aware of any tools to optimize the layout.

    3. Proper implementation would require archiver framework support,
    in particular because of context/dictionary mode.
    I mean, suppose we're analyzing a file, and see that some chunks can
    be better compressed with ppmd, some with lzma.
    Independent compression of these chunks is obvious, but we actually
    can implement solid mode too - by providing previously unpacked data
    as a dictionary for the next codec.
    This certainly requires support in archiver framework, though.

    > Agree with you here, this would be a nice modern feature and is easier to
    > implement and more flexible than the other "forward compatibility/VM/intermediate language" solutions.

    The main problem with rarvm-like forward-compatibility scripting approach
    is that even the original purpose for it was very different.
    Afaik it was intended for potential implementation of exe filters (for different platforms)
    and text filters (like space stuffing).

    Unfortunately there's a potential exploit every time a new script is added,
    because unpacker checks script crc and uses a compiled implementation if it matches,
    while older unpacker is supposed to interpret the script.
    So when a new version is released with added script, it becomes possible to trick
    AVs that unpack rars to check files using an older version of unrar library.

    Well, in any case, it might make sense for simple scripts like E8, but
    surely not actual codecs - even if we had a gcc port for zpaql target,
    just imagine how much overhead would add a script version or something like packjpg.

    > Looking at ZPAQ, I'm not sure if we should mix backup with compression.

    Its one of the main remaining applications though.
    There're usually lots of data and compression ratio is pretty important.

    As to other possibilities... zip is actually quite enough for mailing documents,
    and anyway you can't gain much without recompression (and afaik precomp can't handle .docx atm).

    Then, for filehosting and stuff you don't need archive formats.
    Same for game repacks.

    Sources are usually handled by repository software anyway.

    What else is there?
    As to me, I only create archives to upload some experiments to /u.
    And still use rar for these - mainly because rar has archive repair, unlike 7z.

    > Pipe support (aka stdio)
    > I think this is ruled out by the recursive two stage concept.

    Not really.
    Sequential processing only for unpack may be still ok.
    Also, having format support for data analysis doesn't mean that
    you can't have "fast mode" which adds stuff sequentially without analysis.

  18. #14
    Member
    Join Date
    Feb 2016
    Location
    Luxembourg
    Posts
    520
    Thanks
    196
    Thanked 744 Times in 301 Posts
    I've been really busy with work, so I haven't had the time to write a somewhat formal proposal for the analysis stage.
    But I'm with Shelwien on this, we first need a clear set of requirements, on which we will then base our decision on what is needed in terms
    of archive format to support all of them.
    I'm not aware of any open-source archiver skeleton that provides what we need.

    The pre-processing/transformation stage I proposed is designed to be as close to optimal as possible, within reasonable complexity.

    Currently, both precomp and paq8px parse the input a byte at a time, looking for signatures of known data types. When something is found,
    it's signaled and delt with accordingly, and we proceed parsing the remaining input. This is sub-optimal because some data formats have further
    recompressible data types embedded in them.

    As an example, consider JPEG images. When I looked at precomp's code and saw that I could improve the parsing code for JPEG, I couldn't add
    code for detecting embedded thumbnails. Once the main image is found, it's passed to PackJPG, which itself doesn't compress the thumbnails.
    The only way to do so would be to recurse on the JPEG main image, detect any embedded thumbnails, compress those with PackJPG, pad the
    recompressed output with 0's to match the original length of the thumbnail (so as not to break the JPEG marker structure), overwrite the original
    thumbnail and only then call PackJPG on this modified JPEG stream.

    Now consider almost all raw photo formats and TIFF images. A TIFF file can contain hundreds of images (multipage), embedded thumbnails and
    pretty much any other type of data imaginable. RAW photos commonly have 2 or even 3 JPEG thumbnails, aside from the raw data itself, which
    sometimes is also split. These container type formats, when detected, are processed to read the whole directory structure at once, so we need
    a way to signal possibly hundreds of detections at once. I do this in PackRAW, which already uses a very simplified version of what I'm proposing.
    And let's not forget that in the metadata for these formats there might be other known data types, so you can't just skip to the end of the last detected
    block and resume parsing from there. This is why in paq8px I made the JPEG model also process HDR blocks, because when TIFF images are
    detected, the whole block from the beginning of the TIFF header until the start of the image data is marked as HDR and no further parsing is done.

    Deduplication based on the detected block structure is a no-brainer, if you have all this structure describing the data, it makes sense to add a simple
    step to check for identical blocks. This means that, for the archive format itself, each block will need to a have a field "reference count", where we
    specify how many times it is used by the files included in the archive. This field would then be decremented when we delete a file from the archive,
    but the block itself would only be deleted when it's reference count reaches 0. This is the kind of requirement I meant at the start of this post.

    Codec switching is, for the end-users, the main selling point of this archiver. The possibility of completely customizing compression to suit their needs,
    while knowing that the options available will provide the best possible tradeoff for what they want. Have an 8TB Multi-Bay NAS full with your raw photo
    collection and want to reclaim some space without waiting 3 months? Use the models from PackRAW. Just want to send a raw photo by email and must
    absolutely try to get it under 10MB, even if it takes 3 minutes to compress? Use the model from EMMA.

    As for piping support, EMMA can do it, but then we'd be limited as it is: no deflate recompression, no transform that needs to buffer a lot of data,
    online parsing only. EMMA only needs to buffer 16 bytes before it can start compressing, and everything is sequential. The reason for this 16 byte
    requirement is because of the x86/x64 address transform. It uses a simple parser for x86/x64 intructions, which can be 15 bytes long.
    The parsers and transforms run very slightly out-of-sync with the compression engine, and it makes everything so much more complex.
    All parsers must be able to resume parsing after a jump in position due to a different parser having signaled a detection.
    The transforms work on "future" data when compressing, and on "past" data when decompressing, so my buffer handler has to allow repositioning,
    peeking at values, keep track of overflow bytes.. All this extra-complexity takes a heavy toll on compression speed.

    Reordering of the blocks is also a requirement for the archive format, since it's needed for solid compression mode.
    We need a structure that describes the directory tree of the files included in the archive, another structure that
    lists the information for each file (name, directory id, size, ..., ordered list of block that compose it) and a structure
    to hold the block segmentation info.
    For reordering, like I discussed on the paq8px thread with Shelwien, I'd like to try using some advanced heuristics to reorder
    blocks not just by their type, but also by their similarity. Note that even in solid mode and with reordering active,
    multi-threading is still possible. Though we compress, say, all JPEG blocks in order, we can at the same time compress, in order,
    all other block types.
    Last edited by mpais; 2nd November 2017 at 23:41. Reason: Fix typos

  19. The Following 2 Users Say Thank You to mpais For This Useful Post:

    schnaader (2nd November 2017),Stephan Busch (2nd November 2017)

  20. #15
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    464
    Thanks
    202
    Thanked 81 Times in 61 Posts
    What about a mobile implementation? We are used to compile programs only to desktop machines and generally use them from the console. Now, if the goal here is to produce something graphic for the general public, maybe we have to consider to build an Android APK / iOS package.

    I know that this will have a very low priority but I mention it now because if we are going to do it, we have to pick an according language from the very beginning to span across all the different systems.

    In this case, I would name Xamarin as a very good platform to write using Ruby or C# and publish it in all major mobile O.S.

    All compression engines we have currently are written in c/c++ so I think there is no discussion here.
    As to the graphic interface, maybe QT would be a good choice also in order to reach all platforms. Or HTML5. There are some wrappers that allow to use it as a desktop app.

  21. #16
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Both Android and iOS are unix-based, so its not really a problem to use C++ libraries there.
    Starting from GUI also won't solve the format issues described above.
    And in any case, C and C++ are the only languages for which good optimizing compilers are available.

    On other hand, I never had to use an archiver on iOS or Android.
    And MS is going to introduce x86 emulator on windows/arm.
    So maybe there's no need to do anything special at all, for mobile.

  22. #17
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    464
    Thanks
    202
    Thanked 81 Times in 61 Posts
    Quote Originally Posted by Shelwien View Post
    Both Android and iOS are unix-based, so its not really a problem to use C++ libraries there.
    Starting from GUI also won't solve the format issues described above.
    And in any case, C and C++ are the only languages for which good optimizing compilers are available.

    On other hand, I never had to use an archiver on iOS or Android.
    And MS is going to introduce x86 emulator on windows/arm.
    So maybe there's no need to do anything special at all, for mobile.
    Exactly! No problem with existing libraries.
    And GUI has little to nothing to do with format designing. I'm talking about language selection for the project, one of the four issues stated by Christian.
    Regarding compression apps, well, there are a few that are actually very good. You can create zipx and rar5 on android using official apps. There is also an app that can even decompress FreeArc archivers among others. And while I don't use any of them either, the tendency is raising towards mobile computing. Most recent devices are more powerful than my laptop right now.

  23. #18
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    There's no need to use same language for GUI and archive format library.
    And the only language that makes sense for format library is C++.
    Maybe Go/Rust in a few years, but not now.

  24. The Following User Says Thank You to Shelwien For This Useful Post:

    xinix (5th November 2017)

  25. #19
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    464
    Thanks
    202
    Thanked 81 Times in 61 Posts
    Quote Originally Posted by Shelwien View Post
    There's no need to use same language for GUI and archive format library.
    And the only language that makes sense for format library is C++.
    Maybe Go/Rust in a few years, but not now.
    Well, that's exactly what I'm trying to say, we need two languages, one for the structure (archive format, compression algorithms, etc) and another for GUI.

    For the first, the only option is C++, there is no arguing there. Now, for the second one, well, there is where people might want to state their preferences. But definitely needs to be something portable.

  26. #20
    Member FatBit's Avatar
    Join Date
    Jan 2012
    Location
    Prague, CZ
    Posts
    189
    Thanks
    0
    Thanked 36 Times in 27 Posts
    Dear colleagues,

    my idea/recommendation is to use appropriate file hash function as file name for inside archive storage of files. Therefore inside archive structure would be very flat = only hash filenames. And directory/file logic would be possible to separate "outside" for extended characters and long filename support.

    Best regards,

    FatBit

  27. #21
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    1. Files and data segments are different things.
    There doesn't have to be a direct mapping.
    For example, there can be a compressed segment for bcj2 addr streams (all addrs from the whole archive) -
    it doesn't contain anything meaningful by itself.

    2. It would be nice to have support for virtual files - for example, when compressing a bunch of .docx files
    (which are actually zip archives) it would be good to extract the files from these and sort them before compression.

    3. It may be a good idea to actually handle directory tree as a tree in archive index, rather than treat file paths as filenames.
    LZ compression of archive index (like in .7z) can partially compensate for repeated path prefixes, but its less efficient than
    a specialized data structure.
    Proper archive unpacking to NTFS is also only possible with a tree structure - folder timestamps have to be set in tree order to remains correct.
    Btw, the paths can be surprisingly long - winapi supports up to 32k unicode characters for filename, which is up to 96k as utf8.

    4. Properly working with windows filenames is very hard. Normal path limit for C/C++ standard library functions is 260 symbols -
    32k paths are only unlocked by prepending \\?\ prefix to paths (win10 has a workaround apparently, but we can't make a win10+ archiver).
    Then, a file can have multiple data streams, security attributes, and who knows what else.

    In particular, one unique feature of windows file systems is "short names" - same file can be accessed both as "LongFilename" and "LONGFI~1".
    This is important, because if directory scan is implemented using short names, it becomes potentially possible to break the 32k filepath limit for long names in archive.
    I did a test once and was able to reach 12Mb or so per file path.
    Obviously, most programs won't work with that kind of filesystem, but NTFS itself doesn't have a problem with it.

  28. The Following User Says Thank You to Shelwien For This Useful Post:

    xinix (5th November 2017)

  29. #22
    Member
    Join Date
    Jun 2008
    Location
    G
    Posts
    372
    Thanks
    26
    Thanked 22 Times in 15 Posts
    How abort using zpaq GUI as gui, i can modify it in a way where it can work with different compression libs. Also it is possible to make it work for different operation systems.

  30. #23
    Member
    Join Date
    Dec 2014
    Location
    Berlin
    Posts
    29
    Thanks
    35
    Thanked 26 Times in 12 Posts
    @kaitz: If you like zpaql you might have a look on this here, questions welcome https://github.com/pothos/zpaqlpy
    Funny examples are PNM or brotli (the hcomp/pcomp functions in the python files are the start) and for zpaql in general the pi adaptive prediction test/mixedpi2.cfg

  31. #24
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    464
    Thanks
    202
    Thanked 81 Times in 61 Posts
    Using deduplication as a means for sorting

    I was thinking about the deduplication step. Right now, AFAICT, there are two major types of methods, each one with its pros and cons:

    1) Hash based (zpaq for example) and
    2) LZ- based (FA's rep and srep).

    Type 1 scale in memory very well, and can be used for incremental backups but are less efficient in my experience than type 2.
    The other are better helpers to the final codec but they require a lot of memory.
    So, for example, if I have a 20 gb folder to compress (think dztest) but only 2 or 4 gb RAM, I have to choose between use an efficient method but only in small blocks of ~1-2 gbs or try and dedupe the whole thing with a less efficient algorithm.

    Now, why not use the two of them combined and get the best of both worlds? How would we do that? Well, before we start compressing, we need to sort files (or blocks) in order to have similar content grouped together. Traditionally, that is done based on the files' name, extension, size, and even a rough entropy estimation in FreeArc. But dedupe algorithms basically find similarities between different files...

    So I propose a double scheme. First, use a fast hash based algorithm, let's say, zpaq's, not to actually deduplicate anything but to digger into the whole set and just find out which files are similar to each other, which files benefit from parts of which. Next, use that information to sort the file list / block list and only then apply a lz- technique like, let's say, FA's rep. So, no matter if we have to split that 20 gb folder into 15 different blocks, because there is no much chance to have similar files far from each other...

    I think this approach could give us the best trade-of between speed, memory efficiency and ratio.
    Last edited by Gonzalo; 23rd February 2018 at 15:12. Reason: title in bold characters

  32. #25
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    I implemented this a few months ago in yet-unpublished srep5, so i can share some observations. My main testfile is 100 GB disk image of my system disk. The compression ratios:
    • 30 GB with lzma:1gb:max
    • 27 GB with reordering+lzma. Data are reordered in 8 MB chunks, compression memory = 300 MB, decompression memory is none
    • 24 GB with full-file deduplication + lzma. CMEM = 4 GB, DMEM = 9 GB
    • 24 GB with reordering+deduplication+lzma. CMEM = 4 GB, DMEM = 6 GB

    As you see, reordering effect is, at average, half of deduplication effect, but is requires only a small amount of compression memory and no decompression memory (output data are just written in different order). Its drawback, though, is required extra pass over input data on the compression stage.

    If reordering is combined with full-file deduplication, it doesn't improve compression ratio at all (on all files i've checked), but reduces decompression memory required for SREP algorithm (i.e. future-LZ decompression). Here, 1.5x reduction, but it's a minimum I've seen. On my another usual testfile, 22 GB precomp'ed Little Big Planet game, reordering reduced DMEM from 2 GB to 1 GB, and on unusual testfile, 30 GB combined ten Win7 editions installation disks, DMEM reduced from 10 GB to 1 GB!



    Now, about deduplication algos. SREP 3.93 includes 3 main algos:
    • -m0: the same as REP, but support dictionary > 4GB
    • -m1/2: similar to zpaq
    • -m3..5: the original SREP algo

    REP algo, indeed, keeps the entire dictionary in RAM plus it uses smaller hashtable, so CMEM=1.25*dictsize

    Remaining algorithms split input file into chunks and keep ~30 bytes per each chunk. With the same chunk size (and thus memory occupied), SREP own algo always compresses better than ZPAQ one, but is slower.

    The best result of the following LZMA compression is reached with 512-byte chunks, so best full-file deduplication method require memory equal to ~6% of input file size. It's 1.2 GB for the 20 GB file you mentioned.

  33. The Following User Says Thank You to Bulat Ziganshin For This Useful Post:

    nuclear_hangul (25th February 2018)

  34. #26
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    464
    Thanks
    202
    Thanked 81 Times in 61 Posts
    Thank you Bulat for the insights. This confirm my theory. So maybe it is worth a shot.

    As for the extra pass, FreeArc does it and still is faster than 7z. Mostly because that extra step provides the program with useful information about what algorithm to use, and that affects memory consumption and speed directly. Also, it is much faster to "compress" a chunk by saying "this is just the same as the previous one" than to perform a full-blown parsing/coding process.

    I think is possible to perform a hash-based deduplication at real time speed (hard drive speed). Even more now where almost every computer is multi-threaded. What I don't know is if that's possible using existing code or we should have to write something from scratch...

  35. #27
    Member
    Join Date
    Feb 2018
    Location
    Best Korea
    Posts
    6
    Thanks
    4
    Thanked 1 Time in 1 Post
    Quote Originally Posted by Bulat Ziganshin View Post
    I implemented this a few months ago in yet-unpublished srep5, so i can share some observations.
    Would you mind sharing some srep5 windows x64 and linux binaries for testing? Last release was 2014, whilst srep3.93a aged well it would be nice to try the new on my clonezilla images and backups

  36. #28
    Member
    Join Date
    Feb 2016
    Location
    Luxembourg
    Posts
    520
    Thanks
    196
    Thanked 744 Times in 301 Posts
    Here's a first draft explaining the proposed pre-processing stage.
    It's mostly just a high level summary. Sorry if it's hard to understand, I'm not used to writing technical documentation.
    Last edited by mpais; 1st March 2018 at 14:39. Reason: Fix typo, formatting

  37. The Following 3 Users Say Thank You to mpais For This Useful Post:

    Bulat Ziganshin (1st March 2018),Jagannath (1st March 2018),Stephan Busch (1st March 2018)

  38. #29
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    464
    Thanks
    202
    Thanked 81 Times in 61 Posts
    Thank you Márcio for your work. It's fairly easier to understand than most technical documentation so don't worry about that.

    Now, about the temporary storage issue: the method you describe belongs to the highest ratio end of spectrum in what an ideal archiver should be capable of achieve. From an end user perspective, I won't spend a second doing pre-compression if I just want to make a fast archive. If, on the other hand, I want to get the smallest size my program can obtain, I just let it run overnight. Between those two extremes, it can be some trade-off. For example, reduce the recursion depth. In my experience, there is never something useful to the final ratio below depth 3, and usually depth 2 is more than enough.
    Keep in mind that if the program is going to perform a dedupe step alongside recompression, it doesn't have to be limited to big blocks, we can perform a full-blown srep-like deduplication and your 10gb is more likely to be between 3.5 - 6.5 gb after all...

  39. #30
    Member
    Join Date
    Mar 2018
    Location
    Somewhere
    Posts
    1
    Thanks
    1
    Thanked 0 Times in 0 Posts
    Using temp can be real bottleneck
    Can't we have Circular ring of parsers, pour the data and it will circle the ring until final block is set, or maybe once a parser detected its data, buffer it, create a new instance of parser ring, recurse until final block is set, something like streams, only take data when needed else pause the request

    Ok now i see memory uses can go exponential in recursion step, but doesn't having such a strict parsing is paq compressor 's arena, a user will never use such a slow compressor, maybe we can do such thing at higher levels but for real world purpose recursion depth 2 or 3 is more then sufficient

Page 1 of 2 12 LastLast

Similar Threads

  1. PGI Community Edition
    By nemequ in forum Download Area
    Replies: 7
    Last Post: 1st May 2017, 19:22
  2. Linux Kernel community phases out Bzip2 for source bundles
    By moinakg in forum Data Compression
    Replies: 10
    Last Post: 10th January 2014, 02:39
  3. LH7 archiver
    By spark in forum Download Area
    Replies: 3
    Last Post: 7th February 2013, 11:40
  4. SouceForge's community choiche awards
    By giorgiotani in forum Forum Archive
    Replies: 2
    Last Post: 20th June 2007, 01:12
  5. Best practical archiver
    By nimdamsk in forum Forum Archive
    Replies: 34
    Last Post: 24th March 2007, 21:51

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •