Results 1 to 7 of 7

Thread: Document Press Clone Post-Mortem

  1. #1
    Member
    Join Date
    May 2017
    Location
    Germany
    Posts
    17
    Thanks
    24
    Thanked 5 Times in 3 Posts

    Document Press Clone Post-Mortem

    In the last weeks, I have been working on Compound File Binary Format (CFBF) Optimization – MSI, old DOC/PPT/XLS files, etc. Pretty much what Document Press does – no optimization of file content, just of the container.

    tl;dr: I didn’t find an optimal solution. I reached Document Press’ compression rates; sometimes better, sometimes worse. Program & source code at the end of this post. Here’s what I learned:

    Previous readings:
    1. https://encode.ru/threads/336-Document-Press-6-012
    2. https://encode.ru/threads/897-Docume...-Version-6-013
    3. https://encode.ru/threads/2864-OLE-O...-for-cmd-tools




    1. CFBF is a FAT-like file system.

    Most of you will know this, but let’s have a clear foundation: CFBF is not compressed but it does support features like concurrent editing, transactions, reversal to earlier snapshots. It’s built on the FAT concept (details in the specification). Compression problems that arise from this:

    1. leftover data from earlier snapshots
    2. fragmentation

    2. There are two CFBF versions.

    Version 3 uses 512-B sectors while version 4 uses 4096-B sectors. Therefore, v4 has an advantage with large files. You cannot tell exactly because it depends on the number of files, but the breakeven size is often between 3 and 8 MiB.

    V3 files are produced by old MS Office versions and by MSI setups prior to ~2010. MSI setups that have been built after ~2010 are almost exclusively v4 files.


    3. Document press never uses the v4 format.

    It tries to convert all v4 files to v3. This has huge advantages for small setups (e.g. 72 KiB → 55 KiB) but also great disadvantages for large setups. I have never seen Document Press reducing the size of a v4 CFBF file. It only copies the input because that’s smaller.


    4. Defragmentation *is* important.

    That’s what Document Press’s -opt does. CFBF allocates small files (<4096 B) and large files (≥4096 B) from different streams, but they share the underlying sector system. Writing small files, then large ones, and again small ones will almost certainly lead to fragmentation. Tom Jebo from the Microsoft Open Specification Support Blog (defunct since two weeks) has written about it in a two-part series (Exploring the Compound File Binary Format, Exploring the Compound File Binary Format (part deux)), but you should not trust the source code because it bloats the resulting file. Document Press got it right as far as I can see.

    The true™ way of defragmenting a CFBF file (as far as I tried) is:

    1. Create the directory and all streams (with empty sizes!).
    2. Copy small files. Or large files? I don’t know, see below. But keep them seperate from each other!
    3. Properly close all open handles or else there will be scratch space or snapshots left in the file.

    5. Then, it gets really fuzzy.

    Once you removed needless snapshots & scratch space from the file, you’ve reached the minimal size. Defragmentation will not reduce size any further, so you have to compare sizes after compression e.g. with LZMA.

    And now you’re left with the dilemma of finding the correct way of sorting files to help compression. I didn’t look into this and neither did Document press (I bet it and I got bet randomly).


    6. MSI probably uses a bastardized v4 format.

    This is something I didn’t know but would like to have information on.

    Even though v4 is defined to have 4096-B sectors, I have seen MSIs missing 7/8 of the last sector, i.e. it’s just 512 B instead of 4096. I don’t know why that is. But I certainly know that it defeats optimization because no matter how hard you defragment, your standards-compliant result will always be 3584 B larger than the fragmented version you started with!

    Moreover, I have seen many (but not all!) v4 MSIs being already optimized and defragmented, including some I have generated myself wia WiX or via MSI API. Considering the last-sector trick, it’s impossible to beat them. (Meaning Microsoft has put some effort into this after ca. 2010, which is good, isn’t it?)





    That’s it. You can download my program here (Windows x86-32): https://papas-best.com/downloads/bes...6/bestcfbf.exe
    And the C++ source code here: https://papas-best.com/downloads/bes...e/bestcfbf.cpp

    If you want to optimize and defrag a CFBF file, run it with
    bestcfbf <in> <out> [-v4]

    For normal files, I recommend using it twice (one time with -v4 and one time without) and picking the smaller file. If it’s larger than the input, then you probably have a v4 MSI.
    If you prepare the CFBF for compression, also use Document Press on it and select the one that compresses best.

    I think the last way to improve anything would be reading and writing individual FAT sectors (which I didn’t, I just used the Shell Lightweight COM API). Igor Pavlov is a skilled programmer and there’s some chance that Document Press already does that.

  2. The Following 3 Users Say Thank You to Krishty For This Useful Post:

    Bulat Ziganshin (1st April 2019),comp1 (1st April 2019),maadjordan (6th April 2019)

  3. #2
    Member
    Join Date
    May 2017
    Location
    Germany
    Posts
    17
    Thanks
    24
    Thanked 5 Times in 3 Posts
    P.S. – Small benchmark (all sizes in KiB):

    Code:
                           original      bestcfbf   bestcfbf -v4  Document Press 6.01
    RSS Bandit setup        10687.5         10687          10656                10687
    Word 2000 sample file        70            64             88                   64
    STL Viewer setup             76            55             72                   55
    Movie Quiz.xls           3650.5          3650           3644                 3650
    As you can see, it performs like Document Press on small files and outperforms it on large files with -v4. I should probably add another test with large, post-2010 MSI files where neither gains anything.

  4. #3
    Member
    Join Date
    May 2008
    Location
    Kuwait
    Posts
    301
    Thanks
    26
    Thanked 22 Times in 15 Posts
    its a good start. Document press also support CHM files.. can you add it?

  5. #4
    Member
    Join Date
    May 2017
    Location
    Germany
    Posts
    17
    Thanks
    24
    Thanked 5 Times in 3 Posts
    I’m afraid not – CHM cannot be opened through the compound file API I use, and the fact that 7-Zip opens them flawlessly makes me think that Igor Pavlov coded a special solution for it …

  6. #5
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    7-zip actually has its own parsers both for compound format and for chm/lit, they're different:
    https://github.com/kornelski/7z/tree...ip/Archive/Chm
    https://github.com/kornelski/7z/blob...ndler.cpp#L875

  7. The Following User Says Thank You to Shelwien For This Useful Post:

    Krishty (6th April 2019)

  8. #6
    Member
    Join Date
    May 2008
    Location
    Kuwait
    Posts
    301
    Thanks
    26
    Thanked 22 Times in 15 Posts
    CHM and compound files are different.

    what i was thinking is that calling chm compaction is a system dependent call built into windows but i could not confirm this. I asked Igor to issue document press source code but it confirmed that the source code was lost with a hdd failure.

    Same thing for keytools (calling a system register) which can let chm files reach LZX18 (instead of default LZX16 and document press LZX17 but spec allows for LZX21 none can reach).

    may be we can de-compile both programs to confirm that.

  9. #7
    Member
    Join Date
    May 2017
    Location
    Germany
    Posts
    17
    Thanks
    24
    Thanked 5 Times in 3 Posts
    To build my MSIs, I use the Cabinet API which is built into Windows and supports LZX21.

    With deep knowledge of the CHM format, one may be able to use that API to extract existing data and re-pack it using LZX21, then re-insert that into the CHM and fix up everything around.

    I have never had a look at the CHM format, though. I have also not seen references to CABINET.DLL in Document Press 6.01, so its LZX magic likely was hand-written by Igor.

Similar Threads

  1. Catcompress, a 7-zip clone?
    By jimbow in forum Data Compression
    Replies: 2
    Last Post: 21st April 2014, 05:06
  2. Document Press 6.01
    By nanoflooder in forum Data Compression
    Replies: 8
    Last Post: 18th April 2009, 19:24
  3. Document Press Version 6.01
    By kaitz in forum Download Area
    Replies: 2
    Last Post: 15th August 2008, 07:14
  4. Document Press Version 6.01
    By maadjordan in forum Forum Archive
    Replies: 10
    Last Post: 20th August 2007, 15:15

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •