Results 1 to 23 of 23

Thread: Format priority for recompression

  1. #1
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts

    Format priority for recompression

    Here's a list of data types: deflate(zip,pdf,png,docx),jpeg,bmp,mp3,aac(mp4,m4a ),ogg,flac,wav,COFF(exe/dll/etc),ELF,html,xml
    Please write them in desired order of implementation (you can also add your own types).

  2. #2
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    We're a bit off topic, but my order would be:
    JPEG
    AAC
    Deflate
    html, xml
    ELF, COFF
    wav
    MP3
    MPEG4
    Java classes
    MP2
    wavpack (because it's the strongest lossless audio codec that can be used in videos and played w/out problems)
    ogg
    bmp

    EDIT: added MP2
    Last edited by m^2; 7th March 2011 at 20:03.

  3. #3
    Member
    Join Date
    May 2008
    Location
    Earth
    Posts
    115
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Shelwien View Post
    What I'd really appreciate though, is a real aac decoder source (not faad), one which would be able to decode
    heaac audio streams etc.
    Tried ffmpeg? Works for me.

  4. #4
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > We're a bit off topic, but my order would be:
    > JPEG
    > AAC
    > Deflate
    > html, xml
    > ELF, COFF
    > wav
    > MP3
    > MPEG4
    > Java classes
    > MP2
    > wavpack (because it's the strongest lossless audio codec that can be used in videos and played w/out problems)
    > ogg
    > bmp

    Well, now let me explain what seems wrong with your priorities.

    1. Deflate comes first, because without it you can't reach jpegs in some filetypes
    (like pdf or jar or .docx).
    2. AAC (and also MP2) is mostly used for soundtracks in movie containers,
    thus video streams (h264 and mpeg1; xvid) should have priority.
    However, MP2 is not much different from MP3, and mp3 is the 2nd popular format after jpeg,
    so MP2 support can be implemented much earlier.
    3. html is very complex with many sublanguages (css,js,vbs), and effect from its recompression
    would be fairly low, thus its priority should be very low.
    xml, on other hand, has a much more strict structure, and can be encountered within various
    filetypes (including jpeg; docx is based on it).
    4. Java classes... unless you mean .jar (which only requires deflate support),
    I'd not expect much effect from their recompression, and multi-TB archives of .class
    files are hard to imagine.
    5. wavpack... overall, lossless formats should have much lower ranks comparing to lossy,
    because there's usually no need to write an integrated recompressor for these.
    You can just extract your wavpack track and compress with optimfrog, then extract
    and compress again with wavpack, then remux.
    Imho atm its too obscure to handle. Same applies to flac actually, although its getting
    pretty popular now. But again, its lossless, and its easier to write a better lossless
    audio compressor + player plugins, than a proper recompressor.
    6. Windows EXEs are fairly important. They're getting more and more redundant
    (compiler inlining, x64), they're basically containers for many data types, and
    there's no good support (the best is likely durilca's disasm filter, which is
    still very simple and redundant; and the usual bcj/bcj2 is plain laughable).
    However with linux system you don't normally go around downloading ELF binaries,
    so that's not so important.
    7. bmp and wav do require custom models. they're pretty widespread as is, but
    can be also encountered within other formats. For example, if we'd remove deflate
    from png, it would basically become bmp; also a good bitmap model is required for
    good jpeg compression, though there it won't be used directly.
    8. ogg... it takes a lot of time to develop a recompressor for a complex format,
    so I won't do it unless there's some demand. I have use myself for things listed
    above, but not for this.
    9. MPEG4, if you mean h264, would be surely cool, but video codecs are harder to
    deal with than others, so it can't have higher priority than bmp.

  5. #5
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > Tried ffmpeg? Works for me.

    That's good to know. In the worst case I'd really have to go extracting modules from huge apps
    to build the utils I need on my own. Sometimes it can be harder than decompiling some mp4.dll though,
    because with dll I at least know for certain that it does what I need.
    But what I want is something like lame for aac, which would handle all aac streams and nothing else.
    faac/faad try to be that, but its not exactly enough.

  6. #6
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Quote Originally Posted by Shelwien View Post
    > We're a bit off topic, but my order would be:
    > JPEG
    > AAC
    > Deflate
    > html, xml
    > ELF, COFF
    > wav
    > MP3
    > MPEG4
    > Java classes
    > MP2
    > wavpack (because it's the strongest lossless audio codec that can be used in videos and played w/out problems)
    > ogg
    > bmp

    Well, now let me explain what seems wrong with your priorities.

    1. Deflate comes first, because without it you can't reach jpegs in some filetypes
    (like pdf or jar or .docx).
    I assumed you talked about a recompressor similar to the LZMA one, where you wouldn't be able to access them anyway. But knowing that you plan (also?) a Precomp-like thing, I still stand by the current order - because while there are solutions for both JPEG and deflate recompression, I feel both are lacking, however I think that with deflate it will change soon and I don't see such improvement for JPEG on the horizon.

    Quote Originally Posted by Shelwien View Post
    2. AAC (and also MP2) is mostly used for soundtracks in movie containers,
    thus video streams (h264 and mpeg1; xvid) should have priority.
    However, MP2 is not much different from MP3, and mp3 is the 2nd popular format after jpeg,
    so MP2 support can be implemented much earlier.
    I roughly evaluated savings/(processing time) for Ocarina's test codec for MPEG video and SoundSlimmer on MP3 I guessed the ratio for AAC would be better.

    Quote Originally Posted by Shelwien View Post
    3. html is very complex with many sublanguages (css,js,vbs), and effect from its recompression
    would be fairly low, thus its priority should be very low.
    xml, on other hand, has a much more strict structure, and can be encountered within various
    filetypes (including jpeg; docx is based on it).
    They are omnipresent, that's the only reason why I suggested them this high. If you feel that savings would be low, I agree that the priority should be lowered.
    Quote Originally Posted by Shelwien View Post
    4. Java classes... unless you mean .jar (which only requires deflate support),
    I'd not expect much effect from their recompression, and multi-TB archives of .class
    files are hard to imagine.
    Yes, I meant .class files. I know very little about them, but they seem much like exes, just much more structured and with more metadata - so I expect them to compress better than exes. But I'm very far from certain I'm right.
    And it's not something I meant for archival - but for installers. And what numbers are we talking here? Petabytes per year?
    And there's nothing that targets them specifically (except for obfuscators), so the state of art is underwhelming.
    Quote Originally Posted by Shelwien View Post
    5. wavpack... overall, lossless formats should have much lower ranks comparing to lossy,
    because there's usually no need to write an integrated recompressor for these.
    You can just extract your wavpack track and compress with optimfrog, then extract then
    and compress again with wavpack, then remux.
    Imho atm its too obscure to handle. Same applies to flac actually, although its getting
    pretty popular now. But again, its lossless, and its easier to write a better lossless
    audio compressor + player plugins, than a proper recompressor.
    That's why I put wav first. But manual demux+remux....I didn't think about it actually and I think you're right, it deserves a lower rank (even though for me it's too complicated to be worthwhile).
    Quote Originally Posted by Shelwien View Post
    6. Windows EXEs are fairly important. They're getting more and more redundant
    (compiler inlining, x64), they're basically containers for many data types, and
    there's no good support (the best is likely durilca's disasm filter, which is
    still very simple and redundant; and the usual bcj/bcj2 is plain laughable).
    Interesting opinion, seeing who made dispack and how close it is to BCJ2, I expected the state of art to be well developed. If you think you can save a lot, I strongly encourage you to do so. It goes to no. 2, just after JPEG then.
    Quote Originally Posted by Shelwien View Post
    However with linux system you don't normally go around downloading ELF binaries,
    so that's not so important.
    Seriously? I don't see any difference except for big Microsoft stuff that Windows users install from DVDs and *nix ones - download equivalents. And that *nix users tend to be more willing to experiment with different OSes - which they usually download too.
    IMO the big reason to favour COFF is that there are far more people using it. Reason to favour elf - all FOSS OSes use it. For me it levels the field.
    Quote Originally Posted by Shelwien View Post
    7. bmp and wav do require custom models. they're pretty widespread as is, but
    can be also encountered within other formats. For example, if we'd remove deflate
    from png, it would basically become bmp; also a good bitmap model is required for
    good jpeg compression, though there it won't be used directly.
    Yes, I know how important it is, I just don't think we need yet another image coder.
    Quote Originally Posted by Shelwien View Post
    8. ogg... it takes a lot of time to develop a recompressor for a complex format,
    so I won't do it unless there's some demand. I have use myself for things listed
    above, but not for this.
    Understandable.
    Quote Originally Posted by Shelwien View Post
    9. MPEG4, if you mean h264, would be surely cool, but video codecs are harder to
    deal with than others, so it can't have higher priority than bmp.
    Did I say MPEG4? I meant MPEG2. I slept little at the last 3 nights...
    I judging by Ocarina's codec I think there's fairly little to be saved on MPEG2 and guess that much less with h264. Actually I would be surprised if h264 recompression would give enough savings to be worthwhile soon.

    EDIT: fixing typos and grammar. I read everything twice before posting, yet when reading for a third time, I still see many errors...
    Last edited by m^2; 7th March 2011 at 23:18.

  7. #7
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > I assumed you talked about a recompressor similar to the LZMA one,
    > where you wouldn't be able to access them anyway.

    No, unlike lzma, for deflate its better to decode all data, then
    compress with a different method (for lzma its in theory better too,
    but lossless reconstruction would be terribly slow).

    > I still stand by the current order - because while there are solutions for both JPEG and deflate
    > recompression,
    > I feel both are lacking, however I think that with deflate it will
    > change soon and I don't see such improvement for JPEG on the horizon.

    There's no actual solution for deflate atm. Precomp is just a proof-of-concept,
    it can't be integrated into a popular archiver (not stable enough, no lib/dll,
    temp files, either too slow, or requires manual control).

    And there're two for jpeg - packjpg and winzip one.
    Of these packjpg is even ready for immediate use.
    There're also a few paq versions with different jpeg models, but these are not
    practical due to speed.
    So afaik the situation with jpeg is much better than with deflate.
    The actual problem with jpeg is still jpeg parsing - if you have a lossless parser,
    you can immediately reach nearly the level of winzip jpeg (as my pjpg showed).
    And its also known how to reach paq8 jpeg results - the model is open-source and
    even described in Matt's DCE.
    So its only a parser problem, and I already did most of necessary work
    (like factoring down djpeg source to ~100k + pjpg parser).

    As to deflate, it requires more work afaik. Sure cloning precomp is relatively
    easy, but the clone would have the same issues with stability, speed, stream processing,
    format compatibility (ie won't recompress non-zlib deflate streams).

    So from my point of view, I have to make an actual universal codec to compress
    the uncompressed deflate data, and also have to make a good symmetric parser
    for deflate (like i already did for lzma), and then a custom CM to compress
    the deflate codes in context of uncompressed data.
    To me it seems a lot more complicated than making a jpeg recompressor with paq-like
    model, but reasonable speed.

    Although for jpeg too, I won't be satisfied unless my model would have better
    compression than paq8 or stuffit

    > I roughly evaluated savings/(processing time) for Ocarina's test
    > codec for MPEG video and SoundSlimmer on MP3 I guessed the ratio for
    > AAC would be better.

    Yes, ratio for AAC is expected to be better than for mp3, because AAC
    contains more redundant information (huffman table ids etc) which can
    be derived from spectral coefs.
    I don't see how its related to that mpeg codec though.
    Also its processing speed likely doesn't mean anything - its either not
    optimized, or uses a zpaq backend, or both.

    > [html]
    > They are omnipresent, that's the only reason why I suggested them
    > this high. If you feel that savings would be low, I agree that the
    > priority should be lowered.

    Savings would be good probably, because a normal CM model can't
    predict the context changes in html.
    But its just too complex. We need a browser-level html parser
    to make a really good html recompressor, and what's worse, we
    likely can't use open-source stuff like webkit and chromium,
    because all the normal parsers are lossy, and it eventually
    becomes an error-correction hell when you try to losslessly
    reconstruct original data from a DOM in memory.

    Anyway, the absolute savings from html would be little comparing
    to image/audio/video, so that's what should be done first.

    > Yes, I meant .class files. I know very little about them, but they
    > seem much like exes, just much more structured and with more
    > metadata - so I expect them to compress better than exes.
    > But I'm very far from certain I'm right.

    One problem is that they compress better with plain CM, without
    any special handling, so recompression gain would be smaller.
    (While for exe it can reach 100% - compare ppmd with winrk in
    http://www.maximumcompression.com/data/exe.php).

    > And it's not something I meant for archival - but for installers.

    So do you have java apps with lots of code?
    Apps which I accidentally have are mostly 100-500k, aside from java itself.

    > And what numbers are we talking here? Petabytes per year?

    I'm talking about a possibility to imagine a cost reduction
    from .class (or .jar) storage on a filehosting or something.

    > And there's nothing that targets them specifically (except for
    > obfuscators), so the state of art is underwhelming.

    Well, java programmers can't write strong compressors (because java
    is too slow for that), and others don't care I guess

    [exe]
    > Interesting opinion, seeing who made dispack and how close it is to BCJ2,

    Well, disasm quality doesn't immediately mean good compression.
    Care to compare dispack+ppmonstr vs durilca?
    I mean that for a disasm preprocessor, its not only necessary
    to parse code, but also to transform it to something compressible
    by universal compressors.

    > I expected the state of art to be well developed. If you think
    > you can save a lot, I strongly encourage you to do so.

    Yeah, well, unfortunatelly its pretty complicated.
    Ideally we need a ida-level disassembler (not onepass) + custom
    CM model for code + exe parser (identifying tables of constants,
    like floats, ascii/unicode text, various resources - images,icons,
    dialogs,signatures,manifests) + models for all listed before.

    One interesting problem btw, is that in x64 exes now _all_
    memory addresses are relative, same thing like with call/jmp in x86,
    but it can't be handled just by finding a signature byte like E8
    (which is essentially what BCJ/BCJ2 do).

    > IMO the big reason to favour COFF is that there are far more people
    > using it. Reason to favour elf - all FOSS OSes use it.

    Ok, I'd try to clarify this. Windows programs are commonly distributed
    as _executable_ independent installers. Basically each developer
    distributes his own app. Recently I also see something similar for
    Mac versions - the .DMG images, but mac executable is not really ELF,
    so lets leave it alone for now.
    I wanted to say that linux programs are either distributed as source
    archive by developers, or as ELF binaries on specialized repositories
    for given linux distros.
    The difference is that its much easier to coerce an independent developer
    to try improving compression in his installer, than to do the same
    with a whole linux distro.

    > [bmp]
    > Yes, I know how important it is, I just don't think we need yet another image coder.

    You may be surprised, but we don't have one atm.
    PNG compression is very bad (1194493 png+pngcrush, 539003 paq8px),
    paq is ok but too slow to be of any use, others are not open-source
    (including jpeg2000ls afaik).
    Also paq's bmp model is clearly not very smart, I'd expect a considerable improvement possible.

    > I judging by Ocarina's codec I think there's fairly little to be
    > saved on MPEG2 and guess that much less with h264.

    Actually I think that Przemyslaw's results are pretty good - its
    more than my 15% on mp3 at least.
    Also surely there's less redundancy in h264 (if only because it
    actually uses a kind of arithmetic coding), but I'd still estimate
    at least 10-15%.

    > Actually I would be surprised if h264 recompression would give
    > enough savings to be worthwhile soon.

    In a way, its a case where anything goes. It would be likely used
    to improve video quality at low bitrates, instead of changing
    the bitrate. And 10% gain at entropy coding would certainly make
    a visible difference.

  8. #8
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Quote Originally Posted by Shelwien View Post
    There's no actual solution for deflate atm. Precomp is just a proof-of-concept,
    it can't be integrated into a popular archiver (not stable enough, no lib/dll,
    temp files, either too slow, or requires manual control).

    And there're two for jpeg - packjpg and winzip one.
    Of these packjpg is even ready for immediate use.
    There're also a few paq versions with different jpeg models, but these are not
    practical due to speed.
    So afaik the situation with jpeg is much better than with deflate.
    The actual problem with jpeg is still jpeg parsing - if you have a lossless parser,
    you can immediately reach nearly the level of winzip jpeg (as my pjpg showed).
    And its also known how to reach paq8 jpeg results - the model is open-source and
    even described in Matt's DCE.
    So its only a parser problem, and I already did most of necessary work
    (like factoring down djpeg source to ~100k + pjpg parser).
    Yes, I agree that JPEG is a bit better, but as I said it will be like this for a very short time. PAQ is very slow.

    Quote Originally Posted by Shelwien View Post
    As to deflate, it requires more work afaik. Sure cloning precomp is relatively
    easy, but the clone would have the same issues with stability, speed, stream processing,
    format compatibility (ie won't recompress non-zlib deflate streams).
    I don't think so. I see no inherent problems with stability. Speed - there's a ton of space for improvement, when it comes to speed, precomp is terribly implemented. Compatibility with foreign streams - I have some ideas how to bite the issue, but yes, to do it right you need entirely different approach.
    I'm not sure what do you mean with stream processing...input and output as streams with no chance to buffer them all? It's not a big deal.

    Quote Originally Posted by Shelwien View Post
    So from my point of view, I have to make an actual universal codec to compress
    the uncompressed deflate data, and also have to make a good symmetric parser
    for deflate (like i already did for lzma), and then a custom CM to compress
    the deflate codes in context of uncompressed data.
    To me it seems a lot more complicated than making a jpeg recompressor with paq-like
    model, but reasonable speed.

    Although for jpeg too, I won't be satisfied unless my model would have better
    compression than paq8 or stuffit
    Good luck.

    Quote Originally Posted by Shelwien View Post
    > I roughly evaluated savings/(processing time) for Ocarina's test
    > codec for MPEG video and SoundSlimmer on MP3 I guessed the ratio for
    > AAC would be better.

    Yes, ratio for AAC is expected to be better than for mp3, because AAC
    contains more redundant information (huffman table ids etc) which can
    be derived from spectral coefs.
    I don't see how its related to that mpeg codec though.
    Also its processing speed likely doesn't mean anything - its either not
    optimized, or uses a zpaq backend, or both.
    Yes, I didn't expect the MPEG codec to be optimal, but my memory told me there was only ~10% saved. I just went back to it and see it's more like 20%, about as good as SoundSlimmer.
    Quote Originally Posted by Shelwien View Post
    > [html]
    > They are omnipresent, that's the only reason why I suggested them
    > this high. If you feel that savings would be low, I agree that the
    > priority should be lowered.

    Savings would be good probably, because a normal CM model can't
    predict the context changes in html.
    But its just too complex. We need a browser-level html parser
    to make a really good html recompressor, and what's worse, we
    likely can't use open-source stuff like webkit and chromium,
    because all the normal parsers are lossy, and it eventually
    becomes an error-correction hell when you try to losslessly
    reconstruct original data from a DOM in memory.

    Anyway, the absolute savings from html would be little comparing
    to image/audio/video, so that's what should be done first.
    There was a xml codec that did full parsing and IIRC using a ready made library. It was slow, unstable and not better than XWRT though...

    Quote Originally Posted by Shelwien View Post
    > Yes, I meant .class files. I know very little about them, but they
    > seem much like exes, just much more structured and with more
    > metadata - so I expect them to compress better than exes.
    > But I'm very far from certain I'm right.

    One problem is that they compress better with plain CM, without
    any special handling, so recompression gain would be smaller.
    (While for exe it can reach 100% - compare ppmd with winrk in
    http://www.maximumcompression.com/data/exe.php).

    > And it's not something I meant for archival - but for installers.

    So do you have java apps with lots of code?
    Apps which I accidentally have are mostly 100-500k, aside from java itself.
    Before writing I looked at the most popular sourceforge Java apps. No. 1 was Vuze and 2 Sweet Home 3D, they got 250 000 and 150 000 downloads weekly. Vuze has 21.8 MB of classes and Sweet Home 3D - 15. That's ~380 TB yearly just for 2 apps.

    Quote Originally Posted by Shelwien View Post
    > I expected the state of art to be well developed. If you think
    > you can save a lot, I strongly encourage you to do so.

    Yeah, well, unfortunatelly its pretty complicated.
    Ideally we need a ida-level disassembler (not onepass) + custom
    CM model for code + exe parser (identifying tables of constants,
    like floats, ascii/unicode text, various resources - images,icons,
    dialogs,signatures,manifests) + models for all listed before.
    Really, why do you care for anything but code before you even started? I'd guestimate that 90-95% of average exe is code. Even though variance is big and you can find files all the way to 0% code, I don't think it's worthwhile to do something special for them.

    Quote Originally Posted by Shelwien View Post
    > IMO the big reason to favour COFF is that there are far more people
    > using it. Reason to favour elf - all FOSS OSes use it.

    Ok, I'd try to clarify this. Windows programs are commonly distributed
    as _executable_ independent installers. Basically each developer
    distributes his own app. Recently I also see something similar for
    Mac versions - the .DMG images, but mac executable is not really ELF,
    so lets leave it alone for now.
    I wanted to say that linux programs are either distributed as source
    archive by developers, or as ELF binaries on specialized repositories
    for given linux distros.
    The difference is that its much easier to coerce an independent developer
    to try improving compression in his installer, than to do the same
    with a whole linux distro.
    But gains from convincing a distro are way way bigger.

    Quote Originally Posted by Shelwien View Post
    > [bmp]
    > Yes, I know how important it is, I just don't think we need yet another image coder.

    You may be surprised, but we don't have one atm.
    PNG compression is very bad (1194493 png+pngcrush, 539003 paq8px),
    paq is ok but too slow to be of any use, others are not open-source
    (including jpeg2000ls afaik).
    Also paq's bmp model is clearly not very smart, I'd expect a considerable improvement possible.
    Yeah, png is crap and paq is useless. But there are so many other options...yes, the strongest yet fast enough are closed source only, but for me it's not a reason to ignore them.

    Quote Originally Posted by Shelwien View Post
    > I judging by Ocarina's codec I think there's fairly little to be
    > saved on MPEG2 and guess that much less with h264.

    Actually I think that Przemyslaw's results are pretty good - its
    more than my 15% on mp3 at least.
    Also surely there's less redundancy in h264 (if only because it
    actually uses a kind of arithmetic coding), but I'd still estimate
    at least 10-15%.
    As I said already, I remembered MPEG2 savings wrong. But I still feel surprised by your estimation.

    Quote Originally Posted by Shelwien View Post
    > Actually I would be surprised if h264 recompression would give
    > enough savings to be worthwhile soon.

    In a way, its a case where anything goes. It would be likely used
    to improve video quality at low bitrates, instead of changing
    the bitrate. And 10% gain at entropy coding would certainly make
    a visible difference.
    I think most people won't bother for 10%. For me it's usually the limit of what I'm willing to do and whenever I talk to people about various tools that I use because they give such significant savings, they treat me as a freak.

    BTW I suggest splitting the topic.

  9. #9
    Member chornobyl's Avatar
    Join Date
    May 2008
    Location
    ua/kiev
    Posts
    153
    Thanks
    0
    Thanked 0 Times in 0 Posts

  10. #10
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    @chornobyl:
    Thanks, but I need something like that with source.

  11. #11
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    >> As to deflate, it requires more work afaik.

    > I don't think so. I see no inherent problems with stability.

    I don't see any either - if you won't use zlib.
    Well, at least schnaader seems to have permanent problems with it.

    > I'm not sure what do you mean with stream processing...
    > input and output as streams with no chance to buffer them all? It's not a big deal.

    I meant sequential processing, without seeking around.
    Though having to work with large blocks is bad too.

    > There was a xml codec that did full parsing and IIRC using a ready made library.
    > It was slow, unstable and not better than XWRT though...

    Unfortunately most parsers are inherently lossy (ie its impossible to reproduce
    incomplete or erroneous or simply non-standard inputs)
    Or both lossy and redundant, which is even more fun.

    Also format parsing doesn't mean good compression. Compression is provided by
    a specialized compression method (which usually requires parsing to implement),
    while a parser made for different purpose can as well hurt compression.

    Its even harder to make a good preprocessor, because the idea is basically the
    same, but you have to reorganize format data in such a way that a specific
    "universal" compressor would be able to notice the dependencies.

    > Vuze has 21.8 MB of classes and Sweet Home 3D - 15. That's ~380 TB yearly just for 2 apps.

    Still, don't you think that's too little, comparing to other things?
    The single sqlservr.exe in MS SQL is 41M.
    I'm not saying that its not worth doing at all, I'm discussing priorities.
    And in that sense, imho java .class is one of a least likely formats to
    ever get a recompressor :)

    > Really, why do you care for anything but code before you even started?

    Let's look at already mentioned sqlservr.exe (I specially saved it
    for disasm testing and such, because its a service, and doesn't have
    any pictures or much strings inside).

    sqlservr.exe 40999448
    .text 32654336 (others are not executable)
    .text-filt 30577564 (some text blocks removed)

    And now acrord32.exe from SFC, structure of which is more common for GUI apps:

    acrord32.exe 3870784
    .text 2233600 (mostly clean code here)

    powerarc.exe 8750424
    CODE 6887312
    CODE-filt 6609229 (RTTI structures etc removed)

    So actually its wrong to approach exes as if they were all code.
    Actually when existing disasm filters are applied only to the
    code section, instead of whole exe, compression noticeably
    improves.
    In other words, .exe is a container format, and it doesn't
    make sense to start processing code before learning to parse .exe

    > I'd guestimate that 90-95% of average exe is code.

    Well, its wrong. It wasn't like that even in DOS, where exe format
    was much more compact and strings were ascii instead of unicode.

    > Even though variance is big and you can find files all the way to 0%
    > code, I don't think it's worthwhile to do something special for them.

    Pictures/Icons, unicode strings, binary tables (especially float-point)
    all can benefit greatly from proper handling (like 2x better compression),

    > But gains from convincing a distro are way way bigger.

    Its likely _lots_ more work (the compression engine has to be already stable
    and fully portable when presented), and no direct benefits.
    Using it with some small apps as soon as it works is much more attractive imho.

    > Yeah, png is crap and paq is useless. But there are so many other
    > options...yes, the strongest yet fast enough are closed source only,
    > but for me it's not a reason to ignore them.

    There're not so much stable closed-source ones either.
    And even unstable... its like only BMF and Rhatushnyak's coders
    (no sense to look at others if they have worse compression
    http://www.researchandtechnology.net...benchmarks.php)

    Anyway, the question is basically what to use for lossless
    image compression in my archiver. With addition that I won't
    use anything based on huffman or broken arithmetic, because
    they're outdated.

    >> Also surely there's less redundancy in h264, but I'd still estimate at least 10-15%.

    > As I said already, I remembered MPEG2 savings wrong. But I still
    > feel surprised by your estimation.

    There's a known 5% result from replacing cabac with precise rc (may be fake though).
    Also h264 has huge speed restrictions, so I won't be surprised
    even by larger improvement - for example, you can compare jpeg/ari (or winzip jpeg)
    vs paq8 result and apply that ratio to h264.

    > I think most people won't bother for 10%.

    As I said, it can be also used to improve quality.
    Afaik a video quality improvement corresponding to 10% higher bandwidth
    would be noticeable, especially at low bitrates.

    > For me it's usually the limit of what I'm willing to do and whenever
    > I talk to people about various tools that I use because they give
    > such significant savings, they treat me as a freak.

    Even 1% helps when a file doesn't fit to a DVD :)
    Btw, here's my old avi dumper - http://nishi.dreamhosters.com/u/avip_v0.rar
    I actually used it to download movies (with mp3zip for audio track) when
    I still had dialup :)

    > BTW I suggest splitting the topic.

    Just split this from "compressing mp3" to some "Format priority for recompression",
    or do you mean something else?

  12. #12
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    As to Java classes:
    - there is Pack200 utility that performs lossy transform on .class files and allows for, say, 50 % reduction in size,
    - Android uses Dalvik VM which uses .dex files instead of .class files, ie .class files are converted to .dex files before uploading to Android device. .dex also offers much more compact representation of code,

  13. #13
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    I had to do some research before replying.

    Quote Originally Posted by Shelwien View Post
    >> As to deflate, it requires more work afaik.

    > I don't think so. I see no inherent problems with stability.

    I don't see any either - if you won't use zlib.
    Well, at least schnaader seems to have permanent problems with it.
    Sounds strange, I thought zlib was considered stable.
    Well, I'll have an occasion to find out myself.

    Quote Originally Posted by Shelwien View Post
    > I'm not sure what do you mean with stream processing...
    > input and output as streams with no chance to buffer them all? It's not a big deal.

    I meant sequential processing, without seeking around.
    Though having to work with large blocks is bad too.
    Large blocks are desirable to determine what compression mode to use and later unnecessary. If that's bad for you - OK, it's an inherent problem.

    Quote Originally Posted by Shelwien View Post
    > There was a xml codec that did full parsing and IIRC using a ready made library.
    > It was slow, unstable and not better than XWRT though...

    Unfortunately most parsers are inherently lossy (ie its impossible to reproduce
    incomplete or erroneous or simply non-standard inputs)
    Or both lossy and redundant, which is even more fun.

    Also format parsing doesn't mean good compression. Compression is provided by
    a specialized compression method (which usually requires parsing to implement),
    while a parser made for different purpose can as well hurt compression.

    Its even harder to make a good preprocessor, because the idea is basically the
    same, but you have to reorganize format data in such a way that a specific
    "universal" compressor would be able to notice the dependencies.
    Mhm. But having deeper understanding of data opens more possibilities. I was very disappointed when I saw how far was it from my expectations.

    Quote Originally Posted by Shelwien View Post
    > Vuze has 21.8 MB of classes and Sweet Home 3D - 15. That's ~380 TB yearly just for 2 apps.

    Still, don't you think that's too little, comparing to other things?
    The single sqlservr.exe in MS SQL is 41M.
    I'm not saying that its not worth doing at all, I'm discussing priorities.
    And in that sense, imho java .class is one of a least likely formats to
    ever get a recompressor
    No, I don't think that's too little. And I'm not the only. I've seen 4 or 5 lossy recompressors that optimized internal structures, but they all were supposed to create code for direct execution, that is - still Java classes. Piotr mentioned Dalvik - it's another example that's very similar to class tools in that space was only one of many design factors.

    Quote Originally Posted by Shelwien View Post
    > Really, why do you care for anything but code before you even started?

    Let's look at already mentioned sqlservr.exe (I specially saved it
    for disasm testing and such, because its a service, and doesn't have
    any pictures or much strings inside).

    sqlservr.exe 40999448
    .text 32654336 (others are not executable)
    .text-filt 30577564 (some text blocks removed)

    And now acrord32.exe from SFC, structure of which is more common for GUI apps:

    acrord32.exe 3870784
    .text 2233600 (mostly clean code here)

    powerarc.exe 8750424
    CODE 6887312
    CODE-filt 6609229 (RTTI structures etc removed)

    So actually its wrong to approach exes as if they were all code.
    Actually when existing disasm filters are applied only to the
    code section, instead of whole exe, compression noticeably
    improves.
    In other words, .exe is a container format, and it doesn't
    make sense to start processing code before learning to parse .exe

    > I'd guestimate that 90-95% of average exe is code.

    Well, its wrong. It wasn't like that even in DOS, where exe format
    was much more compact and strings were ascii instead of unicode.

    > Even though variance is big and you can find files all the way to 0%
    > code, I don't think it's worthwhile to do something special for them.

    Pictures/Icons, unicode strings, binary tables (especially float-point)
    all can benefit greatly from proper handling (like 2x better compression),
    OK, I was wrong. And I guess that ELF and COFF store things other than code in significantly different ways?

    Quote Originally Posted by Shelwien View Post
    > But gains from convincing a distro are way way bigger.

    Its likely _lots_ more work (the compression engine has to be already stable
    and fully portable when presented), and no direct benefits.
    Using it with some small apps as soon as it works is much more attractive imho.
    I don't agree with you here.

    Quote Originally Posted by Shelwien View Post
    > Yeah, png is crap and paq is useless. But there are so many other
    > options...yes, the strongest yet fast enough are closed source only,
    > but for me it's not a reason to ignore them.

    There're not so much stable closed-source ones either.
    And even unstable... its like only BMF and Rhatushnyak's coders
    (no sense to look at others if they have worse compression
    http://www.researchandtechnology.net...benchmarks.php)

    Anyway, the question is basically what to use for lossless
    image compression in my archiver. With addition that I won't
    use anything based on huffman or broken arithmetic, because
    they're outdated.
    You know what? I went to research state of art in image compression and found (among other things) BCIF to be a FOSS frontier....and now I see that you mentioned it. When I read your post I missed it entirely.
    Anyway, if I were you I'd do one of 4 things:
    -Convince Rhatushnyak or Shkarin to cooperate.
    -Use BCIF
    -Take BCIF and see how can it be improved. I think you found something already.
    -Write something entirely new
    Honestly, I don't know which one would I do as the straightforward options give no fun at all.
    Anyway, I agree that a bitmap codec is important in a compressor, so priority of getting it one way or another should be high. I still wouldn't be thrilled to see yet another image codec though.

    Quote Originally Posted by Shelwien View Post
    > I think most people won't bother for 10%.

    As I said, it can be also used to improve quality.
    Afaik a video quality improvement corresponding to 10% higher bandwidth
    would be noticeable, especially at low bitrates.
    Yes, but h264 playback is already slow. My PC fails to play HD videos in real time. If you loosen the speed constraints, you shift from streaming to storage.

    Quote Originally Posted by Shelwien View Post
    > For me it's usually the limit of what I'm willing to do and whenever
    > I talk to people about various tools that I use because they give
    > such significant savings, they treat me as a freak.

    Even 1% helps when a file doesn't fit to a DVD
    Very correct.

    Quote Originally Posted by Shelwien View Post
    Btw, here's my old avi dumper - http://nishi.dreamhosters.com/u/avip_v0.rar
    I actually used it to download movies (with mp3zip for audio track) when
    I still had dialup
    Movies on dialup? You're extreme.

    Quote Originally Posted by Shelwien View Post
    > BTW I suggest splitting the topic.

    Just split this from "compressing mp3" to some "Format priority for recompression",
    or do you mean something else?
    Yeah, that's what I meant. We're deep off topic now.

  14. #14
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    >> Well, at least schnaader seems to have permanent problems with it.

    > Sounds strange, I thought zlib was considered stable.
    > Well, I'll have an occasion to find out myself.

    Its reasonably stable, but imho decoding random data until error
    and option bruteforce are not its intended uses.

    > Large blocks are desirable to determine what compression mode to use
    > and later unnecessary. If that's bad for you - OK, it's an inherent problem.

    I meant that if you need a 1MB block for decoding, then you won't be able
    to decode in parallel with (slow) download, which may add a significant delay.
    This is especially bad with BWT coders.

    > [xml preprocessing]
    > Mhm. But having deeper understanding of data opens more
    > possibilities. I was very disappointed when I saw how far was it
    > from my expectations.

    Yes, but as i said, making a good preprocessor is harder than
    making a good specialized compressor, because the task is basically the same,
    but preprocessor has much more restrictions.

    1. Preprocessor doesn't need to remove duplicate strings, because
    that's where universal compressors are good.
    2. Preprocessor can't just randomly add control/escape codes - it
    can hurt compression if done wrong, because its pure redundancy
    (there's no need to losslessly restore these codes, but external
    compressor would still do that).
    3. Its necessary to be careful about properties of specific
    compression algorithms. For example, deleting spaces from
    plaintext would significantly reduce the file size, but it can
    actually hurt overall compression.
    Another example: durilca encodes numbers (produced by disasm filter)
    kinda like XYZT -> 1X 2Y 3Z 4T (X-T are nibbles here). This adds
    redundancy, but ensures aligned matching of numbers, so compression
    surprisingly improves.

    > [java classes]
    > No, I don't think that's too little.

    Obviously I meant that its too little to pay attention when
    developing a general-purpose archiver in C++.

    > And I'm not the only. I've seen 4 or 5 lossy recompressors that
    > optimized internal structures,

    Its a completely different task though, and these "recompressors"
    likely won't be of any use for lossless recompression even if they're
    open-source.

    But sure I know that java is popular.

    > I guess that ELF and COFF store things other than code in
    > significantly different ways?

    Not quite sure what you mean, but their structure is considerably different,
    they're not versions of the same format, though there're common concepts.

    Anyway, windows executables frequently contain a lot of GUI resources,
    because its supported by system.
    (And ELF executables frequently contain debuginfo :).

    Usually its possible to identify executable sections by names
    and/or executable attributes, but there're common exceptions like
    exe packers.

    Also the code section usually contains a lot of data
    (some strings; jump tables; RTTI structures; alignment paddings),
    and its basically impossible to identify which is which
    with a sequential scanner - most data can be still disassembled
    without any "invalid opcodes".

    That's basically the main problem of current disasm filters -
    they do improve compression of real code, but hurt compression
    of data which they disassemble too.


    > Anyway, if I were you I'd do one of 4 things:
    > -Convince Rhatushnyak or Shkarin to cooperate.

    You mean, for me to use their codec in a commercial archiver?
    That would be expensive... and useless.

    > -Use BCIF

    I might look at it later (I'm not going to work on image compression atm anyway),
    but for now it didn't impress me at all.

    > -Take BCIF and see how can it be improved. I think you found something already.

    No sense.

    > -Write something entirely new

    That's most likely. Actually atm its pretty easy to make a CM image coder with good
    results.

    > Honestly, I don't know which one would I do as the straightforward
    > options give no fun at all.

    There's a lot of possible fun actually, because existing models are either primitive
    (paq,bmf,maybe flic/gralic) or do a lot of stuff which seems complex, but doesn't
    make much sense (guessing from the fact that first group has better results) - stuff
    like wavelets etc.
    For example, I posted some info about paq8 bmp model there -
    http://encode.ru/threads/1195-Using-...ll=1#post23673

    So I really wonder what a good CM coder could do - like fuzzy pattern matching etc.

    > Anyway, I agree that a bitmap codec is important in a compressor, so
    > priority of getting it one way or another should be high. I still
    > wouldn't be thrilled to see yet another image codec though.

    Its not very high in my list anyway - I might need one for jpeg recompression,
    but circumstances there are different, so that would be yet another lossless
    image codec, different from bmp :).


    > Yes, but h264 playback is already slow. My PC fails to play HD
    > videos in real time. If you loosen the speed constraints, you shift
    > from streaming to storage.

    That doesn't mean that we can't use stronger/slower compression in
    like youtube videos though.

    > Movies on dialup? You're extreme.

    Yeah... first there was only dialup, then adsl appeared, but only 64kbit
    was unlim, so for a few months I used 48kbit dialup + 64kbit adsl
    (on the same phone line!), which required some tricks to download one
    file via both connections (in the end solved that by binding local proxies
    to both interfaces, then using flashget's "multi-proxy" mode; plus some
    custom utils to reconfigure the proxy when ip changed).

  15. #15
    Programmer
    Join Date
    May 2008
    Location
    PL
    Posts
    307
    Thanks
    68
    Thanked 166 Times in 63 Posts
    Quote Originally Posted by Shelwien View Post
    Also surely there's less redundancy in h264 (if only because it
    actually uses a kind of arithmetic coding), but I'd still estimate
    at least 10-15%.
    I've made some research on H.264 and H.264 is much more complex than MPEG2. My estimations are
    a) up to 10% improvement with CABAC H.264
    b) 15-25% improvement with CAVLC H.264

  16. #16
    Programmer
    Join Date
    May 2008
    Location
    PL
    Posts
    307
    Thanks
    68
    Thanked 166 Times in 63 Posts
    Quote Originally Posted by Shelwien View Post
    for example, you can compare jpeg/ari (or winzip jpeg)
    vs paq8 result and apply that ratio to h264.
    It's not so easy. JPEG encodes images as DCT coefficients, but MPEG encodes images as DCT coefficients of difference between neighboring frames (except I-frames).

  17. #17
    Programmer
    Join Date
    May 2008
    Location
    PL
    Posts
    307
    Thanks
    68
    Thanked 166 Times in 63 Posts
    Quote Originally Posted by Shelwien View Post
    The actual problem with jpeg is still jpeg parsing - if you have a lossless parser,
    you can immediately reach nearly the level of winzip jpeg (as my pjpg showed).
    You are right. It took me about week to get a level nearly of PackJPG (using JPEG Open Source Developers Package). I can even release the sources if you want.

    Quote Originally Posted by Shelwien View Post
    Although for jpeg too, I won't be satisfied unless my model would have better
    compression than paq8 or stuffit
    It will be hard to achieve at speed of WinZip or PackJPG, but good luck

  18. #18
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > I've made some research on H.264 and H.264 is much more complex than MPEG2. My estimations are
    > a) up to 10% improvement with CABAC H.264
    > b) 15-25% improvement with CAVLC H.264

    Thanks for info. I somehow thought that CABAC is mostly applied.
    Can you tell what's used in popular sources?
    like web streams (youtube), HD movie rips, bluerays?

    > It's not so easy. JPEG encodes images as DCT coefficients, but MPEG
    > encodes images as DCT coefficients of difference between neighboring
    > frames (except I-frames).

    Code:
    340018 // cjpeg -quality 95 -optimize image.bmp
    305776 // cjpeg -quality 95 -arithmetic image.bmp
    282789 // pjpg_v0  http://nishi.dreamhosters.com/u/pjpg_v0_bin.rar
    264967 // paq8px69 -7 image_huf.jpg
    (1-305776/340018)*100 = 10.07% // gain from jpeg AC
    (1-282789/305776)*100 =  7.51% // pjpg comparing to jpeg AC
    (1-264967/305776)*100 = 13.35% // paq8px comparing to jpeg AC
    (1-264967/340018)*100 = 22.07% // paq8px comparing to jpeg huffman
    I meant this. Its actually not so different from your estimations for h264.

    Also imho the nature of encoded data is not so important here, as that
    much can be gained simply by applying precise AC + tuned counters
    (I mean pjpg - its my prototype jpeg recompressor).

    > You are right. It took me about week to get a level nearly of
    > PackJPG (using JPEG Open Source Developers Package). I can even
    > release the sources if you want.

    That might be interesting. There're packjpg tools too, and I can
    post my djpeg rip, so we'd have quite a collection :).

    > It will be hard to achieve at speed of WinZip or PackJPG, but good luck

    pjpg is faster than winzip afaik... though well, dejpg is faster than
    winzip's code of the same thing, so its no wonder :) - good compilers
    are good.

    Sure the next version would be slower to compete with paq, but its
    really hard to reach paq's speed :)


    P.S. I'd really like to know whether you're using any tools for C/C++ refactoring
    (of format parsers) and/or any binary parser generator (like flavor).

  19. #19
    Programmer
    Join Date
    May 2008
    Location
    PL
    Posts
    307
    Thanks
    68
    Thanked 166 Times in 63 Posts
    Quote Originally Posted by Shelwien View Post
    I somehow thought that CABAC is mostly applied.
    About 50% of my files are CAVLC. CABAC is even not supported with the following profiles (according to http://en.wikipedia.org/wiki/H.264/MPEG-4_AVC):
    Constrained Baseline Profile (CBP)
    Baseline Profile (BP)
    Extended Profile (XP)

    Quote Originally Posted by Shelwien View Post
    Can you tell what's used in popular sources?
    like web streams (youtube), HD movie rips, bluerays?
    CABAC:
    youtube.com
    Blue-Ray (checked only 1 file)
    Panasonic SD700 movies

    CAVLC:
    QuickTime MPEG-4 (e.g. from apple.com)
    Canon 500D movies
    Sony HDR CX6 movies

    Quote Originally Posted by Shelwien View Post
    P.S. I'd really like to know whether you're using any tools for C/C++ refactoring
    (of format parsers) and/or any binary parser generator (like flavor).
    There was no need to make my own parser, but I like the idea of a parser generator

  20. #20
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Thanks.

    There's that - http://flavor.sourceforge.net/ but somehow I don't like it.

  21. #21
    Programmer
    Join Date
    May 2008
    Location
    PL
    Posts
    307
    Thanks
    68
    Thanked 166 Times in 63 Posts
    BTW, I've managed to improve speed of OCA_MPEG to 500 kb/s at about 2% worse compression (18.7% in my experiments) than OCA_MPEG level 2. It's hard to get better speed, because decompressed DCT is 10-20 times bigger than input data.

  22. #22
    Programmer
    Join Date
    May 2008
    Location
    PL
    Posts
    307
    Thanks
    68
    Thanked 166 Times in 63 Posts
    @Shelwien: Most ideas that work for JPEG (in rejpeg) don't work even for MPEG-2, therefore I don't think that estimations of H.264 recompression using JPEG are accurate.
    Last edited by inikep; 12th March 2011 at 00:33.

  23. #23
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > BTW, I've managed to improve speed of OCA_MPEG to 500 kb/s

    Its good that there's progress, but people say that even 5MB/s is slow for extraction of common files,
    and videos are commonly larger...

    > It's hard to get better speed, because decompressed DCT is 10-20 times bigger than input data.

    The common workaround is to encode more probable values with a single bit
    (unary coding or something similar).

    Or, in other words, you can apply some static compression first
    (some data transformation + huffman coding maybe)
    then a CM model.

    Also, in this specific case (mpeg), it might be possible to completely avoid any recoding -
    just parse the original mpeg bits and compute the right contexts for them.

    Though well, its hard to help here, as I know nothing about your implementation

Similar Threads

  1. filesharing with built-in recompression
    By Shelwien in forum Data Compression
    Replies: 8
    Last Post: 8th December 2009, 13:42
  2. Winzip v12.0 with JPG recompression & 7z support
    By maadjordan in forum Data Compression
    Replies: 3
    Last Post: 12th September 2008, 23:58
  3. StuffIt X Format
    By maadjordan in forum Data Compression
    Replies: 19
    Last Post: 9th August 2008, 13:03
  4. Universal Archive Format
    By Bulat Ziganshin in forum Data Compression
    Replies: 1
    Last Post: 9th July 2008, 00:54
  5. New archive format
    By Matt Mahoney in forum Forum Archive
    Replies: 9
    Last Post: 25th December 2007, 12:22

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •