Here's a list of data types: deflate(zip,pdf,png,docx),jpeg,bmp,mp3,aac(mp4,m4a ),ogg,flac,wav,COFF(exe/dll/etc),ELF,html,xml
Please write them in desired order of implementation (you can also add your own types).
Here's a list of data types: deflate(zip,pdf,png,docx),jpeg,bmp,mp3,aac(mp4,m4a ),ogg,flac,wav,COFF(exe/dll/etc),ELF,html,xml
Please write them in desired order of implementation (you can also add your own types).
We're a bit off topic, but my order would be:
wavpack (because it's the strongest lossless audio codec that can be used in videos and played w/out problems)
EDIT: added MP2
Last edited by m^2; 7th March 2011 at 20:03.
> We're a bit off topic, but my order would be:
> html, xml
> ELF, COFF
> Java classes
> wavpack (because it's the strongest lossless audio codec that can be used in videos and played w/out problems)
Well, now let me explain what seems wrong with your priorities.
1. Deflate comes first, because without it you can't reach jpegs in some filetypes
(like pdf or jar or .docx).
2. AAC (and also MP2) is mostly used for soundtracks in movie containers,
thus video streams (h264 and mpeg1; xvid) should have priority.
However, MP2 is not much different from MP3, and mp3 is the 2nd popular format after jpeg,
so MP2 support can be implemented much earlier.
3. html is very complex with many sublanguages (css,js,vbs), and effect from its recompression
would be fairly low, thus its priority should be very low.
xml, on other hand, has a much more strict structure, and can be encountered within various
filetypes (including jpeg; docx is based on it).
4. Java classes... unless you mean .jar (which only requires deflate support),
I'd not expect much effect from their recompression, and multi-TB archives of .class
files are hard to imagine.
5. wavpack... overall, lossless formats should have much lower ranks comparing to lossy,
because there's usually no need to write an integrated recompressor for these.
You can just extract your wavpack track and compress with optimfrog, then extract
and compress again with wavpack, then remux.
Imho atm its too obscure to handle. Same applies to flac actually, although its getting
pretty popular now. But again, its lossless, and its easier to write a better lossless
audio compressor + player plugins, than a proper recompressor.
6. Windows EXEs are fairly important. They're getting more and more redundant
(compiler inlining, x64), they're basically containers for many data types, and
there's no good support (the best is likely durilca's disasm filter, which is
still very simple and redundant; and the usual bcj/bcj2 is plain laughable).
However with linux system you don't normally go around downloading ELF binaries,
so that's not so important.
7. bmp and wav do require custom models. they're pretty widespread as is, but
can be also encountered within other formats. For example, if we'd remove deflate
from png, it would basically become bmp; also a good bitmap model is required for
good jpeg compression, though there it won't be used directly.
8. ogg... it takes a lot of time to develop a recompressor for a complex format,
so I won't do it unless there's some demand. I have use myself for things listed
above, but not for this.
9. MPEG4, if you mean h264, would be surely cool, but video codecs are harder to
deal with than others, so it can't have higher priority than bmp.
> Tried ffmpeg? Works for me.
That's good to know. In the worst case I'd really have to go extracting modules from huge apps
to build the utils I need on my own. Sometimes it can be harder than decompiling some mp4.dll though,
because with dll I at least know for certain that it does what I need.
But what I want is something like lame for aac, which would handle all aac streams and nothing else.
faac/faad try to be that, but its not exactly enough.
And it's not something I meant for archival - but for installers. And what numbers are we talking here? Petabytes per year?
And there's nothing that targets them specifically (except for obfuscators), so the state of art is underwhelming.
IMO the big reason to favour COFF is that there are far more people using it. Reason to favour elf - all FOSS OSes use it. For me it levels the field.
I judging by Ocarina's codec I think there's fairly little to be saved on MPEG2 and guess that much less with h264. Actually I would be surprised if h264 recompression would give enough savings to be worthwhile soon.
EDIT: fixing typos and grammar. I read everything twice before posting, yet when reading for a third time, I still see many errors...
Last edited by m^2; 7th March 2011 at 23:18.
> I assumed you talked about a recompressor similar to the LZMA one,
> where you wouldn't be able to access them anyway.
No, unlike lzma, for deflate its better to decode all data, then
compress with a different method (for lzma its in theory better too,
but lossless reconstruction would be terribly slow).
> I still stand by the current order - because while there are solutions for both JPEG and deflate
> I feel both are lacking, however I think that with deflate it will
> change soon and I don't see such improvement for JPEG on the horizon.
There's no actual solution for deflate atm. Precomp is just a proof-of-concept,
it can't be integrated into a popular archiver (not stable enough, no lib/dll,
temp files, either too slow, or requires manual control).
And there're two for jpeg - packjpg and winzip one.
Of these packjpg is even ready for immediate use.
There're also a few paq versions with different jpeg models, but these are not
practical due to speed.
So afaik the situation with jpeg is much better than with deflate.
The actual problem with jpeg is still jpeg parsing - if you have a lossless parser,
you can immediately reach nearly the level of winzip jpeg (as my pjpg showed).
And its also known how to reach paq8 jpeg results - the model is open-source and
even described in Matt's DCE.
So its only a parser problem, and I already did most of necessary work
(like factoring down djpeg source to ~100k + pjpg parser).
As to deflate, it requires more work afaik. Sure cloning precomp is relatively
easy, but the clone would have the same issues with stability, speed, stream processing,
format compatibility (ie won't recompress non-zlib deflate streams).
So from my point of view, I have to make an actual universal codec to compress
the uncompressed deflate data, and also have to make a good symmetric parser
for deflate (like i already did for lzma), and then a custom CM to compress
the deflate codes in context of uncompressed data.
To me it seems a lot more complicated than making a jpeg recompressor with paq-like
model, but reasonable speed.
Although for jpeg too, I won't be satisfied unless my model would have better
compression than paq8 or stuffit
> I roughly evaluated savings/(processing time) for Ocarina's test
> codec for MPEG video and SoundSlimmer on MP3 I guessed the ratio for
> AAC would be better.
Yes, ratio for AAC is expected to be better than for mp3, because AAC
contains more redundant information (huffman table ids etc) which can
be derived from spectral coefs.
I don't see how its related to that mpeg codec though.
Also its processing speed likely doesn't mean anything - its either not
optimized, or uses a zpaq backend, or both.
> They are omnipresent, that's the only reason why I suggested them
> this high. If you feel that savings would be low, I agree that the
> priority should be lowered.
Savings would be good probably, because a normal CM model can't
predict the context changes in html.
But its just too complex. We need a browser-level html parser
to make a really good html recompressor, and what's worse, we
likely can't use open-source stuff like webkit and chromium,
because all the normal parsers are lossy, and it eventually
becomes an error-correction hell when you try to losslessly
reconstruct original data from a DOM in memory.
Anyway, the absolute savings from html would be little comparing
to image/audio/video, so that's what should be done first.
> Yes, I meant .class files. I know very little about them, but they
> seem much like exes, just much more structured and with more
> metadata - so I expect them to compress better than exes.
> But I'm very far from certain I'm right.
One problem is that they compress better with plain CM, without
any special handling, so recompression gain would be smaller.
(While for exe it can reach 100% - compare ppmd with winrk in
> And it's not something I meant for archival - but for installers.
So do you have java apps with lots of code?
Apps which I accidentally have are mostly 100-500k, aside from java itself.
> And what numbers are we talking here? Petabytes per year?
I'm talking about a possibility to imagine a cost reduction
from .class (or .jar) storage on a filehosting or something.
> And there's nothing that targets them specifically (except for
> obfuscators), so the state of art is underwhelming.
Well, java programmers can't write strong compressors (because java
is too slow for that), and others don't care I guess
> Interesting opinion, seeing who made dispack and how close it is to BCJ2,
Well, disasm quality doesn't immediately mean good compression.
Care to compare dispack+ppmonstr vs durilca?
I mean that for a disasm preprocessor, its not only necessary
to parse code, but also to transform it to something compressible
by universal compressors.
> I expected the state of art to be well developed. If you think
> you can save a lot, I strongly encourage you to do so.
Yeah, well, unfortunatelly its pretty complicated.
Ideally we need a ida-level disassembler (not onepass) + custom
CM model for code + exe parser (identifying tables of constants,
like floats, ascii/unicode text, various resources - images,icons,
dialogs,signatures,manifests) + models for all listed before.
One interesting problem btw, is that in x64 exes now _all_
memory addresses are relative, same thing like with call/jmp in x86,
but it can't be handled just by finding a signature byte like E8
(which is essentially what BCJ/BCJ2 do).
> IMO the big reason to favour COFF is that there are far more people
> using it. Reason to favour elf - all FOSS OSes use it.
Ok, I'd try to clarify this. Windows programs are commonly distributed
as _executable_ independent installers. Basically each developer
distributes his own app. Recently I also see something similar for
Mac versions - the .DMG images, but mac executable is not really ELF,
so lets leave it alone for now.
I wanted to say that linux programs are either distributed as source
archive by developers, or as ELF binaries on specialized repositories
for given linux distros.
The difference is that its much easier to coerce an independent developer
to try improving compression in his installer, than to do the same
with a whole linux distro.
> Yes, I know how important it is, I just don't think we need yet another image coder.
You may be surprised, but we don't have one atm.
PNG compression is very bad (1194493 png+pngcrush, 539003 paq8px),
paq is ok but too slow to be of any use, others are not open-source
(including jpeg2000ls afaik).
Also paq's bmp model is clearly not very smart, I'd expect a considerable improvement possible.
> I judging by Ocarina's codec I think there's fairly little to be
> saved on MPEG2 and guess that much less with h264.
Actually I think that Przemyslaw's results are pretty good - its
more than my 15% on mp3 at least.
Also surely there's less redundancy in h264 (if only because it
actually uses a kind of arithmetic coding), but I'd still estimate
at least 10-15%.
> Actually I would be surprised if h264 recompression would give
> enough savings to be worthwhile soon.
In a way, its a case where anything goes. It would be likely used
to improve video quality at low bitrates, instead of changing
the bitrate. And 10% gain at entropy coding would certainly make
a visible difference.
I'm not sure what do you mean with stream processing...input and output as streams with no chance to buffer them all? It's not a big deal.
BTW I suggest splitting the topic.
Thanks, but I need something like that with source.
>> As to deflate, it requires more work afaik.
> I don't think so. I see no inherent problems with stability.
I don't see any either - if you won't use zlib.
Well, at least schnaader seems to have permanent problems with it.
> I'm not sure what do you mean with stream processing...
> input and output as streams with no chance to buffer them all? It's not a big deal.
I meant sequential processing, without seeking around.
Though having to work with large blocks is bad too.
> There was a xml codec that did full parsing and IIRC using a ready made library.
> It was slow, unstable and not better than XWRT though...
Unfortunately most parsers are inherently lossy (ie its impossible to reproduce
incomplete or erroneous or simply non-standard inputs)
Or both lossy and redundant, which is even more fun.
Also format parsing doesn't mean good compression. Compression is provided by
a specialized compression method (which usually requires parsing to implement),
while a parser made for different purpose can as well hurt compression.
Its even harder to make a good preprocessor, because the idea is basically the
same, but you have to reorganize format data in such a way that a specific
"universal" compressor would be able to notice the dependencies.
> Vuze has 21.8 MB of classes and Sweet Home 3D - 15. That's ~380 TB yearly just for 2 apps.
Still, don't you think that's too little, comparing to other things?
The single sqlservr.exe in MS SQL is 41M.
I'm not saying that its not worth doing at all, I'm discussing priorities.
And in that sense, imho java .class is one of a least likely formats to
ever get a recompressor :)
> Really, why do you care for anything but code before you even started?
Let's look at already mentioned sqlservr.exe (I specially saved it
for disasm testing and such, because its a service, and doesn't have
any pictures or much strings inside).
.text 32654336 (others are not executable)
.text-filt 30577564 (some text blocks removed)
And now acrord32.exe from SFC, structure of which is more common for GUI apps:
.text 2233600 (mostly clean code here)
CODE-filt 6609229 (RTTI structures etc removed)
So actually its wrong to approach exes as if they were all code.
Actually when existing disasm filters are applied only to the
code section, instead of whole exe, compression noticeably
In other words, .exe is a container format, and it doesn't
make sense to start processing code before learning to parse .exe
> I'd guestimate that 90-95% of average exe is code.
Well, its wrong. It wasn't like that even in DOS, where exe format
was much more compact and strings were ascii instead of unicode.
> Even though variance is big and you can find files all the way to 0%
> code, I don't think it's worthwhile to do something special for them.
Pictures/Icons, unicode strings, binary tables (especially float-point)
all can benefit greatly from proper handling (like 2x better compression),
> But gains from convincing a distro are way way bigger.
Its likely _lots_ more work (the compression engine has to be already stable
and fully portable when presented), and no direct benefits.
Using it with some small apps as soon as it works is much more attractive imho.
> Yeah, png is crap and paq is useless. But there are so many other
> options...yes, the strongest yet fast enough are closed source only,
> but for me it's not a reason to ignore them.
There're not so much stable closed-source ones either.
And even unstable... its like only BMF and Rhatushnyak's coders
(no sense to look at others if they have worse compression
Anyway, the question is basically what to use for lossless
image compression in my archiver. With addition that I won't
use anything based on huffman or broken arithmetic, because
>> Also surely there's less redundancy in h264, but I'd still estimate at least 10-15%.
> As I said already, I remembered MPEG2 savings wrong. But I still
> feel surprised by your estimation.
There's a known 5% result from replacing cabac with precise rc (may be fake though).
Also h264 has huge speed restrictions, so I won't be surprised
even by larger improvement - for example, you can compare jpeg/ari (or winzip jpeg)
vs paq8 result and apply that ratio to h264.
> I think most people won't bother for 10%.
As I said, it can be also used to improve quality.
Afaik a video quality improvement corresponding to 10% higher bandwidth
would be noticeable, especially at low bitrates.
> For me it's usually the limit of what I'm willing to do and whenever
> I talk to people about various tools that I use because they give
> such significant savings, they treat me as a freak.
Even 1% helps when a file doesn't fit to a DVD :)
Btw, here's my old avi dumper - http://nishi.dreamhosters.com/u/avip_v0.rar
I actually used it to download movies (with mp3zip for audio track) when
I still had dialup :)
> BTW I suggest splitting the topic.
Just split this from "compressing mp3" to some "Format priority for recompression",
or do you mean something else?
As to Java classes:
- there is Pack200 utility that performs lossy transform on .class files and allows for, say, 50 % reduction in size,
- Android uses Dalvik VM which uses .dex files instead of .class files, ie .class files are converted to .dex files before uploading to Android device. .dex also offers much more compact representation of code,
I had to do some research before replying.
Well, I'll have an occasion to find out myself.
Anyway, if I were you I'd do one of 4 things:
-Convince Rhatushnyak or Shkarin to cooperate.
-Take BCIF and see how can it be improved. I think you found something already.
-Write something entirely new
Honestly, I don't know which one would I do as the straightforward options give no fun at all.
Anyway, I agree that a bitmap codec is important in a compressor, so priority of getting it one way or another should be high. I still wouldn't be thrilled to see yet another image codec though.
>> Well, at least schnaader seems to have permanent problems with it.
> Sounds strange, I thought zlib was considered stable.
> Well, I'll have an occasion to find out myself.
Its reasonably stable, but imho decoding random data until error
and option bruteforce are not its intended uses.
> Large blocks are desirable to determine what compression mode to use
> and later unnecessary. If that's bad for you - OK, it's an inherent problem.
I meant that if you need a 1MB block for decoding, then you won't be able
to decode in parallel with (slow) download, which may add a significant delay.
This is especially bad with BWT coders.
> [xml preprocessing]
> Mhm. But having deeper understanding of data opens more
> possibilities. I was very disappointed when I saw how far was it
> from my expectations.
Yes, but as i said, making a good preprocessor is harder than
making a good specialized compressor, because the task is basically the same,
but preprocessor has much more restrictions.
1. Preprocessor doesn't need to remove duplicate strings, because
that's where universal compressors are good.
2. Preprocessor can't just randomly add control/escape codes - it
can hurt compression if done wrong, because its pure redundancy
(there's no need to losslessly restore these codes, but external
compressor would still do that).
3. Its necessary to be careful about properties of specific
compression algorithms. For example, deleting spaces from
plaintext would significantly reduce the file size, but it can
actually hurt overall compression.
Another example: durilca encodes numbers (produced by disasm filter)
kinda like XYZT -> 1X 2Y 3Z 4T (X-T are nibbles here). This adds
redundancy, but ensures aligned matching of numbers, so compression
> [java classes]
> No, I don't think that's too little.
Obviously I meant that its too little to pay attention when
developing a general-purpose archiver in C++.
> And I'm not the only. I've seen 4 or 5 lossy recompressors that
> optimized internal structures,
Its a completely different task though, and these "recompressors"
likely won't be of any use for lossless recompression even if they're
But sure I know that java is popular.
> I guess that ELF and COFF store things other than code in
> significantly different ways?
Not quite sure what you mean, but their structure is considerably different,
they're not versions of the same format, though there're common concepts.
Anyway, windows executables frequently contain a lot of GUI resources,
because its supported by system.
(And ELF executables frequently contain debuginfo :).
Usually its possible to identify executable sections by names
and/or executable attributes, but there're common exceptions like
Also the code section usually contains a lot of data
(some strings; jump tables; RTTI structures; alignment paddings),
and its basically impossible to identify which is which
with a sequential scanner - most data can be still disassembled
without any "invalid opcodes".
That's basically the main problem of current disasm filters -
they do improve compression of real code, but hurt compression
of data which they disassemble too.
> Anyway, if I were you I'd do one of 4 things:
> -Convince Rhatushnyak or Shkarin to cooperate.
You mean, for me to use their codec in a commercial archiver?
That would be expensive... and useless.
> -Use BCIF
I might look at it later (I'm not going to work on image compression atm anyway),
but for now it didn't impress me at all.
> -Take BCIF and see how can it be improved. I think you found something already.
> -Write something entirely new
That's most likely. Actually atm its pretty easy to make a CM image coder with good
> Honestly, I don't know which one would I do as the straightforward
> options give no fun at all.
There's a lot of possible fun actually, because existing models are either primitive
(paq,bmf,maybe flic/gralic) or do a lot of stuff which seems complex, but doesn't
make much sense (guessing from the fact that first group has better results) - stuff
like wavelets etc.
For example, I posted some info about paq8 bmp model there -
So I really wonder what a good CM coder could do - like fuzzy pattern matching etc.
> Anyway, I agree that a bitmap codec is important in a compressor, so
> priority of getting it one way or another should be high. I still
> wouldn't be thrilled to see yet another image codec though.
Its not very high in my list anyway - I might need one for jpeg recompression,
but circumstances there are different, so that would be yet another lossless
image codec, different from bmp :).
> Yes, but h264 playback is already slow. My PC fails to play HD
> videos in real time. If you loosen the speed constraints, you shift
> from streaming to storage.
That doesn't mean that we can't use stronger/slower compression in
like youtube videos though.
> Movies on dialup? You're extreme.
Yeah... first there was only dialup, then adsl appeared, but only 64kbit
was unlim, so for a few months I used 48kbit dialup + 64kbit adsl
(on the same phone line!), which required some tricks to download one
file via both connections (in the end solved that by binding local proxies
to both interfaces, then using flashget's "multi-proxy" mode; plus some
custom utils to reconfigure the proxy when ip changed).
> I've made some research on H.264 and H.264 is much more complex than MPEG2. My estimations are
> a) up to 10% improvement with CABAC H.264
> b) 15-25% improvement with CAVLC H.264
Thanks for info. I somehow thought that CABAC is mostly applied.
Can you tell what's used in popular sources?
like web streams (youtube), HD movie rips, bluerays?
> It's not so easy. JPEG encodes images as DCT coefficients, but MPEG
> encodes images as DCT coefficients of difference between neighboring
> frames (except I-frames).
I meant this. Its actually not so different from your estimations for h264.Code:340018 // cjpeg -quality 95 -optimize image.bmp 305776 // cjpeg -quality 95 -arithmetic image.bmp 282789 // pjpg_v0 http://nishi.dreamhosters.com/u/pjpg_v0_bin.rar 264967 // paq8px69 -7 image_huf.jpg (1-305776/340018)*100 = 10.07% // gain from jpeg AC (1-282789/305776)*100 = 7.51% // pjpg comparing to jpeg AC (1-264967/305776)*100 = 13.35% // paq8px comparing to jpeg AC (1-264967/340018)*100 = 22.07% // paq8px comparing to jpeg huffman
Also imho the nature of encoded data is not so important here, as that
much can be gained simply by applying precise AC + tuned counters
(I mean pjpg - its my prototype jpeg recompressor).
> You are right. It took me about week to get a level nearly of
> PackJPG (using JPEG Open Source Developers Package). I can even
> release the sources if you want.
That might be interesting. There're packjpg tools too, and I can
post my djpeg rip, so we'd have quite a collection :).
> It will be hard to achieve at speed of WinZip or PackJPG, but good luck
pjpg is faster than winzip afaik... though well, dejpg is faster than
winzip's code of the same thing, so its no wonder :) - good compilers
Sure the next version would be slower to compete with paq, but its
really hard to reach paq's speed :)
P.S. I'd really like to know whether you're using any tools for C/C++ refactoring
(of format parsers) and/or any binary parser generator (like flavor).
Constrained Baseline Profile (CBP)
Baseline Profile (BP)
Extended Profile (XP)
Blue-Ray (checked only 1 file)
Panasonic SD700 movies
QuickTime MPEG-4 (e.g. from apple.com)
Canon 500D movies
Sony HDR CX6 movies
BTW, I've managed to improve speed of OCA_MPEG to 500 kb/s at about 2% worse compression (18.7% in my experiments) than OCA_MPEG level 2. It's hard to get better speed, because decompressed DCT is 10-20 times bigger than input data.
@Shelwien: Most ideas that work for JPEG (in rejpeg) don't work even for MPEG-2, therefore I don't think that estimations of H.264 recompression using JPEG are accurate.
Last edited by inikep; 12th March 2011 at 00:33.
> BTW, I've managed to improve speed of OCA_MPEG to 500 kb/s
Its good that there's progress, but people say that even 5MB/s is slow for extraction of common files,
and videos are commonly larger...
> It's hard to get better speed, because decompressed DCT is 10-20 times bigger than input data.
The common workaround is to encode more probable values with a single bit
(unary coding or something similar).
Or, in other words, you can apply some static compression first
(some data transformation + huffman coding maybe)
then a CM model.
Also, in this specific case (mpeg), it might be possible to completely avoid any recoding -
just parse the original mpeg bits and compute the right contexts for them.
Though well, its hard to help here, as I know nothing about your implementation