Results 1 to 10 of 10

Thread: Split-stream compression (or what?) for x86 machine code

  1. #1
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts

    Split-stream compression (or what?) for x86 machine code

    What do state-of-the-art compressors do to compress x86 binaries? (Besides relative/absolute address conversion for a couple of opcodes.)

    Does anybody have a fast table-driven stream-splitter for x86 code lying around? I found the filter in kkrunchy, but it's got all sorts of nested conditionals in it.

    Seems like a 64K-element table would work to do aligned byte pairs fast, if you rig the table to handle byte pairs that my have have one two-byte opcode or two one-byte opcodes, or a one-byte opcode and a byte of operand. (But maybe not worth it cache-locality-wise, depending on how often you're hitting weird stuff in the table.)

  2. #2
    Member
    Join Date
    Nov 2013
    Location
    US
    Posts
    131
    Thanks
    31
    Thanked 29 Times in 19 Posts
    Have you seen Bcj2 in 7z? If I remember correctly, it splits instructions based on type.

  3. #3
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    cade,

    The descriptions I've seen don't describe BCJ2 as doing what I'm talking about... it sounds like a filter to do relative-to-absolute address translation for jump and call target addresses, like the E8E9 filter PAQ. (If I'm mistaken about either, I'll stand happily corrected.)

    I'm looking for something that actually splits the instruction stream into separate streams for opcodes, register numbers, and literals, all of which have different regularities. (That's what krunchy's filter does, but a table-driven splitter would be way faster.)

  4. #4
    Member
    Join Date
    Nov 2013
    Location
    US
    Posts
    131
    Thanks
    31
    Thanked 29 Times in 19 Posts
    Bcj2 has different streams: normal, jump or call depending on type, rather than just add or subtract relative-to-absolute. But you're right, it doesn't split up other instructions.

    There is the exe filter in durlica (its disassembler here: http://compression.ru/ds/disasm32.rar), but I don't have any other information about it. There is a table (DisAsm32.tbl) for opcodes to recognize what other information they have (immediate follows, are branching or conditional, etc).

  5. The Following User Says Thank You to cade For This Useful Post:

    Paul W. (23rd April 2014)

  6. #5
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    Thanks, Cade. I wish I knew what Durilca was doing with x86 code... but a little searching based on your tip turned up this old thread here:

    http://encode.ru/threads/1024-Data-s...ve-compression

    Shelwein, responding to Osman, says this:

    Well, I uploaded some older durilca version which is better for such tests.
    http://shelwien.googlepages.com/durilca2_002a.rar
    Its the last version which still had the hidden "-l" switch for segmentation
    results dumping. (Usage: durilca e -t1 -l ccc.tar)

    > I noticed, Shkarin's seg_file has broken Quake3 bsp file into pieces nearly
    > perfect, like hand-optimized

    Better look what it does to executables, its much more interesting
    When I get my Windows system un-hosed, I'll definitely have to experiment with that Durilca option.

  7. #6
    Member
    Join Date
    Nov 2013
    Location
    US
    Posts
    131
    Thanks
    31
    Thanked 29 Times in 19 Posts
    For example, if you know that a type of instruction has certain immediates, you can just remove the immediates and write the instructions out. If you count the number of those types occuring, that tells you the number of immediates in the immediate section to restore. This is just a rough estimate of what I do though.

    Example:
    jmp X ... jmp Y ... jmp Z ...

    Write jmp ... jmp ... jmp ... [absolute(X) absolute(Y) absolute(Z)]
    Decode jmp ? ... jmp ? ... jmp ?, we know there are 3 targets, so relative(read) relative(read) relative(read) and restore to those skipped positions (?).

  8. #7
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    I believe the best x86 compressors parse the instruction stream and use the parse state (opcode, addressing mode, operand) as context.

  9. The Following User Says Thank You to Matt Mahoney For This Useful Post:

    Paul W. (24th April 2014)

  10. #8
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Matt Mahoney View Post
    I believe the best x86 compressors parse the instruction stream and use the parse state (opcode, addressing mode, operand) as context.
    There seems to be a fundamental choice here: re-order by context, or switch between contexts on the fly? In general purpose compression, the BWT would be the canonical example of re-ordering by context. There are many more examples of compressors that switch contexts on the fly.

  11. #9
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    Matt:
    I believe the best x86 compressors parse the instruction stream and use the parse state (opcode, addressing mode, operand) as context.
    By "the best x86 compressors" do you mean models specifically for x86, or are you including things like ZPAQ models that effectively do that with sparse
    contexts (a.k.a. "skip contexts" in Bloom's terminology or "binary contexts" in Taylor's, AIUI)?

    I'm wondering how much of this gets done implicitly and more or less statistically in various compressors, given enough of a hint in the general pattern's they're set to recognize, and sufficient stats from a big training set that they learn which concrete instances to use and which not.

    For example, one problem in compressing code is that normal bytewise Markovish contexts often work okay, except in predicting the least-significant byte of instruction-stream literals, which are the most variable part, and have very non-Markov (numerical) regularities.

    Given a strong enough back-end compressor, it might help to preprocess the instruction stream to just pad short literals out to long ones, essentially canonicalizing the literal format so that fewer sparse patterns are needed---the least-significant literal byte will always be at the same offset from the opcode, so you need fewer sparse patterns, and the padding bytes will just get compressed away again at little cost because they're fixed. (Essentially you're unpacking variable-length instructions into a more "horizontal" RISC-like fixed format, with a clear stride of the hard-to-predict bytes, then letting the compressor find a better way to compress that.)

    One reason I wonder about that is that in compressing ARM code for the Newton it was surprisingly effective to ignore which bitfields of the 32-bit instructions were the opcode and the source and destination registers, and just focus on splitting the instruction into two parts: the well-behaved bits and the annoyingly variable bits. We/they didn't have to know why the less variable parts of one instruction tended to predict the less-variable parts of the next one, as well as constraining the likely values of the more-variable part, just that they did.

    For example, we thought for a while it might pay off to explicitly separate the source and destination register bitfields, but it turned out that just chopping things up into bytes happened better, for a simple back end compressor, because it tended to lump the high couple of bits of one of the register bitfields in with the opcode---and those high bits were less variable than the low bits of the same field. (I forget whether it actually compressed better or just didn't hurt, but it was certainly faster.)

    The very lowest few bits of instruction stream literals often behave in oddly regular, symbol-like ways, too, despite being part of a field that behaves in mostly numeric ways that mess up Markov models. For example you see a literal that ends in 00 or 000, it's likely being used in an address calculation, and in combination with the opcode bits, that may tell you something---it may help you predict other bits in the same instruction, or the next opcode and its operands, or whatever. (You can think it as a hint that further specifies the instruction, heuristically resolving an overloading of the underlying ADD opcode or whatever into a conceptual add-address-offset instruction that has different regularities.)

    That makes me think it would be interesting to try (1) canonicalizing the instructions in a horizontal format, and (2) using fairly brute-force machine learning to figure out which bits of that whole thing are most correlated, most predictive of which bits in the next few instructions, etc., and then (3) backward-mapping that onto a function that will implictly reformat instructions with the right bits together and the right bits apart so that you don't actually need to unpack and repack every instruction.

    But maybe that's more or less what high-end compressors with sparse context models, mixing, etc. are effectively doing already for machine code (in one non-obvious way another) without explicitly separating things out and lumping them back together again... ? I think I get the general ideas, but I'm clueless how the rubber really meets the road.
    Last edited by Paul W.; 26th April 2014 at 16:14.

  12. #10
    Member
    Join Date
    Jan 2013
    Location
    Earth
    Posts
    8
    Thanks
    0
    Thanked 0 Times in 0 Posts
    One of the most advanced x86 code compressors is kkrunchy, used in several intros (notably the 64K category) by Farbrausch. There is a pretty length post from the author on the different tricks he used, including split-stream encoding:

    http://fgiesen.wordpress.com/2011/01...n-in-kkrunchy/

    So how much does it help?

    All these techniques combined typically reduce final compressed size by about 20% compared to no preprocessing at all, with both the LZ-based and the context model coder (this is true not just with the kkrunchy backends, but also with other compressors like ZIP, RAR, LZMA or PAQ). Compare to the simpler “call offsets only” preprocessing, which typically reduces size by about 10%. That’s clearly the low-hanging fruit right there, but an extra 10% isn’t to be sneezed at, particularly since both the encoding and decoding are really quite easy and quick.
    BTW, split-steam encoding was also used in a couple of Motorola (now Freescale) PowerPC MCUs. They could execute compressed code directly from Flash or RAM - the decompression was done on the fly just before the instructions were delivered to the CPU decoder. They had two implementations, and I think one of them also used split-stream encoding.

Similar Threads

  1. Compression for a stream that flushes often?
    By aninternet in forum Data Compression
    Replies: 11
    Last Post: 19th April 2014, 11:15
  2. Precomp stream detection
    By danswano in forum Data Compression
    Replies: 10
    Last Post: 26th September 2013, 03:10
  3. Replies: 3
    Last Post: 19th April 2013, 16:33
  4. Data compression for stream with small packet
    By alpha_one_x86 in forum Data Compression
    Replies: 1
    Last Post: 6th May 2012, 18:51
  5. Do you have a 64-bit machine at home?
    By encode in forum The Off-Topic Lounge
    Replies: 22
    Last Post: 4th December 2009, 14:09

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •