Results 1 to 10 of 10

Thread: Help identifying LZSS-style firmware compression

  1. #1
    Member
    Join Date
    Jun 2017
    Location
    United States
    Posts
    5
    Thanks
    2
    Thanked 0 Times in 0 Posts

    Help identifying LZSS-style firmware compression

    Hi everyone,

    I am trying to figure out the compression algorithm being used on firmware for HDMI over Ethernet extenders that use video chips by ITE Tech. It seems to be similar to LZSS in that it uses a control byte that describes 8 chunks at a time, with a 1 bit meaning a literal and a 0 bit meaning a reference to previous data, but the reference seems to be more complicated than typical LZSS implementations (variable length?). I was hoping that someone with more experience than me might be able to recognize something that I'm missing.

    I have attached a zip file containing two sample binary files (bootloader, main firmware) that both appear to be compressed with the same scheme. Unfortunately I don't have any samples of the decompressed data, because it's handled internally in the chip. By looking at the compressed data you can definitely deduce what several uncompressed strings are supposed to be though. Here is what I have figured out so far about the file format:


    • Everything is big-endian.
    • The 32-bit word at 0x10 is the total length of compressed data, starting at (and including) the "SMAZ" identifier.
    • At 0x70 there is a 32-bit word that I think is describing the length of some data immediately afterward. I'm unsure whether this data is relevant to the compression algorithm, or if it's something else to do with initializing the chip. It seems to be in units of 4-byte words.
    • After a few more values I don't understand, the magic identifier "SMAZ" appears. I immediately did some searching for this to see if anyone else had already figured it out, and this is definitely not the Smaz small strings compression algorithm.
    • After SMAZ, there are three 32-bit values. My best guess is that the first one is a decompression buffer size, the second one is the length of the uncompressed data, and the third one is the length of the compressed data. Then, the compressed data immediately follows. In the larger of my two sample binary files, several more groups of [uncompressed len, compressed len, data] repeat after the first set of data, indicating the decompression is done in chunks of 512 KB post-decompression bytes.


    When I analyze the compressed data, it seems LZSS-esque, at least as far as the bits in the control byte are concerned. There are many places where I can find larger chunks of plain text separated by 0xFF every 8 bytes. For example, in the smaller sample file starting at 0x21547, you can find several chunks of 8 plain text bytes prefixed with 0xFF.

    Another good example is in the larger sample file starting around 0xE024B. If I start there and go forward, processing 0xFF control bytes followed by 8 bytes of uncompressed data, I eventually reach a control byte of 0xFC at 0xE0281. I think the higher bits are counted first, because that is followed by 6 more uncompressed bytes. That's where everything begins to fall apart in my understanding. The uncompressed data seems to start again at 0xE028D, before another control byte has even been reached (the next control byte is at 0xE0290). Also, there should have been two back references, but together they take up 5 bytes, which seems strange.

    Yet another example in the larger file is just afterward at 0xE0337. This is clearly another control byte (0xF0), and clearly the next 4 bytes are uncompressed. But then there are only 3 bytes before the uncompressed bytes start right back up, and the next control byte is at 0xE0346. I don't understand why the uncompressed bytes begin again so quickly.

    I ran the QuickBMS comtype scanner on the first sample file, and none of the results seemed to decode properly, so I don't think it's anything QuickBMS knows about. Does anybody have any clues about what's going on here? Maybe I'm making a mistake comparing it to LZSS. The control bytes seem so similar at first glance, yet they are so different when you start analyzing the data. Is this even enough information to determine anything without having the original uncompressed data available?

    Thanks for any insight!
    Attached Files Attached Files

  2. #2
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    I tried looking at it, and I don't think its LZSS or anything similar.
    First, it actually looks compressed (mostly).
    So I think its bitcoded - maybe the tables in the header are related.
    As to text strings, there could be a "literal run" code, which just appeared byte-aligned
    in some places / for some run lengths.
    I doubt that its possible to decrypt this without an actual decoder.
    Here's what I've got:

    00021502 dc.b $CA ; ¦
    00021503 dc.b 'Main'
    00021507 dc.b $F0 ; Ё
    00021508 dc.b 'Task'
    0002150C dc.b $12
    0002150D dc.b $BF ; ¬
    0002150E dc.b 'r:norw'
    00021514 dc.b $F
    00021515 dc.b 'ERR'
    00021518
    00021518 dc.b $FF
    00021519 dc.b ':E:/lenk'
    00021521
    00021521 dc.b $6E ; n ; 01101110
    00021522 dc.b 4
    00021523 dc.b 'g/c'
    00021526 dc.b $DF ; -
    00021527 dc.b $3C ; <
    00021528 dc.b 'e_cod'
    0002152D
    0002152D dc.b $FF
    0002152E dc.b 'e/ITE_Ex'
    00021536
    00021536 dc.b $B3 ; 10110011
    00021537 dc.b 't' ; 1
    00021538 dc.w $2816
    0002153A
    0002153A dc.b $7F ; 01111111
    0002153B dc.b 'r/FINAL'
    00021542


    cdm also seems to confirm that its a bitcode - samples transformed by bitwise BWT are compressed better than bytewise.
    Not sure about direction though - neither is certainly better, so likely bit7-0 because of visible letters?

  3. The Following User Says Thank You to Shelwien For This Useful Post:

    dougg3 (18th June 2017)

  4. #3
    Member
    Join Date
    Jun 2017
    Location
    United States
    Posts
    5
    Thanks
    2
    Thanked 0 Times in 0 Posts
    Thank you for taking a look, Shelwien! I really appreciate your expertise. I think the groups of 8 uncompressed bytes delimited by 0xFF threw me off and led me to jump to conclusions too quickly. It definitely seems like when I find places where the code is byte-aligned, 0xFF means that there are 8 upcoming plaintext bytes, 0xFE means 7 upcoming, 0xFC (or 0xFD) means 6, and so on...maybe that is just the way the "literal run" code works in this compression scheme.

    I think the decoding is being done in hardware, so I guess this is a dead end unless I am ever able to obtain an SDK for this chipset.

  5. #4
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Well, its not impossible, but would be really tough.
    I'd try if it was plaintext, but how to decrypt it if we don't know what to expect?

    But maybe its possible to find more samples like that - https://blog.danman.eu/new-version-o...ender-lkv373a/ seems to have same signature?

    Also it should be possible to analyze the codes before text (as bitcodes).
    Maybe try to collect bitstring stats and decode their meaning.

    Though well, I tried to shift the data by 1..7 bits and didn't find any obvious text with strings utility -
    maybe literal runs are byte-aligned?

  6. #5
    Member
    Join Date
    Jun 2017
    Location
    United States
    Posts
    5
    Thanks
    2
    Thanked 0 Times in 0 Posts
    The thread you linked is actually where I began I've been posting comments about trying to figure out the firmware update format. There is a link to Google Drive on there with lots of different compressed firmwares. I've already figured out how to extract sections from the .PKG files. There are certain parts of the plaintext that I already know, due to console dumps posted on that blog. For example, the string you were examining in your previous post is supposed to start with:

    E:/lenkeng/case_code/ITE_Extender/FINAL_

    and probably also contains IPTV_TX_SDK/project/bootloader...but I'm guessing based on the compressed data for that part.

    But yeah, it's probably still tricky, especially since I have no idea what the beginning part of the plaintext is supposed to look like. I only have clues toward the end where the strings are stored, probably the .rodata section. I'm also reasonably certain about the content of the chunks around 0xE024B and 0xE0337 because it's following a pattern and there are lots of 0xFF plaintext markers. Maybe I could try to find a way to dump the device's RAM content at runtime, which would probably contain the decompressed plaintext somewhere.

  7. #6
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    So its like this:
      E:/lenkeng/case_code/ITE_Extender/FINAL_
    ':E:/lenk'
    01101110 00000100 // 'en'
    'g/c'
    11011111 00111100 // 'as'
    'e_cod'
    11111111 // ''
    'e/ITE_Ex'
    10110011 // ''
    't'
    00101000 00010110 01111111 // 'ende'
    'r/FINAL'


    I still don't understand the logic in this, though.
    One byte to a encode a literal run of specific length... well, ok.
    Two bytes to encode two symbols (a match?) _and_ another literal run? Now that's hard.
    Also, luckily there's a digram 'en' indirectly encoded two times - but I don't see any common bitstrings there. Different matches?

  8. #7
    Member
    Join Date
    Jun 2017
    Location
    United States
    Posts
    5
    Thanks
    2
    Thanked 0 Times in 0 Posts
    I'm having a hard time understanding the logic too. Forgive me if this is nonsense, but I'm trying to brainstorm:

    In the last part of your example above, it seems like those 7 '1' bits are specifying that there are 7 literal bytes to follow, but that byte started with a '0'. Is it possible that the '0' was actually still being used as part of the bit code for the symbol encoding for "ende", and then once it was finished, since we weren't at a byte boundary yet, the rest of the 7 remaining bits were allowed to be used for literal encoding?

    And a similar thought: in the '10110011', maybe the first '1' encoded a single literal byte, and once a '0' was reached, the rest of the bits started going toward the symbol encoding? So was the symbol encoding for the "ende" chunk actually "0110011 00101000 00010110 0"? (Spaces added to denote byte boundaries in the compressed data)

    The big problem is that it doesn't seem to fit with the first half of your example above, so maybe it's not that. The 7 '1' bits followed by 7 literal bytes definitely seems like a clue though.

  9. #8
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    1. It seems to be a bitcode, but I wasn't able to find unaligned literal runs
    (as I said, I tried shifting the data both as bit7-0 bitstream and as bit0-7 stream, and finding text with "strings" - no luck.)
    So these 1s before literals can be simply padding for byte alignment.

    2. Unfortunately the bootloader example is not any good - there're only 0xFF codes.
    But here I found something different:
    00010100 01011111 'DAT' 10111011 'A' 
    00001111 'RE' 11110111 'COVE' 01001000 10110111 01010100 00000110 01011111
    'AG' 01100000


    Compare:

    10111011 'A'
    10110011 't'


    I wonder if it uses some weird bit ordering, like bit3-0,bit7-4 - that'd explain why i didn't find unaligned text.

  10. #9
    Member
    Join Date
    Jun 2017
    Location
    United States
    Posts
    5
    Thanks
    2
    Thanked 0 Times in 0 Posts
    Hmm, very interesting!

    Here's another interesting sample compressed file from a different chip on the same board that appears to use the same compression scheme. The interesting thing about this file is that the first compressed section has an uncompressed length of 0x80000, but a compressed length of 0x4CF. So there is probably a lot of repetition going on. Unfortunately I don't know what data it represents, but it's definitely very compressible data. Based on the structure of the .bin update files which similarly are very empty for approximately the first 0x80000 bytes, I have a feeling that it is zeros, but I can't confirm that for sure. The compressed data is very noticeable in a hex editor because there are a lot of 0xA and 0x5 nibbles showing up which are the same thing shifted by 1 bit.

    Starting at D04, we see:

    55 42 4A AA A1 25 55 50 92 AA A8 49 55 54 24 AA AA 12 55 55 09 2A AA 84 95 ....

    And then it repeats the 25-byte sequence over and over and over again, eventually reaching the end of the compressed section (which seems to be ended with a 0xFF byte). Then the next compressed section gets started with the repetition almost immediately again. I'm not sure if this actually tells us anything, but it's really the same 25-bit sequence being repeated over and over again:

    1010101010100001001001010
    Attached Files Attached Files

  11. #10
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Yes, this confirms bitcode and bit7-0 order.
    Also, that series of matches ends with
    1010000100100101010101010
    1010000100100101010101010
    1010000100100101010101000

    So that's likely match length at the end; its different in last one because there's a fixed number of zeroes.
    Also, deflate match looks like this:
              uint lc = dec.len-3;
    code = _length_code[lc];
    putbits( lencode.code[code+257], lencode.lens[code+257] );

    extra = extra_lbits[code];
    if( extra!=0 ) putbits( lc-base_length[code], extra );

    uint dist = dec.dist-1;
    code = (dist<256) ? _dist_code[dist] : _dist_code[256+(dist>>7)];
    putbits( distcode.code[code], distcode.lens[code] );

    extra = extra_dbits[code];
    if( extra!=0 ) putbits( dist-base_dist[code], extra );

    And I think here we have something similar, just distance first, then length.
    I mean that only "010" and "000" at the end are uncompressed bits, while
    "101010101" is likely the base length code:

    10100001001 // distance code
    00 // extra bits of distance
    101010101 // length code
    010 // extra bits of length

  12. The Following User Says Thank You to Shelwien For This Useful Post:

    dougg3 (21st June 2017)

Similar Threads

  1. Identifying unknown compression
    By McGee in forum Data Compression
    Replies: 1
    Last Post: 28th October 2015, 22:18
  2. Identifying compression method
    By ngc.kor in forum Data Compression
    Replies: 5
    Last Post: 15th November 2014, 06:03
  3. Compressed Archive: Identifying compression method and Decompressing?
    By UnknownToaster in forum Data Compression
    Replies: 34
    Last Post: 24th September 2014, 21:42
  4. Is the BWT an FFT-style transform?
    By nburns in forum Data Compression
    Replies: 25
    Last Post: 22nd October 2013, 02:13
  5. Problems identifying file compression
    By Mexxi in forum Data Compression
    Replies: 11
    Last Post: 2nd February 2012, 00:03

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •