I am trying to figure out the compression algorithm being used on firmware for HDMI over Ethernet extenders that use video chips by ITE Tech. It seems to be similar to LZSS in that it uses a control byte that describes 8 chunks at a time, with a 1 bit meaning a literal and a 0 bit meaning a reference to previous data, but the reference seems to be more complicated than typical LZSS implementations (variable length?). I was hoping that someone with more experience than me might be able to recognize something that I'm missing.
I have attached a zip file containing two sample binary files (bootloader, main firmware) that both appear to be compressed with the same scheme. Unfortunately I don't have any samples of the decompressed data, because it's handled internally in the chip. By looking at the compressed data you can definitely deduce what several uncompressed strings are supposed to be though. Here is what I have figured out so far about the file format:
- Everything is big-endian.
- The 32-bit word at 0x10 is the total length of compressed data, starting at (and including) the "SMAZ" identifier.
- At 0x70 there is a 32-bit word that I think is describing the length of some data immediately afterward. I'm unsure whether this data is relevant to the compression algorithm, or if it's something else to do with initializing the chip. It seems to be in units of 4-byte words.
- After a few more values I don't understand, the magic identifier "SMAZ" appears. I immediately did some searching for this to see if anyone else had already figured it out, and this is definitely not the Smaz small strings compression algorithm.
- After SMAZ, there are three 32-bit values. My best guess is that the first one is a decompression buffer size, the second one is the length of the uncompressed data, and the third one is the length of the compressed data. Then, the compressed data immediately follows. In the larger of my two sample binary files, several more groups of [uncompressed len, compressed len, data] repeat after the first set of data, indicating the decompression is done in chunks of 512 KB post-decompression bytes.
When I analyze the compressed data, it seems LZSS-esque, at least as far as the bits in the control byte are concerned. There are many places where I can find larger chunks of plain text separated by 0xFF every 8 bytes. For example, in the smaller sample file starting at 0x21547, you can find several chunks of 8 plain text bytes prefixed with 0xFF.
Another good example is in the larger sample file starting around 0xE024B. If I start there and go forward, processing 0xFF control bytes followed by 8 bytes of uncompressed data, I eventually reach a control byte of 0xFC at 0xE0281. I think the higher bits are counted first, because that is followed by 6 more uncompressed bytes. That's where everything begins to fall apart in my understanding. The uncompressed data seems to start again at 0xE028D, before another control byte has even been reached (the next control byte is at 0xE0290). Also, there should have been two back references, but together they take up 5 bytes, which seems strange.
Yet another example in the larger file is just afterward at 0xE0337. This is clearly another control byte (0xF0), and clearly the next 4 bytes are uncompressed. But then there are only 3 bytes before the uncompressed bytes start right back up, and the next control byte is at 0xE0346. I don't understand why the uncompressed bytes begin again so quickly.
I ran the QuickBMS comtype scanner on the first sample file, and none of the results seemed to decode properly, so I don't think it's anything QuickBMS knows about. Does anybody have any clues about what's going on here? Maybe I'm making a mistake comparing it to LZSS. The control bytes seem so similar at first glance, yet they are so different when you start analyzing the data. Is this even enough information to determine anything without having the original uncompressed data available?
Thanks for any insight!