Results 1 to 6 of 6

Thread: Reverse Engineering Custom LZ Variant

  1. #1
    Member
    Join Date
    Jul 2019
    Location
    Somewhere
    Posts
    4
    Thanks
    2
    Thanked 0 Times in 0 Posts

    Reverse Engineering Custom LZ Variant

    Hello,

    I am not really skilled at compression types, but trying to improve myself. I am currently looking at some files from a game. I feel like it is using some kind of a LZ variant, but I haven't seen any noticeable stuff like flags or literal counts. That led me to believe that it is a custom variant. I might be wrong, and it might be something already being used though.

    I have attached 2 file samples. I included compressed text data, since it is easier to guess the decompressed result. Here is an example screenshot.

    Click image for larger version. 

Name:	sample1-ss.png 
Views:	238 
Size:	17.0 KB 
ID:	6720

    First 4 bytes is probably the decompressed size, then starts the compressed data. The decompressed output should go something like :

    "EXPSKIN X:\DATAS\_MASTER\G????\CHARACTERS\DEMONIC\PLAYER\M ESHPWP..."

    You can see in the data CHARAC(00 10) and PLAY(00 20). I am suspecting 0x10 and 0x20 to be offsets to the output/history buffer for copying "TER"/"ER" from "MASTER". I don't know how copy count is specified though. Also don't see any flags for literals.

    Any help and idea is greatly appreciated. Thank you.
    Attached Files Attached Files

  2. #2
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Both samples start with 26 literals, followed by 4 bytes of mask (big-endian; bit31 to bit0; 0=literal, 1=match).
    Matches look like 16-bit, "00 10" means "length=4, distance=16" so maybe something like DDDDllll dddddddd layout (minlen=4).
    Needs further investigation though.

  3. The Following User Says Thank You to Shelwien For This Useful Post:

    rhysling (18th July 2019)

  4. #3
    Member
    Join Date
    Jul 2019
    Location
    Somewhere
    Posts
    4
    Thanks
    2
    Thanked 0 Times in 0 Posts
    Thank you for your reply. It is indeed exactly like you said, I kind of figured out the 4 bytes mask when the forums were down

    The only remaining problem seems like the match length. I can't understand how that is coded.

    I have decompressed the part of the data more properly now using the mask and given offsets. It goes like this :

    EXPSKIN X:\DATAS\_MASTER\G_DATA\CHARACTERS\DEMONIC\PLAYER\ MESHPWP\PLAYER_SKIN.TSKIN


    I think the minlen should be 3 in this case, because "00 10" copies "TER" and "00 20" copies "ER\". Still I have noticed some inconsistencies with the first byte (assumed match length). Both the words "DATA" and "SKIN has matches. For the first one it is coded as (length, distance) : "40 0F" and for the second one it is : "08 45". So distance is correct for both, but the lengths are different. I mean both of them supposed to copy 4 bytes, but I don't get why they have different values.

    That is the only mystery remaining to me. I couldn't make much sense of that 1st byte. Thanks for your help.

    Quote Originally Posted by Shelwien View Post
    DDDDllll dddddddd layout (minlen=4).
    I didn't get this part. Can you elaborate it a bit?
    Last edited by rhysling; 18th July 2019 at 19:00.

  5. #4
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Here's my attempt to make a decoder.
    I think it at least correctly parses masks, and 5/11 split for len/dist also seems to work.

    > I didn't get this part. Can you elaborate it a bit?

    There're two bytes per match. I tried to visualize the bitfield layout.
    "l" is a length bit, "d" is a distance bit, "D" is a high distance bit.
    But actual layout seems to be more like lllllDDD dddddddd

    I wonder if there's still a chance that its actually not distance, but absolute position with some offset, like in Okumura LZSS.
    Attached Files Attached Files

  6. The Following User Says Thank You to Shelwien For This Useful Post:

    rhysling (18th July 2019)

  7. #5
    Member
    Join Date
    Jul 2019
    Location
    Somewhere
    Posts
    4
    Thanks
    2
    Thanked 0 Times in 0 Posts
    Thank you for the program.

    It seems like I have posted the data a bit missing at the start. The 4 bytes uncompressed size at the start is actually part of the compressed data and there is 4 bytes mask before that too. I have attached 2 proper examples. Sample 1 fixed and a new data.

    It looks like this :

    Name:  sample1-fixed-ss.png
Views: 97
Size:  17.9 KB


    First 4 bytes uncompressed size. Next 4 bytes compressed size (including size bytes). These are not part of the compressed data. After that starts the compressed data with 4 bytes mask. The rest is same.

    Quote Originally Posted by Shelwien View Post
    and 5/11 split for len/dist also seems to work.
    It was quite well for sample 2, but sample 1 didn't came out well.

    For sample 1 it produces :

    EXPSKIN X:\DATAS\_MASTER\G_DATAS\_MAST\CHARACATAS\DEMONIC\ PLAYTASMESHPWP\PLAYTA_\DAT.T\DAT


    I am pretty sure the output should be like this :

    EXPSKIN X:\DATAS\_MASTER\G_DATA\CHARACTERS\DEMONIC\PLAYER\ MESHPWP\PLAYER_SKIN.TSKIN

    It would be great if you could take a look at the proper data now. Maybe you could notice something new. In the meantime I will try to figure out the match bytes myself. Thank you very much for your time.

    Edit :

    I am starting to believe those 5 bits are not length at all. I have noticed cases where those bits are same for 2 matches, but the copy length should be different. Could it be possible that it is something like end offset? Like maybe pointing where the copying should end? I have not seen many compression variations, so forgive me if I am talking nonsense It is just what came up to my mind.
    Attached Files Attached Files
    Last edited by rhysling; 19th July 2019 at 00:04.

  8. #6
    Member
    Join Date
    Jul 2019
    Location
    Somewhere
    Posts
    4
    Thanks
    2
    Thanked 0 Times in 0 Posts
    Nevermind what I wrote, your code is actually correct. The only problem was len/dist split in some cases. It looks like it depends on the last bits of the mask. If the mask ends with "11" the split is 5/11, if it ends with "00" the split is 2/14. I got the correct output for sample1 using that.

    There might be other variations, but I think I can figure that out as I see more data. At least I have the correct core concept/code now.

    Thank you very much for your help Shelwien. I hope I can bother you again if I have any more problems

    Edit :

    Tested with multiple data and it works like a charm. Last 2 bits of the mask is indeed what determines the length bits. It is 2 (min value) + last 2 bits of the mask. So the split looks like this depending on last bits

    "00" = 2 / 14
    "01" = 3 / 13
    "10" = 4 / 12
    "11" = 5 / 11
    Last edited by rhysling; 19th July 2019 at 14:37.

Similar Threads

  1. Custom compressor to the Deflate algorithm
    By Yoshitatsu Kiryu in forum Data Compression
    Replies: 0
    Last Post: 10th November 2018, 12:37
  2. Cannot decompress zlib stream. Custom Format?
    By oPryzeLP in forum Data Compression
    Replies: 0
    Last Post: 13th May 2018, 11:34
  3. 32Bit Hash reverse engineering
    By FJ1325 in forum The Off-Topic Lounge
    Replies: 2
    Last Post: 18th January 2017, 05:18
  4. identification/reverse engineer of possible lz compression
    By patr0805 in forum Data Compression
    Replies: 4
    Last Post: 3rd March 2014, 00:45
  5. New LZW variant
    By encode in forum Forum Archive
    Replies: 14
    Last Post: 28th January 2008, 22:33

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •