# Thread: Cracking an old MS-DOS game's compressed file.

1. Originally Posted by mpais
Done, see attachment. Also, the endianness is little-endian, not big.
Okay. I will have a look soon. Little endian? The rest of the program uses big endians. (EDIT: correction it does not) While this is the DOS version it was originally for the Apple IIgs. I will have to check. Thanks for pointing this out.

EDIT:
Indeed, little endian. And the decompressor works wonderfully.

2. Originally Posted by mpais
Done, see attachment. Also, the endianness is little-endian, not big.
@mpais:
Hi, I need a little help with the code you posted:

Code:
```int dump( int d ) {
char b[4096]={0};
int n=4095;
while (d>256 && n>=0){
b[n]=lzw[d][1];
n--;
d=lzw[d][0];
}
b[n]=d;
while (n<4096){
putc(b[n], g);
n++;
}
return d;
}```
-What type of values can the "d" parameter have when it is sent to the "dump" procedure?
-And when it is returned? What is the "b" array?
-What do 4095 and 4096 refer to?

Thanks.

3. > What type of values can the "d" parameter have when it is sent to the "dump" procedure?

Same as explained before, d is LZW symbol.
d=0..255 is a normal data byte
d=258..4095 is an extended LZW symbol, which is constructed by attaching a byte to a previous symbol

> And when it is returned?

First symbol of a byte string corresponding to LZW symbol d is returned

> What is the "b" array?

That array is used to unpack symbol d to byte string (it has to be done backwards because of symbol creation algorithm).

> What do 4095 and 4096 refer to?

4096=1<<12 is the limit of LZW alphabet in this algorithm, because it doesn't use more than 12 bits for codes.
You can only store numbers 0..4095 in 12 bits.
The algorithm also constructs a new symbol by attaching a single byte to already existing symbol,
so code 0x102 unpacks to 2 bytes, 0x103 to 3 bytes, 0xFFF to 3839 bytes,
but mpais reserved full 4096 bytes for unpacked symbol, so in dump(), 4096 is the size of b array
and 4095 is the last index in it.

4. ## The Following User Says Thank You to Shelwien For This Useful Post:

Peter Swinkels (22nd July 2018)

5. > The algorithm also constructs a new symbol by attaching a single byte to already existing symbol,
so code 0x102 unpacks to 2 bytes, 0x103 to 3 bytes, 0xFFF to 3839 bytes,

You're talking about the most extreme case with a single byte run in entire file? In LZW you don't have to attach symbol to longest existing dictionary entry, so maximum entry length doesn't have to grow linearly.

6. I was talking about buffer size for unpacking a symbol.

7. Hi,

During the last month I wrote a vb.net translation for the C decompression code kindly provided by other forum users, and have tried to write a compression algorithm to go with it. With partial success, small chunks of data compress and decompress fine, anything else fails. The compressor can be downloaded here: https://drive.google.com/open?id=11v...xBdry2KAuzw3lf. It's part of a small vb.net project that contains a few other things as well. All relevant compression code is contained in a single module. A ready to use/test executable (LZW.EXE) can be found in the ".\bin" folder.

Further details concerning the decompressor and what it's for can be found in this thread's previous posts.

So my question is:
Can any one help me get the compression to work together with decompression code?

I didn't provide much details, so if anyone reading this has questions: just ask and I'll do my best to answer.

Thanks.

8. Hi

Sorry, but I don't have time atm to write a compressor for you, and your VB code doesn't help, because I don't use VB.

But making a compressor in this case should be reasonably simple - just match following string with LZW symbols, keep longest match.
It should be also possible to reproduce the same compression algorithm that was originally used - it should help with debug too,
because if your output is different, you can check what you do at this step.

As to your problem with "small chunks of data", I think the likely issue is bitsize mismatch.
_After_ adding a new symbol, you should check next symbol index, and if its >=(1<<bitsize), increase bitsize.
Of course, there'd be a decoding error soon enough, if you don't follow decoding logic and switch bitsize differently.
1. after encoding a symbol, a new symbol should be always created
2. when symbol index reaches (1<<bitsize), increase bitsize, but only if symbol index is <4096.

9. ## The Following User Says Thank You to Shelwien For This Useful Post:

Peter Swinkels (24th August 2018)

10. I don't use VB, so I didn't check your code, but here's a C++ version with compression support. It creates a 10 byte dummy header, like on IEA files.
Even with working compression code, how do you know the software will accept your modified files? For all we know, the header contains some sort of checksum that you don't know how to calculate.

11. ## The Following 2 Users Say Thank You to mpais For This Useful Post:

Peter Swinkels (28th August 2018),Shelwien (23rd August 2018)

12. Originally Posted by Shelwien
Hi

Sorry, but I don't have time atm to write a compressor for you, and your VB code doesn't help, because I don't use VB.

But making a compressor in this case should be reasonably simple - just match following string with LZW symbols, keep longest match.
It should be also possible to reproduce the same compression algorithm that was originally used - it should help with debug too,
because if your output is different, you can check what you do at this step.

As to your problem with "small chunks of data", I think the likely issue is bitsize mismatch.
_After_ adding a new symbol, you should check next symbol index, and if its >=(1<<bitsize), increase bitsize.
Of course, there'd be a decoding error soon enough, if you don't follow decoding logic and switch bitsize differently.
1. after encoding a symbol, a new symbol should be always created
2. when symbol index reaches (1<<bitsize), increase bitsize, but only if symbol index is <4096.
Hi, I can understand that and I was just about to start converting my project to C when I read mpais' reply. Yeah, making a compressor shouldn't be that difficult and your suggestions make sense. Well, I'm going to have a look at mpais' compressor. Thanks anyway.

13. Originally Posted by mpais
I don't use VB, so I didn't check your code, but here's a C++ version with compression support. It creates a 10 byte dummy header, like on IEA files.
Even with working compression code, how do you know the software will accept your modified files? For all we know, the header contains some sort of checksum that you don't know how to calculate.
Hi mpais, that's quite alright. My C++ isn't great but I should be able to figure out how your code works. As for the 10 byte header: I have absolutely no idea as to why it is even there. I purposefully mangled it and ran the installer just to see what would happen. As near as I can tell it does nothing at all, as long as there are 10 bytes with no purpose things are fine. Otherwise not. Probably has to to with data alignment in memory or perhaps it's there to confuse would be software crackers. It could possibly be a timestamp that is otherwise ignored. I dunno. Anyway thanks for the compressor! I'm going to have look right now.

14. Originally Posted by mpais
I don't use VB, so I didn't check your code, but here's a C++ version with compression support. It creates a 10 byte dummy header, like on IEA files.
Even with working compression code, how do you know the software will accept your modified files? For all we know, the header contains some sort of checksum that you don't know how to calculate.
First of, thanks for the compressor. A few questions:

1. What does the "table" array do?
2. Why is it set to a size of 9221 (hashsize), what is the reason for this particular value?

15. I'm using a hash table to speed up the dictionary searches. The strange constant (9221) is a prime number that is over 2x as large as the maximum number of dictionary entries (4096), it is used to ensure the step size in probing is always coprime to the table size (see here) and that the load factor stays below 50%.

I guess you won't need that in VB.NET, but it was a simple enough little optimization. The hash function is probably no good (made it up on the spot), but I didn't want to confuse you with any more weird constants, and it still seems fast enough (about 28MB/s compressing enwik8 here). The rest of the code should be pretty self-explanatory, LZW is a very simple algorithm.

16. ## The Following User Says Thank You to mpais For This Useful Post:

Peter Swinkels (27th August 2018)

17. Hi MPais,

Thank you for the explanation. I have managed to extract the compression code from the code you provided and reduced it to its bare minimum (e.g., removed error handling, something that I will reimplement eventually. Also the ten byte header stuff has been removed as it is not relevant to the compression/decompression stuff.)

This is what I have so far:
Code:
```#include "stdafx.h"
#include <stdio.h>
#include <string.h>
#include <stdint.h>

#define EOF_CODE     0x101
#define HASH_SIZE   0x2405
#define RESET_CODE   0x100
#define SYMBOL_BASE  0x102

struct LZWEntry {
int16_t Prefix;
int16_t Suffix;
};

uint8_t Buffer[0x1000];
int32_t DictionaryIndex = 0x0;
LZWEntry DictionaryO[0x1000];
int16_t Table[HASH_SIZE];

void AddEntry(const int32_t Prefix, const int32_t Suffix, const int32_t Offset = -1) {
if (Prefix == -1 || Prefix >= DictionaryIndex || DictionaryIndex >= 0x1000 || Offset >= 0x0) return;
DictionaryO[DictionaryIndex].Prefix = Prefix;
DictionaryO[DictionaryIndex].Suffix = Suffix;
Table[-Offset - 0x1] = DictionaryIndex;
if (DictionaryIndex < 0x1000) DictionaryIndex += 0x1;
}
int32_t FindEntry(const int32_t Prefix, const int32_t Suffix) {
int32_t i = (Prefix & 0x3FF) + (Suffix << 0x4) + ((Suffix & 0xF) << 0x8);
int32_t Offset = (i > 0x0) ? HASH_SIZE - i : 0x1;
while (true) {
if (Table[i] < 0x0) { //FREE_SLOT?
return -i - 0x1;
} else if (DictionaryO[Table[i]].Prefix == Prefix && DictionaryO[Table[i]].Suffix == Suffix) {//DESIRED_ENTRY?
return Table[i];
}
i -= Offset;
if (i < 0x0) i += HASH_SIZE;

//return ...; // DEFAULT_RETURN_VALUE?
}
}
void Reset() {
memset(&DictionaryO, -1, sizeof(DictionaryO));
memset(&Table, -1, sizeof(Table));
for (int32_t i = 0x0; i < RESET_CODE; i++) {
Table[-FindEntry(-1, i) - 1] = (int16_t)i;
DictionaryO[i].Suffix = i;
}
DictionaryIndex = SYMBOL_BASE;
}

void WriteCode(FILE *Output, int32_t *Buffer, int32_t *BitsUsed, int32_t BitsPerCode, int32_t Code) {
(*Buffer) = (*Buffer) | (Code << (*BitsUsed));
(*BitsUsed) += BitsPerCode;
while ((*BitsUsed) > 0x7) {
putc((*Buffer) & 0xFF, Output);
(*Buffer) = (*Buffer) >> 0x8;
(*BitsUsed) -= 0x8;
}
}

int main(int argc, char** argv) {
int32_t BitsPerCode = 0x9;
int32_t BitsUsed = 0x0;
int32_t Buffer = 0x0;
int32_t Code = 0x0;
int32_t FoundIndex = -1;
FILE *Input = NULL;
FILE *Output = NULL;
int32_t Parent = -1;

printf("Compressing to file: %s\n", argv[2]);

Reset();

fopen_s(&Input, argv[1], "rb");
fopen_s(&Output, argv[2], "wb");
fseek(Input, 0, SEEK_SET);

WriteCode(Output, &Buffer, &BitsUsed, BitsPerCode, RESET_CODE);
while (true) {
Code = getc(Input);
if (Code < 0x0) break;

FoundIndex = FindEntry(Parent, Code);
if (FoundIndex < 0x0) { // NO_ENTRY
WriteCode(Output, &Buffer, &BitsUsed, BitsPerCode, Parent);
if (DictionaryIndex >= 0x1000) {
WriteCode(Output, &Buffer, &BitsUsed, BitsPerCode, RESET_CODE);
Reset();
BitsPerCode = 0x9;
} else {
if (DictionaryIndex > (0x1 << BitsPerCode)) BitsPerCode += 0x1;
}
Parent = Code;
} else {
Parent = FoundIndex;
}
}
if (Parent >= 0x0) WriteCode(Output, &Buffer, &BitsUsed, BitsPerCode, Parent);
WriteCode(Output, &Buffer, &BitsUsed, BitsPerCode, EOF_CODE);
if (BitsUsed > 0x0) putc(Buffer & 0xFF, Output); // FLUSH_BUFFER
fclose(Input);
fclose(Output);

return 0x0;
}```
I also made some minor adjustments that should make converting this to vb.net easier. As far I can tell the above code produces valid compressed data which I can decompress without issues. If I understand you correctly the the Table related code can be removed, right?

EDIT:
I see the "ReturnValue" function can exit without explicitly returning a value. What should this value be?

18. Originally Posted by Peter Swinkels
I also made some minor adjustments that should make converting this to vb.net easier. As far I can tell the above code produces valid compressed data which I can decompress without issues. If I understand you correctly the the Table related code can be removed, right?
I'm not familiar with VB.NET, but supposing you have some sort of generic dictionary class designed for fast retrieval, then you can probably just use that. Internally it will probably handle the hashing itself.

Originally Posted by Peter Swinkels
I see the "ReturnValue" function can exit without explicitly returning a value. What should this value be?
Do you mean the FindEntry() function? That function will always return a value, the loop either ends when the entry we requested is found, or when a free position in the dictionary is found for that entry.

19. Originally Posted by mpais
I'm not familiar with VB.NET, but supposing you have some sort of generic dictionary class designed for fast retrieval, then you can probably just use that. Internally it will probably handle the hashing itself.

Do you mean the FindEntry() function? That function will always return a value, the loop either ends when the entry we requested is found, or when a free position in the dictionary is found for that entry.
My mistake, I checked: it does always return a value.

Yes, Vb.net uses the .Net framework which provides a generic dictionary. Here is an example in C# which also uses the .Net framework but much more similar to C++ in syntax:

Code:
```using System;
using System.Collections.Generic;

namespace ConsoleApp1
{
class Program
{
static void Main(string[] args)
{
Dictionary<string, string> Example = new Dictionary<string, string>();

Console.WriteLine(Example["2"]); //Output "two".
}
}
}```

20. Hi,

I got the compressor and decompressor working: https://drive.google.com/open?id=1-k...JU0nxuR3GGd4Uk.

Thanks everyone.

21. Originally Posted by mpais
Done, see attachment. Also, the endianness is little-endian, not big.
mpais, do you still have the C++ code for decompressing *.pea files?

22. The version I posted above (attachment lzw.zip) does decompression also, both for PEA and IEA files.

23. ## The Following User Says Thank You to mpais For This Useful Post:

Peter Swinkels (30th August 2018)

Page 2 of 2 First 12