While I am working on a larger project, a bit of it mixed with the old idea of the xor compressor gave me an idea to implement.
I'm sharing here in the hope that someone wants to implement it. I don't think it would compress better than any existing one, but would open the field for future specific improvements. That's why for now it's not my priority.
Here, more or less the first Mb of enwik8 just with the lowercase words: https://mega.nz/#!lt1HgIbA!Onm7b8NRHICnOUkBW5PZnV4kxF7H0j927N69FfB 0hjo
Getting a dictionary out of it should be easy as a explode with the " " char, index words as array keys and get that array of keys. The dictionary includes all and every word.
I think the dict should be sorted and searched binary for the better timing, I had not worked with hashing, if you think it would be better, please explain so.
A future improvement would be threat different the unique occurrences. If this is the only thing stopping you from implementing it, give me a shout and I would do.
The logic is to trim the file to half its size by xoring each 2 bytes. So the compressed file would count half + dictionary size.
This can be done recursively up to the point that the dictionary entries are not worth, there has to be created a dictionary for each recursive xor process. There is room for improvement here also.
Decompressor logic (almost pseudocode):
Up to a loop of the length of the max length word, it will build all the options from reversing the xor process.
For binary example:
if xor is 1, the input was 01 or 10, so if one of them is in the dict we output the word and move on.
If not we add to the previous, so next iteration we have like: 0101, 1001, 0110, 1010. We check those on dict, if match output, if not move on.
So at one point we maybe had output one that was at the dictionary but was not correct, as it should be other option. Then the stream probably would get stuck.
We here move back 1 step each time until the last choice and choose the next choice instead the first one and check then if we can get to the end.
Unlikely but possible, that other wrong solution get us to the end of the output, we would have to resort to flags in such critical points of the decoding like, at position x take y choice.
For this lowercase trim of one 1mb of enwik, words fill like 800kb. Compressed with xor in 1/2 + Dict size is like 100kb (uncompressed). Total of ~500
This output xor file should be able to be compressed also.
Now I'm guessing for a "recursive" or looped compression of 2 times, it should trim original size from 800 to 1/4 = 200. Dictionaries would weight 100 + ~150 (100-200). Total of ~450
For three times... each iteration is guessing more so if someone could calculate beforehand would be welcome:
xored 100 + 250 (2 prev dicts) + 200 dict (100-300). So it's not worth unless the dictionary could be merged or otherwise improved.
Any brave coders gonna prove me right or wrong? Or willing to work with me?