27th August 2012, 17:14
Sportman's thread reminded me of something -
one rather common idea of random compression people is that its possible to
split random data into substrings without repeated symbols and then compress these.
Obviously it doesn't actually work.
But I kinda liked the idea, and sometimes use it as a "data compression problem" -
given a file of length N, and knowing that each byte value occurs only once within each
aligned 256-byte block, what's the guaranteed compression ratio for this file?
I even made a demo coder for this - http://nishi.dreamhosters.com/u/task1.rar
but its a purely mathematical problem.
27th August 2012, 22:17
it's 256!/2^256 and has obvious implementation via aritmetic encoding, why it should be interesting?
28th August 2012, 03:26
1. Its supposed to be something like an exercise in a data compression textbook.
In fact I'd like to find more reasonably easy problems like this - practical, but still solvable.
2. Its not 256!/2^256 and not even 256!/256^256 - see http://wolframalpha.com
3. The "obvious implementation" is not so obvious, if it has to work for files of arbitrary length.
Sure, the implementation with bytewise rc and linear scans would be rather simple, but it would
be slow, and likely stop working after 16M.
And an implementation with a bitwise rc may be a little more interesting.
4. The "random" files generated in this task can be used for compressor benchmarks - most
compressors don't see any redundancy at all, while paq8 still manages to find some.
28th August 2012, 11:43
may be i misunderstand something, but
1. you can compress each 256 bytes independently
2. for first byte in block you have 256 variants, for second 255 variant and so
so it's easy to count and easy to impelement
28th August 2012, 13:57
> so it's easy to count and easy to impelement
Yes, and your answer is still wrong. Even if you'd eventually remember about missing logarithm(s),
there's still more to it, because I didn't say that N%256==0.
Also if its too easy for you, maybe you can suggest something better?
28th August 2012, 18:18
Originally Posted by Shelwien
(log(256!)/log(2))/8 = 210.5 BYTES so you can compress this to 211 bytes / block.
The easy way to do this is so mod ARB255 you have a 255 binary cell tree with 256 leaves
you start with the path to each leaf with a weight of one so top cell 128 128 each time you
go down a level you half the weights so the last 128 binary cells each weigh 1 1 when you
go through the tree you subtract the weight till done.
I am pretty sure I did this mod years ago its not to hard. The only trick that most might miss
it that each cell for the propabilites used the given weights. The means when you get to a cell
that has a weight like 20 0 (nonzero on one path and zero on the other) you have no output to
the file since it it would always take the path with no weight.
What I am talking about is a file made of only 256 bytes blocks where each character used exactly once.
28th August 2012, 18:40
Ok at first I was talking about files only made of 256 byte blocks. However I see that you might have the last block shorter than 256 characters. That actually is no big deal. I was basically getting the last character for free since all the binary cells would have one weight of zero in them. What you do to allow short ending is call the update on those last cells as for every character. You however ending up only updating the freeends. This over all would make less then one half byte more for the whole file. Also ARB255 using 64 bit high and low for this problem thats over kill a 32 bit high and low would still work. This mod would also make the output bijective to a set of all files of the form we are compressing. Just remember if file being compressed it not of this form it fails.
Let me make it clear. When you do the uncompress it will map every possible bit pattern in a file to a file of the form we are talking about. it will not compress files of some other form to anything.