How to compress strings with equal number of 0s and 1s
I have several binary strings
S1 = ....
S2 = ....
S3 = ....
each string with these properties:
1. they are of arbitrary length
2. within each string the amount of 0 bits is the same amount of 1 bits (for example: 000111)
3. there does not exist any partial string (= a substring starting in position 0 and with length L), with the same amount of 0s and 1s other than the whole string itself. For example, 001101 is not a valid string because there is the partial string "0011" with the same number of 0s and 1s. On the other hand, 110100 is valid because I only have the same number of 0s and 1s at the end of the string
4. other than the properties above, bits can be considered randomly distributed
Because of property 3. I can concatenate all strings into a single string S1S2S3S4... and I can always retrieve them again (each string is equal to "read up until you get an equal amount of 0s and 1s").
My question is... what's the best way to compress this string S1S2S3S4...? I've tried to run some of them into bz2, lzma, xz, but they all seem to expand the original string rather than compressing it. Any idea? Hints? HALPPP
If you want to do it optimally (assuming uniform probability distribution among all possibilities), you need to solve recurrence for the number of combinations fulfilling your condition 3:
C(k,l) - the number of bit sequences with k zeros and l ones fulfilling your condition 3 (no prefix contains the same number of 0 and 1)
C(k,l) = 1 if k = l = 0 // initialization
C(k,l) = 0 if k = l > 0 // prefix condition
C(k,l) = C(k-1,l) + C(k,l-1) otherwise
You just build Pascal's triangle, but put zeros in the center.
There might be analytic solution, but you can just solve it iteratively and e.g. put into a table.
Then you can easily find the number of bits you optimally need: lg(C(n-1,n) + C(n,n-1)) for length 2n sequence.
To achieve it, you can encode bit by bit:
- sometimes your condition 3 enforces some bit value - just use it,
- otherwise use accurate binary entropy coder: if you have already written k zeros and l ones, the probability of 1 is C(k,l+1) / C(k+1,l+1).
Oh, this is Catalan number problem ( https://en.wikipedia.org/wiki/Catalan_number ) - you have to get from one to the opposite vertex of a n x n square, but cannot touch the diagonal on the way (condition 3).
C(n-1,n) = C(n,n-1) = binomial(2n,n) / (n+1)
So thanks to data compression you asymptotically can save only ~lg(n) bits per such 2n bits comparing to just directly writing this sequence - data compression makes sense only for small n here.
The Following User Says Thank You to Jarek For This Useful Post:
Well, from stirling approximation I've got C(n,n/2) =~ 2^n/Sqrt[Pi/2*n]. So we save ~ log2(n) bits per string.
But then you're concatenating multiple such strings, so its necessary to encode same ~log2(n) bits to define the length of each string.
So nothing to save, unless there's some side information about probability distribution of n values or something.
Would those arguments be exactly the same for the opposite case, that is for (even-length) strings with not an equal amount of 0s and 1s? In other words instead of "read up until you get an equal amount of 0s and 1s", for strings like "read up until you get a unequal amount of 0s and 1s (but still of even length such as 2, 4, 6, 8, 10... bits)". Or is there any difference with this approach?
I'd like to share a sample of the string that I'm trying to compress (attachment). The extension ".txt" has no meaning, I only appended it because the website wouldn't allow me to upload otherwise. It's just a binary file made up of strings as defined in the original post. Each string is defined as "read as many bits as possible until you have the same amount of 0s and 1s" (start from position 0 until the end of file). Because each string has the same amount of bits, the whole file also has the same amount of bits: 8672 zeroes and 8672 ones (unless I messed up something with the output, but I think it's fine).
I was curious to know/see if anybody could give a stab at this and try to compress it, and I'd love to read your results and comments. I've tried to run this exact file through bz2, gz, lzma, xz, but they all expand the output. I can try to upload a larger file if necessary.