Results 1 to 7 of 7

Thread: How to compress strings with equal number of 0s and 1s

  1. #1
    Member
    Join Date
    Jan 2017
    Location
    Spain
    Posts
    3
    Thanks
    0
    Thanked 0 Times in 0 Posts

    How to compress strings with equal number of 0s and 1s

    I have several binary strings

    S1 = ....
    S2 = ....
    S3 = ....
    etc...

    each string with these properties:

    1. they are of arbitrary length
    2. within each string the amount of 0 bits is the same amount of 1 bits (for example: 000111)
    3. there does not exist any partial string (= a substring starting in position 0 and with length L), with the same amount of 0s and 1s other than the whole string itself. For example, 001101 is not a valid string because there is the partial string "0011" with the same number of 0s and 1s. On the other hand, 110100 is valid because I only have the same number of 0s and 1s at the end of the string
    4. other than the properties above, bits can be considered randomly distributed

    Because of property 3. I can concatenate all strings into a single string S1S2S3S4... and I can always retrieve them again (each string is equal to "read up until you get an equal amount of 0s and 1s").
    My question is... what's the best way to compress this string S1S2S3S4...? I've tried to run some of them into bz2, lzma, xz, but they all seem to expand the original string rather than compressing it. Any idea? Hints? HALPPP

  2. #2
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    645
    Thanks
    205
    Thanked 196 Times in 119 Posts
    If you want to do it optimally (assuming uniform probability distribution among all possibilities), you need to solve recurrence for the number of combinations fulfilling your condition 3:
    C(k,l) - the number of bit sequences with k zeros and l ones fulfilling your condition 3 (no prefix contains the same number of 0 and 1)

    C(k,l) = 1 if k = l = 0 // initialization
    C(k,l) = 0 if k = l > 0 // prefix condition
    C(k,l) = C(k-1,l) + C(k,l-1) otherwise
    You just build Pascal's triangle, but put zeros in the center.

    There might be analytic solution, but you can just solve it iteratively and e.g. put into a table.
    Then you can easily find the number of bits you optimally need: lg(C(n-1,n) + C(n,n-1)) for length 2n sequence.

    To achieve it, you can encode bit by bit:
    - sometimes your condition 3 enforces some bit value - just use it,
    - otherwise use accurate binary entropy coder: if you have already written k zeros and l ones, the probability of 1 is C(k,l+1) / C(k+1,l+1).

  3. #3
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    645
    Thanks
    205
    Thanked 196 Times in 119 Posts
    Oh, this is Catalan number problem ( https://en.wikipedia.org/wiki/Catalan_number ) - you have to get from one to the opposite vertex of a n x n square, but cannot touch the diagonal on the way (condition 3).
    C(n-1,n) = C(n,n-1) = binomial(2n,n) / (n+1)
    So thanks to data compression you asymptotically can save only ~lg(n) bits per such 2n bits comparing to just directly writing this sequence - data compression makes sense only for small n here.

  4. The Following User Says Thank You to Jarek For This Useful Post:

    nburns (2nd February 2017)

  5. #4
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Well, from stirling approximation I've got C(n,n/2) =~ 2^n/Sqrt[Pi/2*n]. So we save ~ log2(n) bits per string.
    But then you're concatenating multiple such strings, so its necessary to encode same ~log2(n) bits to define the length of each string.
    So nothing to save, unless there's some side information about probability distribution of n values or something.

  6. #5
    Member
    Join Date
    Jan 2017
    Location
    Spain
    Posts
    3
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Would those arguments be exactly the same for the opposite case, that is for (even-length) strings with not an equal amount of 0s and 1s? In other words instead of "read up until you get an equal amount of 0s and 1s", for strings like "read up until you get a unequal amount of 0s and 1s (but still of even length such as 2, 4, 6, 8, 10... bits)". Or is there any difference with this approach?

  7. #6
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Well, the total number of permutations with k!=n from 2*n seems to be
    Sum[2*Binomial[2*n,i],{i,0,n-1}]

    Which Mathematica reduced to
    2*n + log2( 1 - (2*n)! / (n!)^2 / 4^n )

    Where second part looks like this for n=1..32 (this is savings, in bits, from knowing that k!=n):
    {-1., -0.678072, -0.540568, -0.460841, -0.407543, -0.368823, -0.339113, -0.315416, -0.295961, 
    -0.279629, -0.265671, -0.253567, -0.242944, -0.233525, -0.2251, -0.217507, -0.210619,
    -0.204334, -0.198569, -0.193256, -0.188341, -0.183776, -0.179521, -0.175544, -0.171816,
    -0.168311, -0.165009, -0.16189, -0.15894, -0.156142, -0.153485, -0.150956 }

    While you'd still have to store the n. So no, this is even worse.

  8. #7
    Member
    Join Date
    Jan 2017
    Location
    Spain
    Posts
    3
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Question

    I'd like to share a sample of the string that I'm trying to compress (attachment). The extension ".txt" has no meaning, I only appended it because the website wouldn't allow me to upload otherwise. It's just a binary file made up of strings as defined in the original post. Each string is defined as "read as many bits as possible until you have the same amount of 0s and 1s" (start from position 0 until the end of file). Because each string has the same amount of bits, the whole file also has the same amount of bits: 8672 zeroes and 8672 ones (unless I messed up something with the output, but I think it's fine).

    I was curious to know/see if anybody could give a stab at this and try to compress it, and I'd love to read your results and comments. I've tried to run this exact file through bz2, gz, lzma, xz, but they all expand the output. I can try to upload a larger file if necessary.

    Thanks a lot!
    Attached Files Attached Files

Similar Threads

  1. How do I dump the strings LZ77 matches?
    By SolidComp in forum Data Compression
    Replies: 4
    Last Post: 27th August 2016, 08:16
  2. Prime Number Benchmark
    By Matt Mahoney in forum Data Compression
    Replies: 17
    Last Post: 13th March 2016, 17:48
  3. Text strings coding chemical structures
    By FatBit in forum Data Compression
    Replies: 21
    Last Post: 19th February 2016, 21:16
  4. Concatenating strings
    By andromeda in forum Data Compression
    Replies: 29
    Last Post: 15th September 2014, 07:51
  5. Most efficient/practical compression method for short strings?
    By never frog in forum Data Compression
    Replies: 6
    Last Post: 1st September 2009, 04:05

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •