Results 1 to 6 of 6

Thread: How do you call this part of (lossless) data compression?

  1. #1
    Member
    Join Date
    Sep 2015
    Location
    germany
    Posts
    12
    Thanks
    1
    Thanked 0 Times in 0 Posts

    Cool How do you call this part of (lossless) data compression?

    A lossless compression algorithm is only the result of a shorter description of a binary combination that usually appears several times in the uncompressed data.

    To differ and mark the uncompressable data from the compressed you need something I call the "overtake" sign or short ovtk. If something compressable appears you write a sign if an ovtk appears before or not. If there is any you need to encode the length of the bytes that is not compressable and has to be used without any change for decompression.
    The way you mark something as an ovtk-block with the length can have a big influence on the overall compression efficiency. So sometimes it is rather useless to compress a 16 Bit value if the ovtk-sign + ovtk length + compression sign + compression code is bigger than the 16 Bit and some kind of skipping algorithm or efficiency analysis for this case results in a better compression.

    Now my question: Is there is another expression for this "ovtk" marker and what different kind of length encoding for the uncompressable block exist? In most cases compression discussions only talk about the other part: The compression algorithm itself and my interest is on getting more information about the encoding of uncompressable data. Are there any sources talking more about this?

  2. #2
    Member
    Join Date
    Feb 2015
    Location
    United Kingdom
    Posts
    154
    Thanks
    20
    Thanked 66 Times in 37 Posts
    sounds like you're taking about bit-flags. another similar code called an "escape-code" is used in worst-case scenarios and only appear as needed, unlike bit flags which are placed at fixed intervals. but your example is for distinguishing incompressible blocks right? or bytes that weren't modeled? If by uncompressible you mean the bytes that weren't modeled they are called "literals'.

  3. #3
    Member just a worm's Avatar
    Join Date
    Aug 2013
    Location
    planet "earth"
    Posts
    96
    Thanks
    29
    Thanked 6 Times in 5 Posts
    A flag is a value with the size of 1 bit. So the term "bit-flags" is like 12-dozen.

    I would probably also call it something in the direction of "escape code" or "escape marker".

    You can also distinguish between payload and overhead to categorize the flag to tell wether the input has been compressed (overhead), the length (overhead) and the compressed value (payload).

    I had a problem like you have described when I was working on a run length encoding. Sometimes there are several bytes in a row with the same value and sometimes there are several bytes in a row with different values. So the algorithm needs to distinguish each block of bytes. In some cases it is not necessary to store this information because it can be determined by looking at the compressed data. But in some cases you need to store it. Then there is the issue to store how many bytes the block contains. It has a variable length encoding.

    If you like then I can send you a german document of this run length encoding togethere with a test programm so you can see how efficient this run length encoding is. I think it is the worlds most efficient. But it's possible that there is a better one of course.
    Last edited by just a worm; 16th September 2015 at 05:19.

  4. #4
    Member
    Join Date
    Sep 2015
    Location
    germany
    Posts
    12
    Thanks
    1
    Thanked 0 Times in 0 Posts
    Thanks for your answers!

    @just a worm: You can send me the source if you like (Crush@Bigfoot.de) but I´m actually only deeply interested in different methods how the uncompressable data blocks are managed.

    Some personal story how I came here so that you can better understand why I ask my question or skip it:
    I first got in contact with data crunchers >30 years ago on the C64 by improving an existing one that started my interest. I think this was one of the first compressors done by Matt Mahoney because this is the only name I can remember from that time. This was pure assembler monitor hardware hacking and I learned a lot from it. ~5-8 years later I was forced for a demo I made to code my first own turbo compressor on Amiga because I had nowhere any amiga compression sourcecode I could use. Without knowing more details how others worked and astonishingly it was better than the lha and lzx compressor in most cases that were the top at that time. It was also used for some issues of a diskmag some friends made. Then a lot of time went by but I ever was looking at the compression scene because there are so many compressors appearing and it seemed there still are some crazy ways to get it all even smaller like arithmetic coding or BWT.

    back to my question:
    I only know from the C64 code this simple scheme I ever used since then: 1 bit before a compression block is the sign that you need to read some uncompressable bytes before or not. So take some bits 00 (0-2 = length, 3 = take a bigger bitblock) -> take some more bits 0000 (3-18 = length + maximum from previous block, 19 = take again a bigger block) and so on. The block sizes define how big uncompressable data will get in the compressed file and so should be carefully tested.

    This is also the reason I want to know if there are any more efficient or different ways to do it I didn´t thought of before? The only idea I had was to check before compression what block sizes can appear and save them in a table. After some tests it was useless, because for standard files there was no real benefit - only for some special cases that nearly never appear.

    Working on a bitwise reading compressor this method gives nice results, but it can be also done different:

    I wondered while coding the Amiga cruncher is there any other way to not encode the length any more? ... and found a way to only mark compressed data: I wanted to do this, because the Amiga Cruncher used 2 passes and some improved/extended RLE to shorten the distances for the length/distance block compression. I checked the uncompressed code for the most seldom bytes and used those as markers for the RLE compressed blocks. If those bytes began to appear more often in the stream they were detected from that position again and changed to the next most seldom bytes. This had the advantage I didn´t had to blow up the code in any case of uncompressable data and only needed in 8 bits to mark compressed data and not more bits. This worked fine, was quick, the bytes were still in their original state and bit boundary and resulted very often in a much nicer compression and of course also speed for the 2nd pass. Using 8 bits sounds very useless, because less would be better. In my tests I´ve seen sometimes use need only 3-6 bits, but also often more than 8 so allover you result in something from 4-12 (or more) bits in medium. So the result in comparison is not that bad - in fact it was in many cases better than bitwise compression methods for RLE.

    From this time it was a hobby for me just for fun to think about other solutions to this problem. A while ago I had a nice idea to lower the markers to less than bits than you´d need for length encoding and marking and now made a test program to check its efficiency. It´s working rather good and I´m still experimenting to optimize this idea. How much could compression be improved if you don´t need to set additional markers and length blocks for uncompressable data? Only sign compressed code and that´s it. I also think about some improvements of compression algorithms, but this is absolutely secondary at the moment. If there are perhaps some other more efficient ways existing to skip the junk I would know better if my work is perhaps useless, beaten or only the reconstruction of old well-known methods I only haven´t seen before documented. Most people don´t think about the "junk" concentrating mainly on the compression algorithm itself. The only thing coders look at is if there are any ways to compress also this "junk". I see some parallels in the genome science where people declared the junk DNA to be not necessary concentrating only to decode the "content holding" genomes - now they know better it´s not that useless and unimportant as they thought first.

    The reason I do this is because I have the hope to be the first adding a totally new idea to actual compression raising the bar. Perhaps I´m wrong but that´s the reason why I don´t code along and ask some people out there that know much more than me if this is useless work or not or can be done even better.
    Last edited by Crush; 16th September 2015 at 12:44.

  5. #5
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    In LZ77, the incompressible part of the data is coded as literals. You need a symbol to distinguish literals from matches, and you need to encode the literal length, match length, and match offset. The best way to do this is measure the distribution in the data and choose the appropriate code lengths using Huffman, arithmetic, or asymmetric coding. In developing zpaq, I found a fairly uniform distribution over log(literal length), log(match offset), and log(match length + 4), with about 50-75% matches, so I used a fixed code, coding the length of the number in a fixed number of bits, followed by the number without a leading 1 bit. Deflate actually measures the input and constructs an appropriate Huffman table, so it sometimes compresses a little better on small files. However it usually does poorly because offsets are limited to 32K. That was a design constraint decades ago when 32K was a lot of memory.

    Maybe you will find http://mattmahoney.net/dc/dce.html interesting.

  6. #6
    Member
    Join Date
    Sep 2015
    Location
    germany
    Posts
    12
    Thanks
    1
    Thanked 0 Times in 0 Posts
    Thank you for your explanation, Matt. I´ve already read yesterday a little bit of your really great document that describes a lot of interesting things. Everybody who wants to enter compression coding should dive into this first.
    I coded yesterday a first detection routine for my new idea that works totally different from what is done everywhere as far as I´ve seen it. I described it a little bit in the previous post: My goal is to get rid of all literal markers and literal length encoding so that you only need a unique marker for matches. And if there´s no big error in my (rather simple) code it seems to be possible. So I have to test it more intense for correctness. The next question is afterwards how good it compares to standard literal coding. So next I have to make a simple cruncher for testing it with the LZ77 method and my method to see if it can beat the standard encoding.

Similar Threads

  1. loseless data compression method for all digital data type
    By rarkyan in forum Data Compression
    Replies: 157
    Last Post: 9th July 2019, 17:28
  2. .bik no lossless compression?
    By Tidro in forum Data Compression
    Replies: 2
    Last Post: 4th September 2013, 18:39
  3. Call stack compression
    By Piotr Tarsa in forum Data Compression
    Replies: 0
    Last Post: 15th November 2012, 13:57
  4. lossless data compression
    By SLS in forum Data Compression
    Replies: 21
    Last Post: 15th March 2011, 11:35
  5. A name for the off topic part of the forum?
    By Fallon in forum The Off-Topic Lounge
    Replies: 3
    Last Post: 26th May 2009, 17:05

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •