Results 1 to 10 of 10

Thread: Compressing small bits of data

  1. #1
    Member
    Join Date
    Apr 2009
    Location
    san rafael, ca
    Posts
    3
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Compressing small bits of data

    Many compression techniques I encounter seem designed for large amounts of data, and implementations are all about reading/writing files. I have a need to compress/decompress small data streams (200-1500 bytes) in-memory. Looking for a simple-to-implement (or already implemented) in C or simple C++ (no STL, no namespaces, no C#). Ideas? Thanks.

    fred

  2. #2
    Programmer osmanturan's Avatar
    Join Date
    May 2008
    Location
    Mersin, Turkiye
    Posts
    651
    Thanks
    0
    Thanked 0 Times in 0 Posts
    In AI point of view, actually compressing itself is learning. So, all compressors actually need enough big data for effective learning. Bitwise models are generally adapt faster and thus improves compression on small files. But, this depends on input. So, better to test very simple bitwise coder first. You can grap a simple one in here: http://www.encode.ru/downloads/fpaq0p.zip
    It's a file-to-file compressor but you can easily modify it into memory-to-memory. It's simple enough.
    BIT Archiver homepage: www.osmanturan.com

  3. #3
    Tester
    Black_Fox's Avatar
    Join Date
    May 2008
    Location
    [CZE] Czechia
    Posts
    471
    Thanks
    26
    Thanked 9 Times in 8 Posts
    RLE and LZSS would be less effective, but more simple alternatives.
    I am... Black_Fox... my discontinued benchmark
    "No one involved in computers would ever say that a certain amount of memory is enough for all time? I keep bumping into that silly quotation attributed to me that says 640K of memory is enough. There's never a citation; the quotation just floats like a rumor, repeated again and again." -- Bill Gates

  4. #4
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    856
    Thanks
    447
    Thanked 254 Times in 103 Posts
    I concurr. LZSS with a very small buffer and buffer-limited match research can give pretty good results. I've written one for very similar needs lately, and it worked well. Although don't expect stellar results. Average savings was 30%, with big variations depending on source redundancy.
    Last edited by Cyan; 25th April 2009 at 23:19.

  5. #5
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    The question is, what do you want to compress there, and how much?
    Like, if its a sequence of some messages, then it can be considered
    a single stream for compression.
    And if random access is necessary, its another thing.
    Also is that a text or audio or image data?
    Even if you need random access to your data chunks, it still might
    be possible to improve compression a lot by using a dictionary for text,
    and maybe something similar for other datatypes too.
    And its really important what kind of speed do you want.
    Also, for text and some other data types LZSS would be a common
    solution for real, but its far from good for audio and images.
    Another possible solution might be BWT, but thing about images still applies.

  6. #6
    Member
    Join Date
    Apr 2009
    Location
    san rafael, ca
    Posts
    3
    Thanks
    0
    Thanked 0 Times in 0 Posts
    it's binary data that's mostly a sequence of "messages" that are all of the same form, with some text-like content. No audio, no images. Random access not necessary. 300-1500 bytes, tops. Right now I am getting on average a 40% reduction with ZIP.

    I would think that maybe a single dictionary could be developed by surveying a bunch of candidate data streams (which I have), and then NOT storing the dictionary in the compressed stream, but have it stored in the decompressor. I'll take a look at LZSS and BWT. Any pointers to source code for these?

  7. #7
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Well, for LZ there're well-known zlib and LZO, and this: http://quicklz.com/
    And LZMA of course.
    There're many, in fact.
    Also, structural preprocessing may be effective (if there's any specific
    message format with fields of different types). Universal codecs don't
    understand such structure, so it might be good to subtract fields with
    similar values, encode ascii numbers to binary, transform fixed-length
    binary numbers to variable-length, transform floats into a terminated
    form instead of exponential, etc.
    Well, any simple and redundant data parts, predictable when message
    structure is known, should be handled by preprocessor, taking into
    account the fact that most universal codecs only work with prefix
    contexts.
    Another idea might be to use different alphabets for fields of different
    types, like bytes 00-0F for nibble-encoded binary numbers, and
    rest of alphabet for text.

    Considering BWT, there're many implementations, starting from bzip2,
    but I'm not aware of anything as convenient as LZ libraries.
    Maybe this, though i didn't evaluate it myself:
    http://www.geocities.com/zxcb33/

    And then, if its really a stream without random access, it'd be interesting
    if you could test this:
    http://ctxmodel.net/files/PPMd/ppmd_Jr1_sh8.rar
    Its a rewrite of ppmd (which is used in winzip/rar/7-zip as "text compression
    method") with model incapsulation into a class, abstraction from file i/o etc.

    And as to dictionaries, the most simple method applicable with any
    compression algos, would be to encode a fixed block of "typical data"
    (and drop the encoded result) before encoding/decoding
    the actual data.
    Last edited by Shelwien; 28th April 2009 at 22:18.

  8. #8
    Programmer osmanturan's Avatar
    Join Date
    May 2008
    Location
    Mersin, Turkiye
    Posts
    651
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by fredrp View Post
    it's binary data that's mostly a sequence of "messages" that are all of the same form, with some text-like content. No audio, no images. Random access not necessary. 300-1500 bytes, tops. Right now I am getting on average a 40% reduction with ZIP.

    I would think that maybe a single dictionary could be developed by surveying a bunch of candidate data streams (which I have), and then NOT storing the dictionary in the compressed stream, but have it stored in the decompressor. I'll take a look at LZSS and BWT. Any pointers to source code for these?
    If the messages like you said are similar, then having a "substring situation" like coding makes sense. In the other word, suppose you have a bunch of substrings (not necessarily in printable characters) in a dictionary (sorted by frequencies so frequent occurance will have shorter codes). You can encode substrings as dictionary indexes. You can also improve compression by using source coding (huffman, arithmetic or whatever you favor). I think, this schema fits in your needs.
    BIT Archiver homepage: www.osmanturan.com

  9. #9
    Member
    Join Date
    Apr 2009
    Location
    san rafael, ca
    Posts
    3
    Thanks
    0
    Thanked 0 Times in 0 Posts
    I did try your PPMD and it performs better with default settings than the zlib I've been using set to maximum compression, on these small data sets.

    Quote Originally Posted by Shelwien View Post
    ...
    And then, if its really a stream without random access, I'd be interesting
    if you could test this:
    http://ctxmodel.net/files/PPMd/ppmd_Jr1_sh8.rar
    Its a rewrite of ppmd (which is used in winzip/rar/7-zip as "text compression
    method") with model incapsulation into a class, abstraction from file i/o etc.

  10. #10
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Instead of default settings, you should probably use a lower order
    (like /o6 or /o and disable the x86 code filter (/e).

    Also, as I said, there's no sense to compress the messages
    separately, if you only need sequential access.
    You should compress all that as a single file instead.
    Because at least this ppmd outputs decoded data
    byte by byte, not in blocks, so you can use your message
    data immediately after decoding its bytes, without processing
    any more compressed data.

    And in fact LZ algorithms are not really different considering
    that, just maybe don't provide a corresponding API.
    Last edited by Shelwien; 28th April 2009 at 22:39.

Similar Threads

  1. found 6 bits redundancy in sharnd_challenge.dat
    By ddfox in forum The Off-Topic Lounge
    Replies: 1
    Last Post: 4th June 2010, 22:30
  2. compressing a really small 1k .COM file
    By Rugxulo in forum Data Compression
    Replies: 3
    Last Post: 28th November 2009, 01:32
  3. Frequent Bits Elimination Method - suggest please
    By Scientist in forum Data Compression
    Replies: 14
    Last Post: 28th September 2009, 18:30
  4. Sequence of bits
    By Kaw in forum Data Compression
    Replies: 12
    Last Post: 25th September 2009, 08:53
  5. A Small Warning
    By encode in forum The Off-Topic Lounge
    Replies: 1
    Last Post: 30th August 2008, 21:05

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •