Results 1 to 12 of 12

Thread: Are most compression programs byte-oriented?

  1. #1
    Member
    Join Date
    Jan 2017
    Location
    Germany
    Posts
    48
    Thanks
    25
    Thanked 10 Times in 7 Posts

    Are most compression programs byte-oriented?

    Hello to the expert members of this board.

    A question that nags me right now.
    Is it correct that most compression programs are optimized for sequences of bytes as input data?

    Are there compression programs that are optimized for other data sizes, e.g. sequences of integer (16 bit)?

  2. #2
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Its rare, but there're some, for example NNCP directly supports 16-bit symbols, or lzma -lc8 -lp1.

    But its inefficient, since most cpus still have only 32kb of L1 cache, so you can't have any kind of lookup table
    for 16-bit symbols, that would fit in L1.
    So people usually prefer to deal with large alphabets by preprocessing, like utf8.

  3. The Following User Says Thank You to Shelwien For This Useful Post:

    WinnieW (11th May 2019)

  4. #3
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    There are plenty of programs oriented towards multimedia compression. 24-bit bitmaps (3 x 8-bit channels), 16-bit stereo sound, etc
    Context mixing programs often have record models for records of different sizes, e.g. 8 bytes.
    There's also preprocessing - e.g. in 7z format executables are preprocessed so that input data is split into multiple streams and e.g. delta coded or there's E8/E9 transformation in x86 code. Some streams contain sequences of 32-bit offsets so 7z format has special modelling for low offset bits.

    While compressors often recognize multi-byte structures in input data or encode data bit-by-bit, (almost?) all of them work on structures that start on byte boundary. This means that e.g. if you have a context mixing compressor that compresses bit-by-bit, the contexts that are used for modelling all start on byte boundary. Contexts that can start on any bit offset aren't usually useful, because (most?) uncompressed data is (multi-)byte oriented and it usually doesn't make sense to compress already compressed data (without decompressing it first). JPEG compression in PAQ(-like) compressors is maybe an exception, but I'm not sure how it works.

  5. The Following User Says Thank You to Piotr Tarsa For This Useful Post:

    WinnieW (11th May 2019)

  6. #4
    Member
    Join Date
    Jan 2017
    Location
    Germany
    Posts
    48
    Thanks
    25
    Thanked 10 Times in 7 Posts
    Well, thanks a lot for your useful responses. I'm at rookie level at data compression and it's a spare time interest for me only.
    I asked myself what's the reason why so many compression programs reads input data as a stream of bytes.
    Ok, I realized that for LZ it doesn't make much a difference if the data is a stream is grouped in 8 bits or 16 bits.
    But is there a reason why there is e.g. BWT rearranging 16 bit data instead bytes?

    I guess it doesen't make much sense to optimize an compression algorithm for data types of any size of bits, only for data types of the format 2^n bits (for n >=2 integers).

  7. #5
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    But is there a reason why there is e.g. BWT rearranging 16 bit data instead bytes?
    Is there such a thing? Post a link to it.

    I guess it doesen't make much sense to optimize an compression algorithm for data types of any size of bits, only for data types of the format 2^n bits (for n >=2 integers).
    It does make sense to optimize compression algorithm for any type of data as long as it will be used to compress significant amounts of such data and if such data can be compressed at all.

    For example if you had a program A that outputs compressible data in 11-bit symbols that are stitched together in a continuous bit stream then it would make sense to make a compressor B operating on 11-bit symbols that would compress the output of program A.

    Look at https://github.com/JohannesBuchner/p...aq8l/paq8l.cpp
    PAQ8L models records of any byte length and does that dynamically (i.e. it adapts to the record lengths found in the input data):
    - Fixed record length. The record length is determined by searching for
    byte sequences with a uniform stride length. Once this is found, then
    the record length is combined with the context of the bytes immediately
    preceding it and the corresponding byte locations in the previous
    one or two records (as with formatted text).

  8. The Following User Says Thank You to Piotr Tarsa For This Useful Post:

    WinnieW (12th May 2019)

  9. #6
    Member
    Join Date
    Jan 2017
    Location
    Germany
    Posts
    48
    Thanks
    25
    Thanked 10 Times in 7 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    Is there such a thing? Post a link to it.
    No, sorry for the confusion, I got distracted while posting.
    There is no BWT of 16 bit data right now, I know of.

  10. #7
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Actually there is: http://nishi.dreamhosters.com/u/BWT16.rar

    Code:
    // some japanese books: http://nishi.dreamhosters.com/u/test_bwt16.7z
    6,171,521 testu8.txt
    6,171,525 testu8.bwt
      975,482 testu8.bwt.paq8px178
    4,298,866 testu16be.txt
    4,298,870 testu16be.bwt
      963,802 testu16be.bwt.paq8px178
    4,298,864 testu16le.txt
    4,298,868 testu16le.bwt
      954,412 testu16le.bwt.paq8px178
    
    // enwik8
    100,000,000 enwik8
    100,000,004 enwik8.bwt
     21,614,894 enwik8.ari
    199,243,770 enwik8u16
    199,243,774 enwik8u16.bwt
    100,000,111 enwik8u16.bwt_u8
     21,718,209 enwik8u16.bwt_u8.ari

  11. The Following User Says Thank You to Shelwien For This Useful Post:

    WinnieW (12th May 2019)

  12. #8
    Member
    Join Date
    Jan 2017
    Location
    Germany
    Posts
    48
    Thanks
    25
    Thanked 10 Times in 7 Posts
    Texts consisting of charcters in ASCII format don't benefit from BWT16. I expected that, but 16 bit data formats should benefit somewhat, as your example show.

  13. #9
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    But why converting enwik8 from UTF-8 to UTF-16 and doing BWT on 16-bit symbols hurt compression? Symbols are converted back to UTF-8 before arithmetic coding so the whole process should just give more precise sorting. Why is more precise sorting leading to worse compression of enwik8?

    Eugene:
    IIUC your scheme (of conversion to UTF-16, BWT'ing on 16-bit symbols and then converting back to UTF-8 ) should effectively change nothing if input file contain only ASCII symbols. book1, english.dic and world95.txt should contain only ASCII symbols. Could you test your program on that files?

  14. #10
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    enwik8 test is actually different - I didn't want to wait for paq results, so I compressed bwt output with my own bwt postcoder, tuned to book1bwt.
    Also my postcoder didn't have anything for 16-bit symbols, so I converted 16-bit bwt back to utf8 first.

    As to why it didn't improve enwik8 compression - well, utf8 characters in enwik mostly occur in links to articles in other languages, and these are special -
    its probably really better to compress them as utf8.

    Also, I just compared BWT(book1) vs utf8(BWT16(utf16(book1))) and the only difference is the index value at the end.

    There's another problem though - BWT16(utf16(enwik8)) is not fully valid utf16, so it can't be losslessly converted to utf8 and back.
    But the diff would be small enough to not affect the results.

    P.S. BWT16.rar also contains sources for both 8-bit and 16-bit simple BWT/unBWT (BWT is qsort-based).

  15. #11
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    645
    Thanks
    205
    Thanked 196 Times in 119 Posts
    Intermediate between bitwise and bytewise, some adaptive compressors are nibble-oriented (4 bits) as 16 symbol distribution is very convenient for SIMD vectorized adaptation - e.g. LZNA, AV1, Dropbox DivANS.

  16. The Following User Says Thank You to Jarek For This Useful Post:

    Bulat Ziganshin (13th May 2019)

  17. #12
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    16-bit text compression IMHO is good idea due to non-latin symbols and dictionary/LZP transformations
    Last edited by Bulat Ziganshin; 13th May 2019 at 13:51.

Similar Threads

  1. State of the art byte compression (for 8-bit computers)
    By introspec in forum Data Compression
    Replies: 23
    Last Post: 29th March 2019, 03:28
  2. Zlib-ng: a performance-oriented fork of zlib
    By dnd in forum Data Compression
    Replies: 0
    Last Post: 5th June 2015, 14:29
  3. Traffic compression (1500 byte chunks)
    By blackhaz in forum Data Compression
    Replies: 4
    Last Post: 21st January 2012, 22:44
  4. Byte oriented LZ77 compressor
    By Matt Mahoney in forum Data Compression
    Replies: 21
    Last Post: 30th December 2011, 18:27
  5. Replies: 13
    Last Post: 2nd April 2010, 23:46

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •