Results 1 to 10 of 10

Thread: Intermediate representation for compression

  1. #1
    Member SolidComp's Avatar
    Join Date
    Jun 2015
    Location
    USA
    Posts
    222
    Thanks
    89
    Thanked 46 Times in 30 Posts

    Intermediate representation for compression

    Hi all – Has anyone worked on using intermediate representations or formats for compression? I'm thinking mostly of text compression of the sort that's needed on the web, which is currently handled by gzip or, much more expensively, by brotli.

    Do you think text could be represented in an intermediate form that would make a tailored gzipper significantly faster and/or lighter on CPU? Another way of saying this is: What's easier than what gzip/DEFLATE currently do to a text file? DEFLATE looks pretty light compared to most codecs, but I wouldn't assume it's the most efficient thing we could be doing. (I realize that we could think of LZ77 as the IR for DEFLATE, which then Huffman codes it.)

    One thing that has struck me is that our current compressors are completely naïve with respect to what they're compressing. zlib knows nothing about HTML, CSS, JS, and SVG. This is absurd, but it's been the status quo for 20 years. It knows nothing about the file it's about to compress, or anything about the sorts of strings it should look for, or ignore (e.g. DOCTYPE). So for example, one light intermediate representation beyond plain HTML would be HTML files that came with metadata that helped the compressor. It might include the longest repeated string, small strings that are worth counting, small repeated strings that should be ignored, etc.

    A more interesting intermediate representation would be a standardized minified format that would allow for interesting optimizations in a gzipper or brotli compressor. It might not have a huge payoff – it would take some work and more thought to know, but if the gzipper could assume various things about the strings, the whitespace, the only possible line ending character, etc. maybe it would make some big wins possible.

    The most interesting intermediate representation would be a binary format for text. This could be for all text, or just tailored for web files. Text is already in a binary format – all formats are in truth binary, but "text formats" usually means: an extremely inefficient binary representation, that in the web's case includes ponderous English-language code words with syntax that varies just enough to weaken the compression, which is then converted to a slightly more efficient binary form (gzip), sent, and then strangely reconverted back to its extremely inefficient prior form before a machine parses it (just brilliant .

  2. #2
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    Do you think text could be represented in an intermediate form that would make a tailored gzipper significantly faster and/or lighter on CPU? Another way of saying this is: What's easier than what gzip/DEFLATE currently do to a text file? DEFLATE looks pretty light compared to most codecs, but I wouldn't assume it's the most efficient thing we could be doing. (I realize that we could think of LZ77 as the IR for DEFLATE, which then Huffman codes it.)
    Each LZ77 codec has different cost model and different capabilities (eg repeated offsets), so single intermediate representation would be inappropriate for most LZ77 codecs.

    One thing that has struck me is that our current compressors are completely naïve with respect to what they're compressing. zlib knows nothing about HTML, CSS, JS, and SVG. This is absurd, but it's been the status quo for 20 years.
    Nope. PAQ-like compressors had special text modelling for a long time. Other than direct modelling there were preprocessors with different preprocessing modes for LZ77, PPM, CM etc based compressors. Those preprocessors include drt and xwrt (can be found on: http://mattmahoney.net/dc/text.html ).

    The most interesting intermediate representation would be a binary format for text. This could be for all text, or just tailored for web files. Text is already in a binary format – all formats are in truth binary, but "text formats" usually means: an extremely inefficient binary representation, that in the web's case includes ponderous English-language code words with syntax that varies just enough to weaken the compression, which is then converted to a slightly more efficient binary form (gzip), sent, and then strangely reconverted back to its extremely inefficient prior form before a machine parses it (just brilliant .
    Compressed formats usually have a very big disadvantage to uncompressed formats - they have very slow random access, not to mention the time taken for editing compressed content.

  3. #3
    Member
    Join Date
    Jun 2017
    Location
    Salem Salem, NJ
    Posts
    1
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Unfortunetly i havent, i allways use winrar and it pretty much enaught for me!

  4. #4
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Not sure this is what you are asking, but GLZA uses an intermediate format. The intermediate format contains a set of grammar production rules that can be used to recreate the file. Terminals symbols except 0xFE and 0xFF remain as bytes, 0xFE and 0xFF require two bytes, and non-terminals are 4 bytes. This could be more efficient, depending on the file. As is, this format can often be less than half the original file size for moderately large text files.

  5. #5
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    i think that any lz+ec libs with block-static encoding have an intermediate format. at least, that's true for zlib and zstd. this format employs fixed-width encoding of found matches prior to entropy-coding stage

  6. #6
    Member SolidComp's Avatar
    Join Date
    Jun 2015
    Location
    USA
    Posts
    222
    Thanks
    89
    Thanked 46 Times in 30 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    Each LZ77 codec has different cost model and different capabilities (eg repeated offsets), so single intermediate representation would be inappropriate for most LZ77 codecs.


    Nope. PAQ-like compressors had special text modelling for a long time. Other than direct modelling there were preprocessors with different preprocessing modes for LZ77, PPM, CM etc based compressors. Those preprocessors include drt and xwrt (can be found on: http://mattmahoney.net/dc/text.html ).


    Compressed formats usually have a very big disadvantage to uncompressed formats - they have very slow random access, not to mention the time taken for editing compressed content.
    Piotr, sure PAQ-like compressors have good tailored models, but I was just referring to web compression codecs and norms, mostly gzip, and to some extent brotli (which has been a massive disappointment for an all-new codec coming 20+ years after gzip).

    On your point on the disadvantage that compressed formats have, check out Rocco, Caverlee, and Liu (2005) and their XPack compressor: http://faculty.cse.tamu.edu/caverlee...cco05xpack.pdf

    Their format allows one to operate on the compressed data, and quite efficiently. (Their XPack has nothing to do with Eric Biggers' much newer xpack, which we sometimes see in benchmarks. He didn't know the name had been used already by other researchers.)

    XPack is awesome. It was introduced in 2005. That tree fell in the woods and no one heard it, apparently.

    That's what bothers me most about brotli. There's already a bunch of research on compressing XML efficiently, with nice advances like being able to operate on the graph/DOM data structure, and Google acts like it never happened, ignores all of it, and secretly develops a format that can't even replace gzip because it uses so much more CPU in the compression phase. Two hours later, Mozilla adopts it, with hardly any discussion, testing or formal comparison with other codecs. Mozilla could have saved the world massive bandwidth and energy over the last ten years just by adding JPEG-XR support, but no, they ignored it because it was from MS, let us waste all that energy, but then adopt brotli after quick non-public communications with Google. Compression on the web is mired in bullshit and politics, and is wildly anti-meritocratic and unscientific.

    Rant over. Tailored compression for the web has the potential to be awesome, and an intermediate data format (or metadata) might have an interesting role there.

  7. #7
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    I don't really see any advantage of XML compacting schemes. First problem is that many HTML documents aren't valid XML documents, thus XPack would fail on them. There was an experiment of pushing XHTML standard which is a HTML document that is also a valid XML document. That didn't really pan out and was abandoned.

    Second, I think that parsing HTML and/ or XML constitutes only a negligible fraction of document processing time. It should be measurable. For example, take your favourite browser, record all XML-like traffic, measure CPU time taken by browser during browsing and then parse all XML-like files and check how much CPU time that parsing took. Then you'll have two numbers for easy comparison. Firefox is still a single process browser so it should be easy to profile.

    Brotli (just like gzip) can be used to compress everything. HTML, CSS, JavaScript, JSON data, any kind of binary data and so on.

    Probably most of the time that browser spends on parsing is spent on parsing JavaScript. There are tons of JavaScript code today. The future is to gradually replace most of the heavy JavaScript code with easy to parse bytecode: http://webassembly.org/docs/binary-encoding/

    WASM is said to be 20x faster to parse than raw JavaScript and that's a serious speed-up. Look at e.g. my compression program written in Scala.js (with Akka.js included): https://tarsa.github.io/TarsaLZP-demo/ The amount of JS code is quite big (mainly because of Akka.js), about 1.4 MB of minified JavaScript and on my desktop (i5-4670 3.8 GHz) it takes about 1.5s to show the initial UI. Speeding it up by 20x would make loading seem instantaneous.

  8. #8
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    Firefox is still a single process browser so it should be easy to profile.
    Are you sure? I think that just changed this week: https://www.mozilla.org/en-US/firefox/54.0/releasenotes/

  9. #9
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    On my system Firefox 54 still runs as single process. e10s is selectable in about:config and Firefox automatically decided that it will not enable e10s.

  10. #10
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    539
    Thanks
    192
    Thanked 174 Times in 81 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    On my system Firefox 54 still runs as single process. e10s is selectable in about:config and Firefox automatically decided that it will not enable e10s.
    According to the MozillaWiki page, e10s can be disabled by default because of "accessibility, add-ons" - I had this too, about:support said "0/1 - deactivated" - you can either force e10s by adding the bool "browser.tabs.remote.force-enable" and set it to true, but the better way is to find out which of the add-ons blocks e10s - the "Add-on Compatibility Reporter" is a Firefox add-on that shows the e10s blocking state of all the add-ons in about:addons - in my case, it was the start-page modifier "Speed Dial".
    http://schnaader.info
    Damn kids. They're all alike.

Similar Threads

  1. Suffix Tree's internal representation
    By Piotr Tarsa in forum Data Compression
    Replies: 4
    Last Post: 18th December 2011, 08:37

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •