Results 1 to 4 of 4

Thread: Optimize new markup language for compressibility

  1. #1
    Member SolidComp's Avatar
    Join Date
    Jun 2015
    Location
    USA
    Posts
    222
    Thanks
    89
    Thanked 46 Times in 30 Posts

    Optimize new markup language for compressibility

    Hi all,

    I'm working on a new markup language. It's most similar to YAML and TOML, but better specified and more powerful.

    Have you thought about ways that textual markup languages could be designed for better compression? What can I do, in the design of a markup language, that would make it easier to compress? I've seen specialized compression codecs for XML, for example, but never anything about designing a markup language for compression.

    One thing I'm going to do is give it both a text and binary form. When humans are reading or writing it in a text editor, obviously that would be in its text format. But I want the saved and transmitted format to be binary, for efficiency.

    One problem is that this markup language will serve the same sorts of use cases as YAML, and so will have arbitrary, unpredictable text. The keys and values could be anything, so I don't see much potential for a static dictionary unless it replaced HTML. All we seem to know in advance are minor things like that there will be lots of colons followed by a single space, e.g:

    Aircraft: F-22
    Price: $12.00

    Binary formats generally report the data type in advance of the value, and also give its length in advance. So if there are fewer than 16 data types, I could use four bits for type, followed by n bits for length, and so forth. Does having data structured this way help with compression? I'm only really familiar with DEFLATE, the LZ77 and Huffman procedures. I don't know much about how the newer codecs work.

    What about encoding integers and decimals as binary integers/decimals instead of text/UTF-8? Do you think the added complexity will be worth it in compression? (e.g. 237 can take up one byte as an unsigned integer vs. three bytes as text in UTF-8.)

    I like the idea of using position in the data as information, usually instead of a key – defining a particular position as a particular key and type – but that seems like it would only be possible for the initial metadata. Like in HTML, instead of this:

    <!DOCTYPE html>
    <html lang="en">
    <head>
    <meta charset="UTF-8">
    ....
    ....
    </head>
    I would do this:

    HTML 5.2
    UTF-8
    en
    Those first few lines would be preassigned for HTML version, character encoding, and language. There would be no need for the keys or tags, since their position carries that information. You could only do this for maybe the first ten lines, then you'd start to have arbitrary content. I'd also make it binary, so there would be maybe four bits for HTML version, four bits for encoding scheme, and maybe ten bits for language. Anyway, I'm just brainstorming here and wondering about ways to optimize a markup language for compression. I'm very interested in whether the binary practice of having the type and length in advance of each value would help.

    Thanks!
    Last edited by SolidComp; 31st August 2018 at 05:56.

  2. #2
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    For LZ compression:
    - avoid making terminator tags that differ at start, like <html>,</html>.
    For lzma, syntax like <html/> would be better (fits for match-literal-rep coding), and
    /<html> or <html>/ even better;
    - use shorter tags, preferably of same power-of-2 length (would improve encoding of match lens)
    - if binary format is considered anyway, it could make sense to store markup separately from text
    (ie "<b>hello</b>, <i>world</i> !!!" ->
    text part: "hello, world !!!"
    tag part: "b5;2i7;")

    For CM its the same, but also symbol choice for tags would matter too -
    have to be careful to not mess up frequent contexts (eg. a tag "<the>" would force high probability of ">" after "the"),
    and it should be a good idea to pad tags with spaces on both sides

    But actually, if binary format is considered anyway, the right choice should be making a custom CM model for markup coding.
    (Only markup, text part can remain as plaintext).
    Such a model would be able to make use of markup structure (tag nesting etc), and also it would be possible to use corresponding text as context.

  3. The Following User Says Thank You to Shelwien For This Useful Post:

    SolidComp (31st August 2018)

  4. #3
    Member SolidComp's Avatar
    Join Date
    Jun 2015
    Location
    USA
    Posts
    222
    Thanks
    89
    Thanked 46 Times in 30 Posts
    I think I need to learn context modeling then. Where can I learn it? Is there an intro on the web or in the forum? Wait, is CM context modeling or mixing? I see this paper from 1992 on context modeling: http://citeseerx.ist.psu.edu/viewdoc...10.1.1.49.1427

    But Wikipedia speaks only of context mixing.

    Then I found one of Matt's paper's on Adaptive Weighing of Context Models...: https://cs.fit.edu/~mmahoney/compression/cs200516.pdf

    Is there a seminal paper I should read?

    Next, there won't be tags per se in the text markup. It's not an SGML- or XML-based language. It will be more like YAML and TOML, but with objects like icons, images, and emoji. The color of text will also be meaningful data. Anyway, here's a sample of YAML:

    ---
    receipt: Oz-Ware Purchase Invoice
    date: 2012-08-06
    customer:
    first_name: Dorothy
    family_name: Gale

    items:
    - part_no: A4786
    descrip: Water Bucket (Filled)
    price: 1.47
    quantity: 4

    - part_no: E1628
    descrip: High Heeled "Ruby" Slippers
    size: 8
    price: 133.7
    quantity: 1


    It will use tabs instead of spaces for indentation (YAML forbids tabs), and I might be agnostic on whether values should be vertically aligned the way they are in the example above (actually this got broke by using the quote feature – see the original here: https://en.wikipedia.org/wiki/YAML#Example). If they were, that would create a lot of variable numbers of spaces, which would be annoying for compression.

    Prices will probably come up a lot if the format is used for ecommerce and such. What are some of the best ways to encode decimal values? Floats seem like overkill, and inaccurate. BCD seems okay. Is there anything better? Ultimately though, this is just a markup language, text and such, and the applications and DBs that consume it will decide how to handle data types.

  5. #4
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > Is there a seminal paper I should read?

    I meant mixing. In this case it would mean either something paq-like (see nesting model etc in paq), or PPM+SSE.
    http://mattmahoney.net/dc/dce.html#Section_43

    > Next, there won't be tags per se in the text markup.

    Question is, are they predicable from context?
    Its one thing if in most cases there'd be matches of tag+text.
    But if you can't predict a tag from previous symbols, and you can't predict text from tag,
    then compressing it with normal LZ or CM won't really be a good idea.

    There's a good example with sorted dictionaries - simple context models (PPM and BWT, for example) fail on that kind of data.
    They try to predict word's prefix from previous word's suffix, while there's no dependency.
    I think same applies to your example.

    > What are some of the best ways to encode decimal values?

    Integer + decimal exponent?

Similar Threads

  1. LZMA markup tool
    By Shelwien in forum Data Compression
    Replies: 13
    Last Post: 29th November 2015, 23:05
  2. Multi-language text compression corpus?
    By Paul W. in forum Data Compression
    Replies: 13
    Last Post: 19th November 2015, 19:06
  3. AMD helps WinZip optimize file compression
    By Sportman in forum Data Compression
    Replies: 5
    Last Post: 16th December 2013, 23:26
  4. How to micro-optimize a compressor
    By willvarfar in forum Data Compression
    Replies: 12
    Last Post: 15th April 2010, 08:46
  5. Lisaac programming language
    By lunaris in forum The Off-Topic Lounge
    Replies: 0
    Last Post: 11th December 2008, 18:51

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •