I'm working on a new markup language. It's most similar to YAML and TOML, but better specified and more powerful.
Have you thought about ways that textual markup languages could be designed for better compression? What can I do, in the design of a markup language, that would make it easier to compress? I've seen specialized compression codecs for XML, for example, but never anything about designing a markup language for compression.
One thing I'm going to do is give it both a text and binary form. When humans are reading or writing it in a text editor, obviously that would be in its text format. But I want the saved and transmitted format to be binary, for efficiency.
One problem is that this markup language will serve the same sorts of use cases as YAML, and so will have arbitrary, unpredictable text. The keys and values could be anything, so I don't see much potential for a static dictionary unless it replaced HTML. All we seem to know in advance are minor things like that there will be lots of colons followed by a single space, e.g:
Binary formats generally report the data type in advance of the value, and also give its length in advance. So if there are fewer than 16 data types, I could use four bits for type, followed by n bits for length, and so forth. Does having data structured this way help with compression? I'm only really familiar with DEFLATE, the LZ77 and Huffman procedures. I don't know much about how the newer codecs work.
What about encoding integers and decimals as binary integers/decimals instead of text/UTF-8? Do you think the added complexity will be worth it in compression? (e.g. 237 can take up one byte as an unsigned integer vs. three bytes as text in UTF-8.)
I like the idea of using position in the data as information, usually instead of a key – defining a particular position as a particular key and type – but that seems like it would only be possible for the initial metadata. Like in HTML, instead of this:
I would do this:<!DOCTYPE html>
Those first few lines would be preassigned for HTML version, character encoding, and language. There would be no need for the keys or tags, since their position carries that information. You could only do this for maybe the first ten lines, then you'd start to have arbitrary content. I'd also make it binary, so there would be maybe four bits for HTML version, four bits for encoding scheme, and maybe ten bits for language. Anyway, I'm just brainstorming here and wondering about ways to optimize a markup language for compression. I'm very interested in whether the binary practice of having the type and length in advance of each value would help.HTML 5.2