What I had in mind when I written about modelling XML is properly parsing XML tokens i.e. tag names and attribute names and then keeping a stack of open tags, recent sibling tags and recent attributes in current tag.
Primary data structure would be a stack of stacks of hashes (one hash per token, because we're hashing contexts anyway). Let's call it 'tag_levels' and give it a type (in pseudocode) Stack<Stack<Hash>>. One level contains sibling tags (i.e. tags that share parent tag) that were already seen. There are as many levels as is the current nest level. Additional structure is a stack of attributes 'attr_levels' of type Stack<Hash>. Decision tree for each input byte would be as follows:
1. If we aren't in a XML tag then use generic models.
2. Otherwise use only XML models.
If we are in XML model then:
1. If an opening tag name was finished (i.e. we read space or closing bracket after tag name) then add hash of tag name to current stack: tag_levels.top.push(hash(tag_name))
2. If opening tag was finished together with attributes (i.e. we read the closing bracked) then reset attributes stack: attr_levels.clear()
3. If a closing tag name was started then remove current stack level: tag_levels.remove_top()
4. If an attribute name was finished then add it to attributes stack: attr_levels.push(hash(attr_name))
5. If we're inside a closing tag name then we predict strongly the same name as the corresponding opening tag name - we have it's hash at the top of current stack.
Word model uses previous words to predict next word. Analogous thing can be done for XML tags. We can use previous sibling tag names and also ancestor tag names as contexts.
1. Names of siblings, i.e. top 1, top 2, top 3 etc names from tag_levels.top
2. Names of ancestors i.e. top name from 1, 2 or 3 previous levels
3. A combination of the above
4. For modelling attribute names we can use previous attributes in a tag as context.
Also we can have a model that predicts when parent tags are closed based on child tag name. I.e. in case of following XML:
We see that after tag "d" was closed the next tag was also always a closing tag.
If you want to conserve RAM then you can eg disable all non-XML models when you're inside XML tag. Also you can use strategy similar to word model - instead of adding contexts on byte boundary you can add them on token boundary. In case of word model a token is a single word. In case of XML model a token is a tag or attribute name.
Paq8* needs RAM for storing many contexts for many models. Storing everything (or anything) in xml would require way too much space, parsing the intermediate xml format would require too much processing power. I see your point: defining and parsing xml is universal, but for Paq8* I think it would be impractical.
In case of column based data we can use different names for tags representing columns and then use the model that predicts the moment of closing parent tag based on child tag name I've described above with an example. So if we have a CSV like:
Like when you have a csv, and *if* it is successfully encoded to xml, *then* you *know* that it is a valid csv, so it has *always* a specific number of columns and that each row *always* ends with the same line ending.
Something like that?
And we translate it to XML:
Then we see that after closing tag 'col3' there's always another closing tag, so we can predict it with high probability and we don't need to encode any extra information.
I don't know if all of the above sounds sensible but I'm going to implement and test it anyway But it will have to wait some time because I'm making a CM compressor from scratch.