Results 1 to 14 of 14

Thread: Fastest and smallest enwik8 skeleton (de)compression (for fun)

  1. #1
    Member Gotty's Avatar
    Join Date
    Oct 2017
    Location
    Hungary
    Posts
    343
    Thanks
    235
    Thanked 226 Times in 123 Posts

    Fastest and smallest enwik8 skeleton (de)compression (for fun)

    I've decided to post the result of an experiment I made a couple of years ago. Not useful at all (at this stage), it is only a proof of concept.
    I intended to experiment with data compression, and see what manual tuning could do to compressing one of the more well-known testfiles: enwik8.
    It does only one thing. And it does that extremely well. (I might be biased .)
    It decompresses the xml skeleton of enwik8 with some selected data. That is:

    Code:
    The complete header (siteinfo)
    All the attribute names
    The content id
    The revision id
    The timestamp
    The indication of a redirection
    It uses a generative model, an order-0 semi-static probability model (no joke) and a 32-bit range coder I wrote from scratch.

    Code:
    Input (archive) file:         54,996 bytes
    Output (enwik8 skeleton):  3,926,359 bytes
    Compression ratio:              1.40 %
    Decompressor (exe):            2,316 bytes (Windows Xp+, 32 bit, i386 assembly, packed by Crinkler)
    Decompression time:            0.281 seconds (on my old pc)
    Regarding the speed or the size of the decompressor I think it's "tight".
    But regarding the compression ratio, probably nothing (not even cmix) can get even close to it.
    Attached Files Attached Files
    Last edited by Gotty; 14th March 2018 at 19:11. Reason: semi-static

  2. The Following User Says Thank You to Gotty For This Useful Post:

    xinix (11th March 2018)

  3. #2
    Member
    Join Date
    Nov 2015
    Location
    -
    Posts
    46
    Thanks
    202
    Thanked 10 Times in 9 Posts
    Is this an asymmetric algorithm?
    ==
    CM?

  4. #3
    Member Gotty's Avatar
    Join Date
    Oct 2017
    Location
    Hungary
    Posts
    343
    Thanks
    235
    Thanked 226 Times in 123 Posts
    It is "very asymmetric". No Context Mixing.

    The compression algorithm is not fully automated, it's half manual. I used the following tools for compression: total commander, C#, excel, a range codec in an assembly dll callable from C# and last but not least my eyes to detect patterns .

    The compression algorithm:
    - you open enwik8 with the total commander viewer.
    - you realize the recurring pattern: there is an xml header followed by 12347 article "pages", each with a similar xml structure.
    - you write a parser in C# to walk thru the xml structure of the article pages while eliminating the content but keeping the xml elements (page, title, (text)id, restrictions, revision, (revision)id, timestamp, contributor, (contributor)username, (contributor)id, (contributor)ip, minor, comment, text) and write the results to a file to further analyse it.

    A sample of the structure:
    Code:
      <page>
        <title></title>
        <id></id>
        <revision>
          <id></id>
          <timestamp></timestamp>
          <contributor>
            <username></username>
            <id></id>
          </contributor>
          <minor />
          <text xml:space="preserve"></text>
        </revision>
      </page>
    - when you examine the results (your eyes come into play here) you realize that there are 5 binary variables that control the structure.
    - you also realize that there are fixed strings in the structure: like the "text" xml element always have the xml:space="preserve" attribute, or that the sub-elements are always aligned with two spaces, or that the "minor" tag is always empty and self-closing: <minor />.

    The variables that control the structure of the article page header:
    - is there a restrictions element present,
    - is either a (contributor)username+(contributor)id element-pair or a (contributor)ip element present,
    - is there a minor element present,
    - is there a comment element present,
    - is the text element empty (<text xml:space="preserve" />).

    The structure of an article page header thus depends only on those 5 binary variables. By concatenating the variables (for example the structure of the above example can be represented by '00100') and counting their frequencies, their probabilities ( p('00100') = 0.07 ) then calculating the total entropy, you'll find that you can easily compress the xml structure of the article page headers to 3855 bytes. I already don't have the pure original structure, I can't remember its size, but I remember that it is already a really good compression ratio. (It's not the best though - but we will keep it simple.)

    The result of the structural analysis is attached. It has two worksheets.

    [to be continued if anyone finds it useful so far]
    Attached Files Attached Files

  5. The Following User Says Thank You to Gotty For This Useful Post:

    xinix (12th March 2018)

  6. #4
    Member
    Join Date
    Nov 2015
    Location
    -
    Posts
    46
    Thanks
    202
    Thanked 10 Times in 9 Posts

    Very cool!
    Thanks for the detailed analysis!
    It leads to thoughts. It will be useful!

  7. #5
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    I think proper XML modelling would help a lot in PAQ case. Structured data is ubiquitous so having support for multiple types of structures would improve compression. XML is a pretty generic format and it shouldn't be hard to define bijections between XML and other formats (like JSON, CSV, etc). Instead of modelling multiple types of data directly, PAQ could have a converter that converts many types of data into XML and then directly model XML only. In case of enwik8/ enwik9 we have XML already, but we can go further and convert the formatting from Wikicode to XML too. What do you think?

  8. #6
    Member
    Join Date
    Dec 2008
    Location
    Poland, Warsaw
    Posts
    909
    Thanks
    531
    Thanked 359 Times in 267 Posts
    Some time ago Shelwien started to create some kind of XML parser for enwik8/9 xml - his program could rip articles titles/definitions into separated file which helps in enwik8 compresion.
    Maybe he made some progress until now.

  9. #7
    Member Gotty's Avatar
    Join Date
    Oct 2017
    Location
    Hungary
    Posts
    343
    Thanks
    235
    Thanked 226 Times in 123 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    Structured data is ubiquitous so having support for multiple types of structures would improve compression.
    I completely agree. If the compressor/decompressor has a better understanding of the structure of the input instead of reading the input solely as a stream of bytes, then compression can be significantly improved. Generally all the models in paq8* and lately Márcio's (text-related) improvements show that.
    Preprocessing and transforming the input (Precomp, paq8* and now also Fairytale) is an important step to get better compression.
    Quote Originally Posted by Piotr Tarsa View Post
    XML is a pretty generic format and it shouldn't be hard to define bijections between XML and other formats (like JSON, CSV, etc). Instead of modelling multiple types of data directly, PAQ could have a converter that converts many types of data into XML and then directly model XML only.
    Paq8* needs RAM for storing many contexts for many models. Storing everything (or anything) in xml would require way too much space, parsing the intermediate xml format would require too much processing power. I see your point: defining and parsing xml is universal, but for Paq8* I think it would be impractical.

    It is just a coincidence that I carried out a proof of concept on xml data. What I have in mind reaches farer than that. I'll share my thoughts and results sometime in the near future. It's still an embryo - it hasn't been born, yet.
    In short: it has to do with contexts and hierarchical decomposition; the probability model is semi-static; slower compression with faster decompression.

  10. The Following User Says Thank You to Gotty For This Useful Post:

    xinix (13th March 2018)

  11. #8
    Member Gotty's Avatar
    Join Date
    Oct 2017
    Location
    Hungary
    Posts
    343
    Thanks
    235
    Thanked 226 Times in 123 Posts
    On a second thought: Do you mean that paq8* should have a "model" for parsing/processing the special xml container? That could work.
    I use "model" in quotes to differentiate it from the current models. Now all models emit probabilities for each input bit. And clearly we need something that can handle some parts of the input as "known" without predicting. Like when you have a csv, and *if* it is successfully encoded to xml, *then* you *know* that it is a valid csv, so it has *always* a specific number of columns and that each row *always* ends with the same line ending.
    Something like that?
    Last edited by Gotty; 13th March 2018 at 22:51.

  12. #9
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    What I had in mind when I written about modelling XML is properly parsing XML tokens i.e. tag names and attribute names and then keeping a stack of open tags, recent sibling tags and recent attributes in current tag.

    Primary data structure would be a stack of stacks of hashes (one hash per token, because we're hashing contexts anyway). Let's call it 'tag_levels' and give it a type (in pseudocode) Stack<Stack<Hash>>. One level contains sibling tags (i.e. tags that share parent tag) that were already seen. There are as many levels as is the current nest level. Additional structure is a stack of attributes 'attr_levels' of type Stack<Hash>. Decision tree for each input byte would be as follows:
    1. If we aren't in a XML tag then use generic models.
    2. Otherwise use only XML models.

    If we are in XML model then:
    1. If an opening tag name was finished (i.e. we read space or closing bracket after tag name) then add hash of tag name to current stack: tag_levels.top.push(hash(tag_name))
    2. If opening tag was finished together with attributes (i.e. we read the closing bracked) then reset attributes stack: attr_levels.clear()
    3. If a closing tag name was started then remove current stack level: tag_levels.remove_top()
    4. If an attribute name was finished then add it to attributes stack: attr_levels.push(hash(attr_name))
    5. If we're inside a closing tag name then we predict strongly the same name as the corresponding opening tag name - we have it's hash at the top of current stack.

    Word model uses previous words to predict next word. Analogous thing can be done for XML tags. We can use previous sibling tag names and also ancestor tag names as contexts.

    Useful contexts:
    1. Names of siblings, i.e. top 1, top 2, top 3 etc names from tag_levels.top
    2. Names of ancestors i.e. top name from 1, 2 or 3 previous levels
    3. A combination of the above
    4. For modelling attribute names we can use previous attributes in a tag as context.

    Also we can have a model that predicts when parent tags are closed based on child tag name. I.e. in case of following XML:
    Code:
    <a>
      <b>
        <c></c>
        <d></d>
      </b>
      <b>
        <c></c>
        <d></d>
      </b>
    </a>
    We see that after tag "d" was closed the next tag was also always a closing tag.

    Paq8* needs RAM for storing many contexts for many models. Storing everything (or anything) in xml would require way too much space, parsing the intermediate xml format would require too much processing power. I see your point: defining and parsing xml is universal, but for Paq8* I think it would be impractical.
    If you want to conserve RAM then you can eg disable all non-XML models when you're inside XML tag. Also you can use strategy similar to word model - instead of adding contexts on byte boundary you can add them on token boundary. In case of word model a token is a single word. In case of XML model a token is a tag or attribute name.

    Like when you have a csv, and *if* it is successfully encoded to xml, *then* you *know* that it is a valid csv, so it has *always* a specific number of columns and that each row *always* ends with the same line ending.
    Something like that?
    In case of column based data we can use different names for tags representing columns and then use the model that predicts the moment of closing parent tag based on child tag name I've described above with an example. So if we have a CSV like:
    Code:
    aa,ab,ac
    ba,bb,bc
    ca,cb,cc
    And we translate it to XML:
    Code:
    <root>
      <row>
        <col1>aa</col1><col2>ab</col2><col3>ac</col3>
      </row>
      <row>
        <col1>ba</col1><col2>bb</col2><col3>bc</col3>
      </row>
      <row>
        <col1>ca</col1><col2>cb</col2><col3>cc</col3>
      </row>
    </root>
    Then we see that after closing tag 'col3' there's always another closing tag, so we can predict it with high probability and we don't need to encode any extra information.

    I don't know if all of the above sounds sensible but I'm going to implement and test it anyway But it will have to wait some time because I'm making a CM compressor from scratch.

  13. #10
    Member
    Join Date
    Feb 2016
    Location
    Luxembourg
    Posts
    520
    Thanks
    196
    Thanked 744 Times in 301 Posts
    paq8 already has a XML model that does that Piotr

  14. #11
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    Then it should compress the enwik8 skeleton to tiny sizes. I haven't tested it, though.

  15. #12
    Member
    Join Date
    Feb 2016
    Location
    Luxembourg
    Posts
    520
    Thanks
    196
    Thanked 744 Times in 301 Posts
    It's nearly useless on enwik, and it's easy to see why. The XML structure there is very well defined, so even without much sophistication it's very easy to predict (case in point, this thread).
    On HTML files, on the other hand, where the structure is a lot more complex, it gave about 1% gain.

  16. #13
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    HTML is somewhat problematic because there's no rule that you have to close all tags and that would damage contexts and reduce compression.

    XML as a transform target would mean that you don't need to e.g. model Wikicode directly in paq8, but you can have a Wikicode -> XML converter and then only XML modelling in paq8.

  17. #14
    Member Gotty's Avatar
    Join Date
    Oct 2017
    Location
    Hungary
    Posts
    343
    Thanks
    235
    Thanked 226 Times in 123 Posts

    Compressing the revision identifiers

    As a response to https://encode.ru/threads/3082-how-t...numbers-better, let's talk about how we can compress the revision identifiers in the enwik8 skeleton.
    There is an identifier for each article. The first one is here:

    Code:
    <page>
        <title></title>
        <id>1</id>
        <revision>
          <id>32899315</id>
          <timestamp>2005-12-27T18:46:47Z</timestamp>
          <contributor>
            <username></username>
            <id></id>
          </contributor>
          <text xml:space="preserve">#REDIRECT [[]]</text>
        </revision>
      </page>
    There are 12347 articles so there are 12347 unique integers to compress. Let's examine these integers and notice that there are two sub-sequences (see attachment):
    An *ordered* one with small gaps: 15898943..15913059 (count: 3220). The gaps (deltas) are in the range between 1 and 51, where delta=1 is very frequent and delta=51 is very rare. We will go with delta encoding then.
    And there is an *unordered* sub-sequence with larger gaps: 17734157..42164216 (count: 9127). The gaps are large - we may need to find something better than delta encoding.

    Encoding:

    1) Create an array of the 9127 values of the second sequence ("values").
    Create another array of integers from 0 to 9126 ("keys"). They will indicate the original positions of the values from the second sequence in enwik8..
    Sort the two arrays together by keys in ascending order so that the values move together with the keys.
    The values are now sorted, and the keys can be used to sort the arrays back to their original order.

    2) Encode the 9127 keys (the "output" of the 2nd step): they are unique, without gaps, but unordered.
    Use an arithmetic encoder to encode which key occupies which array position.
    The encoding cost of the value in the first position is: -log(1/9127), the second: -log(1/9126), the one before the last: -log(1/2) and the last is then known (no need to encode; -log(1/1)=0).
    The encoding cost is the sum of the above, which is 13364 bytes.

    3) Now encode the 9127 values in the following manner. Split the initial interval [17734157..42164216], and by using an arithmetic encoder encode the value in the middle position (4563). The value in the middle is 41173824. The encoding cost is -log(1/(42164215-17734156+1))=7.38 bits.
    Notice that the interval contains unique integers so the middle value is in the range [low+1 ... high-1]. Continue splitting the two new intervals recursively (17734157..41173823 and 41173825..42164215), and encode the middle value of each.
    Notice that the intervals are smaller and smaller so the encoding cost of the middle value is smaller and smaller. When you can't split any more intervals, you are done with this step.
    The total cost for this step is 12577 bytes.

    4) Let's prepare for delta encoding the first sequence: let's create a histogram from the deltas (there are 42 delta values: 1, 2, 3 ... 34, 37, 51 and counts are 1274, 439, 305 ... 1, 1, 1).
    Compress this histogram by range splitting.
    Initialize the "last seen value" (base for the first delta) to 15898942 (notice: it is 1 less than the minimum value).
    The total cost for this step is 61 bytes.

    5) Finally, let's do the following for each revision id value in enwik8 (i.e. 12347 times):
    Using an arithmetic encoder encode the fact that it is from the first sequence (15898943..15913059) or from the second sequence (17734157..42164216).
    In order to encode this, we need a tiny histogram with two values with the following counts: 3220 and 9127. Compress this histogram by range splitting.
    If the value is from the first sequence, encode its delta using the histogram from step 4). Otherwise (the value is from second sequence) you don't need to do anything.
    The total cost for this step is 2563 bytes (sequence type + deltas).

    The total compressed size is: 28565 bytes.

    Decoding order:

    1) Decode the 9127 values by range splitting.
    2) Decode the 9127 indices (keys).
    3) Sort the two arrays together by keys. Now we have the revision ids in the "values" array in the original enwik8-order.
    4) Decode two histograms: one for the deltas (42 items) and one for determining the sequence type (2 items).
    5) Initialize the "last seen value" (base for the first delta) to 15898942
    6) Finally do the following 12347 times: decode the sequence type; if it indicates that the value is from the first sequence, decode the delta, and by using the delta get the next revision id by adding the delta to the previous id. Otherwise (the sequence type is the second one), just output the next revision id from the "value" array from step 3).

    Remark: code for sorting the key-value pairs in asm: 117 bytes; code for range splitting and histogram compression: 710 bytes.

    Important hint: the timestamps belonging to the second revision id sequence are in the same order as the revision ids (with only one exception). If we encode/decode the order of the revision ids (the 9127 key-value pairs) then we have the order of the corresponding timestamps as well! That's a huge saving (~13K in compression).
    Attached Files Attached Files
    Last edited by Gotty; 16th March 2019 at 04:12.

  18. The Following User Says Thank You to Gotty For This Useful Post:

    xinix (17th March 2019)

Similar Threads

  1. A C++ Skeleton Program For Data Compression
    By greatalazar in forum The Off-Topic Lounge
    Replies: 1
    Last Post: 4th January 2018, 11:54
  2. Replies: 7
    Last Post: 25th November 2017, 12:11
  3. Highest LZ77 compression of enwik8
    By lz77 in forum Data Compression
    Replies: 8
    Last Post: 12th August 2017, 14:58
  4. [LZ] M/T & GPU (de)compression
    By Bulat Ziganshin in forum Data Compression
    Replies: 0
    Last Post: 31st December 2015, 13:09
  5. Can't allocate memory required for (de)compression..help!
    By Duarte in forum Data Compression
    Replies: 19
    Last Post: 18th July 2008, 18:14

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •