Results 1 to 9 of 9

Thread: how to compress these numbers better?

  1. #1
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts

    how to compress these numbers better?

    "Source" is "revision.txt": https://pastebin.com/xDWLcefm
    Its revision-id field values from enwik8 xml index.

    Here's what I tried:
    Code:
    111,123 revision.txt
     49,388 revision.bin   // simply packed to 32-bit binary
     49,388 revision0.bin  // -15898943 from all numbers (min value)
     49,388 revision1.bin  // -(15898943+idx)
     49,388 revision1a.bin // -(15898943-3+idx) (prevent negative numbers)
     28,820 revision.txt.paq8px178 // paq8px -7
     29,650 revision.bin.paq8px178
     28,321 revision.rst.paq8px178 // "1a.bin" convered to decimal as is
     29,746 revision0.bin.paq8px178
     28,815 revision1.bin.paq8px178
     28,814 revision1a.bin.paq8px178
     30,599 revision1a.bin.lzma // -fb273 -mc9999 -lc3 -lp2 -pb2 -mt1
    Attached Files Attached Files

  2. #2
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    539
    Thanks
    192
    Thanked 174 Times in 81 Posts
    The plot of the sorted numbers in revisions.txt looks promising:

    Click image for larger version. 

Name:	plot_sorted.png 
Views:	47 
Size:	17.4 KB 
ID:	6514

    In the "curved part", fitting and subtracting a logarithmic curve might be possible so that all values are in a much smaller range. Additionally, the permutation to unsort has to be stored, of course.
    http://schnaader.info
    Damn kids. They're all alike.

  3. #3
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Code:
    111,123 revision.txt
    
     95,443 revision2.txt
     28,321 revision2.txt.paq8px178
    
    111,123 revision_s.txt // sort -nu
     14,682 revision_s.txt.paq8px178
    
     85,653 revision2_s.txt // sort -nu
     13,560 revision2_s.txt.paq8px178
    
     18,751 revision.txt.steg // stegdict.exe d revision.txt nul revision.txt.steg = "permutation to unsort has to be stored"
    
    28321-18751 = 9570 // max size of compressed sorted file to make it useful
    Thing is, there're fairly frequent runs of subsequent numbers in revision.txt

  4. The Following User Says Thank You to Shelwien For This Useful Post:

    schnaader (12th March 2019)

  5. #4
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Tried fitting it, script turned up quite scary, but no luck:

    if( $y<3221 ) {
    if( $y<549 ) {
    $x -= int(1.589933846573067083954811096E7 + 0.1199820652864920234*$y - 140.0904431842914448225*log($y) + 0.7245062460546726202*$y*log($y));
    } elsif( $y<1389 ) {
    $x -= int(1.58987848781941495835781097E7 + 4.1709471162017948132*$y);
    } elsif( $y<3139 ) {
    $x -= int(1.588208195236327126622200012E7+3.5405111894402621076*$y+2427.3304499257183124428*log($y));
    } elsif( $y<3221 ) {
    $x -= int(1.470368866811318695545196533E7-49.9302335691300243070*$y+176052.3532899742713198066*log($y-0.65265209586033812172)-6414.7628061393261305057*log($y));
    }
    } else {
    $y -= 3220;
    if( $y<4565 ) {
    $x -= int( 2**(24.1021 + 0.0017118*$y - 1.18426E-6*($y**2) + 4.35482E-10*($y**3) - 7.99101E-14*($y**4) + 5.72367E-18*($y**5)) );
    } else {
    $x -= int( 2**(24.0677 + 0.000839362*$y-2.3139E-7*($y**2) + 3.23133E-11*($y**3) - 2.25951E-15*($y**4) + 6.29656E-20*($y**5)) );
    }
    }

    Uncompressed size is 66k instead of 111k, but compressed size gets +1k.

  6. #5
    Member Gotty's Avatar
    Join Date
    Oct 2017
    Location
    Hungary
    Posts
    343
    Thanks
    235
    Thanked 226 Times in 123 Posts
    As a response to this thread, see my current post in
    https://encode.ru/threads/2924-Faste...ll=1#post59465

    The compressed size of "revision.txt": 28565 bytes.

    My primary goal was to create a very small decompressor (keeping the Kolmogorov complexity in mind). I already know that this is *not* the best possible compression. But it is self-contained, no external compressor (like paq8px) is used. That means I answer the question: how to compress "revision.txt", not how to transform the revision ids to help a general compressor (like paq8px) to compress it better.

  7. #6
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    My full script for enwik8 index processing is there - https://encode.ru/threads/2590-Some-...ll=1#post59399
    As to revision.txt, the best paq8px result for it here is 28,321 (which includes the archive header).

  8. #7
    Member Gotty's Avatar
    Join Date
    Oct 2017
    Location
    Hungary
    Posts
    343
    Thanks
    235
    Thanked 226 Times in 123 Posts
    Aha, I see.
    Then the only help I can give: the revision ids and the timestamps are *strongly* correlated. You can save ~13K, if you take advantage of this.

  9. The Following User Says Thank You to Gotty For This Useful Post:

    Shelwien (16th March 2019)

  10. #8
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Not that strongly, I guess.
    But I agree that we can detect subsequent revision-ids by small ts delta.
    Click image for larger version. 

Name:	rev_ts.png 
Views:	19 
Size:	18.6 KB 
ID:	6519

    I tried sorting ts/rev pairs by time, and the first part is pretty random (3k out of 12k pairs),
    but later there's really a better correlation:
    Click image for larger version. 

Name:	ts_rev2.png 
Views:	16 
Size:	98.8 KB 
ID:	6520

    Still, in detail, it looks like this at best:
    Code:
    1135529171 -3695
    1135529409 -3601
    1135529511 -3542
    1135529922 -3346
    1135535618 -936
    1135536169 -679
    1135539334 494
    1135542626 1480
    1135543312 1607
    1135567018 8226
    1135571468 9306
    1135571773 9400
    So compression didn't really improve much:
    Code:
    25,400 ts_out_s.paq8px178
    24,102 ts_out_s2.paq8px178
    But maybe delta-coding of the sorted ts;rev list would help.

  11. #9
    Member Gotty's Avatar
    Join Date
    Oct 2017
    Location
    Hungary
    Posts
    343
    Thanks
    235
    Thanked 226 Times in 123 Posts
    Quote Originally Posted by Shelwien View Post
    Not that strongly, I guess.
    But I agree that we can detect subsequent revision-ids by small ts delta.
    This way:

    If you have the revision ids (decoded), sort them by value (and remember their position to unsort them at the end).
    The task is to find the timestamps for these revision ids.

    When the revision id is 15898943..15913059, there may be only 3 types of timestamps:
    2002-02-25T15:43:11Z (count: 437 )
    2002-02-25T15:51:15Z (count: 508 )
    2002-02-25T23:32:07Z .. 2005-06-27T02:54:12Z (count: 2275 )

    When the revision id is 17734157..42164216, the timestamps are:
    2005-06-28T04:28:28Z .. 2006-03-04T06:14:28Z (count: 9127)

    Nice, isn't it?

    The timestamps are strictly growing as the revision ids are growing in the last group. There is only one exception (you may need to simply patch it at the end: at Revision id=19120691 the timestamp=2005-07-19T01:39:52Z is off at least by 93 seconds).

    Find the timestamps: don't do it sequentially, but do it like quicksort: split the whole interval (9127), find the middle timestamp value (having the upper and lower timestamps as bounds). Do it recursively. Any timestamp to be decoded will be constrained by the already decoded timestamps.

    When the timestamps are all decoded, unsort the revision ids (now together with the timestamps).

    Using this method here I could compress the timestamps to 19336 bytes.

Similar Threads

  1. which compress algorithm is best for me?
    By skogkatt in forum Data Compression
    Replies: 19
    Last Post: 29th December 2016, 15:29
  2. Numbers vs text compression
    By irect in forum Data Compression
    Replies: 3
    Last Post: 7th March 2016, 02:24
  3. Random numbers
    By nburns in forum The Off-Topic Lounge
    Replies: 9
    Last Post: 9th September 2013, 05:53
  4. Compressing prime numbers
    By Matt Mahoney in forum Data Compression
    Replies: 14
    Last Post: 18th May 2013, 18:41
  5. Compress-LZF
    By spark in forum Data Compression
    Replies: 2
    Last Post: 16th October 2009, 00:08

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •