Results 1 to 4 of 4

Thread: Some perl scripts for enwik8 parsing

  1. #1
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts

    Some perl scripts for enwik8 parsing

    http://nishi.dreamhosters.com/u/enwik8pl_v0.rar

    Code:
    0. text.pl (split the article text from xml index)
    // old script! No inverse script, and its probably lossy
    
    Input: enwik8 (100,000,000)
    Output: enwik_text (86,961,331), enwik_idx (4,789,238), enwik_notes (385,235), IDX/000000..IDX/008581
    
    1. enwik1.pl (split the article text from xml index)
    
    Input: enwik8 (100,000,000)
    Output: enwik_text (98,131,052) enwik_idx (4,780,656)
    
    2. enwik1d.pl (restore #1)
    
    Input: enwik_text
    Output: enwik8out (should match enwik8)
    
    3. enwik2.pl (move <!-- ... --> html comments to a separate file)
    
    Input: enwik_text (98,131,052)
    Output: enwik_text2 (98,114,172), enwik_notes (376,319)
    
    4. enwik2d.pl (restore #3)
    
    Input: enwik_text2, enwik_notes
    Output: enwik8out2 (should match enwik_text)
    
    5. enwik3.pl (split &#xHHHH;, &#NNNN; etc codes to separate unicode files)
    
    Input: enwik_text2 (98,114,172)
    Output: enwik_text3 (97,817,017), enwik_text3_s1 (32,186), enwik_text3_s2 (0), enwik_text3_s3 (0), enwik_text3_s4 (200,981),
    
    6. enwik3b.pl (attach unicode streams to main file)
    
    Input: enwik_text3, enwik_text3_s4, enwik_text3_s1, enwik_text3_s2, enwik_text3_s3
    Output: enwik_text3a (98,056,487)
    
    7. enwik3b_d.pl (split the streams from main file)
    
    Input: enwik_text3a
    Output: enwik_text3d (=enwik_text3), enwik_text3_s1v, enwik_text3_s2v, enwik_text3_s3v, enwik_text3_s4v 
    
    8. enwik3_d.pl (reconstruct the file using the streams)
    
    Input: enwik_text3d, enwik_text3_s1v, enwik_text3_s2v, enwik_text3_s3v, enwik_text3_s4v
    Output: enwik_text3d1 (should match enwik_text2)
    
    
    9. enwik3a.pl (16bit unicode to utf8 conversion)
    
    Input: enwik_text3_s1, enwik_text3_s2, enwik_text3_s3
    Output: enwik_text3_s1u, enwik_text3_s2u, enwik_text3_s3u
    
    a. enwik3a_d.pl (utf8 back to utf16)
    
    Input: enwik_text3_s1u, enwik_text3_s2u, enwik_text3_s3u
    Output: enwik_text3_s1v (=enwik_text3_s1), enwik_text3_s2v (=enwik_text3_s2), enwik_text3_s3v (=enwik_text3_s3)

  2. The Following User Says Thank You to Shelwien For This Useful Post:

    kaitz (22nd February 2019)

  3. #2
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    http://nishi.dreamhosters.com/u/e8_idx_reorder_v0.rar
    Code:
    Utils for enwik8 article permutation
    
    1. bld_idx
    
    Input: enwik_text2_drt (from enwik2.pl/enwik_text2 + drt; but plain enwik_text2 also would work)
    Output: enwik_art_idx (binary index of article locations and sizes)
    
    2. bld_idx10 (Build the index for text lines instead)
    
    3. bld_idx256 (Build the index for 512-byte (atm) blocks)
    
    4. use_idx (Reorder the file using idx)
    
    Input: enwik_text2_drt, enwik_art_idx
    Output: enwik_text2_drt_r (with parts reordered based on idx)
    
    5. use_idx_d (Inverse of use_idx)
    
    Input: enwik_text2_drt_r, enwik_art_idx
    Output: enwik_text2_drt_u
    
    *. Samples
    
    enwik_art_idx - simple demo, intended for renamed enwik8_text2
    
    enwik_art_idx1 - actual optimization attempt, intended for actual enwik8_text2_drt
    
    enwik_art_idx2 - actual optimization attempt #2, intended for actual enwik8_text2_drt
    There's also a MT optimizer based on ppmd_sh and plzma codecs, but actual optimization attempts showed that their results don't really correlate with paq results.

  4. #3
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Hutter prize has been sitting idle for awhile. Glad to see someone working on it again

  5. #4
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    I made some new scripts for parsing of enwik8 xml index.
    http://nishi.dreamhosters.com/u/enwik8pl_v1.rar

    Preprocessing is actually pretty complicated (separate handling for 5 types of data),
    and results are far from impressive (for paq8px178 it reduces enwik_idx size from 224k to 220k or so),
    but the effect should be more significant for simpler compressors (scripts reduce data size from 4,780,656 to 1,188,718).
    Code:
    // enwik1.pl
    227,854 enwik_idx.paq8px178     
    224,506 enwik_idx_drt.paq8px178 
    
    // enwik1.pl+repl.pl
    225,005 enwik_idx_dmp.paq8px178    
    220,113 enwik_idx_dmp_drt.paq8px178
      1,125 archive of enwik1d.pl+enwik2d.pl+replU.pl  
    
    // enwik1.pl+repl.pl+gfh.pl
    222,271 enwik8.paq8px0
    220,261 enwik8.paq8px1 // acij_dmp_drt
    220,683 enwik8.paq8px2 // acij_dmp_drt+gfh_dmp_drt
      1,608 archive of enwik1d.pl+enwik2d.pl+replU.pl+gfhU.pl
    paq8px is actually kinda strange. On one hand I find it very hard to improve its results by preprocessing the text.
    For example, I tested binary/hex/base64 packing for timestamps, but simple decimal numbers won - and even that
    only saved like 200 bytes compared to original timestamp syntax.
    But on other hand, DRT still significantly improves its compression?
    Another example, IPs:
    Code:
    27,683 gfh_dmp_0.paq8px178 // 12.108.99.34 0 (original)
    27,487 gfh_dmp_1.paq8px178 // 1001916709 0
    27,784 gfh_dmp_2.paq8px178 // 00jpAg 0
    27,576 gfh_dmp_3.paq8px178 // 012108099034 0
    27,528 gfh_dmp_4.paq8px178 // 0C6C6322 0
    27,501 gfh_dmp_5.paq8px178 // 1001916709
    Anyway, some improvements are still possible:
    1) there's a sorted dictionary, and stegdict allows to save ~700 bytes there.
    2) "revision id" field is left untouched, but it should be possible to transform it somehow, maybe at least to index+dictionary (and then +stegdict).
    Original idea was to use it as sort key, then restore the order by article titles... but these turned out to not actually be sorted.
    3) some special syntax is not removed (like \x0A etc)
    And that should save 1-2k more.

    Btw, I also ported Alex' scripts: http://nishi.dreamhosters.com/u/AR_preproc_v0.rar
    But they are far from flexible (hardcoded input/output size etc), so I wasn't able to use them.
    In any case, the compression difference with perl scripts here (not counting new ones) is pretty small, 5k or so.
    (I don't have an equivalent of "footer" script atm), so there's no real need to use these.

  6. The Following 2 Users Say Thank You to Shelwien For This Useful Post:

    Darek (3rd March 2019),xinix (3rd March 2019)

Similar Threads

  1. Optimal Preprocessing/Parsing for LZ?
    By comp1 in forum Data Compression
    Replies: 68
    Last Post: 11th February 2015, 20:27
  2. [LZ] Optimal parsing
    By Bulat Ziganshin in forum Data Compression
    Replies: 14
    Last Post: 19th March 2014, 21:56
  3. Replies: 4
    Last Post: 3rd February 2012, 04:35
  4. optimal parsing
    By flounder in forum Data Compression
    Replies: 10
    Last Post: 3rd August 2008, 13:07
  5. parsing methods
    By Piotr Tarsa in forum Forum Archive
    Replies: 18
    Last Post: 9th August 2007, 06:45

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •