Page 1 of 21 12311 ... LastLast
Results 1 to 30 of 605

Thread: Paq8pxd dict

  1. #1
    Member
    Join Date
    May 2008
    Location
    Estonia
    Posts
    372
    Thanks
    134
    Thanked 188 Times in 103 Posts

    Paq8pxd dict

    Dynamic record model
    Text/utf detection
    dynamic dict preprocess (modified version of XWRT)
    0xX0X0X0X0... to 0xXXXX... filter for text



    Code:
    enwik8 test 
    (option -3)
               compressed     time
    paq8pxd    20337801       1229
    paq8px_v69 20794944       1797
    
    
    (option -7)
               compressed     time
    paq8pxd    17596170       11464
    paq8px_v69 17939198       15363
    Attached Files Attached Files
    Last edited by kaitz; 22nd January 2012 at 12:42. Reason: test result
    KZo


  2. #2
    Member
    Join Date
    Jan 2012
    Location
    Sopianae
    Posts
    32
    Thanks
    9
    Thanked 0 Times in 0 Posts
    For texts? Ok, thank U, I will give it a try!
    Thanks!!

  3. #3
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,253
    Thanks
    304
    Thanked 768 Times in 482 Posts
    Calgary corpus results (14 files to 1 archive). Unfortunately, results are worse than paq8px_v69

    Code:
    D:\>paq8px -7 calgary-7 c:\res\calgary\*
    Creating archive calgary-7.paq8px with 14 file(s)...
    
    1/14  Filename: c:/res/calgary/BIB (111261 bytes)
    Block segmentation:
     0           | default   |    111261 bytes [0 - 111260]
    Compressed from 111261 to 20635 bytes.
    
    2/14  Filename: c:/res/calgary/BOOK1 (768771 bytes)
    Block segmentation:
     0           | default   |    768771 bytes [0 - 768770]
    Compressed from 768771 to 191178 bytes.
    
    3/14  Filename: c:/res/calgary/BOOK2 (610856 bytes)
    Block segmentation:
     0           | default   |    610856 bytes [0 - 610855]
    Compressed from 610856 to 116216 bytes.
    
    4/14  Filename: c:/res/calgary/GEO (102400 bytes)
    Block segmentation:
     0           | default   |    102400 bytes [0 - 102399]
    Compressed from 102400 to 44094 bytes.
    
    5/14  Filename: c:/res/calgary/NEWS (377109 bytes)
    Block segmentation:
     0           | default   |    377109 bytes [0 - 377108]
    Compressed from 377109 to 82789 bytes.
    
    6/14  Filename: c:/res/calgary/OBJ1 (21504 bytes)
    Block segmentation:
     0           | default   |     21504 bytes [0 - 21503]
    Compressed from 21504 to 7280 bytes.
    
    7/14  Filename: c:/res/calgary/OBJ2 (246814 bytes)
    Block segmentation:
     0           | default   |    246814 bytes [0 - 246813]
    Compressed from 246814 to 44111 bytes.
    
    8/14  Filename: c:/res/calgary/PAPER1 (53161 bytes)
    Block segmentation:
     0           | default   |     53161 bytes [0 - 53160]
    Compressed from 53161 to 10389 bytes.
    
    9/14  Filename: c:/res/calgary/PAPER2 (82199 bytes)
    Block segmentation:
     0           | default   |     82199 bytes [0 - 82198]
    Compressed from 82199 to 16461 bytes.
    
    10/14  Filename: c:/res/calgary/PIC (513216 bytes)
    Block segmentation:
     0           | default   |    513216 bytes [0 - 513215]
    Compressed from 513216 to 30828 bytes.
    
    11/14  Filename: c:/res/calgary/PROGC (39611 bytes)
    Block segmentation:
     0           | default   |     39611 bytes [0 - 39610]
    Compressed from 39611 to 8218 bytes.
    
    12/14  Filename: c:/res/calgary/PROGL (71646 bytes)
    Block segmentation:
     0           | default   |     71646 bytes [0 - 71645]
    Compressed from 71646 to 9503 bytes.
    
    13/14  Filename: c:/res/calgary/PROGP (49379 bytes)
    Block segmentation:
     0           | default   |     49379 bytes [0 - 49378]
    Compressed from 49379 to 6688 bytes.
    
    14/14  Filename: c:/res/calgary/TRANS (93695 bytes)
    Block segmentation:
     0           | default   |     93695 bytes [0 - 93694]
    Compressed from 93695 to 9965 bytes.
    
    Total 3141622 bytes compressed to 598550 bytes.
    Time 458.69 sec, used 811717915 bytes of memory
    
    
    D:\>paq8pxd -7 calgary-7d c:\res\calgary\*
    Creating archive calgary-7d.paq8pxd with 14 file(s)...
    
    File list (169 bytes)
    Compressed from 169 to 100 bytes.
    
    1/14  Filename: c:/res/calgary/BIB (111261 bytes)
    Block segmentation:
     0           | text      |    111261 bytes [0 - 111260] (wrt: 85981)
    Compressed from 111261 to 20801 bytes.
    
    2/14  Filename: c:/res/calgary/BOOK1 (768771 bytes)
    Block segmentation:
     0           | text      |    173891 bytes [0 - 173890] (wrt: 131001)
     1           | default   |         1 bytes [173891 - 173891]
     2           | text      |    249971 bytes [173892 - 423862] (wrt: 183473)
     3           | default   |         1 bytes [423863 - 423863]
     4           | text      |    344907 bytes [423864 - 768770] (wrt: 241344)
    Compressed from 768771 to 197989 bytes.
    
    3/14  Filename: c:/res/calgary/BOOK2 (610856 bytes)
    Block segmentation:
     0           | text      |    610856 bytes [0 - 610855] (wrt: 375627)
    Compressed from 610856 to 119444 bytes.
    
    4/14  Filename: c:/res/calgary/GEO (102400 bytes)
    Block segmentation:
     0           | default   |    102400 bytes [0 - 102399]
    Compressed from 102400 to 44145 bytes.
    
    5/14  Filename: c:/res/calgary/NEWS (377109 bytes)
    Block segmentation:
     0           | text      |    314908 bytes [0 - 314907] (wrt: 238603)
     1           | default   |         1 bytes [314908 - 314908]
     2           | text      |      3959 bytes [314909 - 318867]
     3           | default   |         1 bytes [318868 - 318868]
     4           | text      |     58240 bytes [318869 - 377108] (wrt: 49491)
    Compressed from 377109 to 88660 bytes.
    
    6/14  Filename: c:/res/calgary/OBJ1 (21504 bytes)
    Block segmentation:
     0           | default   |     21504 bytes [0 - 21503]
    Compressed from 21504 to 7341 bytes.
    
    7/14  Filename: c:/res/calgary/OBJ2 (246814 bytes)
    Block segmentation:
     0           | default   |    246814 bytes [0 - 246813]
    Compressed from 246814 to 44003 bytes.
    
    8/14  Filename: c:/res/calgary/PAPER1 (53161 bytes)
    Block segmentation:
     0           | text      |     53161 bytes [0 - 53160] (wrt: 40392)
    Compressed from 53161 to 11592 bytes.
    
    9/14  Filename: c:/res/calgary/PAPER2 (82199 bytes)
    Block segmentation:
     0           | text      |     82199 bytes [0 - 82198] (wrt: 59795)
    Compressed from 82199 to 18096 bytes.
    
    10/14  Filename: c:/res/calgary/PIC (513216 bytes)
    Block segmentation:
     0           | default   |    513216 bytes [0 - 513215]
    Compressed from 513216 to 38731 bytes.
    
    11/14  Filename: c:/res/calgary/PROGC (39611 bytes)
    Block segmentation:
     0           | text      |     39611 bytes [0 - 39610] (wrt: 31184)
    Compressed from 39611 to 8748 bytes.
    
    12/14  Filename: c:/res/calgary/PROGL (71646 bytes)
    Block segmentation:
     0           | text      |     71646 bytes [0 - 71645] (wrt: 52840)
    Compressed from 71646 to 9693 bytes.
    
    13/14  Filename: c:/res/calgary/PROGP (49379 bytes)
    Block segmentation:
     0           | text      |     49379 bytes [0 - 49378] (wrt: 36331)
    Compressed from 49379 to 6828 bytes.
    
    14/14  Filename: c:/res/calgary/TRANS (93695 bytes)
    Block segmentation:
     0           | default   |     93695 bytes [0 - 93694]
    Compressed from 93695 to 10238 bytes.
    
    Total 3141622 bytes compressed to 626419 bytes.
    Time 442.42 sec, used 812319808 bytes of memory

  4. #4
    Tester
    Stephan Busch's Avatar
    Join Date
    May 2008
    Location
    Bremen, Germany
    Posts
    860
    Thanks
    440
    Thanked 169 Times in 80 Posts
    Hi there,

    I am currently running paq8pxd -7 on my testsets. So far, the results are not better on textual data.
    Here is the output for the wikipedia testset:

    1/1 Filename: wiki.tar (1000009216 bytes)
    Block segmentation:
    0 | default | 1024 bytes [0 - 1023]
    1 | utf-8 | 100000000 bytes [1024 - 100001023] (wrt: 75673565)
    2 | default | 768 bytes [100001024 - 100001791]
    3 | utf-8 | 100000000 bytes [100001792 - 200001791] (wrt: 67339901)
    4 | default | 768 bytes [200001792 - 200002559]
    5 | utf-8 | 100000000 bytes [200002560 - 300002559] (wrt: 61950819)
    6 | default | 768 bytes [300002560 - 300003327]
    7 | utf-8 | 100000000 bytes [300003328 - 400003327] (wrt: 65501064)
    8 | default | 768 bytes [400003328 - 400004095]
    9 | utf-8 | 100000000 bytes [400004096 - 500004095] (wrt: 65785770)
    10 | default | 768 bytes [500004096 - 500004863]
    11 | utf-8 | 99999999 bytes [500004864 - 600004862] (wrt: 6164441
    12 | default | 769 bytes [600004863 - 600005631]
    13 | utf-8 | 100000000 bytes [600005632 - 700005631] (wrt: 66960493)
    14 | default | 768 bytes [700005632 - 700006399]
    15 | utf-8 | 100000000 bytes [700006400 - 800006399] (wrt: 85100616)
    16 | default | 768 bytes [800006400 - 800007167]
    17 | utf-8 | 99999998 bytes [800007168 - 900007165] (wrt: 70613645)
    18 | default | 770 bytes [900007166 - 900007935]
    19 | utf-8 | 100000000 bytes [900007936 - 1000007935]
    20 | default | 1280 bytes [1000007936 - 1000009215]
    Compressed from 1000009216 to 143474038 bytes.

    Total 1000009216 bytes compressed to 143474071 bytes.
    Time 99227.93 sec, used 812320083 bytes of memory

  5. #5
    Member
    Join Date
    May 2008
    Location
    Estonia
    Posts
    372
    Thanks
    134
    Thanked 188 Times in 103 Posts
    @Stephan
    Each wrt block has its own dict. Probably this is cause.


    Updated version. No progname change. Previous attempt is obsolete.
    On enwik8 compression time is about same as in paq8p3
    Code:
    opt -7          Compression                        Time        
                px_v69        pxd        diff      px_v69     pxd
    enwik6      207610        206343     1267      155        101
    world95.txt 351923        350288     1635      451        224
    calgary.tar 598118        607317     -9199     457        371
    enwik8      17939198      17511910   427288    15363      8238
    vlcfile     1634624       1632802    1822      3004       1676
    Attachment has some testing results. And yes drt+px_v69 has better results on enwik8 then pxd.

    EDIT:
    Tested in another pc. (Core2Duo T8300 2.4GHz 2GB RAM)
    Code:
    paq8pxd -7 enwik9 144773408 63302
    Code:
    paq8pxd -8 enwik8 17300285    8137(sec) 1626035957(mem)
    Attached Files Attached Files
    Last edited by kaitz; 1st February 2012 at 16:05. Reason: more test results
    KZo


  6. #6
    Tester
    Stephan Busch's Avatar
    Join Date
    May 2008
    Location
    Bremen, Germany
    Posts
    860
    Thanks
    440
    Thanked 169 Times in 80 Posts
    Hi Kaido,

    v1 compresses slightly better.

    1/1 Filename: wiki.tar (1000009216 bytes)
    Block segmentation:
    0 | default | 1024 bytes [0 - 1023]
    1 | utf-8 | 100000000 bytes [1024 - 100001023] (wrt: 77295703)
    2 | default | 768 bytes [100001024 - 100001791]
    3 | utf-8 | 100000000 bytes [100001792 - 200001791] (wrt: 69018975)
    4 | default | 768 bytes [200001792 - 200002559]
    5 | utf-8 | 100000000 bytes [200002560 - 300002559] (wrt: 64253847)
    6 | default | 768 bytes [300002560 - 300003327]
    7 | utf-8 | 100000000 bytes [300003328 - 400003327] (wrt: 67024667)
    8 | default | 768 bytes [400003328 - 400004095]
    9 | utf-8 | 100000000 bytes [400004096 - 500004095] (wrt: 67571240)
    10 | default | 768 bytes [500004096 - 500004863]
    11 | utf-8 | 99999999 bytes [500004864 - 600004862] (wrt: 63552163)
    12 | default | 769 bytes [600004863 - 600005631]
    13 | utf-8 | 100000000 bytes [600005632 - 700005631] (wrt: 6880379
    14 | default | 768 bytes [700005632 - 700006399]
    15 | utf-8 | 100000000 bytes [700006400 - 800006399] (wrt: 86301047)
    16 | default | 768 bytes [800006400 - 800007167]
    17 | utf-8 | 99999998 bytes [800007168 - 900007165] (wrt: 72774055)
    18 | default | 770 bytes [900007166 - 900007935]
    19 | utf-8 | 100000000 bytes [900007936 - 1000007935]
    20 | default | 1280 bytes [1000007936 - 1000009215]
    Compressed from 1000009216 to 143021889 bytes.

    But it doesn't seem to detect 24-bit images - only 8bit seem to be detected:

    1/1 Filename: bmp2.tar (633510400 bytes)
    Block segmentation:
    0 | default | 1024 bytes [0 - 1023]
    1 | hdr | 17 bytes [1024 - 1040]
    2 | 8b-image | 6291456 bytes [1041 - 6292496] (width: 3072)
    3 | default | 18876399 bytes [6292497 - 25168895]
    4 | hdr | 17 bytes [25168896 - 25168912]
    5 | 8b-image | 39052992 bytes [25168913 - 64221904] (width: 7216)
    6 | default | 117160751 bytes [64221905 - 181382655]
    7 | hdr | 17 bytes [181382656 - 181382672]
    8 | 8b-image | 27700400 bytes [181382673 - 209083072] (width: 608
    9 | default | 83103039 bytes [209083073 - 292186111]
    Compressing... 41.54%

  7. #7
    Member
    Join Date
    May 2008
    Location
    Estonia
    Posts
    372
    Thanks
    134
    Thanked 188 Times in 103 Posts
    // Detect .pbm .pgm .ppm image //fails on enwik9 at offset 435132165 (24 bit header )
    KZo


  8. #8
    Member BetaTester's Avatar
    Join Date
    Dec 2010
    Location
    Brazil
    Posts
    43
    Thanks
    0
    Thanked 3 Times in 3 Posts
    In my tests I found that PAQ, compression becomes greater when:

    - Files with the same extension are compressed together, and the compressor goes to another extension, only after compressing all the files in a specific extension

    - Within the list of files with the same extension, compression will be greater if the files are in size order, First the largest file, and finally the smallest file.

    These latest versions of the PAQ does not give option to change the order of input files, just attack any folder compressing the files in alphabetical order.

  9. #9
    Tester
    Stephan Busch's Avatar
    Join Date
    May 2008
    Location
    Bremen, Germany
    Posts
    860
    Thanks
    440
    Thanked 169 Times in 80 Posts
    PAQ8pxd_v1 compresses the bitmap testset about 30 MB worse and does not detect all 24-bit images.
    -there is no default data in this testset.

    1/1 Filename: bmp2.tar (633510400 bytes)
    Block segmentation:
    0 | default | 1024 bytes [0 - 1023]
    1 | hdr | 17 bytes [1024 - 1040]
    2 | 8b-image | 6291456 bytes [1041 - 6292496] (width: 3072)
    3 | default | 18876399 bytes [6292497 - 25168895]
    4 | hdr | 17 bytes [25168896 - 25168912]
    5 | 8b-image | 39052992 bytes [25168913 - 64221904] (width: 7216)
    6 | default | 117160751 bytes [64221905 - 181382655]
    7 | hdr | 17 bytes [181382656 - 181382672]
    8 | 8b-image | 27700400 bytes [181382673 - 209083072] (width: 608
    9 | default | 83103039 bytes [209083073 - 292186111]
    10 | hdr | 17 bytes [292186112 - 292186128]
    11 | 8b-image | 11130701 bytes [292186129 - 303316829] (width: 2749)
    12 | default | 33393314 bytes [303316830 - 336710143]
    13 | hdr | 17 bytes [336710144 - 336710160]
    14 | 8b-image | 6016000 bytes [336710161 - 342726160] (width: 2000)
    15 | default | 15999209 bytes [342726161 - 358725369]
    16 | hdr | 18 bytes [358725370 - 358725387]
    17 | 8b-image | 2048 bytes [358725388 - 358727435] (width: 512)
    18 | default | 498916 bytes [358727436 - 359226351]
    19 | hdr | 18 bytes [359226352 - 359226369]
    20 | 8b-image | 5658 bytes [359226370 - 359232027] (width: 1)
    21 | default | 518977 bytes [359232028 - 359751004]
    22 | hdr | 18 bytes [359751005 - 359751022]
    23 | 24b-image | 57600 bytes [359751023 - 359808622] (width: 15)
    24 | default | 263779 bytes [359808623 - 360072401]
    25 | hdr | 18 bytes [360072402 - 360072419]
    26 | 24b-image | 30957768 bytes [360072420 - 391030187] (width: 3897)
    27 | default | 12457556 bytes [391030188 - 403487743]
    28 | hdr | 17 bytes [403487744 - 403487760]
    29 | 8b-image | 7375872 bytes [403487761 - 410863632] (width: 3136)
    30 | default | 22129647 bytes [410863633 - 432993279]
    31 | hdr | 17 bytes [432993280 - 432993296]
    32 | 8b-image | 3429216 bytes [432993297 - 436422512] (width: 226
    33 | default | 10289295 bytes [436422513 - 446711807]
    34 | hdr | 17 bytes [446711808 - 446711824]
    35 | 8b-image | 6291456 bytes [446711825 - 453003280] (width: 3072)
    36 | default | 18876399 bytes [453003281 - 471879679]
    37 | hdr | 17 bytes [471879680 - 471879696]
    38 | 8b-image | 6016000 bytes [471879697 - 477895696] (width: 300
    39 | default | 18050031 bytes [477895697 - 495945727]
    40 | hdr | 17 bytes [495945728 - 495945744]
    41 | 8b-image | 6016000 bytes [495945745 - 501961744] (width: 300
    42 | default | 18050031 bytes [501961745 - 520011775]
    43 | hdr | 17 bytes [520011776 - 520011792]
    44 | 8b-image | 7375872 bytes [520011793 - 527387664] (width: 3136)
    45 | default | 22129647 bytes [527387665 - 549517311]
    46 | hdr | 17 bytes [549517312 - 549517328]
    47 | 8b-image | 7375872 bytes [549517329 - 556893200] (width: 3136)
    48 | default | 11772012 bytes [556893201 - 568665212]
    49 | hdr | 18 bytes [568665213 - 568665230]
    50 | 24b-image | 9984 bytes [568665231 - 568675214] (width: 9984)
    51 | default | 10347633 bytes [568675215 - 579022847]
    52 | hdr | 17 bytes [579022848 - 579022864]
    53 | 8b-image | 12121088 bytes [579022865 - 591143952] (width: 4256)
    54 | default | 36365295 bytes [591143953 - 627509247]
    55 | hdr | 17 bytes [627509248 - 627509264]
    56 | 8b-image | 6000000 bytes [627509265 - 633509264] (width: 3000)
    57 | default | 1135 bytes [633509265 - 633510399]
    Compressed from 633510400 to 269355793 bytes.

    Total 633510400 bytes compressed to 269355825 bytes.
    Time 64796.59 sec, used 881831523 bytes of memory

  10. #10
    Member Karhunen's Avatar
    Join Date
    Dec 2011
    Location
    USA
    Posts
    91
    Thanks
    2
    Thanked 1 Time in 1 Post
    Quote Originally Posted by BetaTester View Post
    In my tests I found that PAQ, compression becomes greater when:

    - Files with the same extension are compressed together, and the compressor goes to another extension, only after compressing all the files in a specific extension
    This brings up a question I have: If the //Detect fails to match any known filestream (PGM,BMP etc), instead of default mode, would anyone want a mode that falls back to an uncompressed format mode like PPM or BMP? This may not be possible, since odd stream lengths could not be a valid bitmap stream.

  11. #11
    Member
    Join Date
    May 2008
    Location
    Estonia
    Posts
    372
    Thanks
    134
    Thanked 188 Times in 103 Posts
    Code:
    enwik8
    -7 17045653 Time 9428.17 sec, used 853029829 bytes of memory
    -8 16848214 Time 9535.25 sec, used 1658336197 bytes of memory
    Image detection is back, so do not try to compress enwik9.
    Attached Files Attached Files
    KZo


  12. #12
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,253
    Thanks
    304
    Thanked 768 Times in 482 Posts
    I posted your results. http://mattmahoney.net/dc/text.html#1448
    I wonder if with -8 you might be able to move up to the #3 spot.

  13. #13
    Member
    Join Date
    May 2010
    Location
    France
    Posts
    4
    Thanks
    0
    Thanked 0 Times in 0 Posts
    For detection, why don't used the extension format in first ?

    This image won't be see as jpeg

    Name:  img.jpg
Views: 8213
Size:  2.6 KB

    But this image would be see as an jpeg image :

    Name:  img (1).jpg
Views: 8696
Size:  2.6 KB

    And, if you want to compress theses images :

    "
    Files list <14 bytes>
    Compressed from 14 to 17 bytes.
    "

    Maybe it's will be possible to not compress if after compress it's not thinner ?

  14. #14
    Member
    Join Date
    May 2008
    Location
    Estonia
    Posts
    372
    Thanks
    134
    Thanked 188 Times in 103 Posts
    New
    • Modified im8model (faster/slightly better in my tests)
    • base64 in e-mails (recursion. yes, it can fail on transform)
    • fixed enwik9 img problem (i hope)
    Code:
    option -7
    
    
    zone_plate.pgm 6000017
    
    
               Compressed   Time
    paq8px_v69 404964       238 
    paq8pxd_v3 355834       225
    
    
    hdr.pgm 6291473
    paq8px_v69 1556167      178
    paq8pxd_v3 1553288      173
    
    
    bridge.pgm whas about 1700 bytes larger with paq8pxd_v3
    Code:
    Thunderbird inbox 30118877 bytes
    option -7
    paq8pxd_v3 17364195 bytes, time 4207.49 sec, 827511670 mem
    paq8px_v69 17677583 bytes, time 5001.42 sec, 811820566 mem
    Code:
    paq8pxd_v3  -8 enwik8 16847903  bytes Time 8300 sec,  used 1658336197 mem
    paq8pxd_v3  -8 enwik9 136777893 bytes Time 82822 sec, used 1658336197 mem
    
    
    paq8pxd_v3  -7 enwik8 17045354  bytes Time 8023 sec,  used 853029829 mem
    paq8pxd_v3  -7 enwik9 140110094 bytes Time 80069 sec, used 853029829 mem
    Attached Files Attached Files
    Last edited by kaitz; 26th February 2012 at 20:31. Reason: base64 test & enwik8/9 tests
    KZo


  15. #15
    Member
    Join Date
    May 2008
    Location
    Estonia
    Posts
    372
    Thanks
    134
    Thanked 188 Times in 103 Posts
    • Added 4bit bmp
    • base64 fixes
    • other fixes
    • combined wrt files to one
    • etc.
    Sample file has multi-base64 encoded data. From web.
    Attached Files Attached Files
    KZo


  16. #16
    Member
    Join Date
    Jan 2007
    Location
    Moscow
    Posts
    239
    Thanks
    0
    Thanked 3 Times in 1 Post
    Is it possible to add 32-bit image filters?

  17. #17
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,253
    Thanks
    304
    Thanked 768 Times in 482 Posts
    v3 results are posted to http://mattmahoney.net/dc/text.html#1368
    Somehow this escaped my attention when you released it. It is now #3, beating lpaq9m.
    Anyway, if you want to test v4 on enwik9 I will post it too. I am testing on silesia. So far it is 1958K on dickens, beating paq8px_v69.

    Edit: paq8pxd_v4 -8 takes the top position on the silesia benchmark. http://mattmahoney.net/dc/silesia.html
    Compression took 15 hours on a 2 GHz T3200. Testing decompression now.

    Compression was better on most files but somewhat worse on samba. Here is the output.

    Code:
    D:\silesia>for %i in (*.) do paq8pxd_v4 -8 %i
    
    D:\silesia>paq8pxd_v4 -8 dickens
    Creating archive dickens.paq8pxd with 1 file(s)...
    
    File list (18 bytes)
    Compressed from 18 to 20 bytes.
    
    1/1  Filename: dickens (10192446 bytes)
    Block segmentation:
     0           | text      |  10192446 bytes [0 - 10192445] (wrt: 6006941)
    Compressed from 10192446 to 1958629 bytes.
    
    Total 10192446 bytes compressed to 1958659 bytes.
    Time 1124.14 sec, used 1633424260 bytes of memory
    
    D:\silesia>paq8pxd_v4 -8 mozilla
    Creating archive mozilla.paq8pxd with 1 file(s)...
    
    File list (18 bytes)
    Compressed from 18 to 20 bytes.
    
    1/1  Filename: mozilla (51220480 bytes)
    Block segmentation:
     0           | default   |  16003634 bytes [0 - 16003633]
     1           | text      |    565152 bytes [16003634 - 16568785] (wrt: 392572)
     2           | default   |  33443184 bytes [16568786 - 50011969]
     3           | utf-8     |    575462 bytes [50011970 - 50587431] (wrt: 468965)
     4           | default   |     51416 bytes [50587432 - 50638847]
     5           | jpeg      |      9407 bytes [50638848 - 50648254]
     6           | default   |       833 bytes [50648255 - 50649087]
     7           | jpeg      |     49629 bytes [50649088 - 50698716]
     8           | default   |       547 bytes [50698717 - 50699263]
     9           | hdr       |        44 bytes [50699264 - 50699307]
     10          | audio     |     27760 bytes [50699308 - 50727067] (8b mono)
     11          | default   |    493412 bytes [50727068 - 51220479]
    Compressed from 51220480 to 10229462 bytes.
    
    Total 51220480 bytes compressed to 10229492 bytes.
    Time 9270.58 sec, used 1862708696 bytes of memory
    
    D:\silesia>paq8pxd_v4 -8 mr
    Creating archive mr.paq8pxd with 1 file(s)...
    
    File list (12 bytes)
    Compressed from 12 to 15 bytes.
    
    1/1  Filename: mr (9970564 bytes)
    Block segmentation:
     0           | default   |   9970564 bytes [0 - 9970563]
    Compressed from 9970564 to 2060422 bytes.
    
    Total 9970564 bytes compressed to 2060447 bytes.
    Time 1349.45 sec, used 1565701897 bytes of memory
    
    D:\silesia>paq8pxd_v4 -8 nci
    Creating archive nci.paq8pxd with 1 file(s)...
    
    File list (14 bytes)
    Compressed from 14 to 16 bytes.
    
    1/1  Filename: nci (33553445 bytes)
    Block segmentation:
     0           | text      |  33553445 bytes [0 - 33553444]
    Compressed from 33553445 to 923150 bytes.
    
    Total 33553445 bytes compressed to 923176 bytes.
    Time 4439.04 sec, used 1633424264 bytes of memory
    
    D:\silesia>paq8pxd_v4 -8 ooffice
    Creating archive ooffice.paq8pxd with 1 file(s)...
    
    File list (17 bytes)
    Compressed from 17 to 19 bytes.
    
    1/1  Filename: ooffice (6152192 bytes)
    Block segmentation:
     0           | default   |      4228 bytes [0 - 4227]
     1           | exe       |   5012819 bytes [4228 - 5017046]
     2           | default   |     26830 bytes [5017047 - 5043876]
     3           | exe       |    253183 bytes [5043877 - 5297059]
     4           | default   |    855132 bytes [5297060 - 6152191]
    Compressed from 6152192 to 1418239 bytes.
    
    Total 6152192 bytes compressed to 1418268 bytes.
    Time 1118.72 sec, used 1582557220 bytes of memory
    
    D:\silesia>paq8pxd_v4 -8 osdb
    Creating archive osdb.paq8pxd with 1 file(s)...
    
    File list (15 bytes)
    Compressed from 15 to 17 bytes.
    
    1/1  Filename: osdb (10085684 bytes)
    Block segmentation:
     0           | default   |  10085684 bytes [0 - 10085683]
    Compressed from 10085684 to 2069934 bytes.
    
    Total 10085684 bytes compressed to 2069961 bytes.
    Time 1776.86 sec, used 1565701895 bytes of memory
    
    D:\silesia>paq8pxd_v4 -8 reymont
    Creating archive reymont.paq8pxd with 1 file(s)...
    
    File list (17 bytes)
    Compressed from 17 to 19 bytes.
    
    1/1  Filename: reymont (6627202 bytes)
    Block segmentation:
     0           | text      |   6501239 bytes [0 - 6501238]
     1           | default   |    125963 bytes [6501239 - 6627201]
    Compressed from 6627202 to 812189 bytes.
    
    Total 6627202 bytes compressed to 812218 bytes.
    Time 1064.92 sec, used 1633426308 bytes of memory
    
    D:\silesia>paq8pxd_v4 -8 samba
    Creating archive samba.paq8pxd with 1 file(s)...
    
    File list (16 bytes)
    Compressed from 16 to 18 bytes.
    
    1/1  Filename: samba (21606400 bytes)
    Block segmentation:
     0           | default   |    279004 bytes [0 - 279003]
     1           | text      |   1658664 bytes [279004 - 1937667] (wrt: 1040945)
     2           | default   |    131757 bytes [1937668 - 2069424]
     3           | text      |   2661772 bytes [2069425 - 4731196] (wrt: 1699529)
     4           | default   |   1092855 bytes [4731197 - 5824051]
     5           | text      |    725004 bytes [5824052 - 6549055] (wrt: 562098)
     6           | default   |    420432 bytes [6549056 - 6969487]
     7           | jpeg      |      8020 bytes [6969488 - 6977507]
     8           | default   |    461300 bytes [6977508 - 7438807]
     9           | text      |    678792 bytes [7438808 - 8117599] (wrt: 554030)
     10          | default   |      9673 bytes [8117600 - 8127272]
     11          | text      |  13132289 bytes [8127273 - 21259561] (wrt: 9307955)
     12          | default   |    346838 bytes [21259562 - 21606399]
    Compressed from 21606400 to 2853155 bytes.
    
    Total 21606400 bytes compressed to 2853183 bytes.
    Time 2701.86 sec, used 1794596602 bytes of memory
    
    D:\silesia>paq8pxd_v4 -8 sao
    Creating archive sao.paq8pxd with 1 file(s)...
    
    File list (13 bytes)
    Compressed from 13 to 16 bytes.
    
    1/1  Filename: sao (7251944 bytes)
    Block segmentation:
     0           | default   |   7251944 bytes [0 - 7251943]
    Compressed from 7251944 to 3776301 bytes.
    
    Total 7251944 bytes compressed to 3776327 bytes.
    Time 1378.77 sec, used 1565701896 bytes of memory
    
    D:\silesia>paq8pxd_v4 -8 webster
    Creating archive webster.paq8pxd with 1 file(s)...
    
    File list (18 bytes)
    Compressed from 18 to 21 bytes.
    
    1/1  Filename: webster (41458703 bytes)
    Block segmentation:
     0           | text      |  41458703 bytes [0 - 41458702] (wrt: 29889928)
    Compressed from 41458703 to 4907154 bytes.
    
    Total 41458703 bytes compressed to 4907185 bytes.
    Time 5363.06 sec, used 1633424260 bytes of memory
    
    D:\silesia>paq8pxd_v4 -8 x-ray
    Creating archive x-ray.paq8pxd with 1 file(s)...
    
    File list (15 bytes)
    Compressed from 15 to 17 bytes.
    
    1/1  Filename: x-ray (8474240 bytes)
    Block segmentation:
     0           | default   |   8474240 bytes [0 - 8474239]
    Compressed from 8474240 to 3587948 bytes.
    
    Total 8474240 bytes compressed to 3587975 bytes.
    Time 1407.96 sec, used 1565701894 bytes of memory
    
    D:\silesia>paq8pxd_v4 -8 xml
    Creating archive xml.paq8pxd with 1 file(s)...
    
    File list (13 bytes)
    Compressed from 13 to 15 bytes.
    
    1/1  Filename: xml (5345280 bytes)
    Block segmentation:
     0           | text      |   5345279 bytes [0 - 5345278] (wrt: 3560922)
     1           | default   |         1 bytes [5345279 - 5345279]
    Compressed from 5345280 to 264663 bytes.
    
    Total 5345280 bytes compressed to 264688 bytes.
    Time 609.45 sec, used 1633426312 bytes of memory
    
    D:\silesia>dir
     Volume in drive D is DATA
     Volume Serial Number is 5CE8-C77D
    
     Directory of D:\silesia
    
    04/20/2012  04:23 AM    <DIR>          .
    04/20/2012  04:23 AM    <DIR>          ..
    04/12/2002  01:21 PM        10,192,446 dickens
    04/19/2012  08:05 PM         1,958,659 dickens.paq8pxd
    05/31/2002  07:50 PM        51,220,480 mozilla
    04/19/2012  10:40 PM        10,229,492 mozilla.paq8pxd
    03/20/2003  11:12 AM         9,970,564 mr
    04/19/2012  11:02 PM         2,060,447 mr.paq8pxd
    04/02/2002  10:21 PM        33,553,445 nci
    04/20/2012  12:16 AM           923,176 nci.paq8pxd
    07/04/2002  05:00 AM         6,152,192 ooffice
    04/20/2012  12:35 AM         1,418,268 ooffice.paq8pxd
    04/11/2002  06:56 PM        10,085,684 osdb
    04/20/2012  01:05 AM         2,069,961 osdb.paq8pxd
    04/02/2002  11:40 PM         6,627,202 reymont
    04/20/2012  01:22 AM           812,218 reymont.paq8pxd
    03/25/2002  02:34 PM        21,606,400 samba
    04/20/2012  02:07 AM         2,853,183 samba.paq8pxd
    03/24/2002  01:38 AM         7,251,944 sao
    04/20/2012  02:30 AM         3,776,327 sao.paq8pxd
    03/25/2002  10:39 AM        41,458,703 webster
    04/20/2012  04:00 AM         4,907,185 webster.paq8pxd
    04/04/2002  02:00 PM         8,474,240 x-ray
    04/20/2012  04:23 AM         3,587,975 x-ray.paq8pxd
    12/01/2000  12:54 AM         5,345,280 xml
    04/20/2012  04:33 AM           264,688 xml.paq8pxd
                  24 File(s)    246,800,159 bytes
                   2 Dir(s)  39,701,053,440 bytes free
    Edit: decompression checks OK. Decompression took 9 hours. Looking back at compression times, it was also 9 hours, not 15. My bad.
    Last edited by Matt Mahoney; 21st April 2012 at 02:46.

  18. #18
    Member
    Join Date
    May 2008
    Location
    Estonia
    Posts
    372
    Thanks
    134
    Thanked 188 Times in 103 Posts
    Code:
    paq8pxd_v4 -8     enwik9
     
    Total 1000000000 bytes compressed to 135027170 bytes.
    Time 88409.59 sec, used 1633424261 bytes of memory
    
    
    paq8pxd_v4 -8     enwik8
    
    Total 100000000 bytes compressed to 16642941 bytes.
    Time 8395.20 sec, used 1633424261 bytes of memory
    KZo


  19. #19
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,253
    Thanks
    304
    Thanked 768 Times in 482 Posts
    Nice improvement on LTCB. http://mattmahoney.net/dc/text.html#1350

    I also tested on the Silesia benchmark with -5 through -8. http://mattmahoney.net/dc/silesia.html

  20. #20
    Member
    Join Date
    May 2008
    Location
    Estonia
    Posts
    372
    Thanks
    134
    Thanked 188 Times in 103 Posts
    It seems that multiple wrt blocks is messing up modeling in 'samba'. Maybe building one dict for whole file will improve results.
    Like building dict from all text blocks and using that for each text block.
    KZo


  21. #21
    Member Anitatoom's Avatar
    Join Date
    Nov 2012
    Location
    Thailand
    Posts
    4
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Please help me, I can compile paq8px but I can't compile paq8pxd.

    when compile it error that

    g++ paq8pxd_v4.cpp -DUNIX -DNOASM -O3 -s -march=nocona -O2 -pipe -o paq8pxd_v4
    In file included from paq8pxd_v4.cpp:4393:
    wrtpre.cpp: In function ?int min(int, int)?:
    wrtpre.cpp:39: error: redefinition of ?int min(int, int)?
    paq8pxd_v4.cpp:547: error: ?int min(int, int)? previously defined here
    wrtpre.cpp: In function ?int max(int, int)?:
    wrtpre.cpp:40: error: redefinition of ?int max(int, int)?
    paq8pxd_v4.cpp:548: error: ?int max(int, int)? previously defined here
    paq8pxd_v4.cpp: In function ?void compressRecursive(FILE*, long int, Encoder&, char*, int, int, int)?:
    paq8pxd_v4.cpp:4645: warning: format ?%d? expects type ?int?, but argument 2 has type ?long int?
    paq8pxd_v4.cpp:4645: warning: format ?%d? expects type ?int?, but argument 2 has type ?long int?
    paq8pxd_v4.cpp: In function ?int main(int, char**)?:
    paq8pxd_v4.cpp:5066: warning: format ?%ld? expects type ?long int?, but argument 2 has type ?int?
    paq8pxd_v4.cpp:5066: warning: format ?%ld? expects type ?long int?, but argument 2 has type ?int?
    paq8pxd_v4.cpp:5072: warning: format ?%ld? expects type ?long int?, but argument 2 has type ?int?
    paq8pxd_v4.cpp:5072: warning: format ?%ld? expects type ?long int?, but argument 2 has type ?int?
    Last edited by Anitatoom; 1st November 2012 at 09:32.

  22. #22
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,934
    Thanks
    325
    Thanked 305 Times in 124 Posts
    Remove min()/max() from wrtpre.cpp

  23. #23
    Member Anitatoom's Avatar
    Join Date
    Nov 2012
    Location
    Thailand
    Posts
    4
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Thank you

  24. #24
    Member
    Join Date
    May 2008
    Location
    Estonia
    Posts
    372
    Thanks
    134
    Thanked 188 Times in 103 Posts
    I am working on newer version.
    Numbers in wrt preprocessing are treated as a-z.

    nci, xml from Silesia corpus
    Code:
    D:\test>paq8pxd_v6.exe -7 nciCreating archive nci.paq8pxd with 1 file(s)...
    
    
    File list (14 bytes)
    Compressed from 14 to 16 bytes.
    
    
    1/1  Filename: nci (33553445 bytes)
    Block segmentation:
     0           | text      |  33553445 bytes [0 - 33553444]
    Compressed from 33553445 to 850808 bytes.
    
    
    Total 33553445 bytes compressed to 850834 bytes.
    Time 4182.96 sec, used 844903304 bytes of memory
    
    
    D:\test>paq8pxd_v6.exe -7 xml
    Creating archive xml.paq8pxd with 1 file(s)...
    
    
    File list (13 bytes)
    Compressed from 13 to 15 bytes.
    
    
    1/1  Filename: xml (5345280 bytes)
    Block segmentation:
     0           | text      |   5345279 bytes [0 - 5345278] (wrt: 3546174)
     1           | default   |         1 bytes [5345279 - 5345279]
    Compressed from 5345280 to 264647 bytes.
    
    
    Total 5345280 bytes compressed to 264672 bytes.
    Time 689.62 sec, used 844905352 bytes of memory
    There is mistake in displaying if text is wrt processed.
    Some text files suffer slight loss of compression.
    KZo


  25. #25
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,253
    Thanks
    304
    Thanked 768 Times in 482 Posts
    That's a huge improvement on nci.

  26. #26
    Member
    Join Date
    May 2008
    Location
    Estonia
    Posts
    372
    Thanks
    134
    Thanked 188 Times in 103 Posts
    Quote Originally Posted by Matt Mahoney View Post
    That's a huge improvement on nci.
    Dict. mostly contains numbers. I limited numwords to minimum 3 bytes. (excluding 34,3g etc, allowing f4,k2 etc)
    world95.txt suffers little less and nci loses about 6 kb.
    I am no expert so i can only guess how much can nci be actually compressed.
    KZo


  27. #27
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,253
    Thanks
    304
    Thanked 768 Times in 482 Posts
    I have so far not found any documentation on the nci file format but I did find http://cactus.nci.nih.gov/ncidb2.1/ which apparently lets you query the same database to show chemical structures by NSC number. For example, the first record in the file is:
    Code:
    155542
    ROtclserve11150011212D 0   0.00000     0.000001049521
    
     31 32  0  0  0  0  0  0  0  0  1 V2000
        2.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
        3.0000    0.0000    0.0000 C   0  0  1  0  0  0  0  0  0  0  0  0
        3.0000    1.0000    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
        3.8660    1.5000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
        4.7321    1.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
        5.5981    1.5000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
        5.5981    2.5000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
        4.7321    3.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
        3.8660    2.5000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
        4.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
        5.0000    0.0000    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
        3.0000   -1.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
        2.1340   -1.5000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
        2.1340   -2.5000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
        3.0000   -3.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
        3.8660   -2.5000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
        3.8660   -1.5000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
        2.0000    0.6200    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
        1.3800    0.0000    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
        2.0000   -0.6200    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
        2.4631    1.3100    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
        4.7321    0.3800    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
        6.1350    1.1900    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
        6.1350    2.8100    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
        4.7321    3.6200    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
        3.3291    2.8100    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
        1.5970   -1.1900    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
        1.5970   -2.8100    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
        3.0000   -3.6200    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
        4.4030   -2.8100    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
        4.4030   -1.1900    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
      1  2  1  0  0  0  0
      2  3  1  0  0  0  0
      3  4  1  0  0  0  0
      4  5  2  0  0  0  0
      5  6  1  0  0  0  0
      6  7  2  0  0  0  0
      7  8  1  0  0  0  0
      8  9  2  0  0  0  0
      4  9  1  0  0  0  0
      2 10  1  0  0  0  0
     10 11  3  0  0  0  0
      2 12  1  0  0  0  0
     12 13  2  0  0  0  0
     13 14  1  0  0  0  0
     14 15  2  0  0  0  0
     15 16  1  0  0  0  0
     16 17  2  0  0  0  0
     12 17  1  0  0  0  0
      1 18  1  0  0  0  0
      1 19  1  0  0  0  0
      1 20  1  0  0  0  0
      3 21  1  0  0  0  0
      5 22  1  0  0  0  0
      6 23  1  0  0  0  0
      7 24  1  0  0  0  0
      8 25  1  0  0  0  0
      9 26  1  0  0  0  0
     13 27  1  0  0  0  0
     14 28  1  0  0  0  0
     15 29  1  0  0  0  0
     16 30  1  0  0  0  0
     17 31  1  0  0  0  0
    M  END
    > <NSC>
    155542
    
    > <CAS_RN>
    17424-68-9
    
    > <SMILES>
    C[C](NC1=CC=CC=C1)(C#N)C2=CC=CC=C2
    
    > <HASH>
    4d5775e8d9fc4fd3
    
    $$$$
    The first number (155542) is the NSC number, which you can search and show the chemical structure.

    The numbers 31 32 are the number of atoms and the number of chemical bonds. The next 2 sections have 31 and 32 lines respectively. I don't know what the rest of the line means.

    Each line shows (I guess) the x,y,z coordinates of the atom for drawing a 3-D model, followed by the chemical symbol such as C for carbon. I don't know what the rest of the numbers are. I don't know why the z coordinate is always 0 because a carbon atom with 4 single bonds has a tetrahedral structure which should mean that attached atoms would have nonzero z values.

    The next 32 lines show the chemical bonds. The first 2 numbers are the numbers of the atoms (from 1 to 31) and the next number is the type of bond (1 for single, 2 for double, etc). These should be predictable by the x,y,z coordinates because they should be at a characteristic distance depending on the atoms, and also by the fact that each atom has a characteristic valence or total number of bonds (4 for C, 3 for N, 2 for O and S, 1 for H, Cl, Br). I don't know what the other numbers are.

    Note that the NSC number appears twice. It looks like they are numbered consecutively as well.

    SMILES is a string that gives the chemical structure in a canonical form. Since the previous data gives the chemical structure, this string should be predictable. The string omits H and uses = and # to indicate double and triple bonds and numbers to indicate bonds between distant parts of the molecule (forming rings). See http://en.wikipedia.org/wiki/Simplif...e-entry_system

    Edit: Notice that the atoms in the SMILES string appear in the same order as in the list of atoms above. SMILES omits H, so these appear last.

    I don't know how the hash is computed, but obviously this information is redundant.
    Last edited by Matt Mahoney; 24th December 2012 at 05:11.

  28. #28
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    503
    Thanks
    176
    Thanked 156 Times in 68 Posts
    Good analysis so far, though there seems to be more - there are some "surprising" records with slightly unexpected variations and SMILES strings, e.g. #156316:

    Code:
    156316
    ROtclserve11150011212D 0   0.00000     0.00000115469
    999-99-9
     66100  0  0  0  0  0  0  0  0  7 V2000
      (snip)table 1(snip)
      (snip)table 2(snip)
    M  CHG  2  37  -1  40   1
    M  RAD  8   1   2   2   2   3   2   4   2   5   2   6   2   7   2   8   2
    M  RAD  8   9   2  10   2  11   2  12   2  13   2  14   2  15   2  16   2
    M  RAD  8  17   2  18   2  19   2  20   2  21   2  22   2  23   2  24   2
    M  RAD  8  25   2  26   2  27   2  28   2  29   2  30   2  31   2  32   2
    M  RAD  4  33   2  34   2  35   2  36   2
    M  END
    > <NSC>
    156316
    
    > <CAS_RN>
    999-99-9
    
    > <SMILES>
    [Cl-].CC[N+](CC)(CC)CC.[Cl]1[Nb]234567[Cl][Nb]289%10%11%12[Cl][Nb]138%13%14%15[Cl][Nb]4%13%16%17%18([Cl]5)[Cl][Nb]69%16%19([Cl]7)([Cl]%10)[Cl][Nb]%11%14%17%19([Cl]%15)([Cl]%12)[Cl]%18.[Cl]%20[Nb]%21%22%23%24%25%26[Cl][Nb]%21%27%28%29%30%31[Cl][Nb]%20%22%27%32%33%34[Cl][Nb]%23%32%35%36%37([Cl]%24)[Cl][Nb]%25%28%35%38([Cl]%26)([Cl]%29)[Cl][Nb]%30%33%36%38([Cl]%34)([Cl]%31)[Cl]%37
    
    > <HASH>
    dd377c9f0d5029e0
    
    $$$$
    Note the fixed 3-digit layout so that the two "line counts" are not seperated by a space ("66100" instead of "66 100"), the "M" table that usually is "M END" only, but is longer here, and the huge SMILES string containing some dots (I don't have a clue what these are for) and coded numbers ("%10") - % seems to initiate a two digit number - EDIT: OK, '%' preceding a label above 9 is described on the Wikipedia page - there also is an example containing a dot, though I haven't found an explanation yet.


    Quote Originally Posted by Matt Mahoney View Post
    I don't know how the hash is computed, but obviously this information is redundant.
    It's 64 bit long, but at least it doesn't seem to be CRC-64 of obvious parts - I tested the first record until the hash line, without the last linefeed, without the two last linefeeds, removing the first NSC line and combinations of those.

    EDIT: There is some documentation available that describes 64-bit hashcodes named HASHISY and mentions a web resolving service that seems to calculate those hashcodes based on a simplified chemical formula, e.g. "http://cactus.nci.nih.gov/chemical/s...ccccc1/hashisy" returns the hash "3DB0124A3ECF5ECE" as plain MIME text.

    This works for simple SMILES strings like #155653, http://cactus.nci.nih.gov/chemical/s...CCOCCO/hashisy and there really seem to be some calculations involved (instead of a simple file/directory lookup) as e.g. the lower case version http://cactus.nci.nih.gov/chemical/s...ccocco/hashisy doesn't return the same hash - on the other hand, HTTP 404 codes are returned in invalid cases like http://cactus.nci.nih.gov/chemical/s...c1cccc/hashisy or http://cactus.nci.nih.gov/chemical/s...nvalid/hashisy

    EDIT: To get less off-topic, a dictionary approach seems to help because of the little variety of numbers coming from their meaning (graphical representation of a structural formula). Fractional parts of numbers have a geometric meaning like .8660 (cosine of 30 degrees). Another interesting thing is that this knowledge could be used to improve compression some more - there are pairs of fractional parts that add to 1 (e.g. .1340 and .8660), so only one of them has to be in the dictionary and the other one could be generated automatically.
    Last edited by schnaader; 25th December 2012 at 23:28.
    http://schnaader.info
    Damn kids. They're all alike.

  29. #29
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Did you contact Sebastian Deodorowicz? He's the SCC creator and may be able to get you documentation.

  30. #30
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,253
    Thanks
    304
    Thanked 768 Times in 482 Posts
    A . means no chemical bond. http://www.daylight.com/dayhtml/doc/...ry.smiles.html gives more details on the SMILES format.

Page 1 of 21 12311 ... LastLast

Similar Threads

  1. FreeArc compression suite (4x4, Tornado, REP, Delta, Dict...)
    By Bulat Ziganshin in forum Data Compression
    Replies: 554
    Last Post: 26th September 2018, 02:41
  2. Dict preprocessor
    By pat357 in forum Data Compression
    Replies: 5
    Last Post: 2nd May 2014, 21:51

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •