Results 1 to 9 of 9

Thread: a very simple transform for english.dic

  1. #1
    Member
    Join Date
    Feb 2010
    Location
    Nordic
    Posts
    200
    Thanks
    41
    Thanked 36 Times in 12 Posts

    Smile a very simple transform for english.dic

    Regarding http://www.maximumcompression.com/data/dict.php

    Edited: I originally claimed that best results were achieved by reversing the dictionary; this was a bug; my code makes the assumption that the word-list is sorted in ascending order. I've fixed the claimed results.

    This transformer simply encodes deletes as <

    I'd be most interested in any other results and any improvements!

    This literally has not been thought out, I just wrote it in five minutes because I thought it'd work.

    input:

    Code:
    1080
    10th
    1st
    2
    2nd
    output:

    Code:
    1080<<th<<<st<<<2>nd
    ('>' denotes same stem as previous word)

    Curious fact: had to special-case that "animate" appears twice in the file..

    Program usage (Linux; reports of line-ending problems on Windows):

    Code:
    g++ -o dic1 dic1.cpp
    ./dic1 c < english.dic > english.dic1
    ./dic1 d < english.dic1 > english.dic2
    cmp english.dic english.dic2
    english.dic is 4,067,439 bytes.
    dic1 transform of english.dic is 2,055,914 bytes.

    But the point is not to use it as a stand-alone compressor (although it might well be the very fastest one), but rather to compress its output with a mainstream algorithm:

    Results with some compressors I had laying around:

    Code:
    bbb before: 1,170,539 after: 431,644
    paq8l after: 418,086
    zip (Info-ZIP 3.0 as on Ubuntu) before: 1,049,959 after: 629,219
    paq8hp12any after: 416,783
    And the source:

    Code:
    #include <stdio.h>
    #include <string.h>
    #include <assert.h>
    
    static int common_prefix(const char* a,const char* b) {
        int common = 0;
        while(*a++==*b++)
            common++;
        return common;
    }
    
    int main(int argc,char** args) {
        
        enum {
            LONGEST_LINE = 35, // wc -L says 31
        };
        
        if(2!=argc) {
            fprintf(stderr,"Usage: dic1 [c|d]\n");
            return -1;
        }
        
        if(!strcmp(args[1],"c")) {
            char prev[LONGEST_LINE] = {0};
            int prev_len = 0;
            char word[LONGEST_LINE];
            while(fgets(word,sizeof(word),stdin)) {
                const int len = strlen(word);
                if(!strcmp(word,prev)) {
                    fprintf(stderr,"Oh! %s",word);
                    printf(">");
                    continue;
                }
                const int common = common_prefix(word,prev);
                if(common == (prev_len-2))
                    putchar('>');
                for(int i=prev_len-2; i>common; i--)
                    putchar('<');
                for(int i=common; i<(len-2); i++)
                    putchar(word[i]);
                prev_len = len;
                strcpy(prev,word);
            }
        } else if(!strcmp(args[1],"d")) {
            char word[LONGEST_LINE] = {0};
            int ofs = 0;
            char prev = 0;
            for(int ch = getchar(); ch != -1; prev = ch, ch = getchar()) {
                if('>'==ch) {
                    printf("\r\n");
                    for(int i=0; i<ofs; i++)
                        putchar(word[i]);
                } else if('<'==ch) {
                    if('<'!=prev)
                        printf("\r\n");
                    ofs--;
                    assert(ofs >= 0);
                } else {
                    if('<'==prev)
                        for(int i=0; i<ofs; i++)
                            putchar(word[i]);
                    word[ofs++] = ch;
                    putchar(ch);
                }
            }
            printf("\r\n");
        } else {
            fprintf(stderr,"Error: command \"%s\" not supported\n",args[1]);
        }
        return 0;
    }
    Last edited by willvarfar; 1st March 2010 at 11:48. Reason: fixed bug in code I pasted

  2. #2
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    856
    Thanks
    447
    Thanked 254 Times in 103 Posts
    Interesting, a simple transform, good results.

    If i do understand, this transform specifically targets english.dic file structure ?

  3. #3
    Member
    Join Date
    Feb 2010
    Location
    Nordic
    Posts
    200
    Thanks
    41
    Thanked 36 Times in 12 Posts
    my implementation particularly expects the terms to be \r\n terminated, as english.dic is. That's just an implementation detail, and not part of the algorithm.

    It targets any long list of sorted words.

    Given that it took 5 minutes to write, and that none of the compressors take any advantage of the fact that its encoding a sorted word list, I should imagine it can be substantially improved upon, or even a compressor that internalises that its encoding backwards and forwards commands in addition to characters...

  4. #4
    Tester
    Stephan Busch's Avatar
    Join Date
    May 2008
    Location
    Bremen, Germany
    Posts
    872
    Thanks
    457
    Thanked 175 Times in 85 Posts
    Is there a compiled version so we can test it on other word lists?

    This would be very interesting to test.

    Regardds,

    Stephan

  5. #5
    Programmer Jan Ondrus's Avatar
    Join Date
    Sep 2008
    Location
    Rychnov nad Kněžnou, Czech Republic
    Posts
    278
    Thanks
    33
    Thanked 137 Times in 49 Posts
    Quote Originally Posted by Stephan Busch View Post
    Is there a compiled version so we can test it on other word lists?

    This would be very interesting to test.

    Regardds,

    Stephan
    dic1.exe c > output < english.dic
    dic1.exe d > english.dic2 < output
    Attached Files Attached Files

  6. #6
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    I'm a bit surprised that this gets any attention, but whatever
    http://nishi.dreamhosters.com/u/dictpack3a.rar
    - dic1.cpp modified for windows compilers
    - english.dic inside
    - BWTmix
    - ancient wordlist filter from dictpack3 kit (see http://compression.ru/sh/2003.htm, in russian)

    Unpack and run the test.bat
    It should print the following:
    Code:
    Results:
    
    4067439 - original wordlist size
    1984035 - preprocessed with dictpack3 filter
    2055914 - preprocessed with willvarfar's dic1 filter
    1156560 - original, compressed with BWTmix v1
     426636 - dictpack3 + BWTmix
     426901 - dic1 + BWTmix
    Note that dictpack3 contains some additional transforms
    which improve compression a bit more (426082 with bwtmix)

  7. #7
    Member
    Join Date
    Feb 2010
    Location
    Nordic
    Posts
    200
    Thanks
    41
    Thanked 36 Times in 12 Posts

    Smile

    Here's an alternative encoding that uses a high-bit flag to mark the < and > instead of emitting them.

    It doesn't make much difference to BWT compressors and such - still around 428KB.

    I had claimed it broke records with paq8hp12any. I was wrong. I had not appreciated that if I compress more than one file to an archive, paq8hp12any would use the model for one to compress the other and so on; when english.dic2 is the only file in an archive, it is around 412KB.

    The source:

    Code:
    #include <stdio.h>
    #include <string.h>
    #include <assert.h>
    
    static int common_prefix(const char* a,const char* b) {
        int common = 0;
        while((*a!='\r') && (*a==*b))
            a++, b++, common++;
        return common;
    }
    
    int main(int argc,char** args) {
        
        enum {
            LONGEST_LINE = 35, // wc -L says 31
        };
        
        if(2!=argc) {
            fprintf(stderr,"Usage: dic2 [c|d]\n");
            return -1;
        }
        
        if(!strcmp(args[1],"c")) {
            char prev[LONGEST_LINE] = {0};
            int prev_len = 0;
            char word[LONGEST_LINE];
            while(fgets(word,sizeof(word),stdin)) {
                const int len = strlen(word)-2; // strip \r\n line-ending
                assert(len > 0);
                for(int i=0; i<len; i++)
                    assert(0x80>(unsigned char)word[i]);
                const int common = common_prefix(word,prev);
                if(prev_len)
                    putchar(0x80|(prev_len-common));
                for(int i=common; i<len; i++)
                    putchar(word[i]);
                prev_len = len;
                strcpy(prev,word);
            }
        } else if(!strcmp(args[1],"d")) {
            char word[LONGEST_LINE] = {0};
            int ofs = 0;
            char prev = 0;
            for(int ch = getchar(); ch != -1; prev = ch, ch = getchar()) {
                if(0x80&ch) {
                    printf("\r\n");
                    ch &= ~0x80;
                    ofs -= ch;
                    assert(ofs >= 0);
                    for(int i=0; i<ofs; i++)
                        putchar(word[i]);
                } else {
                    word[ofs++] = ch;
                    assert(ofs < (int)sizeof(word));
                    putchar(ch);
                }
            }
            printf("\r\n");
        } else {
            fprintf(stderr,"Error: command \"%s\" not supported\n",args[1]);
        }
        return 0;
    }
    Last edited by willvarfar; 1st March 2010 at 15:42. Reason: Made a fundemental error ... again.

  8. #8
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    539
    Thanks
    192
    Thanked 174 Times in 81 Posts
    Here's a compiled Win32 version of dic2.

    EDIT: Seems there's something wrong with the source. Running on english.dic leads to

    Code:
    Assertion failed: len, file dic2.cpp, line 29
    and the output file is empty. Seems to be something platform-specific, one-character words are stored in the array as "x\n" instead of "x\r\n" and so the length check fails as it assumes 2 trailing characters.

    Apart from that, this G++ warning seems appropriate:

    Code:
    dic2.cpp:31: warning: comparison is always true due to limited range of data type
    In this case, it helps to change this line to

    Code:
    assert(0x80>(unsigned char)word[i]);
    Attached Files Attached Files
    Last edited by schnaader; 1st March 2010 at 14:49.
    http://schnaader.info
    Damn kids. They're all alike.

  9. #9
    Member
    Join Date
    Feb 2010
    Location
    Nordic
    Posts
    200
    Thanks
    41
    Thanked 36 Times in 12 Posts
    thx schnaader, I've fixed the bug and also, on retesting it, discovered something most obvious. I was mistaken to claim that dic2 helps much.

Similar Threads

  1. Simple bytewise context mixing demo
    By Shelwien in forum Data Compression
    Replies: 11
    Last Post: 27th January 2010, 04:12
  2. Kwc – very simple keyword compressor
    By Sportman in forum Data Compression
    Replies: 10
    Last Post: 20th January 2010, 17:06
  3. Simple encryption (RC4 like)
    By encode in forum Forum Archive
    Replies: 37
    Last Post: 26th January 2008, 04:05
  4. rnd - simple pseudo random number generator
    By encode in forum Forum Archive
    Replies: 21
    Last Post: 14th January 2008, 03:41

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •