Results 1 to 30 of 30

Thread: Sometimes data look like random... here's an interesting file:

  1. #1
    Member Alexander Rhatushnyak's Avatar
    Join Date
    Oct 2007
    Location
    Canada
    Posts
    232
    Thanks
    38
    Thanked 80 Times in 43 Posts

    Smile Sometimes data look like random... here's an interesting file:

    http://www.imagecompression.info/gralic/somedata.zip

    ZIP, BZIP2 and RAR fail to compress this file, i.e. compressed size is bigger than uncompressed size.
    paq8px_v69 -6 packs it to 99.23% ,
    paq8px_v69 -8 packs it to 99.17% ,
    thus we can see that data ain't absolutely random.
    Do you know a program able to compress it to less than 98% ?
    I'm sure some future PAQ version will be able to pack it to less than 90%.
    I wonder if a ZPAQ model can be written to compress such files (including this file) to less than 90%.
    Last edited by Alexander Rhatushnyak; 23rd June 2010 at 03:19.

  2. #2
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    Well just for the fun of it I used arb255 which is designed for IID files based on byte symbols
    it compressed to 994,844 bytes so its likely not random

  3. #3
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    990880 for now, with tuned mix_test

  4. #4
    Member Alexander Rhatushnyak's Avatar
    Join Date
    Oct 2007
    Location
    Canada
    Posts
    232
    Thanks
    38
    Thanked 80 Times in 43 Posts
    Quote Originally Posted by Shelwien View Post
    990880 for now, with tuned mix_test
    Less than 890880* with an appropriate model (a rather simple model).
    PAQ8px have a model of this type.
    ---
    *including decompressor size
    Last edited by Alexander Rhatushnyak; 24th June 2010 at 22:12.

  5. #5
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Well, its nothing surprising with recompression.
    I was only able to see that there're no long bitwise matches,
    and no interesting bytewise context masks (my model ended
    up with 0000 and 00FF and FF00 masks mostly).
    So its likely not raw data or even a static huffman code,
    and reconstructing unknown data from unknown arithmetic code isn't really possible.

    I guess its MRP output or something and "PAQ8px have a model of this type" means the jpeg model?

  6. #6
    Member Alexander Rhatushnyak's Avatar
    Join Date
    Oct 2007
    Location
    Canada
    Posts
    232
    Thanks
    38
    Thanked 80 Times in 43 Posts
    You'll need no decompression and no decryption,
    PAQ8px* compressors minus JpegModel still have a model of this type.

  7. #7
    Member Alexander Rhatushnyak's Avatar
    Join Date
    Oct 2007
    Location
    Canada
    Posts
    232
    Thanks
    38
    Thanked 80 Times in 43 Posts

    Question

    Did everyone give up?

    Quote Originally Posted by Shelwien View Post
    990880 for now, with tuned mix_test
    The compressed file size is less than 890000 bytes after I modify a couple of bytes in paq8px_v69.cpp, compile it and run with '-6'.
    This takes 96 seconds.
    After I disable all other models (insert three pairs of #if 0 + #endif, plus #else + two lines before one of the three #endifs), compressed size is only 0.23% bigger, but now compression takes 5.5 seconds, and memory usage is less than 20% of the regular usage with '-6'.

    A week ago I didn't know if zpaq+special model can compress the file to less than 900000, but since it can run an external program from a cfg file, then yes, it definitely can!
    Last edited by Alexander Rhatushnyak; 30th June 2010 at 02:25.

  8. #8
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Well, I tried a few more things, like splitting it into nibbles and bits, and checking for fixed records.
    Also tried mtf, bwt, and some deltas.
    Can't say that I did all I could, as quite a few parameters have to be bruteforced there, and
    I don't have a pre-built kit for that.
    Also its possible that only some part of the file is actually compressible, and I just didn't see it.

    Still, I suspect that you used more info to choose that model

  9. #9
    Member Skymmer's Avatar
    Join Date
    Mar 2009
    Location
    Russia
    Posts
    681
    Thanks
    37
    Thanked 168 Times in 84 Posts
    Quote Originally Posted by Alexander Rhatushnyak View Post
    The compressed file size is less than 890000 bytes after I modify a couple of bytes in paq8px_v69.cpp, compile it and run with '-6'.
    This takes 96 seconds.
    After I disable all other models (insert three pairs of #if 0 + #endif, plus #else + two lines before one of the three #endifs), compressed size is only 0.23% bigger, but now compression takes 5.5 seconds, and memory usage is less than 20% of the regular usage with '-6'.
    Where is the proof ?

  10. #10
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    fv showed a record length of 17. Try this transform for better compression.

    Code:
    #include <stdio.h>
    #include <fcntl.h>
    int main() {
      setmode(0, O_BINARY);
      setmode(1, O_BINARY);
      static unsigned char buf[1000000];
      fread(buf, 1, 1000000, stdin);
      const int N=17; // N=15000001/17 for inverse
      for (int i=0, j=0; i<1000000; ++i) {
        putchar(buf[j]);
        j=(j+N)%1000000;
      }
      return 0;
    }

  11. #11
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Also, the more general problem is not computable. For a harder problem, try to compress sharnd_challenge.dat from http://mattmahoney.net/dc/#sharnd
    It was created by a small program, so it is not random.

  12. #12
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    The following ZPAQ model compresses somedata.dat to 877401 in 0.4 seconds. It is a stationary order 0 model that uses offset mod 17 as context.

    Code:
    comp 0 0 0 0 1 (hh hm ph pm n)
      0 cm 14 255
    hcomp
      b++ a=b a%= 17 b=a *d=0 hashd
      halt
    post
      0
    end

  13. #13
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Thanks, I was too lazy to do that

  14. #14
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Another example. The following simple program creates a pseudo-random 1 MB file.

    Code:
    #include <stdio.h>
    #include <fcntl.h>
    int main() {
      setmode(1, O_BINARY);
      static unsigned char buf[1000000];
      for (int i=0; i<55; ++i)
        buf[i]=1000000000/(i+3);
      for (int i=55; i<1000000; ++i)
        buf[i]=buf[i-24]^buf[i-55];
      fwrite(buf, 1, 1000000, stdout);
      return 0;
    }
    I redirect the output to a file named "x" and get the following results:

    Code:
    1,000,000 x
    1,000,021 x.pmm (default options)
    1,000,392 x.zip (-9)
    1,000,813 x.paq8px (v67 -6)
        1,254 x.zpaq
    Can anyone guess the model or find a better one? I know better solutions exist because the above program is 294 bytes.

  15. #15
    Member Alexander Rhatushnyak's Avatar
    Join Date
    Oct 2007
    Location
    Canada
    Posts
    232
    Thanks
    38
    Thanked 80 Times in 43 Posts
    Quote Originally Posted by Matt Mahoney View Post
    fv showed a record length of 17. Try this transform for better compression.
    Or just replace "rlen=2" with "rlen=17" in paq8px_v69.cpp. The XLS file in Canterbury Corpus has a record length around 16 (see http://corpus.canterbury.ac.nz/descr...rbry/Excl.html ), one of the images in Silesia Corpus has 19 planes, so 17 is not so big.
    Quote Originally Posted by Matt Mahoney View Post
    The following ZPAQ model compresses somedata.dat to 877401 in 0.4 seconds
    One more file for the repository of ZPAQ config files I guess the record length detection problem won't be so easy to solve.

    Quote Originally Posted by Matt Mahoney
    It was created by a small program, so it is not random.
    a relatively small program can receive input data from microphone or web camera, or collect timestamps when keyboard is pressed, and use the least significant bits.
    Last edited by Alexander Rhatushnyak; 1st July 2010 at 18:49.

  16. #16
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Quote Originally Posted by Alexander Rhatushnyak View Post
    a relatively small program can receive input data from microphone or web camera, or collect timestamps when keyboard is pressed, and use the least significant bits.
    You know the definition of Kolmgorov complexity is the smallest program that outputs the data but takes no input.

  17. #17
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Here's a 57 byte version of that generator
    Use as "w.com >sample"
    Code:
    00000000: 8CD8                         mov         ax,ds
    00000002: 03C0                         add         ax,ax
    00000004: 8ED8                         mov         ds,ax
    00000006: BF0300                       mov         di,3
    00000009: 66B800CA9A3B                 mov         eax,03B9ACA00
    0000000F: 6699                         cdq
    00000011: 66F7F7                       div         edi
    00000014: 6683FF3A                     cmp         edi,00000003A
    00000018: 7206                         jb          000000020 --↓1
    0000001A: 8A45E8                       mov         al,[di][-18]
    0000001D: 3245C9                       xor         al,[di][-37]
    00000020: 8805                         mov         [di],al
    00000022: B440                         mov         ah,040 ;'@'
    00000024: BB0100                       mov         bx,1
    00000027: 8BCB                         mov         cx,bx
    00000029: 8BD7                         mov         dx,di
    0000002B: CD21                         int         021
    0000002D: 6647                         inc         edi
    0000002F: 6681FF43420F00               cmp         edi,0000F4243
    00000036: 72D1                         jb          000000009 --↑2
    00000038: C3                           retn
    Attached Files Attached Files
    • File Type: zip w.zip (692 Bytes, 192 views)

  18. #18
    Member Alexander Rhatushnyak's Avatar
    Join Date
    Oct 2007
    Location
    Canada
    Posts
    232
    Thanks
    38
    Thanked 80 Times in 43 Posts
    Quote Originally Posted by Shelwien View Post
    Thanks, I was too lazy to do that
    But not lazy enough to skip translation of Matt's C program to assembler
    To the best of your knowledge, which compressors can detect and use rlen=17 in simpler cases, when data in each channel are closer to RLE-compressible than to 'looks like incompressible' ?

    Quote Originally Posted by Matt Mahoney View Post
    You know the definition of Kolmgorov complexity is the smallest program that outputs the data but takes no input.
    I know this definition, and I see what you mean, that your program doesn't take input from microphone, camera, etc. My comment was about the compressed statement "created by a small program is not random".
    BTW, at least one ZPAQ config file has a link to this forum:
    http://www.encode.ru/forum/showthread.php?t=461&page=2 in max_o2_msr.cfg,
    and this link doesn't work, so the config file can be compressed by removing this link.
    I don't remember where I downloaded max_o2_msr.cfg from, today I can't find a link to it on http://mattmahoney.net/dc/

    And BTW,
    Quote Originally Posted by Matt Mahoney View Post
    The following ZPAQ model compresses somedata.dat to 877401 in 0.4 seconds.
    this is good, but after detection of the most appropriate model, compression can be as fast as 100 Mb/sec, i.e. 0.01 seconds on this 1Mb file (if all inappropriate models are switched off... hopefully zpaq will be able to do so, sooner or later)
    Last edited by Alexander Rhatushnyak; 2nd July 2010 at 07:44.

  19. #19
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    539
    Thanks
    192
    Thanked 174 Times in 81 Posts
    Quote Originally Posted by Alexander Rhatushnyak View Post
    And BTW,this is good, but after detection of the most appropriate model, compression can be as fast as 100 Mb/sec, i.e. 0.01 seconds on this 1Mb file (if all inappropriate models are switched off... hopefully zpaq will be able to do so, sooner or later)
    Optimized compression with switch o is about 4 times faster here (0.69 s instead of 2.58 s - 3.36 s instead of 13.28 for 5 MB of data), so 10 MB/s and more on faster PCs seem to be possible with zpaq. Regarding how customizable zpaq is and how efficient it is with customized config files, this seems to be a fair price to pay.

    About random appearing data that could be compressed better: There's a very simple case and I'm still surprised that no compressor seems to detect it: Reverse matches. For example, generate 500 KB random data and repeat it reversed.
    Last edited by schnaader; 2nd July 2010 at 13:06.

  20. #20
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    I tried to tune the sequential o2-max CM to Matt's data and unexpectedly got some compression.
    Its at 999579 currently, thus the gain is clearly greater than parameter profile size.
    Any ideas on how this can be used for further encryption cracking?

    Btw, this is the mentioned parameter profile.
    There's some interesting symmetry in mask nibbles.
    Code:
    Index f0x
     f0mask2: pp,   &01000100
     f0mask1: p,    &01100100
     f0mask0: ctx, b&01011110
    
    Index f1x
     f1mask2: pp,   &00000000
     f1mask1: p,    &11111111
     f1mask0: ctx, b&10001100
    
    Index f2x
     f2mask2: pp,   &11111111
     f2mask1: p,    &01110011
     f2mask0: ctx, b&00000000
    
    Index f3x
     f3mask2: pp,   &00000001
     f3mask1: p,    &11000100
     f3mask0: ctx, b&11111110
    
    Index g0x
     g0mask2: pp,   &11111111
     g0mask1: p,    &11111111
     g0mask0: ctx, b&00000000
    
    Index g1x
     g1mask2: pp,   &00000000
     g1mask1: p,    &00000000
     g1mask0: ctx, b&00000000
    
    Index g2x
     g2mask2: pp,   &00111111
     g2mask1: p,    &01110111
     g2mask0: ctx, b&01110110
    
    Index g3x
     g3mask2: pp,   &10110100
     g3mask1: p,    &10000011
     g3mask0: ctx, b&00010000

  21. #21
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > But not lazy enough to skip translation of Matt's C program
    > to assembler

    Yeah. Actually I did check for fixed records there visually,
    and simply didn't see it, so didn't bother to bruteforce the record length.
    Also, after Matt's post I did check again using his fv, and I don't think
    I'd notice that without knowing where to look.
    Unfortunately I don't have a good record size detector either,
    so size optimization of Matt's algo really seemed more interesting

    > To the best of your knowledge, which compressors can detect
    > and use rlen=17 in simpler cases, when data in each channel
    > are closer to RLE-compressible than to 'looks like
    > incompressible' ?

    I space-padded english.dic to 17-byte records and checked a few.
    Nanozip and slim seemingly detected it, and ccm surprisingly didn't.
    Also bsc found it (so there're likely more bwt coders to check), and
    then there're winrk and freearc and durilca.

    It looks like all the working implementations are based on
    entropy estimation with some simple model (basically bruteforce),
    but imho match structure analysis (like fv) would be better,
    as it allows to process all the data (instead of some small blocks) and
    check for a larger set of record configurations (i mean record sizes
    and fields).

  22. #22
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    When I looked at somedata.dat I ran fv and got the first image below and saw that the data looked random except you can see some dark horizontal lines at 17, 34, 51... so I tried the interleave program I posted earlier. It is hard to tell the exact value. I tried 16 first which didn't work.

    After running the interleave and then fv, I got the second image, which is more interesting. It shows that the file has 17 distinct segments and that 2-grams (red) that occur in any segment do not occur in the segment next to it.

    I still did not know exactly the characteristics of the data so I tried different zpaq models on the interleaved data and found order 0 worked as well as higher orders. So then I created the order 0 + offset mod 17 model for the original file.

    I should have looked at the byte distribution of the interleaved file. If I did, I would have gotten the third image here. The horizontal axis shows position in the file as usual. The vertical axis shows the byte value with 0 at the bottom and 255 at the top. It shows that in each segment only about half of the alphabet is used (dark pixels). So each byte could be encoded using 7 bits, which is about the compression ratio we got.
    Attached Thumbnails Attached Thumbnails Click image for larger version. 

Name:	fv.png 
Views:	296 
Size:	150.3 KB 
ID:	1337   Click image for larger version. 

Name:	fv17.png 
Views:	324 
Size:	158.8 KB 
ID:	1338   Click image for larger version. 

Name:	vdist.png 
Views:	318 
Size:	65.0 KB 
ID:	1339  

  23. #23
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > 1,254 x.zpaq
    > Can anyone guess the model or find a better one?

    Btw, was that an o2 cm with buf[i-24] and buf[i-55]
    as context?

  24. #24
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Actually, stationary order 1 with buf[24]^buf[55] as context.

    Code:
    comp 0 6 0 0 1 (hh hm ph pm n)
      0 cm 20 255
    hcomp
      *b=a b++
      a=b a-= 24 b=a a-= 31 c=a a=*b a^=*c
      *d=0 hashd
      a=b a+= 24 b=a
      halt
    post
      0
    end

  25. #25
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Here is a zpaq model r.cfg that compresses my pseudo-random 1 MB file to 75 bytes. This is not quite as small as Shelwien's 57 byte program.

    Code:
    (r.cfg)
    comp 0 0 0 6 1
      0 icm 3 (size 2^(3+6))
    hcomp
      halt
    pcomp empty.bat ;
      a= 100 a*= 100 a*= 100 r=a 0 (r0=1000000)
      a*= 100 a*= 10 r=a 1 (r1=1000000000)
      do
        a=b a< 55 if
          c=r 1 a++ a++ a++ c<>a a/=c
        else
          a-= 24 c=a d=*c a-= 31 c=a a=*c a^=d
        endif
        *b=a out
        b++ a=r 0 a==b
      until
      halt
    end
    This uses a transform that is specific to this one file. It uses the preprocessor empty.bat which takes any input file as its first argument and outputs a file of size 0 as its second argument.

    Code:
    copy nul: %2
    The PCOMP section is the same program I posted earlier rewritten in ZPAQL. You can create the file with either of these commands:

    a.exe > x
    zpaq prr.cfg nul: x

    where a.exe is my compiled first program. In the zpaq command, prr.cfg says to run the PCOMP (p) section of r.cfg with nul: as input and x as output. The PCOMP program is run once for each byte of input with that byte in the A register, and then once more with EOF (-1) in the A register. So it is run once because nul: is empty. x should be 1 MB (1000000 bytes).

    To compress:

    zpaq nsicr.cfg x.zpaq x

    The option n says don't store the file name (saves 1 byte), s says don't store the checksum (saves 20 bytes), i says don't store the size as a comment (saves 7 bytes), c says create, r.cfg is the config file above.

    You can decompress with any zpaq compatible program:

    zpaq x x.zpaq out
    unzpaq x x.zpaq out
    zpipe d < x.zpaq > out
    zp x x.zpaq out

    The PCOMP program does something like:

    Code:
      byte M[64] (rotating buffer)
      R0 = 1000000 (file size)
      R1 = R0 * 1000
      for B from 0 to R0
        if B < 55
          M[B] = R1/(B+3)
        else
          M[B] = M[B-24]^M[B-55]
        output M[B]
        B++
    The COMP/HCOMP section describes an indirect order 0 model. HCOMP is empty because there is no context to compute. It compresses the PCOMP section (and the empty file that follows) saving 5 bytes. The COMP/HCOMP section is not compressed.

    I fooled around with the code to make it more compressible. For example I used a++ a++ a++ instead of a+= 3 because it compresses 1 byte smaller, even though the program is 1 byte longer before compression.

  26. #26
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    That certainly seems like it took more time than my assembly translation :)
    Anyway, here's a 55 byte version.
    Btw, note that x86 code already has some built-in "compression" - like
    representations for constants and addresses, or shortcuts like "cdq"
    (which means edx=(eax<0)?-1:0; )

    Code:
    0000: 43                           inc         bx
    0001: 8CD8                         mov         ax,ds
    0003: 03C0                         add         ax,ax
    0005: 8ED8                         mov         ds,ax
    0007: BF0300                       mov         di,3
    000A: 66B800CA9A3B                 mov         eax,03B9ACA00
    0010: 6699                         cdq
    0012: 66F7F7                       div         edi
    0015: 6683FF3A                     cmp         edi,00000003A
    0019: 7206                         jb          000000021 --↓1
    001B: 8A45E8                       mov         al,[di][-18]
    001E: 3245C9                       xor         al,[di][-37]
    0021: 8805                         mov         [di],al
    0023: B440                         mov         ah,040 ;'@'
    0025: 8BCB                         mov         cx,bx
    0027: 8BD7                         mov         dx,di
    0029: CD21                         int         021
    002B: 6647                         inc         edi
    002D: 6681FF43420F00               cmp         edi,0000F4243
    0034: 72D4                         jb          00000000A --↑2
    0036: C3                           retn
    Attached Files Attached Files
    • File Type: zip w.zip (691 Bytes, 177 views)

  27. #27
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    A couple hours to write, I guess. ZPAQ has a lot of header information that takes up extra space. It's not like .com files where you can write a working 1 byte program. The minimum overhead is 31 bytes. The 75 bytes for the 1 MB pseudo-random file is:

    Block header: 17 bytes
    - magic number "zPQ", 3 bytes
    - level, 1 byte
    - version, 1 byte
    - comp/hcomp size, 2 bytes
    - comp (model description), 8 bytes with trailing 0
    - hcomp (context code), 2 bytes with trailing 0
    Segment header: 4 bytes
    - version, 1 byte
    - empty filename, 1 byte for trailing 0
    - empty comment, 1 byte for trailing 0
    - reserved, 1 byte
    Compressed pcomp code, 48 bytes
    End of compressed data tag (0 0 0 0), 4 bytes
    End of segment (no SHA-1), 1 byte
    End of block, 1 byte

    The compressed pcomp code consists of a 1 byte flag, a 2 byte length, and the 46 byte program for a total of 49 bytes before compression. It would compress to 44 except for encoding the EOF bit with probability 2^-32 which flushes 4 bytes from the arithmetic coder. The (0 0 0 0) that follows is so that the ends of segments can be found without decompressing (by l command or single block extraction). The arithmetic coder never outputs 4 consecutive 0 bytes. The end of segment byte indicates whether there is a 20 byte checksum.

  28. #28
    Member
    Join Date
    May 2009
    Location
    CA USA
    Posts
    22
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Matt Mahoney View Post
    Can anyone guess the model or find a better one?

    It's a lagged Fibonacci generator. It forms a pseudorandom sequence of period 2^55 -1 .
    That specific generator was proposed by George Marsaglia in 1984, published in Comp. Sci and Statistics: Symposium on the Interface 16, pages 3-10.

    Knuth gives it as a good example of a lagged Fib generator in TAOCP 3.2.2.

  29. #29
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Yeah, I got it from Knuth. I also use it for the rnd() function in PAQ.

  30. #30
    Member
    Join Date
    Jul 2009
    Location
    Moscow
    Posts
    2
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Open somedata in cooledit as raw pcm 8bit,
    and zoom in till you can see the byte values as a "function".
    You can easily see a pattern, the points a quite near, most of the time every 2nd (sometimes 3rd) byte is close, forming some noisy curve.
    Last edited by exact; 25th December 2010 at 04:21.

Similar Threads

  1. goodbye and some random thoughts
    By Christian in forum The Off-Topic Lounge
    Replies: 72
    Last Post: 25th January 2010, 05:40
  2. Interesting tools
    By lunaris in forum Data Compression
    Replies: 2
    Last Post: 25th August 2009, 23:50
  3. Dark Space Random Thoughts
    By Tribune in forum The Off-Topic Lounge
    Replies: 19
    Last Post: 14th March 2009, 15:22
  4. Random Data Question.
    By Tribune in forum Data Compression
    Replies: 7
    Last Post: 13th June 2008, 19:30
  5. rnd - simple pseudo random number generator
    By encode in forum Forum Archive
    Replies: 21
    Last Post: 14th January 2008, 03:41

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •