Results 1 to 28 of 28

Thread: Compressing continuous data stream

  1. #1
    Member
    Join Date
    Jun 2018
    Location
    Slovakia
    Posts
    80
    Thanks
    22
    Thanked 3 Times in 3 Posts

    Compressing continuous data stream

    Hello users, please give me detailed explanation...

    I am looking for the BEST algorithm for compression of continuous data stream.
    To be exact, it´s not pure continuous stream since characters "Q" are randomly (yes, specific pattern can be seen there, but far away) mixed with letter "q". But I suppose that for compression algorithms capitalization will be irrelevant and therefore I will be able to compress down my file with high ratio.
    My sample is attached below.

    Additional details:
    Only letter "q" can be replaced with something else (yes, I´ve also tried with spaces and special characters)... but not with "ABCDEF0123456789" nor with "Q", since the aforementioned letter acts as a separator and the big letter represents... I don´t want to say what.

    It´s not complicated to calculate all characters in text document and interpret them in much simplier form like 1320(Q) where number is the repetition of a given character... actually, the problem lies how to properly distinguish uppercase characters between lowercase characters.

    I don´think so that some further simplification and thus achieving better compression is even possible... I expect at least 98% space saving.
    As far as I know, RLE itself is limited to 255 repetitions since characters are interpreted as a bits.
    Also, I have knowledges about more sophisticated neural-network powered compressors like PAQ, CMIX, EMMA and its derivatives (PAQ8PXPRE etc).
    But I am afraid that there isn´t a compression algorithm that don´t distinguish between "Q" and "q" - it is handled as a another character no matter if it´s capital letter or not.
    Furthermore, I am unable to use characters from 0-9 and A-F and also Q;q at all for simplification.

    After decompression, the output must be EXACTLY SAME as input, of course.

    Please DO NOT OPEN FILE IN NOTEPAD! It takes 20 min or so. Use another editor instead.
    So, is there any algorithm or, better, software (WINDOWS GUI preferably) that´s capable to achieve that I want?

    Thanks a lot for your help.

    Best regards,
    CompressMaster
    Attached Files Attached Files
    Last edited by CompressMaster; 20th July 2018 at 10:30.

  2. #2
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Here's what I've got:
    Code:
    3,539,138 txt
      442,393 bin           // bit.exe c only-O.txt only-O.bin
      222,989 txt.7z        // 7z a -mx=9 -myx=9 1 only-O.txt
      203,392 bin.7z        // 7z a -mx=9 -myx=9 2 only-O.bin
      185,541 txt.paq8px149 // paq8px_v149.exe -7 only-O.txt
      180,506 bin.paq8px149 // paq8px_v149.exe -7 only-O.bin
      179,928 bin.cmix      // cmix_v15 -c only-O.bin only-O.cmix
      202,074 txt.pa        // 7z.exe a -m0=bwt3h:mt4:c=4M -m1=cdm 1.pa only-O.txt
      179,703 bin.pa        // 7z.exe a -m0=bwt3h:mt4:c=4M -m1=cdm 1.pa only-O.bin
    More like 95% though, not 98. But at least cdm beats paq by an impressive margin :)
    Attached Files Attached Files

  3. The Following User Says Thank You to Shelwien For This Useful Post:

    Kennon Conrad (28th July 2018)

  4. #3
    Member
    Join Date
    May 2008
    Location
    Estonia
    Posts
    377
    Thanks
    139
    Thanked 198 Times in 108 Posts
    Code:
    176,671  only-O.txt.paq8pxd50 //paq8pxd_v50 -s7
    181,769  1.paq8pxd50          //bit.exe -> paq8pxd_v50 -s7
    KZo


  5. #4
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Nice, thanks.
    Though still (1-176671/3539138)*100=95.008%

  6. #5
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    464
    Thanks
    202
    Thanked 81 Times in 61 Posts
    I couldn't beat paq8pxd myself but I got pretty close using much faster programs. See below (only included <200.000)

    Code:
    94.95%        178745        only-O.txt.bbb
    94.94%        178957        only-O.txt.bwtmix
    94.94%        179100        only-O.txt.bcm
    94.92%        179633        only-O.txt.bssc
    94.87%        181553        only-O.txt.nz.cdm
    94.86%        181918        only-O.txt.mcompw.cdm
    94.85%        182136        only-O.txt.nz
    94.84%        182480        only-O.txt.mcompw
    94.83%        183060        only-O.txt.fazip (grzip) 
    94.70%        187467        only-O.txt.ppmonstr
    94.67%        188546        only-O.txt.ccmx
    94.67%        188793        only-O.txt.cmm
    94.65%        189398        only-O.txt.bee
    94.42%        197456        only-O.txt.glza
    94.38%        199055        only-O.txt.7z (ppmd)
    94.35%        199860        only-O.txt.fazip (dict+ppmd)
    About the algorithms, bwt is the shining jewel here, by far.

    A few unrelated observations:
    * bit.exe only help the weaker algorithms. Never the top entries.
    * cdm also compresses just a few of the programs. Of this list, only Nanozip defaults and mcomp -mw.
    Last edited by Gonzalo; 20th July 2018 at 00:05.

  7. #6
    Member
    Join Date
    Jun 2018
    Location
    Slovakia
    Posts
    80
    Thanks
    22
    Thanked 3 Times in 3 Posts
    Ok, and what about of compressing the following data?

    QQQQQQQQ
    QQQQ
    QQQQQ
    QQQQQQ
    QQ
    Q
    Q
    QQQQQ
    QQQQ
    QQQ

    Instad of "q" separator, the separation will be handled by ENTER key line-by-line.
    But I suppose that it is still character like blank spaces that I need to compress...

    In that form, it is possible to express this as a continuous stream of data?
    I need to compress that data to form such as "39Q" regardless ENTER separation? I need the smallest possible input as I can get.

    So, is there a way to achieve this?

    Thanks a lot.

    Best regards,
    CompressMaster

  8. #7
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    464
    Thanks
    202
    Thanked 81 Times in 61 Posts
    No, there is not. Blank spaces might appear as they don't exist, but they are information just like any other character. And a lossless compressor needs to store them always. Otherwise, the decompressed output is not the same as the input.
    Your example can be expressed as 39Q, but in some place the other characters, blank or not, must be stored too.
    How do you reconstruct your paragraph only from 39Q if not?

  9. #8
    Member
    Join Date
    Sep 2015
    Location
    Italy
    Posts
    216
    Thanks
    97
    Thanked 128 Times in 92 Posts
    These are the best results I found for cmv 0.2.0:
    179932 only-O.txt.cmv-m2,3,0x30a07112
    179780 only-O.bin.cmv-m1,0,0x38a201d0

  10. #9
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    456
    Thanks
    46
    Thanked 164 Times in 118 Posts
    You can build "TurboBench Compression Benchmark" and test different compressors and algorithms with your own files.

    Result for the file only-O.txt. Skylake cpu 3.4 GHz

    Code:
          C Size  ratio%     C MB/s     D MB/s   Name                          (bold = pareto) MB=1.000.0000
          179082     5.1      22.65      25.56   bsc 0e2         bwt  
          179093     5.1      11.69       7.93   bcm             bwt
          183230     5.2      31.84      85.35   lzturbo 59      bwt
          185298     5.2       0.46       0.46   zpaq 5          context-mixing
          189047     5.3      11.35      73.54   bzip2           bwt
          202161     5.7       0.78     421.08   lzturbo 49      lz77
          204150     5.8       0.89    3429.40   lzturbo 39      lz77 
          216187     6.1       0.33    1471.58   brotli 11       lz77
          217636     6.1       1.94    2187.35   zstd 22         lz77  
          223650     6.3       1.46     286.71   lzma 9          lz77
          253230     7.2       1.03     717.15   zlib 9          lz77
         3539142   100.0   27867.23   28541.44   memcpy          

  11. #10
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Since the only-O.txt file only contains runs of length 1 - 16 of O followed by a single o, you can do slightly better by converting the file to a list of run lengths - 1 written in hex.

    Code:
    95.06%        174780        only-O.runs.cmix (version 15d)
    The conversion program is attached. runs c only-O.txt only-O.runs to convert, runs d only-O.runs only-O.txt to revert. only-O.runs is 400838 hex symbols (could be 200419 bytes @ 2 hex symbols per byte).
    Attached Files Attached Files

  12. The Following User Says Thank You to Kennon Conrad For This Useful Post:

    Gotty (28th July 2018)

  13. #11
    Member cfeck's Avatar
    Join Date
    Jan 2012
    Location
    Germany
    Posts
    50
    Thanks
    0
    Thanked 17 Times in 9 Posts
    It's a JPEG image...

  14. The Following 3 Users Say Thank You to cfeck For This Useful Post:

    Gotty (28th July 2018),Kennon Conrad (28th July 2018),schnaader (28th July 2018)

  15. #12
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Huh, it really is...
    Attached Files Attached Files

  16. The Following 3 Users Say Thank You to Shelwien For This Useful Post:

    Gotty (28th July 2018),Kennon Conrad (28th July 2018),schnaader (28th July 2018)

  17. #13
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by Shelwien View Post
    Huh, it really is...
    Yeah, and it's much easier to see the patterns in the file. It seems to mostly be random run lengths, except a few long runs of run length 1 toward the beginning and a fairly big block of alternating runs of length 1 and 3 with an occassional run length of 11 instead of 3. I tried a little with lz77 but didn't get do any better than 200K but it helped glza a lot, only a few KB over cmix with the transformation. lz77 might do better if runs wrote two run lengths per byte, I think it could then see a lot of a repeating bytes where the alternating runs of length 1 and 3 are. Can't imagine it would save much more than 1 - 2 KB though. Never going to make 98% reduction with that much randomish data unless there's an unknown pattern that becomes known and a custom compressor. I see a few other patterns, but nothing generally obvious.

  18. #14
    Member Gotty's Avatar
    Join Date
    Oct 2017
    Location
    Hungary
    Posts
    343
    Thanks
    235
    Thanked 226 Times in 123 Posts
    Well done, Kennon Conrad.

  19. #15
    Member Gotty's Avatar
    Join Date
    Oct 2017
    Location
    Hungary
    Posts
    343
    Thanks
    235
    Thanked 226 Times in 123 Posts
    Quote Originally Posted by CompressMaster View Post
    Only letter "q" can be replaced with something else (yes, I´ve also tried with spaces and special characters)... but not with "ABCDEF0123456789" nor with "Q", since the aforementioned letter acts as a separator and the big letter represents... I don´t want to say what.
    CompressMaster, your secret is revealed. So it was a test: how much does it take for us to find the photo? Nice.

    Now, tell us: is the photo a random choice, or you've got a message for us?
    Last edited by Gotty; 28th July 2018 at 15:09.

  20. #16
    Member Gotty's Avatar
    Join Date
    Oct 2017
    Location
    Hungary
    Posts
    343
    Thanks
    235
    Thanked 226 Times in 123 Posts
    Code:
    cmd> paq8px_v151.exe -9 2.jpg
    Filename: 2.jpg (200419 bytes)
    Block segmentation:
     0           | jpeg             |    200419 bytes [0 - 200418]
    -----------------------
    Total input size     : 200419
    Total archive size   : 132827
    Quote Originally Posted by CompressMaster View Post
    I expect at least 98% space saving.
    It's now 96.24%.
    Last edited by Gotty; 28th July 2018 at 13:40.

  21. The Following User Says Thank You to Gotty For This Useful Post:

    Kennon Conrad (29th July 2018)

  22. #17
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    464
    Thanks
    202
    Thanked 81 Times in 61 Posts
    Quote Originally Posted by Shelwien View Post
    Huh, it really is...
    Sweet! Now, I guess somehow runs.exe doesn't work.
    The inverse command
    Code:
    runs d 1 Only-O.txt_
    produces a file 46,847,552 bytes long (should be 3,539,138 )

    Am I ding something wrong, or is it a bug?


    The funny thing is that RLE actually helped after all... This is an updated table:


    Code:
    177693        only-O.txt.mrle.bcm
    177726        only-O.txt.mrle.bwt
    178585        only-O.txt.mrle.bbb
    178745        only-O.txt.bbb
    178957        only-O.txt.bwt
    179100        only-O.txt.bcm
    179633        only-O.txt.bssc
    180294        only-O.txt.mrle.nz
    180907        only-O.txt.mrle.bssc
    181553        only-O.txt.nz.cdm
    181918        only-O.txt.mcompw.cdm
    182136        only-O.txt.nz
    182480        only-O.txt.mcompw
    183060        only-O.txt.grzip
    187467        only-O.pmm
    188546        only-O.txt.ccmx
    188793        only-O.txt.cmm
    189398        only-O.txt.bee
    197456        only-O.txt.glza
    199055        only-O.txt.7zx
    199860        only-O.txt.dict+ppmd
    199895        only-O.txt.dict+lzp+ppmd

    I suspect Turbo RLE should do even better. At least according to the in-memory test.

    I wonder if this transform could serve other purposes, like helping to compress multimedia or high-entropy files. Remember mp3tobmp?
    Using only runs and bcm I managed to beat even paq8 (obviously without the jpg awareness) with far less resources.

    Code:
    176,605  1.bcm
    176,671  only-O.txt.paq8pxd50 //paq8pxd_v50 -s7
    Last edited by Gonzalo; 29th July 2018 at 19:08.

  23. #18
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by Gonzalo View Post
    Sweet! Now, I guess somehow runs.exe doesn't work.
    The inverse command
    Code:
    runs d 1 Only-O.txt_
    produces a file 46,847,552 bytes long (should be 3,539,138 )

    Am I ding something wrong, or is it a bug?
    It is a bug. Shelwien changed the format of the output of runs for the conversion compared to what I posted but not the deconversion. The version of runs I posted was a byte per run with the run length - 1 printed in hex. Shelwein changed it to 4 bytes per run length, essentially passing one bit per byte with 'o' or 'O', which is what his bit program expects. Where the deconversion code is this:

    Code:
        while (i < insize) {
          uint8_t j, run_length;
          if (in_buf[i] < 'a')
            run_length = in_buf[i] - '0' + 1;
          else
            run_length = in_buf[i] - 'a' + 11;
          for (j = 0 ; j < run_length ; j++)
            fputc('O', fd_out);
          fputc('o', fd_out);
          i++;
        }
    it should be more like this to support the new format:

    Code:
        while (i < insize) {
          uint8_t run_length = 0;
          if (in_buf[i] == 'O')
            run_length += 8;
          if (in_buf[i + 1] == 'O')
            run_length += 4;
          if (in_buf[i + 2] == 'O')
            run_length += 2;
          if (in_buf[i + 3] == 'O')
            run_length += 1;
          uint8_t j;
          for (j = 0 ; j < run_length ; j++)
            fputc('O', fd_out);
          fputc('o', fd_out);
          i += 4;
        }
    I don't have a C# compiler, so I can't compile it though.

    I think you may get better compression with the original runs program than just Shelwein's run program. With GLZA, I get 180,282 bytes when I compress Shelwien's "1" file and 178,420 bytes with only-O.txt.runs (my version).

  24. The Following User Says Thank You to Kennon Conrad For This Useful Post:

    Gonzalo (30th July 2018)

  25. #19
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    772
    Thanks
    63
    Thanked 270 Times in 190 Posts

  26. The Following User Says Thank You to Sportman For This Useful Post:

    Kennon Conrad (30th July 2018)

  27. #20
    Member Gotty's Avatar
    Join Date
    Oct 2017
    Location
    Hungary
    Posts
    343
    Thanks
    235
    Thanked 226 Times in 123 Posts
    Shelwien's code is c++, not c#.

  28. The Following User Says Thank You to Gotty For This Useful Post:

    Kennon Conrad (30th July 2018)

  29. #21
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Ok, here's a complete version

    #include <stdio.h>

    int main( int argc, char** argv ) {
    if( argc<4 ) return 1;
    int c,d,f_DEC = argv[1][0]=='d';

    FILE* f = fopen( argv[2], "rb" ); if( f==0 ) return 1;
    FILE* g = fopen( argv[3], "wb" ); if( g==0 ) return 1;

    if( f_DEC ) {
    while((c=getc(f))>=0) for( d=16879+511*(c/16)-32*c; d>16; d/=32 ) fputs("OOOOOOOOOOOOOOOOo"+d%16,g);
    } else {
    for( d=16; !feof(f); d=(d-1)*(1+15*(getc(f)=='o'))+2 ) if( d>999 ) putc(d/16-2,g),d=16;
    }

    return 0;
    }
    Attached Files Attached Files

  30. The Following 3 Users Say Thank You to Shelwien For This Useful Post:

    Gonzalo (30th July 2018),Kennon Conrad (31st July 2018),Mike (30th July 2018)

  31. #22
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    464
    Thanks
    202
    Thanked 81 Times in 61 Posts
    I was thinking about why those transforms helped so much with this specific file. For example, jpg>bit>bcm is 10,3% smaller than jpg>bcm. A nice boost, if it could only be reproduced.
    So I tried the same with at least two dozen files and guess what? Nada. Always smaller without preprocessing.

    Any clues as to why? Why it seems like the only file that benefit from the transforms was the image of this thread?

  32. #23
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    The reason is how this file was made. Probably, BlackBerry doesn't have a good compression for jpegs. Try to convert this file to bmp and then save as JPG again to ensure that it will be recompressed by your software.

  33. #24
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    464
    Thanks
    202
    Thanked 81 Times in 61 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    The reason is how this file was made. Probably, BlackBerry doesn't have a good compression for jpegs. Try to convert this file to bmp and then save as JPG again to ensure that it will be recompressed by your software.
    I'm not sure I follow. Bad encoding should mean easier compressibility with most algorithms, yes, but why this particular transform helped in this particular file and no other?
    In particular, I'm talking about this:

    pg>bit>bcm is 10,3% smaller than jpg>bcm
    So, yes, bcm compressed the jpg, and that's because the image was saved with no much optimizations by a quick and dirty hardware coder, maybe. OK. But why after Shelwien's bit.exe, it compressed even better, while no other file did? That is my question.

  34. #25
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    1. BCM is essentially bit-level coder, predicting each bit based on ~16 last ones. And with original data, BCM was limited to predict each bit based only on 1-2 last ones.
    2. For other files, there is nothing to predict since data are incompressible. So BCM essentially end up converting each byte into one bit and that's all.

    You can check this idea by converting any compressible file (f.e. enwik6) into sequence of 0/1 bytes and then compressing it with BCM, comparing compression results of original and converted data. Then try the same with an incompressible file.

  35. The Following User Says Thank You to Bulat Ziganshin For This Useful Post:

    Gonzalo (31st July 2018)

  36. #26
    Member
    Join Date
    Jun 2018
    Location
    Slovakia
    Posts
    80
    Thanks
    22
    Thanked 3 Times in 3 Posts
    Let me elaborate a little...

    Well first of all, these "Q"s are truly hexadecimal representation. It has nothing to do with binary AND other files, of course. They´re pure text data...

    Yes, I have choosen range from 0-16, because I supposed that for computers this hex representation will be much more compressible - I can use much longer patterns, of course, but for compressors in general, I´ve choosen max.lenght only up to 16 characters. I could also use decimal representation 0-9 with much less repetitions, but I´ve used hexadecimal charset instead. Or must I significantly increase the repetitions of these Q´s as stated in the attached file? It´s not a problem, of course...


    But now back to the DEFLATE.
    I have looked at LZ77+Huffman explanation at https://zlib.net/feldspar.html
    But the algorithm LZ77 itself uses a very small sliding window buffer size - 32K (only 32768 characters are known prior).

    My questions:
    1.It is possible to directly alter sliding window buffer size and thus increase prior known distance in LZ77? I mean 8 192K or more...
    2.If yes, is there Windows (GUI preferably) implementations with advanced settings?
    3.If not, is there are more sophisticated algorithm?
    4.I´ve readed the passage about LZ77 compressor - "Blah blah blah blah blah!" - What about if I will have input text data consist of "Aaaa aaaa Aaa Aaaa AAAAAaaa AAAAAaaa AAAAAAAAAAaaa AAaaaaa"? It is possible to compress it to something like "Aaaa a[D=58,L=58]" regardless capitalization? And what if the input data would contains only "aaaaAaaaaAaaaAaaaaAaaaaaaaaAaaaaaaaaAaaaaaaaaaaaa aAaaaaaaaA"? If not, is there are algorithm to do that?
    5.Now back to RLE... if I will have the string "aaaaAaaaaAaaaAaaaaAaaaaaaaaAaaaaaaaaAaaaaaaaaaaaa aAaaaaaaaA", is RLE (or more sophisticated algorithm out there) able to express this stream as a 59´A? But I´m afraid that position of these A´s will be most likely lost. Is there a way how to preserve the exact position of this letter? Decompressed output must exactly match the input, of course...

    Many thanks.
    Attached Files Attached Files

  37. #27
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts

  38. The Following 2 Users Say Thank You to Shelwien For This Useful Post:

    Gotty (1st August 2018),Kennon Conrad (1st August 2018)

  39. #28
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    464
    Thanks
    202
    Thanked 81 Times in 61 Posts
    You have very good answers to all these questions in our previous posts, "CompressMaster". Have you read them? I mean, really read them? I you did, you clearly skipped a very good part of the content.
    Please read the answers, and pay close attention because they are even more than what you asked for.

    Have a nice day!

Similar Threads

  1. lzma2 stream detector
    By Shelwien in forum Data Compression
    Replies: 0
    Last Post: 4th June 2016, 19:28
  2. Data compression for stream with small packet
    By alpha_one_x86 in forum Data Compression
    Replies: 1
    Last Post: 6th May 2012, 18:51
  3. BIN@ERN: binary-ternary compressing data coding
    By I/I.I-I. in forum Data Compression
    Replies: 4
    Last Post: 29th January 2012, 14:30
  4. Need advice on compressing specific sample of data
    By Piotr Tarsa in forum Data Compression
    Replies: 11
    Last Post: 21st April 2011, 23:01
  5. Compressing small bits of data
    By fredrp in forum Data Compression
    Replies: 9
    Last Post: 28th April 2009, 22:27

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •