Results 1 to 24 of 24

Thread: Integer compression

  1. #1
    Member
    Join Date
    Oct 2015
    Location
    Portugal
    Posts
    12
    Thanks
    1
    Thanked 1 Time in 1 Post

    Integer compression

    Hi,

    i need help with compressing big files contains integers. Values are from 20 to 230. I tried many compressors but no one is good enough for me (including TurboPFor, but I need open source solution). I am thinking to write my own compressor. Can you tell me which techniques will be the best to use in this case?

  2. #2
    Member
    Join Date
    May 2008
    Location
    Germany
    Posts
    408
    Thanks
    35
    Thanked 60 Times in 37 Posts
    "compressing big files contains integers. Values are from 20 to 230"

    do you seek for fast compression or fast decompression
    or do you want good compression?

    what about FastPFor? https://github.com/lemire/FastPFor

    what about lzo? http://www.oberhumer.com/opensource/lzo/
    what about bsc? http://libbsc.com/
    what about zpaq? http://mattmahoney.net/dc/zpaq.html

    have you tested this compression programs? (all are open source)
    results?
    example file?

    best regards

  3. #3
    Member
    Join Date
    Oct 2015
    Location
    Portugal
    Posts
    12
    Thanks
    1
    Thanked 1 Time in 1 Post
    Thank you for yours reply. I tried FastPFor (it's port to Java), results was quite good when I find out that compressed data are stored in integer array, each integer is 32-bit(depends on system) and so when I want to write these result to file it's size is 4x bigger.

  4. #4
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    591
    Thanks
    186
    Thanked 183 Times in 111 Posts
    Are these 20-230 numbers correlated?
    If not and each of them uses 1 byte, FSE would be perfect:
    https://github.com/Cyan4973/FiniteStateEntropy

  5. #5
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    413
    Thanks
    42
    Thanked 141 Times in 100 Posts
    You can test different file types and integer compression schemes with TurboPFor.

    You can use icbench to test different generated zipfian distributions .
    Ex:

    "./icbench -a1.5 -m20 -M230 -b64K -n100m"

    Use "-b64K" instead of the standard block size = 128 integers.

    Please upload some samples and provide the corresponding link.
    Last edited by dnd; 25th October 2015 at 22:18.

  6. #6
    Member
    Join Date
    Oct 2015
    Location
    Portugal
    Posts
    12
    Thanks
    1
    Thanked 1 Time in 1 Post
    Thanks for replies,

    here is one sample of files I need to compress. Dnd great work with TurboPFor, it's really great compressor, but unfortunately I can not compile it, I just can run binary, so if somebody can run test for me it would be great. And I want to ask if there is a possibility to write better compressor than others if we use it just for this concrete case. Thanks
    Attached Files Attached Files

  7. #7
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    413
    Thanks
    42
    Thanked 141 Times in 100 Posts
    Thanks for your nice words.

    Quote Originally Posted by Senoble View Post
    Values are from 20 to 230
    What is the format of the sample file?
    Is your sample file delta coded? There are a lot of "0x7f" in your file and the values are not in your specified range.
    If you are storing 1 byte per integer, you can subtract "0x7f" from each value, then apply integer compression.
    "Integer Compression" schemes are more suited for compressing lists where most of integer values are small (Small integers or sorted arrays).

    Otherwise,
    You can test allmost all popular or fast compressors with TurboBench for windows and linux (see also TurboBench).
    Ex.
    "turbobench -ebrotli,11/lzma,9/lzham,4/lzturbo,39/lzturbo,49"

    Test with different file sizes.


    Note but
    when applicable, "Integer Compression" like TurboPFor can be a lot faster (>10 times) than general purpose compressors.

    Last edited by dnd; 26th October 2015 at 12:48.

  8. #8
    Member
    Join Date
    Dec 2012
    Location
    Russia
    Posts
    11
    Thanks
    25
    Thanked 1 Time in 1 Post
    Quote Originally Posted by Senoble View Post
    here is one sample of files I need to compress.Thanks
    I think here we can use simple delta between current and previous value + integer compression for values that are close to zero.

  9. #9
    Member
    Join Date
    Oct 2015
    Location
    Portugal
    Posts
    12
    Thanks
    1
    Thanked 1 Time in 1 Post
    Sorry for confusion. Data are stored in file in hexadecimal coding.This is just one example from many files. All of them are in range 20 to 230.

    you can subtract "0x7f" from each value, then apply integer compression


    Value 127 is the most common so yes, I was thinking about it, but what about negative values? aren't they bad for integer compression?

  10. #10
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    413
    Thanks
    42
    Thanked 141 Times in 100 Posts
    well, your file is already tranformed and the integers are stored as 1 byte (instead of 4 bytes as expected).
    You can't do much better by using integer compression.
    For more compression, it is better to use an lz77 algorithm with "entropy coding".


    Your sample file benchmarked with TurboBench:
    Code:
                        size        ratio%     C MB/s     D MB/s  
         3670016        1273689     34.7        1.49       48.71    lzturbo 49      
         3670016        1297988     35.4        0.29      155.91    brotli 11        
         3670016        1322150     36.0        2.18       53.18    lzma 9           
         3670016        1425407     38.8        1.84      140.79    lzham 4          
         3670016        1539641     42.0        1.95      788.09    lzturbo 39
    The best "integer compression" ratio is obtained with the codec "TurboPFor" in the "TurboPFor" package.
    When integers are stored as 32-bits (with sub 127)

    blocksize = 64K
    Code:
                 size    ratio%   Bits/I      C MB/s     D MB/s
              2969003     20.22     6.47     1335.04     5567.12    TurboPFor
             14680064    100.00    32.00     5960.54     6006.56    Copy (32 bits integers. 14680064=3670016*4 bytes)
    "2969003" is corresponding to only 80,9% compression ratio relative to the file size "3670016" of the "sample.txt" file
    Last edited by dnd; 26th October 2015 at 21:01.

  11. The Following User Says Thank You to dnd For This Useful Post:

    Senoble (26th October 2015)

  12. #11
    Member
    Join Date
    Feb 2015
    Location
    United Kingdom
    Posts
    151
    Thanks
    19
    Thanked 66 Times in 37 Posts
    Does the order of each integer matter in your files?

  13. #12
    Member
    Join Date
    Oct 2015
    Location
    Portugal
    Posts
    12
    Thanks
    1
    Thanked 1 Time in 1 Post
    dnd thank you for your time. If I understand it, I can compress few integers into one 32-bits integer but when I want to write it will be 4 times bigger anyway.

    Does the order of each integer matter in your files?
    Yes it does matter. You think about make it in order? I tried to compress this file by blocks which are in order but it was't good "model" so compression ratio was poor.

  14. #13
    Member
    Join Date
    Feb 2015
    Location
    United Kingdom
    Posts
    151
    Thanks
    19
    Thanked 66 Times in 37 Posts
    Your file does not appear to have many predictable sequences. If you want better compression (e.g. under 1MB) would need to categorize each integer by its source. By utilizing knowledge of the system that is generating the integers you can outperform standard/generic models.

  15. #14
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    413
    Thanks
    42
    Thanked 141 Times in 100 Posts
    Quote Originally Posted by Senoble View Post
    dnd thank you for your time. If I understand it, I can compress few integers into one 32-bits integer but when I want to write it will be 4 times bigger anyway...
    No, you store simply the output BYTE buffer using the compressed length (return value = outlen).


    Integer compression example:
    Code:
      char bytebuf[N], outbuf[N];
      FILE *fin  = fopen("sample.txt", "rb");
      FILE *fout = fopen("sample.txtz", "wb");
      n = fread(bytebuf, 1, N, fin);
      uint32_t intbuf[N];
      for(i=0; i < n; i++) intbuf[i] = bytebuf[i]; 
      int outlen = icompress(intbuf, n, outbuf); 
      fwrite(outbuf, 1, outlen, fout);
    other compression libraries (ex. brotli):
    Code:
      char bytebuf[N], outbuf[N];
      FILE *fin  = fopen("sample.txt", "rb");
      FILE *fout = fopen("sample.txtz", "wb");
      n = fread(bytebuf, 1, N, fin);
      int outlen = compress(bytebuf, n, outbuf); 
      fwrite(outbuf, 1, outlen, fout);
    The ouput file length is exactly the compressed length and not 4 times bigger
    Last edited by dnd; 26th October 2015 at 22:57.

  16. #15
    Member Skymmer's Avatar
    Join Date
    Mar 2009
    Location
    Russia
    Posts
    681
    Thanks
    37
    Thanked 168 Times in 84 Posts
    Senoble, you have created example.txt file by yourself ?

  17. #16
    Member
    Join Date
    Oct 2015
    Location
    Portugal
    Posts
    12
    Thanks
    1
    Thanked 1 Time in 1 Post
    Hi Skymmer. What's your point? I don't now how it's matter but in theory I can say that yes.

  18. #17
    Member Skymmer's Avatar
    Join Date
    Mar 2009
    Location
    Russia
    Posts
    681
    Thanks
    37
    Thanked 168 Times in 84 Posts
    I'm not trying to tell anything but your file looks like already preprocessed for me. I can be wrong but its structure somehow reminds me some sort of RLE.

    EDIT: Just for curiousity I packed example.txt with cmix-v6 and result is 1 214 872 bytes (with dictionary), 1 214 899 (without dictionary)
    Last edited by Skymmer; 27th October 2015 at 00:58. Reason: cmix results added

  19. #18
    Member
    Join Date
    Oct 2015
    Location
    Portugal
    Posts
    12
    Thanks
    1
    Thanked 1 Time in 1 Post
    Definitely not, I choose this file on purpose because it's one of that it's harder to compress.

    EDIT: Here's another example. really easy to compress. Do you have any ideas which techniques I should use to compress this kind of files? Thanks
    Attached Files Attached Files
    Last edited by Senoble; 27th October 2015 at 01:17.

  20. #19
    Member Skymmer's Avatar
    Join Date
    Mar 2009
    Location
    Russia
    Posts
    681
    Thanks
    37
    Thanked 168 Times in 84 Posts
    There are a lot of ways for this file. For example the symbol ranking will give you 26 bytes (depending) for this file (example2.txt)

    EDIT: 11 bytes possible with RLEC cx
    Last edited by Skymmer; 27th October 2015 at 03:34. Reason: added rlec cx

  21. #20
    Member
    Join Date
    Sep 2015
    Location
    Italy
    Posts
    178
    Thanks
    84
    Thanked 101 Times in 73 Posts
    example2.txt has 3.017.568 128 values followed by 62.624 0 values: 0 isn't in the range 20-230.
    Code:
    1.163.244 cmv c -m2,3,0x00ab65ff example.txt example.txt.cmv
          159 cmv c -m1,3 example2.txt example2.txt.cmv

  22. #21
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    413
    Thanks
    42
    Thanked 141 Times in 100 Posts
    What is your use case?
    Are general purpose lz77 compression libraries not appropriate for your use case?
    Are your files block based? if yes, what is the block size?
    If you still want to use "integer compression", you can sort the symbols by frequency and use the mapping in the "integer compression".
    This will work good for low cardinality files.
    Last edited by dnd; 27th October 2015 at 12:39.

  23. #22
    Member
    Join Date
    Oct 2015
    Location
    Portugal
    Posts
    12
    Thanks
    1
    Thanked 1 Time in 1 Post
    Mauro you're right, sorry. The correct file ends where 0s starts so it contains only values 128, which it's really easy to compress. dnd I just need better compression. Files I have uploaded are quite good to compress but I have bigger files where I don't achieve good compression.

    I have a idea to make array in order by adding some coefficient to it. For example we can sort array: [2,3,4,1,2,6] by adding coefficient i*10 where i is index of array: [2,13,24,31,42,56] I just need formula for lowest coefficient as possible. I came with arithmetic mean rounded up but it's not very efficient. Any suggestions?

  24. #23
    Member
    Join Date
    Oct 2015
    Location
    Portugal
    Posts
    12
    Thanks
    1
    Thanked 1 Time in 1 Post
    I want to ask, if Elias, Gamma, Variable-byte, Rice encoding, etc.. are used only "in memory" compression or they can be efficient also in compression "to file". Thanks

  25. #24
    Member
    Join Date
    Oct 2010
    Location
    Germany
    Posts
    274
    Thanks
    4
    Thanked 23 Times in 16 Posts
    variable length codes are optimal prefix codes if the numbers are uncorrelated and follow a certain distribution. Golomb-codes, for example, are optimal in case of a geometric distribution and are often used in image/audio-compression coders.

Similar Threads

  1. TurboPFor: Integer Compression
    By dnd in forum Data Compression
    Replies: 39
    Last Post: Today, 00:39
  2. fastest open source integer compression algorithms
    By mitra in forum Data Compression
    Replies: 2
    Last Post: 5th September 2015, 18:21
  3. Replies: 4
    Last Post: 22nd June 2015, 00:32

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •