Results 1 to 19 of 19

Thread: Best compression algorithm for a sequence of incremental integers

  1. #1
    Member
    Join Date
    Jun 2018
    Location
    Slovakia
    Posts
    80
    Thanks
    22
    Thanked 3 Times in 3 Posts

    Best compression algorithm for a sequence of incremental integers

    I have a huge array of incremental integers and I need to compress them.

    All numbers are unique and progressively increasing, but some numbers are cutted-off from sample, thus it´s impossible to express it as a 1 - 6000.

    I´ve readed one post at stackoverflow.com where is:
    First, preprocess your list of values by taking the difference between each value and the previous one (for the first value, assume the previous one was zero). This should in your case give mostly a sequence of ones, which can be compressed much more easily by most compression algorithms.
    And what if I will have separate text document that would contain the missing numbers? I don´t think so that this approach could lead to 100% compression ratio.

    I expect at least 99.95% space saving - or 100% if it´s possible which I don´t think.

    Thanks a lot.
    CompressMaster
    Attached Files Attached Files

  2. #2
    Member JamesWasil's Avatar
    Join Date
    Dec 2017
    Location
    Arizona
    Posts
    33
    Thanks
    29
    Thanked 6 Times in 6 Posts
    If I'm understanding what you're asking, you want to compress the integers in the text file, but it's not possible to represent them as an incremental loop from 0 to 6000 because there are digits missing?

    What I would probably do is get the missing numbers first and use a 1 bit flag to represent the end markers between the digits while representing the rest as 3.32 bits per digit. The total would be 4.32 bits per digit, but that way the special decompressor for it only needs to see the flags and read the numbers to add and know where to do it, rather than storing all of the numbers that it doesn't which are sequential. If you already know which numbers aren't there and the range, you can make a really small footprint to model around it.

    The 1 bit flag can be 0 if you're at the last digit of a number, or 1 if it is a digit that follows another. That way you can separate and have it know the difference between 20, 200, or 2000 if you have a series of digits that are 2020020200020220. When these digits are merged together, there'd be no way to tell where one starts and the other ended unless you used a 1 bit flag or had another method to differentiate them.

    Alternatively, you can keep the 1 bit flags in a separate variable (string variable for example) and convert them to binary when done. That may be more advantageous because you can represent the missing digits as 3.32 bits per symbol still, but you can then statistically arrange them or use BWT and then MTF (move to front) on the digits to assign the smallest binary codes to those most frequently occurring.

    That would introduce more complexity than you may want to do, because you'd have to move the binary flag and digit together as a pair every time you did a sort (same with the MTF of the digit afterward; you'd have to move the binary flag to the front along with the digit each time to make sure it stays together).

    Initially, I'd opt for storing the 1 bit flags in string 1, use a dictionary compression method on the missing digits that are grouped together as string 2, and then for the final stage use an arithmetic encoder on the result to see what I got.

    I haven't passed over integers.txt, but do you have a list of the integers which aren't there to attach instead? If you do, it'll be faster to analyze than writing a program to look for the ones that are missing.

  3. #3
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    I'd recommend writing the data as a series of 0/1 bits representing the presence of the value. Then the whole file could be a series of 751 0xDD and then you can just run length encode it and achieve the 99.95% compression ratio.

  4. #4
    Member Gotty's Avatar
    Join Date
    Oct 2017
    Location
    Hungary
    Posts
    343
    Thanks
    235
    Thanked 226 Times in 123 Posts
    Is the attached sample a true representation of what to be compressed?
    Each 4th integer is missing. (Missing is: 2,6,10,14,...) You don't need to compress anything. It can be generated.
    for(int i=0;i<=6000;++i)if((i%4)!=2)printf("%d\n",i);

  5. #5
    Member
    Join Date
    Jun 2018
    Location
    Slovakia
    Posts
    80
    Thanks
    22
    Thanked 3 Times in 3 Posts
    @JamesWasil

    Unfortunately, I don´t have the list of missing numbers.

    @Kennon Conrad

    You mentioned that my data must be written in form of 0/1 bits. But my current sample does not contains "simply" pattern that can be easily followed and thus reconstructed. So, it would be still possible to compress down my sample to 99.95% regardless that?
    Could I request you to do that? Because I don´t know how it can be compressed that data down.

    @Gotty

    Of course that was oversimplified sample. The correct sample is attached below (see much_more_integers.txt)
    I´ve noticed prior that there is some pattern that can be easily followed and thus reversed as you´ve described. But, unfortunately, my new sample is very different from the first.
    Attached Files Attached Files

  6. #6
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by CompressMaster View Post
    @JamesWasil
    @Kennon Conrad

    You mentioned that my data must be written in form of 0/1 bits. But my current sample does not contains "simply" pattern that can be easily followed and thus reconstructed. So, it would be still possible to compress down my sample to 99.95% regardless that?
    Could I request you to do that? Because I don´t know how it can be compressed that data down.
    Your data does not need to be written in any particular format, you can do whatever you want with it. Usually the simplest representation will compress best, so I'd say you are making things much more difficult for the compressor to recognize the data pattern by writing the sequence as a series of text integers than by just writing a 0 or 1 for each possible value. In general you will not be able to reduce the file by 99.95%, that is only possible when there is a simple pattern like your first sample. For random data, it would take a little more than 1 bit times the largest possible integer to represent the data.

  7. #7
    Member Gotty's Avatar
    Join Date
    Oct 2017
    Location
    Hungary
    Posts
    343
    Thanks
    235
    Thanked 226 Times in 123 Posts
    @Compressmaster:
    Sooo, what are these integers? What do they represent?
    You give us more insight - we can help you better.

  8. #8
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    437
    Thanks
    137
    Thanked 152 Times in 100 Posts
    Without knowing more, over than some missing but most not:

    1. Turn it around - record missing ones and drop the observed ones.
    2. Replace N by delta of previous N.
    3. Compress the numbers using a basic entropy encoder, maybe with zig-zag encoding for variable sized integers first if the scale is too large.

    It's possible the missing values aren't random though - eg clustered in certain regions - in which case more advanced modelling will help.

    Edit: actually with a good enough entropy encoder, ditch the negation step 1 too. All those "1"s with the delta will compress away just as well.

    I get your example file to 2729 bytes.

  9. #9
    Member cfeck's Avatar
    Join Date
    Jan 2012
    Location
    Germany
    Posts
    50
    Thanks
    0
    Thanked 17 Times in 9 Posts
    There are 8130 numbers present out of 23001. Assuming a random distribution with the computed probability that a number is present, this needs 21555 bits (2695 bytes) according to Shannon. Add four bytes for the counts you could get this test file down to 2699 bytes. Assuming a non-random distribution requires one to model the data, and since the result in this case is more than 2700 bytes (tested using paq8px), I am pretty sure the test file is either randomly generated, or is the result of already compressed data.

  10. The Following User Says Thank You to cfeck For This Useful Post:

    Gotty (12th October 2018)

  11. #10
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    2709 with cdm.
    I kept intermediate files (bitmaps etc), for reference.
    http://nishi.dreamhosters.com/u/integers_0.rar

  12. The Following 2 Users Say Thank You to Shelwien For This Useful Post:

    JamesWasil (12th October 2018),xinix (12th October 2018)

  13. #11
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    772
    Thanks
    63
    Thanked 270 Times in 190 Posts
    I created GearEnc for this type of data, not the best but can be better than RAR and 7-Zip:
    https://encode.ru/threads/545-GearEn...hlight=gearenc

    Best to use command line with Delta enabled.
    Attached Files Attached Files

  14. #12
    Member SolidComp's Avatar
    Join Date
    Jun 2015
    Location
    USA
    Posts
    222
    Thanks
    89
    Thanked 46 Times in 30 Posts
    Quote Originally Posted by CompressMaster View Post
    I have a huge array of incremental integers and I need to compress them.

    All numbers are unique and progressively increasing, but some numbers are cutted-off from sample, thus it´s impossible to express it as a 1 - 6000.

    I´ve readed one post at stackoverflow.com where is:


    And what if I will have separate text document that would contain the missing numbers? I don´t think so that this approach could lead to 100% compression ratio.

    I expect at least 99.95% space saving - or 100% if it´s possible which I don´t think.

    Thanks a lot.
    CompressMaster
    If you want performance, look into Daniel Lemire's integer compression libraries, like this one: https://github.com/lemire/streamvbyte

    Depending on your use case, you might not need compression if you can generate the data algorithmically. That's what Gotty was getting at, and why people want to know what the data represents. As a simple example of what I mean, if your data was the Fibonacci sequence, you wouldn't need compression – you'd just use code to generate it.

  15. #13
    Member
    Join Date
    Jun 2018
    Location
    Slovakia
    Posts
    80
    Thanks
    22
    Thanked 3 Times in 3 Posts
    As for integer generation process, the numbers has been generated firstly as an incremental loop from 0-25 000. Then, I´ve used online service to get (randomness) - I´ve using this term carefully, because in general, it is not possible to compress random data of course.

    Secondly, I´ve used sorting algorithm to sort all numbers ascending.

    Lastly, I´ve got all missing numbers using an online service (see much_more_integers_2.txt).

    @JamesB

    1. Turn it around - record missing ones and drop the observed ones.
    see "much_more_integers_2.txt"

    "I get your example file to 2729 bytes."
    Could you upload your sample?

    Thanks.
    Attached Files Attached Files
    Last edited by CompressMaster; 14th October 2018 at 19:46.

  16. #14
    Member
    Join Date
    Sep 2007
    Location
    Denmark
    Posts
    856
    Thanks
    45
    Thanked 104 Times in 82 Posts
    I am by no way and expert in compression

    But I'm not sure why the Delta preprocessing suggested in stack can not be use but on integer level to begin with

    - All i see is an list of ever increasing integer samples. only small jumps and no falls. perfect for delta preprocessing in my opinion
    - then there are some Human formatting I don't know if you want to retain ( enter/new line etc etc) but lets assume you want for this simple test

    Code:
    4	4
    6	2
    7	1
    8	1
    9	1
    10	1
    13	3
    15	2
    17	2
    18	1
    22	4
    24	2
    28	4
    29	1
    30	1
    31	1
    34	3
    35	2
    37	2
    38	1
    40	2
    41	1
    42	1
    45	3
    46	1
    47	1
    50	3
    51	1
    53	2
    54	1
    55	1
    56	1
    58	2
    59	1
    60	1
    62	2
    63	1
    64	1
    65	1
    66	1
    67	1
    69	2
    as you can see all your integer samples are now condensed to a single bytes
    heck if it continues with only being 1-4 in increments that only 2 bits you need to encoded for each sample and you don't need a new sample code (aka can remove the enter/nwline bytes as well) since you know each 2 bits is a sample)
    this should get you beyond 75% reduction compared to the raw text file
    Then look into some simple huffman binary tree compression if you need more and you should be home free with an simple/fast/pretty efficient compression (due to the data)


    P.S.
    You cannot compress something 100%
    Something can not come from nothing.

    P.P.S.
    The reason you can compress this very simply even though you started with "random" Input is because once you sorted it you tossed away data and removed some of the randomness (the order of the samples)
    You no longer trying to recreated the random data

    P.P.P.S
    If you need to fastly seek into this data you can made chunk of delta compressed block of like 20-50 intergers at a time
    so you know that if you need intergeer that numb 207 on the list. go the chunk start 4 and then read and Add the delta values until you hit the 7th sample in that block

    P.P.P.P.S
    You Integer sample would be down to using only 3717.75 Bytes from just the delta filtering and bit reduction.
    That's a 96% compression and you haven't even applied any real compression yet
    Last edited by SvenBent; 16th October 2018 at 11:47.

  17. #15
    Member Gotty's Avatar
    Join Date
    Oct 2017
    Location
    Hungary
    Posts
    343
    Thanks
    235
    Thanked 226 Times in 123 Posts
    @CompressMaster

    Your posts from your different threads.
    (Emphasis by me.)

    Quote Originally Posted by CompressMaster View Post
    3.1 Both questions are correct. I started with random hexadecimal characters generated by myself written in one text file
    Quote Originally Posted by CompressMaster View Post
    This number has been generated randomly by myself.
    Quote Originally Posted by CompressMaster View Post
    By typing RANDOMLY numbers, of course.
    Quote Originally Posted by CompressMaster View Post
    Description: Each of text file consist of the same character (in this case it´s "A", but it can be replaced with any other symbols from ASCII table) that has been generated RANDOMLY by myself, but it this case randomness will be irrelevant, because it contains only one letter.
    Quote Originally Posted by CompressMaster View Post
    Then, I´ve used online service to get (randomness) - I´ve using this term carefully, because in general, it is not possible to compress random data of course.
    And our answers:

    Quote Originally Posted by Gotty View Post
    Did you really generate random content? Random content is incompressible, you probably know that.
    Quote Originally Posted by Bulat Ziganshin View Post
    I bet that all these approaches are going around random data compression
    Quote Originally Posted by Gotty View Post
    If you generate something random, please don't ask us how to compress it.
    Quote Originally Posted by cfeck View Post
    I am pretty sure the test file is either randomly generated, or is the result of already compressed data
    Can you draw a conclusion?

  18. #16
    Member
    Join Date
    Sep 2018
    Location
    Philippines
    Posts
    29
    Thanks
    12
    Thanked 0 Times in 0 Posts
    Hey all,

    How about compressing an array of 256 Very Large Ints (BigNums or infinite-precision integers), i.e. a frequency table for 8-bit symbols?

    This is hard to compress, i think, since the said frequency table is already the output of a compression algorithm (which was fascinating at first).
    Last edited by compgt; 6th February 2019 at 15:17.

  19. #17
    Member
    Join Date
    Sep 2018
    Location
    Philippines
    Posts
    29
    Thanks
    12
    Thanked 0 Times in 0 Posts
    Anyways, the compression algorithm is best understood if you "visualize" a bar chart or a histogram, where new symbol frequencies are always trying to become greater than the current highest frequency, which we increment by its delta with the new symbol's frequency. The new highest frequency becomes the new symbol's frequency; or put simply, the new symbol must have the highest frequency. So at most, the new highest frequency can only "double" or add by 1 bit in length. (In decoding, the symbol with the highest frequency is the symbol to decode; this means it is stack based. We add the delta to the highest frequency during encoding so we can preserve or get back to the previous frequency of the symbol when decoding.) Output is actually the frequency table, which is easy to compress or generate?

    Algorithm pseudo-code:

    Code:
    /* initialize frequency table. */ 
    for (i=0; i < 256; i++) freq[i] = i + 1; 
    max = freq[255]; 
    
    do { 
      c = get_byte(infile); 
      if (c == EOF) break; 
      freq[c] = max + (max - freq[c]); 
      max = freq[c]; 
    } while (1);

    No "runs" of a single character allowed in the input, as much as possible. "Random data" indeed.

    New or recalled observations:

    1. This algorithm ironically "expands" the frequencies at first. ? LOL

    We're back to the early days of information theory or data compression history!

    2. The bombshell: It takes more than 1 bit added to encode for very small frequencies which suddenly must be maximum. The solution might be to "swap" them but this requires new information or codes. This is back to delta coding. haist

    3. But a total cycling of the frequency table might work...

    4. Instead of 8-bit bytes, use 4-bit symbols;

    ***

    This is similar, i think, to WEB Technologies' algorithm as featured in BYTE magazine in 1992 and noted by comp.compression FAQ:
    "WEB, in fact, says that virtually any amount of data can be
    squeezed to under 1024 bytes by using DataFiles/16 to compress
    its own output multiple times."

    I think they were using or playing with a frequency table too, 256 32-bit frequencies = 1K.

    They might had to output the MSbit of the highest frequency, the result of which may equal another byte frequency/ies?

    That's why they had the problem that 4 numbers in a matrix are equal, a rare case in their algorithm.

    Just maybe.

    (Ideally, at most 1 bit increase in frequency of output or new symbol, but the bombshell precludes that. If they are of the same bitsize, then only 1 bit increase in the new max frequency.

    The current symbol has always the highest frequency.

    You decode backwards, from last symbol to first; the symbol with the highest frequency is the current symbol.

    One parameter in decoding is the famed file_size().
    )

    The problem with the algorithm is that the emitted frequency table could be very large due to very large frequencies if you implement it by really using BigNums or BigInts;
    You then have to compress the very large frequency table.

    Maybe to achieve compression, you can just consider the MSBit after the arithmetic (addition) operation.
    Or the solution is nearly just MTF (you have to output the character that *doubled* (MSBit activated)).

    WEB Technologies' Datafiles/16 algorithm is clearly designed for compression of *random* data, and recursive, which are futile indeed.
    Last edited by compgt; 17th May 2019 at 10:58. Reason: Added Algorithm pseudo-code.

  20. #18
    Member
    Join Date
    Sep 2018
    Location
    Philippines
    Posts
    29
    Thanks
    12
    Thanked 0 Times in 0 Posts
    > 4. Instead of 8-bit bytes, use 4-bit symbols;

    How about 2-bit (base-4) symbols? Or maybe even better, a data source of base-3 symbols ??

  21. #19
    Member
    Join Date
    Sep 2018
    Location
    Philippines
    Posts
    29
    Thanks
    12
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by compgt
    Ideally, at most 1 bit increase in frequency of output or new symbol, but the bombshell precludes that. If they are of the same bitsize, then only 1 bit increase in the new max frequency.
    To solve the bombshell, or presence of "anomalous" symbols, you must have a smooth frequency distribution or the frequencies must be of the same bitsize as much as possible, of which there are many ways to do this, but must be reversible. For example, you can maybe XOR the bytes of the data source first (this is reversible), pre-whitening it for encoding. Am i correct here?

    And, like in Huffman coding, you can contrive or generate a file that is exactly perfect or suitable for this algorithm. What would be interesting is that the generated file is a "random-appearing data" file, perhaps indeed incompressible to known compressors.

    (See post above, now has pseudo-code for your easy understanding.)
    Last edited by compgt; 18th May 2019 at 08:55. Reason: To contrive or generate a file

Similar Threads

  1. Binary sequence compression
    By smjohn1 in forum Data Compression
    Replies: 23
    Last Post: 8th December 2017, 02:48
  2. Incremental codebook generation?
    By Trixter in forum The Off-Topic Lounge
    Replies: 1
    Last Post: 28th August 2013, 06:14
  3. Hashing a large range of integers to a smaller range
    By mtwaha in forum Data Compression
    Replies: 3
    Last Post: 11th April 2011, 23:27
  4. Sequence of bits
    By Kaw in forum Data Compression
    Replies: 12
    Last Post: 25th September 2009, 08:53
  5. LZP flag sequence compression
    By Shelwien in forum Data Compression
    Replies: 8
    Last Post: 9th August 2009, 02:08

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •