Results 1 to 22 of 22

Thread: Can this be compressed better?

  1. #1
    Member
    Join Date
    Dec 2016
    Location
    Canada
    Posts
    10
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Question Can this be compressed better?

    Hey guys, so I have an array of unsorted unsigned integers with 1000 unique items between 0 and 1000. (Basically a shuffled array from 0-1000)
    I've tried many ways to compress it, so far the best I've got is 1524 bytes with lz4 compression.
    I've compared to lz4hc, zstd0-22, brotli0-11, rANS and fastpfor (compressed int array)


    The point of this is to send an array of unique ids not greater than 1000 in preserved order across the network.


    Can this be compressed better? I appreciate the help guys!

    edit:
    One more thing I should add, the client on the other side already has the exact same values in an array just in a different order, could I possibly just inform them of the order that they are in? Possibly with a hash and have the client sort the array accordingly? I was also trying out cuckoo filters but they are lossy and I'd prefer lossless techniques.
    Last edited by Warvstar; 15th December 2016 at 16:56.

  2. #2
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Try this - http://encode.ru/threads/2597-Stegan...y-permutations
    You'd be able to generate the sorted array on decoding side, so stegdict can be used as a permutation coder.

  3. #3
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    do it's always a permutation of a range, or it may have less elements than the range size?

  4. #4
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    856
    Thanks
    447
    Thanked 254 Times in 103 Posts
    0-1000 values can be encoded in 10 bits.
    Sending 1000 value of 10 bits gives you 10000 bits, hence 1250 bytes.
    If everything else is random, that's likely the simplest way to reach this performance.

  5. #5
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    @Cyan: stegdict actually encodes the permutation, so result should be around Log[256.,1000!] = 1067 bytes

  6. #6
    Member
    Join Date
    Dec 2016
    Location
    Canada
    Posts
    10
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Cyan View Post
    0-1000 values can be encoded in 10 bits.
    Sending 1000 value of 10 bits gives you 10000 bits, hence 1250 bytes.
    If everything else is random, that's likely the simplest way to reach this performance.
    Doh! I can't believe I didn't think of that, it's actually how I was packing some other data.

    Quote Originally Posted by Shelwien View Post
    @Cyan: stegdict actually encodes the permutation, so result should be around Log[256.,1000!] = 1067 bytes
    Interesting! I'm going to look into this more in the morning.

    One more thing I should add, the client on the other side already has the exact same values in an array just in a different order, could I possibly just inform them of the order that they are in? Possibly with a hash and have the client sort the array accordingly? I was also trying out cuckoo filters but they are lossy and I'd prefer lossless techniques.

    This forum is awesome, thanks for the quick replies guys!

  7. #7
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    then you need to send only permutation, it seems exactly what Eugene utility does

  8. #8
    Member
    Join Date
    Dec 2016
    Location
    Canada
    Posts
    10
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    do it's always a permutation of a range, or it may have less elements than the range size?
    It could have less than the full range, anywhere from 0-1000 values. Also I made an addition to the original post, the client shares the same list of possible items/values, they just needs to know in what order the server has the items.

  9. #9
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Here's an example - http://nishi.dreamhosters.com/u/stegdict_demo_v0.rar; 1067 bytes as i said.

  10. #10
    Member
    Join Date
    Dec 2016
    Location
    Canada
    Posts
    10
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    then you need to send only permutation, it seems exactly what Eugene utility does
    I'm having trouble finding this, can you tell me where I can find Eugene utility?

  11. #11
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    TLDR syndrome I guess. Eugene is my name.
    https://github.com/Shelwien/stegdict = source
    http://nishi.dreamhosters.com/u/stegdict_demo_v0.rar = demo specifically made for you

  12. #12
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    456
    Thanks
    46
    Thanked 164 Times in 118 Posts
    The arrays are known to both the client and the server, so you can send the differences between the two arrays by using gamma ZigZag encoding.
    In the ideal case (no differences) you will send 1000/8 = 125 bytes(1 bit per item). It is also possible to combine this with Run Length Encoding.
    You can use a flag to switch between this encoding and full array transfer.

  13. #13
    Member
    Join Date
    Dec 2016
    Location
    Canada
    Posts
    10
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Shelwien View Post
    TLDR syndrome I guess. Eugene is my name.
    https://github.com/Shelwien/stegdict = source
    http://nishi.dreamhosters.com/u/stegdict_demo_v0.rar = demo specifically made for you
    Oi haha, oops. I was thinking that might be the case. Your demo does appear to have better compression than what I was able to achieve; however, I'm hesitant to use stegdict (GPL) because I'm not sure the license I will release my project under yet, it may very well be closed source as I'm developing a PC game.

    Impressive though, as it would shave 16kbps off my network data, topping my dataoutput for 1000 units to 114kbps.

    Edit: A little bit extra details wont hurt I guess, the array are the id's, I send the update 10 times a second (100kbps packing the bits, 84kbps if I can get the permutations working). All the other unit data (deltas) I can fit into 33kbps.

  14. #14
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    1. Its my own project so license is not really an issue

    2. You probably don't need the whole of it for your task, as stegdict purpose was a bit more complicated, especially v1 one.
    Basically you just need a rangecoder (simpler one than stegdict's rc128 would do if you don't need to support permutations of millions of items),
    then you can sort the list and encode the indexes with rc - with STL it probably can be done in like 5 lines of code.
    (Encode the position of first sorted list item in unsorted list (0..N-1), delete it from unsorted list,
    encode the position of 2nd item in unsorted list (0..N-2), delete it, etc)

    3. Its possible to add extra compression, as stegdict considers all permutations to have equal probability.
    The most obvious idea is to encode the indexes with a bitwise model, like in lzma etc.

    4. Actually you can also try writing item indexes to a file as 16-bit binary numbers, then compress the file with lzma lp1 pb1

  15. #15
    Member
    Join Date
    Dec 2016
    Location
    Canada
    Posts
    10
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by dnd View Post
    The arrays are known to both the client and the server, so you can send the differences between the two arrays by using gamma ZigZag encoding.
    In the ideal case (no differences) you will send 1000/8 = 125 bytes(1 bit per item). It is also possible to combine this with Run Length Encoding.
    You can use a flag to switch between this encoding and full array transfer.
    That sounds like somethig I'vee tried before but I was not able to do it efficiently; however, now that I know what its called I might have a better chance, I'll look into this again. Thanks

  16. #16
    Member
    Join Date
    Aug 2016
    Location
    Paris
    Posts
    14
    Thanks
    5
    Thanked 28 Times in 6 Posts
    Just to make it simple, consider that in order to encode your list, you first have an array of 1001 unique values 0..1000, and that for each position you permute it with one of the next entries that contain the value you want. You just have to remember which permutations you perform, for each position N you only have 1001-N possible permutations. The position you permute with requires less bits as the position progresses, so if you only encode the positions, ideally you'd encode 1001*1000*999*998...*2*1 possibilities = 1001, hence the log256(1001!) size max. For ease of implementation you could decide that you use 10 bits for the first half, then 9, then 8, then 7 etc.. If your values are often partially sorted, you may end up with some constant position offsets and it could make sense to use a variable-length encoding for the positions so that the most common ones take less space.

  17. #17
    Member
    Join Date
    Dec 2016
    Location
    Canada
    Posts
    10
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Thanks for the help guys, the best I've gotten so far is 1,193 bytes, I'm going to get back to this after the weekend.

  18. #18
    Member cfeck's Avatar
    Join Date
    Jan 2012
    Location
    Germany
    Posts
    50
    Thanks
    0
    Thanked 17 Times in 9 Posts
    With a truncated binary encoding of the positions I got 1072...1077 bytes (less than 1% over entropy). Decoder is left as an exercise. Here, 'bc' is any bit encoder class that implements https://en.wikipedia.org/wiki/Truncated_binary_encoding

    #define COUNT 1000
    #define MIN_VAL 1
    #define MAX_VAL 1000

    int array[COUNT]; // initialize with random permutation of values 1...1000 somewhere

    int available = COUNT;
    for (int val = MIN_VAL; val <= MAX_VAL; ++val) {
    int i;
    int marked = 0;
    for (i = 0; i < COUNT; ++i) {
    if (array[i] < val) ++marked;
    if (array[i] == val) break;
    }
    if (available > 1) {
    bc.writeTruncated(i - marked, available);
    }
    --available;
    }

  19. #19
    Member
    Join Date
    Dec 2016
    Location
    Canada
    Posts
    10
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by cfeck View Post
    With a truncated binary encoding of the positions I got 1072...1077 bytes (less than 1% over entropy). Decoder is left as an exercise. Here, 'bc' is any bit encoder class that implements https://en.wikipedia.org/wiki/Truncated_binary_encoding

    #define COUNT 1000
    #define MIN_VAL 1
    #define MAX_VAL 1000

    int array[COUNT]; // initialize with random permutation of values 1...1000 somewhere

    int available = COUNT;
    for (int val = MIN_VAL; val <= MAX_VAL; ++val) {
    int i;
    int marked = 0;
    for (i = 0; i < COUNT; ++i) {
    if (array[i] < val) ++marked;
    if (array[i] == val) break;
    }
    if (available > 1) {
    bc.writeTruncated(i - marked, available);
    }
    --available;
    }
    Nice thanks for this! I'm going to loom into this when I get back home tonight[.

  20. #20
    Member
    Join Date
    Dec 2016
    Location
    Canada
    Posts
    10
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by dnd View Post
    The arrays are known to both the client and the server, so you can send the differences between the two arrays by using gamma ZigZag encoding.
    In the ideal case (no differences) you will send 1000/8 = 125 bytes(1 bit per item). It is also possible to combine this with Run Length Encoding.
    You can use a flag to switch between this encoding and full array transfer.
    Can you tell me what I might be doing wrong/overlooking here. Using this code below my array compresses to 1600 bytes. Maybe this technique is not good for completely random orders?

    //First create a sorted array with 0-1000 (the same array as on the client)
    unsigned int value = 1000;
    std::vector<uint16_t> unitdatasorted(value);
    for (unsigned int i = 0; i < value; ++i) {
    unitdatasorted[i] = i;
    }


    //Then create a shuffled array with 0-1000 (the one we want to send to client)
    std::vector<uint16_t> unitdata(value);
    unitdata = Shuffled(unitdatasorted); //This just returns a shuffled array
    //Then find differences between the original / perform delta encoding
    std::vector<int16_t> unitdatadelta(value);
    for (unsigned int i = 0; i < value; ++i) {
    unitdatadelta[i] = unitdata[i] - unitdatasorted[i];
    }
    //Then use zigzag encoding to turn signed to unsigned


    std::vector<uint16_t> unitdataencoded(value);
    for (unsigned int n = 0; n < value; ++n) {
    unitdataencoded[n] = (unitdatadelta[n] << 1) ^ (unitdatadelta[n] >> 15);
    }

  21. #21
    Member
    Join Date
    Dec 2016
    Location
    Canada
    Posts
    10
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by cfeck View Post
    With a truncated binary encoding of the positions I got 1072...1077 bytes (less than 1% over entropy). Decoder is left as an exercise. Here, 'bc' is any bit encoder class that implements https://en.wikipedia.org/wiki/Truncated_binary_encoding

    #define COUNT 1000
    #define MIN_VAL 1
    #define MAX_VAL 1000

    int array[COUNT]; // initialize with random permutation of values 1...1000 somewhere

    int available = COUNT;
    for (int val = MIN_VAL; val <= MAX_VAL; ++val) {
    int i;
    int marked = 0;
    for (i = 0; i < COUNT; ++i) {
    if (array[i] < val) ++marked;
    if (array[i] == val) break;
    }
    if (available > 1) {
    bc.writeTruncated(i - marked, available);
    }
    --available;
    }
    Nice! I got this working! I'm not sure it's possible to get it much lower than this.

  22. #22
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    456
    Thanks
    46
    Thanked 164 Times in 118 Posts

    To Compress or Not To Compress!

    You must also encode the zigzag array using codecs suitable for small values like gamma coding or variable nibble coding.
    You must compare the compressed length of this method and your main method and choose the best.
    Use the first 1 bit flag to indicate the method used.
    In the ideal case, when there is no difference between the two arrays, you'll encode only 1 bit per element.
    This can be exploited only If your arrays are not random, otherwise you'll see no benefits (because of network latency) compressing and sending the arrays through a network.
    Without compression we have already 10 bits * 1000 = 1250 bytes. saving only ~100 bytes is not worth the effort.

Similar Threads

  1. reproductible (compressed) archives
    By sebbu in forum Data Compression
    Replies: 6
    Last Post: 8th December 2016, 00:09
  2. PLEASE HELP ME ABOUT THIS COMPRESSED ARCHIVE !!!
    By CoreGames in forum Data Compression
    Replies: 8
    Last Post: 22nd September 2014, 18:20
  3. Compressed Pixel Formats
    By gregkwaste in forum Data Compression
    Replies: 20
    Last Post: 1st April 2014, 03:33
  4. Re: Useful compressed streaming properties
    By Shelwien in forum Data Compression
    Replies: 2
    Last Post: 2nd June 2012, 15:29
  5. BWT with compressed input data
    By Shelwien in forum Data Compression
    Replies: 3
    Last Post: 29th May 2009, 15:16

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •