Page 1 of 5 123 ... LastLast
Results 1 to 30 of 132

Thread: LIBSSC : SHARC's algorithms unleashed

  1. #1
    Member
    Join Date
    Aug 2013
    Location
    France
    Posts
    77
    Thanks
    27
    Thanked 26 Times in 11 Posts

    LIBSSC : SHARC's algorithms unleashed

    Hello everyone,

    Following this much debated thread http://encode.ru/threads/1804-Compre...-BSD-or-others (thanks to all who inputted), I've decided to release the SHARC compression algorithms in a BSD licensed library : it's very nicely (!) called libssc.
    You can download it / try it out here : http://github.com/centaurean/libssc. Or alternatively, for a quick test you can get the latest repository version of SHARC which is now a command line client of this library : http://github.com/centaurean/sharc.

    It is currently working, in version 0.9.11 beta, with lots of improvements expected in the upcoming weeks (security with input hashsum checking, mode reversion to limit extra size to a minimum in the case of incompressible files, a new higher-compression-ratio algorithm...).

    This is a snapshot of the latest tests on my laptop, matched against LZ4 which is really quick as everyone knows (the dataset is way too small, I know, but it's just preliminary testing ). Mode c2 is used as it has a similar resulting ratio to LZ4's default mode, but c1 is actually 50% faster in both compression and decompression with a 60ish % compression ratio :

    The tests include warmup and are run on ramdisk, with both compressors being built using their provided makefile :


    Compression


    $ time ./sharc -n -c2 enwik7
    Compressed enwik7 (10,000,000 bytes) to enwik7.sharc (5,725,434 bytes), Ratio out / in = 57.3%, Time = 0.031 s, Speed = 305 MB/s


    real 0m0.036s
    user 0m0.022s
    sys 0m0.013s


    $ time ./lz4c -y enwik7
    Compressed filename will be : enwik7.lz4
    Compressed 10000000 bytes into 5750468 bytes ==> 57.50%


    real 0m0.058s
    user 0m0.040s
    sys 0m0.017s


    $ time ./sharc -n -c2 enwik8
    Compressed enwik8 (100,000,000 bytes) to enwik8.sharc (57,031,786 bytes), Ratio out / in = 57.0%, Time = 0.301 s, Speed = 317 MB/s


    real 0m0.311s
    user 0m0.206s
    sys 0m0.102s


    $ time ./lz4c -y enwik8
    Compressed filename will be : enwik8.lz4
    Compressed 100000000 bytes into 56995506 bytes ==> 57.00%


    real 0m0.485s
    user 0m0.372s
    sys 0m0.110s


    Decompression


    $ time ./sharc -n -d enwik7.sharc
    Decompressed enwik7.sharc (5,725,434 bytes) to enwik7 (10,000,000 bytes), Time = 0.036 s, Speed = 262 MB/s


    real 0m0.043s
    user 0m0.023s
    sys 0m0.018s


    $ time ./lz4c -d -y enwik7.lz4
    Decoding file enwik7
    Successfully decoded 10000000 bytes


    real 0m0.033s
    user 0m0.012s
    sys 0m0.020s


    $ time ./sharc -n -d enwik8.sharc
    Decompressed enwik8.sharc (57,031,786 bytes) to enwik8 (100,000,000 bytes), Time = 0.353 s, Speed = 270 MB/s


    real 0m0.367s
    user 0m0.215s
    sys 0m0.148s


    $ time ./lz4c -d -y enwik8.lz4
    Decoding file enwik8
    Successfully decoded 100000000 bytes


    real 0m0.264s
    user 0m0.097s
    sys 0m0.162s
    It seems that for compression speed LIBSSC is ahead, and for decompression speed LZ4 is ahead. But again the benchmark is not varied enough nor rigorous enough for any conclusive evidence.
    Last edited by gpnuma; 13th December 2013 at 19:11.

  2. The Following 7 Users Say Thank You to gpnuma For This Useful Post:

    avitar (24th November 2013),Bulat Ziganshin (6th November 2013),Cyan (6th November 2013),m^2 (7th November 2013),Paul W. (6th November 2013),Sebastian W (6th November 2013),seth (6th November 2013)

  3. #2
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    Wow, that's impressive.

    Is there a description of the algorithm somewhere, besides the source itself? (The source is uncommented as far as I can see.) What basic algorithm are you using, what are the most important enhancements, and how did you get it to go so fast?

  4. #3
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    I've read through the source code - it was tough as there's hell of a lot of boilerplate code.

    Basically, from what I understood from the code, the codec works like that:
    8-byte signatures which consists of 64 flags each and those signatures work similar to the flags in LZSS ie bit 1 means match, bit 0 means miss,
    on miss copy 4 bytes verbatim and update the dictionary; nothing is written to compressed stream except the flag in signature
    on match copy 4 bytes from a dictionary; 2-byte index to dictionary is saved into compressed stream
    The 2-byte index is a hash of the coded 4-byte chunk of data (ie it's not derived from previous data)

    There is a precomputed dictionary so it's a little like cheating like in PAQ with text dictionaries.

    There's no possibility for sharc/ libssc to compress to half the original size or better in one pass - at best it could compress a little worse than half the size. With more passes it is possible to improve the result, but probably more than 2 passes don't pay well for the effort, compared to more flexible compression formats.


    Overall, lz4 should get better compared to sharc/ libssc as the compressibility of the input rises. And vice-versa - on files with medium to high entropy sharc/ libssc should be more more efficient.
    Last edited by Piotr Tarsa; 6th November 2013 at 22:18.

  5. The Following 5 Users Say Thank You to Piotr Tarsa For This Useful Post:

    Bulat Ziganshin (6th November 2013),Cyan (6th November 2013),gpnuma (7th November 2013),Matt Mahoney (8th November 2013),Paul W. (7th November 2013)

  6. #4
    Member
    Join Date
    Aug 2013
    Location
    France
    Posts
    77
    Thanks
    27
    Thanked 26 Times in 11 Posts
    Thanks Piotr that's exactly it !

    Although I have to add that the precomputed dictionary has almost no impact on the resulting ratio, maybe 1% at best, the hash function's distribution quality is much more important.
    That's because the dictionary, if empty on start, is usually filled after reading somewhere between 256K and 1-4M of data depending on entropy, so its initial state is only efficient on that initial data.

    You're also absolutely right on the number of passes, compression ratio for the faster kernel (codename chameleon) is capped at 50% on single pass and 25% on dual pass (modes c1 and c2 respectively). It is inefficient after two passes.

    I'm also working on a new kernel (codename argonaut) which has a 10% better ratio than dual pass chameleon (it brings enwik8's compressed ratio down to about 47%), but is still very fast : somewhere around 170MB/s on my platform or half of the c2 speed.

  7. #5
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    For multi-megabyte files a small dictionary don't matter much but for short streams it can have great impact. m^2 (forum member) has made a benchmark that tests algorithms efficiency as disk sector compressors and in such cases dictionary could have significant effects. Same for eg independent compression of forum posts, short articles, comments and so on. OTOH, dictionary reloading could negate the efficienty advantage.

    I have small iidea for improvement but that would break the "streamable nature" of codec:
    Instead of interleaving signatures (flags), indexes (hashes) and literals in single stream you could split them to three streams (or more realistically - subblocks) and on dual-pass compression re-compress only the two first types (ie don't recompress literals). That could improve ratio and speed of two pass compression, I think.
    Last edited by Piotr Tarsa; 7th November 2013 at 02:32.

  8. #6
    Member
    Join Date
    Aug 2013
    Location
    France
    Posts
    77
    Thanks
    27
    Thanked 26 Times in 11 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    For multi-megabyte files a small dictionary don't matter much but for short streams it can have great impact. m^2 (forum member) has made a benchmark that tests algorithms efficiency as disk sector compressors and in such cases dictionary could have significant effects. Same for eg independent compression of forum posts, short articles, comments and so on. OTOH, dictionary reloading could negate the efficienty advantage.

    I have small iidea for improvement but that would break the "streamable nature" of codec:
    Instead of interleaving signatures (flags), indexes (hashes) and literals in single stream you could split them to three streams (or more realistically - subblocks) and on dual-pass compression re-compress only the two first types (ie don't recompress literals). That could improve ratio and speed of two pass compression, I think.
    It's interesting yes but I can see a few issues apart from streaming.
    First, how do you reorder the blocks on decompression ? you will need to store this "sequencing" information and thus lose some space.
    Also, from single pass to dual pass you change the data alignment, which means it is possible the uncompressed literals after pass one could be better compressed on pass two.
    For example, let's say AAAA BBBB CCCC get compressed to AB BBBB CCCC, on the second pass the algorithm will attempt compression on ABBB BBCC CC.. thus might generate completely different results.

  9. #7
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    First, how do you reorder the blocks on decompression ? you will need to store this "sequencing" information and thus lose some space.
    Depending on the block size that can be small, eg 2 x 3 = 6 bytes per block.

    Also, from single pass to dual pass you change the data alignment, which means it is possible the uncompressed literals after pass one could be better compressed on pass two.
    That's why I said "could". Also you don't only change data alignment but you mix all three types of information (flags, indexes and literals). And because everything has size of power of two (signatures - 8 bytes, literal run - 4 bytes, index - 2 bytes) then you never find a match that is spaced by an odd offset, no matter how many passes you do. Probably that's why more than 2 passes don't bring significant improvement. You could alleviate that e.g. by updating two dictionary entries on miss - but e.g. only in second pass.
    Last edited by Piotr Tarsa; 7th November 2013 at 13:11.

  10. #8
    Member
    Join Date
    Aug 2013
    Location
    France
    Posts
    77
    Thanks
    27
    Thanked 26 Times in 11 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    Depending on the block size that can be small, eg 2 x 3 = 6 bytes per block.
    That's why I said "could". Also you don't only change data alignment but you mix all three types of information (flags, indexes and literals). And because everything has size of power of two (signatures - 8 bytes, literal run - 4 bytes, index - 2 bytes) then you never find a match that is spaced by an odd offset, no matter how many passes you do.
    Yes absolutely, in fact I designed it that way because having even offsets permits avoidance of some extra checks and other optimizations, and thus improves speed which is the number 1 concern in this algorithm : its c1 mode speed is actually close to memcpy's speed ! There is for example a padding byte in the efficiency check block to maintain that even offset at all cost.

    Quote Originally Posted by Piotr Tarsa View Post
    Probably that's why more than 2 passes don't bring significant improvement. You could alleviate that e.g. by updating two dictionary entries on miss - but e.g. only in second pass.
    What do you mean by "updating two dictionary entries on miss" ? You mean having two separate dictionaries, or twice updating a dictionary ? Because the latter could not be acceptable in terms of speed (requires two hashing function calculations and that's where time gets consumed so the algorithm will get too slow in my opinion) ?

    Thanks for your input btw great stuff ^^

  11. #9
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    I've meant twice updating the dictionary.

    Almost all current processors are superscalar and perform good when there are few serial dependencies. So it could turn out that (with careful ordering of instructions) computing two hashes and updating the dictionary twice will be not much slower than computing one hash.

  12. #10
    Member
    Join Date
    Aug 2013
    Location
    France
    Posts
    77
    Thanks
    27
    Thanked 26 Times in 11 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    I've meant twice updating the dictionary.

    Almost all current processors are superscalar and perform good when there are few serial dependencies. So it could turn out that (with careful ordering of instructions) computing two hashes and updating the dictionary twice will be not much slower than computing one hash.
    So if I got your point correctly your idea was : in case we have ADFG ATFT DEFG to compress, first try ADFG, if no match then write 1 byte (A) to output, offset by 1 and try DFGA.
    If no match again continue with ATFT, otherwise TFTD is next. Is that right ?

    I gave a try (better safe than sorry !) to the doubling of hashing + dictionary updating and it ends up giving quite a hit on performance.
    It is not twice as slow because of processor optimizations as you said, but still not far from 20% slower .... that's not negligible. And that was without the additional instructions required for handling the even/odd offset cases.

  13. #11
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    No, I was thinking about something else.

    Suppose we have input: ABCD EFGH IJKL MNOP

    Now the updates go like that:
    Code:
    if (dict[hash(ABCD)] != ABCD) {
      dict[hash(ABCD)] := ABCD
    }
    if (dict[hash(EFGH)] != EFGH) {
      dict[hash(EFGH)] := EFGH
    }
    if (dict[hash(IJKL)] != IJKL) {
      dict[hash(IJKL)] := IJKL
    }
    if (dict[hash(MNOP)] != MNOP) {
      dict[hash(MNOP)] := MNOP
    }
    I'm suggesting something like:
    Code:
    if (dict[hash(ABCD)] != ABCD) {
      dict[hash(ABCD)] := ABCD
    }
    if (dict[hash(EFGH)] != EFGH) {
      dict[hash(EFGH)] := EFGH
      dict[hash(BCDE)] := BCDE
    }
    if (dict[hash(IJKL)] != IJKL) {
      dict[hash(IJKL)] := IJKL
      dict[hash(GHIJ)] := GHIJ
    }
    if (dict[hash(MNOP)] != MNOP) {
      dict[hash(MNOP)] := MNOP
      dict[hash(LMNO)] := LMNO
    }
    And do that only for second pass.

  14. #12
    Member
    Join Date
    Aug 2013
    Location
    France
    Posts
    77
    Thanks
    27
    Thanked 26 Times in 11 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    No, I was thinking about something else.

    Suppose we have input: ABCD EFGH IJKL MNOP

    Now the updates go like that:
    Code:
    if (dict[hash(ABCD)] != ABCD) {
      dict[hash(ABCD)] := ABCD
    }
    if (dict[hash(EFGH)] != EFGH) {
      dict[hash(EFGH)] := EFGH
    }
    if (dict[hash(IJKL)] != IJKL) {
      dict[hash(IJKL)] := IJKL
    }
    if (dict[hash(MNOP)] != MNOP) {
      dict[hash(MNOP)] := MNOP
    }
    I'm suggesting something like:
    Code:
    if (dict[hash(ABCD)] != ABCD) {
      dict[hash(ABCD)] := ABCD
    }
    if (dict[hash(EFGH)] != EFGH) {
      dict[hash(EFGH)] := EFGH
      dict[hash(BCDE)] := BCDE
    }
    if (dict[hash(IJKL)] != IJKL) {
      dict[hash(IJKL)] := IJKL
      dict[hash(GHIJ)] := GHIJ
    }
    if (dict[hash(MNOP)] != MNOP) {
      dict[hash(MNOP)] := MNOP
      dict[hash(LMNO)] := LMNO
    }
    And do that only for second pass.
    OK I'll give it a try and see what's the gain/loss in ratio and performance, with no precomputed dictionary for a more fair comparison to the existing c2 mode.

    BTW feel free to fork the github project (http://github.com/centaurean/libssc) if you want to test anything ! That's the beauty of open source after all

  15. #13
    Member
    Join Date
    Aug 2013
    Location
    France
    Posts
    77
    Thanks
    27
    Thanked 26 Times in 11 Posts
    Ok Piotr I gave your suggestion a try, but sadly it doesn't seem to work..

    Using enwik8 :
    * compression ratio down from 57.1% (current -c2 with no precomputed dictionary) to 58.2% (modified -c2 with your suggestion and no precomputed dictionary)
    * speed down from 330MB/s (current -c2 with no precomputed dictionary) to 300MB/s (modified -c2 with your suggestion and no precomputed dictionary)

    It actually reminded me I had done some "offsetting entries" tries during the algorithm design, and I think the main problem is that you get a dictionary "overfill" by including these offsetted entries. You multiply hash collisions and although it might be beneficial in some cases, overall it isn't because some good entries (ie repeated many times) get eclipsed (because of a collision) by some of the new entries.

    The obvious option would then be to increase dictionary size, but I don't want to do that for performance/optimization reasons, on the chameleon kernel at least.

    But again thanks for your interesting input !

  16. The Following User Says Thank You to gpnuma For This Useful Post:

    Piotr Tarsa (8th November 2013)

  17. #14
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    Thanks for the test.

    Can you post the modified sources? I would play with it a little.

  18. #15
    Member
    Join Date
    Aug 2013
    Location
    France
    Posts
    77
    Thanks
    27
    Thanked 26 Times in 11 Posts
    Sure. Start by cloning https://github.com/centaurean/sharc.git then cd in the directory and do :

    git submodule init
    git submodule update
    cd modules/libssc
    git checkout 6f11fdb63632d9ffe39a4fa1336cf53b3fb842b0
    Then apply the following patch : https://gist.github.com/gpnuma/bc3a13b9b2ce7913ba47 and there you go !
    Last edited by gpnuma; 8th November 2013 at 16:46.

  19. #16
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    Thanks. Tried it and I think that actually you've made the shifts in the wrong direction. After reversing the direction of shifts the ratio is now better but still is worse than original on second pass. I've tried it on first pass on enwik8 and there was negligible gain in compression ratio.

    Maybe with non-interleaved streams there would be some noticeable improvements in second pass, (but not by much, perhaps). In case of non-interleaved streams there would be little sense to do standard dictionary update in second pass while recompressing literals as it would be the same (the dictionary) as in first pass (thus there would be no improvement) so only shifted updates could be used. Code seems to be too complicated to do quick hack, though.

  20. #17
    Member
    Join Date
    Aug 2013
    Location
    France
    Posts
    77
    Thanks
    27
    Thanked 26 Times in 11 Posts
    Thanks. Tried it and I think that actually you've made the shifts in the wrong direction. After reversing the direction of shifts the ratio is now better but still is worse than original on second pass. I've tried it on first pass on enwik8 and there was negligible gain in compression ratio.
    Ah probably with little endianness yes, I implemented it (too) quickly !

    Maybe with non-interleaved streams there would be some noticeable improvements in second pass, (but not by much, perhaps). In case of non-interleaved streams there would be little sense to do standard dictionary update in second pass while recompressing literals as it would be the same (the dictionary) as in first pass (thus there would be no improvement) so only shifted updates could be used. Code seems to be too complicated to do quick hack, though.
    The thing is, libssc is parallelizable, although it's not yet fully implemented as a list of cues isn't at this stage put in the footer. To enable multithreading, you must have certain points in your file where the dictionary is reset, and these points are the starting points for different threads. That's the reason behind the dictionary reset, otherwise it is more efficient of course to keep the same dictionary during the whole compression phase, as you skip the "learning process" that happens in the beginning of the dictionary filling.

  21. #18
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Hey gpnuma, I've seen the announcements a couple of days ago, but haven't found time to comment until now:
    I find it a great news that you started the library. I used to view sharc as interesting but useless project because GPL3 is a show stopper for me. Still, I wanted to add it to fsbench and even spent an hour or so to do it, though it turned out to be much too little.
    Now that you provided the algorithms under more liberal terms, they lost the hugely important 'useless' tag and libssc looks like a very promising project. Sadly, working 12+ hours a day like I do recently I can't add it to fsbench, let alone learn how it works deeply enough to hack a bit. But it's high in my priorities list.
    And: I wish you luck.

  22. The Following User Says Thank You to m^2 For This Useful Post:

    gpnuma (9th November 2013)

  23. #19
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    I don't see how fixed-block-scoped deinterleaving would make parallelization harder. Workers (in multithreaded processing) must work on separate blocks of data and the data in those blocks can be deinterleaved (deinterleaved only on 1st pass).

    I've done a very quick port to Scala to play with ideas. Firstly I've modified the sharc/ libssc sources to disable dictionary preinitialization and dictionary reset. Then measured the results.
    Secondly, I've made my implementation that doesn't interleave the output. I've made two types of coding: one like in your libssc and second one with shifted offsets. The one with shifted offsets I've used on recompression of the literals.

    Code:
    -r--r--r-- 1 piotrek piotrek  100000000 cze  1  2011 enwik8
    -rw-rw-r-- 1 piotrek piotrek    3125000 lis  9 00:28 enwik8.ca1
    -rw-rw-r-- 1 piotrek piotrek      97656 lis  9 00:30 enwik8.ca1.ca1
    -rw-rw-r-- 1 piotrek piotrek     442420 lis  9 00:30 enwik8.ca1.ca2
    -rw-rw-r-- 1 piotrek piotrek    2240160 lis  9 00:30 enwik8.ca1.ca3
    -rw-rw-r-- 1 piotrek piotrek   41335102 lis  9 00:28 enwik8.ca2
    -rw-rw-r-- 1 piotrek piotrek    1291722 lis  9 00:30 enwik8.ca2.ca1
    -rw-rw-r-- 1 piotrek piotrek    5572612 lis  9 00:30 enwik8.ca2.ca2
    -rw-rw-r-- 1 piotrek piotrek   30189880 lis  9 00:30 enwik8.ca2.ca3
    -rw-rw-r-- 1 piotrek piotrek   17329796 lis  9 00:28 enwik8.ca3
    -rw-rw-r-- 1 piotrek piotrek     541556 lis  9 00:49 enwik8.ca3.cb1
    -rw-rw-r-- 1 piotrek piotrek    1648618 lis  9 00:49 enwik8.ca3.cb2
    -rw-rw-r-- 1 piotrek piotrek   14032560 lis  9 00:49 enwik8.ca3.cb3
    -rw-rw-r-- 1 piotrek piotrek  100000784 lis  9 00:27 enwik8.sharc.c0
    -rw-rw-r-- 1 piotrek piotrek   61791064 lis  9 00:27 enwik8.sharc.c1
    -rw-rw-r-- 1 piotrek piotrek   57352258 lis  9 00:27 enwik8.sharc.c1.c1
    -rw-rw-r-- 1 piotrek piotrek   57058846 lis  9 00:28 enwik8.sharc.c2
    The files produced on first pass are enwik8.ca? while the files produces on second pass are enwik8.ca?.c??

    After summing up the sizes of files from second pass (ie enwik8.ca?.c??) and comparing to enwik8.sharc.c2 size there's almost 1 MB difference in favor of my deinterleaved approach.


    Full code in Scala:
    Code:
    import java.io.BufferedInputStream
    import java.io.FileInputStream
    import java.lang.System.err
    import scala.collection.mutable.MutableList
    import java.io.File
    import java.io.FileOutputStream
    
    object Hello {
      def makeChunk(array: Array[Byte], pos: Int) =
        ((array(pos).toLong & 0xffl) << 0) + ((array(pos + 1).toLong & 0xffl) << 8) +
          ((array(pos + 2).toLong & 0xffl) << 16) + ((array(pos + 3).toLong & 0xffl) << 24)
    
      def makeHash(chunk: Long) = {
        val weak = (2885564586l ^ chunk) * 16777619
        (((weak >>> 16) & 0xffff) ^ (weak & 0xffff)).toInt
      }
    
      def main(args: Array[String]): Unit = {
        if (args.length != 5 || !Set("ca", "cb", "da", "db").contains(args(0))) {
          err.println("Usage:");
          err.println("\tscala <program.jar> c(a|b) in out1 out2 out3");
          err.println("\tscala <program.jar> d(a|b) in1 in2 in3 out");
          System.exit(0);
        }
        if (args(0) == "ca" || args(0) == "cb") {
          val modeA = args(0) == "ca"
          val in = Array.ofDim[Byte](new File(args(1)).length().toInt + 3)
          new FileInputStream(args(1)).read(in)
          val out1 = Array.ofDim[Byte](in.length / 8)
          var out1p = 0
          val out2 = Array.ofDim[Byte](in.length / 2)
          var out2p = 0
          val out3 = Array.ofDim[Byte](in.length)
          var out3p = 0
    
          val items = in.length / 4
          val dictionary = Array.ofDim[Int](256 * 256)
          var signature = 1
          var matched = 0
          var missed = 0
          for (itemNr <- 0 until items) {
            val item = makeChunk(in, itemNr * 4)
            val hash = makeHash(item)
            val matches = dictionary(hash) == item.toInt
            signature *= 2
            if (matches) {
              matched += 1
              signature += 1
              out2(out2p) = (hash >>> 8).toByte
              out2p += 1
              out2(out2p) = hash.toByte
              out2p += 1
            } else {
              missed += 1
              if (modeA)
            	dictionary(hash) = item.toInt
              else {
                val newChunk = makeChunk(in, itemNr * 4 + 1 + (itemNr & 2))
                dictionary(makeHash(newChunk)) = newChunk.toInt
              }
              for (i <- 0 until 4) { 
            	  out3(out3p) = in(itemNr * 4 + i)
            	  out3p += 1
              }
            }
            if (signature >= 256) {
              out1(out1p) = (signature & 0xff).toByte
              out1p += 1
              signature = 1
            }
          }
          println("Matched: " + matched)
          println("Missed: " + missed)
          println("Estimated size: " + (items / 8 + matched * 2 + missed * 4))
          val out1s = new FileOutputStream(args(2))
          out1s.write(out1, 0, out1p)
          out1s.close()
          val out2s = new FileOutputStream(args(3))
          out2s.write(out2, 0, out2p)
          out2s.close()
          val out3s = new FileOutputStream(args(4))
          out3s.write(out3, 0, out3p)
          out3s.close()
        }
      }
    }
    There's no flushing of state at the end so the files aren't completely decodable (few bytes at the end wouldn't be decodable) and also I haven't implemented any of the features of sharc's format. I'll try tommorow to develop the code further to the point of reversibility of the algorithm.
    Last edited by Piotr Tarsa; 9th November 2013 at 12:24.

  24. #20
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    OK. I've made a reversible algorithm.

    Code:
    import java.io.BufferedInputStream
    import java.io.FileInputStream
    import java.lang.System.err
    import scala.collection.mutable.MutableList
    import java.io.File
    import java.io.FileOutputStream
    
    object Hello {
      def makeChunk(array: Array[Byte], pos: Int) =
        ((array(pos) & 0xff) << 0) + ((array(pos + 1) & 0xff) << 8) +
          ((array(pos + 2) & 0xff) << 16) + ((array(pos + 3) & 0xff) << 24)
    
      def makeHash(chunk: Int) = {
        val weak = ((2885564586l ^ chunk) * 16777619).toInt
        ((weak >>> 16) & 0xffff) ^ (weak & 0xffff)
      }
    
      def writeChunk(array: Array[Byte], pos: Int, chunk: Int) {
        array(pos) = (chunk & 0xff).toByte
        array(pos + 1) = ((chunk >> 8) & 0xff).toByte
        array(pos + 2) = ((chunk >> 16) & 0xff).toByte
        array(pos + 3) = ((chunk >> 24) & 0xff).toByte
      }
    
      def main(args: Array[String]): Unit = {
        if (args.length != 6 || !Set("ca", "cb", "da", "db").contains(args(0))) {
          err.println("Usage:");
          err.println("\tscala <program.jar> c(a|b) in out1 out2 out3 out4");
          err.println("\tscala <program.jar> d(a|b) in1 in2 in3 in4 out");
          System.exit(0);
        }
        val modeA = args(0).charAt(1) == 'a'
        val dictionary = Array.ofDim[Int](256 * 256)
        if (args(0) == "ca" || args(0) == "cb") {
          val in = Array.ofDim[Byte](new File(args(1)).length().toInt + 3)
          new FileInputStream(args(1)).read(in, 3, in.length - 3)
          val out1 = Array.ofDim[Byte]((in.length - 3) / 32)
          var out1p = 0
          val out2 = Array.ofDim[Byte](out1.length * 16)
          var out2p = 0
          val out3 = Array.ofDim[Byte](out1.length * 32)
          var out3p = 0
          val items = out1.length * 8
          var signature = 1
          var matched = 0
          var missed = 0
          for (itemNr <- 0 until items) {
            val item = makeChunk(in, 3 + itemNr * 4)
            val hash = makeHash(item)
            val matches = dictionary(hash) == item
            signature *= 2
            if (matches) {
              matched += 1
              signature += 1
              out2(out2p) = (hash >>> 8).toByte
              out2p += 1
              out2(out2p) = hash.toByte
              out2p += 1
            } else {
              missed += 1
              if (modeA)
                dictionary(hash) = item
              else {
                val newChunk = makeChunk(in, 3 + itemNr * 4 - 1 - (itemNr & 2))
                dictionary(makeHash(newChunk)) = newChunk
              }
              for (i <- 0 until 4) {
                out3(out3p) = in(3 + itemNr * 4 + i)
                out3p += 1
              }
            }
            if (signature >= 256) {
              out1(out1p) = (signature & 0xff).toByte
              out1p += 1
              signature = 1
            }
          }
          println("Matched: " + matched)
          println("Missed: " + missed)
          println("Size: " + (out1p + out2p + out3p + -3 + in.length - out1p * 32))
          val out1s = new FileOutputStream(args(2))
          out1s.write(out1, 0, out1p)
          out1s.close()
          val out2s = new FileOutputStream(args(3))
          out2s.write(out2, 0, out2p)
          out2s.close()
          val out3s = new FileOutputStream(args(4))
          out3s.write(out3, 0, out3p)
          out3s.close()
          val out4s = new FileOutputStream(args(5))
          out4s.write(in, 3 + out1p * 32, -3 + in.length - out1p * 32)
          out4s.close()
        } else {
          val in1 = Array.ofDim[Byte](new File(args(1)).length().toInt)
          new FileInputStream(args(1)).read(in1)
          var in1p = 0
          val in2 = Array.ofDim[Byte](new File(args(2)).length().toInt)
          new FileInputStream(args(2)).read(in2)
          var in2p = 0
          val in3 = Array.ofDim[Byte](new File(args(3)).length().toInt)
          new FileInputStream(args(3)).read(in3)
          var in3p = 0
          val in4 = Array.ofDim[Byte](new File(args(4)).length().toInt)
          new FileInputStream(args(4)).read(in4)
          val out = Array.ofDim[Byte](in2.length * 2 + in3.length + in4.length + 3)
          val items = in1.length * 8
          var signature = 1 << 16
          for (itemNr <- 0 until items) {
            if (signature >= (1 << 16)) {
              signature = (1 << 8) + (in1(in1p) & 0xff)
              in1p += 1
            }
            val matches = (signature & 0x80) != 0
            signature *= 2
            if (matches) {
              val hash = ((in2(in2p) & 0xff) << 8) + (in2(in2p + 1) & 0xff)
              in2p += 2
              val chunk = dictionary(hash)
              writeChunk(out, 3 + itemNr * 4, chunk)
            } else {
              val chunk = makeChunk(in3, in3p)
              writeChunk(out, 3 + itemNr * 4, chunk)
              if (modeA) {
                dictionary(makeHash(chunk)) = chunk
              } else {
                val newChunk = makeChunk(out, 3 + itemNr * 4 - 1 - (itemNr & 2))
                dictionary(makeHash(newChunk)) = newChunk
              }
              in3p += 4
            }
          }
          for (i <- 0 until in4.length) {
            out(3 + items * 4 + i) = in4(i)
          }
          val outs = new FileOutputStream(args(5))
          outs.write(out, 3, -3 + out.length)
          outs.close()
        }
      }
    }
    Results:
    Code:
    -r--r--r-- 1 piotrek piotrek  100000000 cze  1  2011 enwik8
    -rw-rw-r-- 1 piotrek piotrek    3125000 lis  9 23:02 enwik8.ca1
    -rw-rw-r-- 1 piotrek piotrek      97656 lis  9 23:02 enwik8.ca1.ca1
    -rw-rw-r-- 1 piotrek piotrek     442418 lis  9 23:02 enwik8.ca1.ca2
    -rw-rw-r-- 1 piotrek piotrek    2240156 lis  9 23:02 enwik8.ca1.ca3
    -rw-rw-r-- 1 piotrek piotrek          8 lis  9 23:02 enwik8.ca1.ca4
    -rw-rw-r-- 1 piotrek piotrek   41335102 lis  9 23:02 enwik8.ca2
    -rw-rw-r-- 1 piotrek piotrek    1291721 lis  9 23:04 enwik8.ca2.ca1
    -rw-rw-r-- 1 piotrek piotrek    5572608 lis  9 23:04 enwik8.ca2.ca2
    -rw-rw-r-- 1 piotrek piotrek   30189856 lis  9 23:04 enwik8.ca2.ca3
    -rw-rw-r-- 1 piotrek piotrek         30 lis  9 23:04 enwik8.ca2.ca4
    -rw-rw-r-- 1 piotrek piotrek   17329796 lis  9 23:02 enwik8.ca3
    -rw-rw-r-- 1 piotrek piotrek     541556 lis  9 23:05 enwik8.ca3.cb1
    -rw-rw-r-- 1 piotrek piotrek    1652428 lis  9 23:05 enwik8.ca3.cb2
    -rw-rw-r-- 1 piotrek piotrek   14024936 lis  9 23:05 enwik8.ca3.cb3
    -rw-rw-r-- 1 piotrek piotrek          4 lis  9 23:05 enwik8.ca3.cb4
    -rw-rw-r-- 1 piotrek piotrek          0 lis  9 23:02 enwik8.ca4
    -rw-rw-r-- 1 piotrek piotrek  100000000 lis  9 23:09 enwik8.da
    -rw-rw-r-- 1 piotrek piotrek    3125000 lis  9 23:08 enwik8.da1
    -rw-rw-r-- 1 piotrek piotrek   41335102 lis  9 23:08 enwik8.da2
    -rw-rw-r-- 1 piotrek piotrek   17329796 lis  9 23:08 enwik8.da3
    -rw-rw-r-- 1 piotrek piotrek  100000784 lis  9 00:27 enwik8.sharc.c0
    -rw-rw-r-- 1 piotrek piotrek   61791064 lis  9 00:27 enwik8.sharc.c1
    -rw-rw-r-- 1 piotrek piotrek   57352258 lis  9 00:27 enwik8.sharc.c1.c1
    -rw-rw-r-- 1 piotrek piotrek   57058846 lis  9 00:28 enwik8.sharc.c2
    Program takes six arguments:
    - mode (compression/ decompression, typeA/ typeB, ie non-shifted/ shifted updates)
    - in compression mode: one input and four output files
    - in decompression mode: four input and one output file
    - those four files have to be in the same order

    The total size of files enwik8.c??.c?? and enwik8.ca4 (only those files are required for successful decompression) is 56053377, which is about one megabyte smaller than enwik8.sharc.c2
    The overhead for sharc container shouldn't be more than a dozen kilobytes I think, so the difference is significant anyway.

  25. The Following User Says Thank You to Piotr Tarsa For This Useful Post:

    gpnuma (10th November 2013)

  26. #21
    Member
    Join Date
    Aug 2013
    Location
    France
    Posts
    77
    Thanks
    27
    Thanked 26 Times in 11 Posts
    Hey Piotr !

    Thanks for that I'll study your code as soon as I can to determine the global gain/cost in ratio/speed. Did you try it out with other file types (like text logs, etc.) ? It could even get better with more compressible data I suspect.

  27. #22
    Member
    Join Date
    Aug 2013
    Location
    France
    Posts
    77
    Thanks
    27
    Thanked 26 Times in 11 Posts
    Ok I had a quick look. First things first :

    I don't see how fixed-block-scoped deinterleaving would make parallelization harder. Workers (in multithreaded processing) must work on separate blocks of data and the data in those blocks can be deinterleaved (deinterleaved only on 1st pass).
    Well you're absolutely right : it doesn't make it any harder, I initially understood your method would incur suppressing dictionary resets (mandatory for parallelization), but obviously it's only for testing purposes and it's not necessary.

    On the other hand, there is a performance penalty (granted it's not dramatic but libssc is all about speed) and more importantly a problem with streaming. Your approach will make the algorithm definitely less streamable due to the deinterleaving. You could try to have some way of implementing frequent "context switching" from one type of stream to another (i.e. from the literals to the signatures to the hashes etc.) so decompression can still operate in a streaming fashion, but that means extra transmitted data will probably be required and the deinterleaving advantage will be at least partly lost.

    Are these problems offsetted by a 1% gain in compression ratio ? At this stage I would say no. But 5% ? Well that's another story

    As for the code itself, I can see a few issue after reading through it (or maybe it's just me misinterpreting ?) :
    def makeHash(chunk: Int) = {
    val weak = ((2885564586l ^ chunk) * 16777619).toInt
    ((weak >>> 16) & 0xffff) ^ (weak & 0xffff)
    }
    If I remember well java ints are signed 32 bits, whereas in libssc calculations use unsigned 32 bit ints. Maybe it doesn't make any difference but it would be worth double checking that the above calculation produces a bijection with an unsigned 32bit one.

    val newChunk = makeChunk(out, 3 + itemNr * 4 - 1 - (itemNr & 2))
    dictionary(makeHash(newChunk)) = newChunk
    I think the ItemNr & 2 is probably mistaken there, as it looks like you're trying to get a itemNr modulo. The problem is that "a modulo b" is equal to "a & (b - 1)" only when b is a power of 2, in this case b = 3 so it's not.

    Thanks for your time.

  28. #23
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    On the other hand, there is a performance penalty (granted it's not dramatic but libssc is all about speed) and more importantly a problem with streaming. Your approach will make the algorithm definitely less streamable due to the deinterleaving. You could try to have some way of implementing frequent "context switching" from one type of stream to another (i.e. from the literals to the signatures to the hashes etc.) so decompression can still operate in a streaming fashion, but that means extra transmitted data will probably be required and the deinterleaving advantage will be at least partly lost.
    Performance penalty should be dependent on block size - the bigger block size, the less frequent the switching between types of streams so less performance overhead.
    Additional transmitted data also depends on block size - the bigger block, the less overhead.

    Are these problems offsetted by a 1% gain in compression ratio ? At this stage I would say no. But 5% ? Well that's another story
    Two-pass is much slower than single-pass yet it doesn't gain a huge advantage: on enwik8 it's 8% improvement, but two pass takes probably 2/3 times longer or so. Deinterleaving performance penalty should be low, I think < 10% for an 1.5% improvement. So it would improve the perceived efficiency of dual-pass algorithms.

    If I remember well java ints are signed 32 bits, whereas in libssc calculations use unsigned 32 bit ints. Maybe it doesn't make any difference but it would be worth double checking that the above calculation produces a bijection with an unsigned 32bit one.
    I've checked and it seems correct.

    I think the ItemNr & 2 is probably mistaken there, as it looks like you're trying to get a itemNr modulo. The problem is that "a modulo b" is equal to "a & (b - 1)" only when b is a power of 2, in this case b = 3 so it's not.
    No, it's not a mistake. I wanted the offset to alternate between 1 and 3. OTOH, maybe it's not optimal. I haven't checked thoroughly if it is.




    I have another idea though: integrate some kind of LZP. On reduntant data separate LZP preprocessing helps greatly but on dense data it causes overhead, so integrating it within single pass should be more efficient than doing preprocessing and then compression.

    I propose therefore a change in flags meanings. Suppose there's a run of matches. Instead of writing all the hashes/ indexes we can temporarily change the meaning of flags, so that after a run of eg 3 matches a flag mean a LZP match that doesn't require a hash/ index. On first mismatch in LZP mode, we swtich back to normal mode and encode match/ literals using old ways of doing that.

    With that change the theoretical maximum compression ratio would be close to 32:1 instead of 2:1 in single pass (because one flag could potentially mean a 4-byte match).

    There's no need to store offsets (in dictionary) either - a 16-bit additional data per dictionary entry, ie hash/ index of the predicted next entry, would be needed. That would increase the dictionary size by half (ie up from 4 bytes to 6 bytes per entry).

    Maybe I'll check that idea sometime.



    Edit:
    I was too curious and implemented my idea already. Unfortunately, decompression is not implemented yet so results aren't certain. But if they are correct then LZP-enchanced SHARC would gain a lot. Enhanced single pass scores somewhat (200 KB or so) better than normal dual pass on enwik8. On fp.log from maximumcompression.com the difference is more dramatic - enhanced single pass produces file that's over two times smaller than normal dual pass. And I haven't yet tested dual pass on my LZP-enchanced SHARC - though I think the improvement over single pass won't be as significant as in normal SHARC.

    And to spice up your appettite - I think that with reduntant data and my LZP-enhanced scheme there would be actually a chance to improve compression speed, as there is no need to compute hashes for LZP matches as they are obtained from LZP tables (they contains hashes of next predicted chunks, as I said).
    Last edited by Piotr Tarsa; 21st November 2013 at 02:29.

  29. #24
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    OK, I've managed to make a decodable version. Results are as impressive as I've reported in above post.

    On enwik8 and enwik9 the LZP-enhanced single pass compresses better than non-LZP-enhanced dual passes. It's not an impressive improvement but neither it is negligible.
    On fp.log (from maximumcompression.com) the LZP-enhanced single pass compresses 2,356 times smaller than non-LZP-enhanced dual passes. That's huge improvement IMO.

    Source code:
    Code:
    import java.io.BufferedInputStream
    import java.io.FileInputStream
    import java.lang.System.err
    import scala.collection.mutable.MutableList
    import java.io.File
    import java.io.FileOutputStream
    
    object Hello {
      def makeChunk(array: Array[Byte], pos: Int) =
        ((array(pos) & 0xff) << 0) + ((array(pos + 1) & 0xff) << 8) +
          ((array(pos + 2) & 0xff) << 16) + ((array(pos + 3) & 0xff) << 24)
    
      def makeHash(chunk: Int) = {
        val weak = ((2885564586l ^ chunk) * 16777619).toInt
        ((weak >>> 16) & 0xffff) ^ (weak & 0xffff)
      }
    
      def writeChunk(array: Array[Byte], pos: Int, chunk: Int) {
        array(pos) = (chunk & 0xff).toByte
        array(pos + 1) = ((chunk >> 8) & 0xff).toByte
        array(pos + 2) = ((chunk >> 16) & 0xff).toByte
        array(pos + 3) = ((chunk >> 24) & 0xff).toByte
      }
    
      def outFlag(array: Array[Byte], pos: Int, signature: Int, flag: Boolean): Int = {
        val newSignature = (signature * 2) + (if (flag) 1 else 0)
        if (newSignature >= 256) {
          array(pos) = (newSignature & 0xff).asInstanceOf[Byte]
          1
        } else {
          newSignature
        }
      }
    
      def adjustment(signature: Int) = {
        if (signature == 1) {
          1
        } else {
          0
        }
      }
    
      def main(args: Array[String]): Unit = {
        if (args.length != 6 || !Set("c", "d").contains(args(0))) {
          err.println("Usage:");
          err.println("\tscala <program.jar> c in out1 out2 out3 out4");
          err.println("\tscala <program.jar> d in1 in2 in3 in4 out");
          System.exit(0);
        }
        val dictionary = Array.ofDim[Int](256 * 256)
        val lzpTransitions = Array.ofDim[Short](256 * 256)
        if (args(0) == "c") {
          val in = Array.ofDim[Byte](new File(args(1)).length().toInt + 3)
          new FileInputStream(args(1)).read(in, 3, in.length - 3)
          val out1 = Array.ofDim[Byte]((in.length - 3) / 16 + 1)
          var out1p = 0
          val out2 = Array.ofDim[Byte](out1.length * 8)
          var out2p = 0
          val out3 = Array.ofDim[Byte](out1.length * 16)
          var out3p = 0
          val items = (in.length - 3) / 4
          var signature = 1
          var lzpMatched = 0
          var dictMatched = 0
          var literals = 0
          var lastMatch = false
          var lastHash = 0
          for (itemNr <- 0 until items) {
            val item = makeChunk(in, 3 + itemNr * 4)
            val predictedHash = 0xffff & lzpTransitions(lastHash)
            val lzpMode = lastMatch
            if (lzpMode && dictionary(predictedHash) == item) {
              lzpMatched += 1
              signature = outFlag(out1, out1p, signature, true)
              out1p += adjustment(signature)
              lastHash = predictedHash
            } else {
              if (lzpMode) {
                signature = outFlag(out1, out1p, signature, false)
                out1p += adjustment(signature)
              }
              val hash = makeHash(item)
              val matches = dictionary(hash) == item
              signature = outFlag(out1, out1p, signature, matches)
              out1p += adjustment(signature)
              if (matches) {
                dictMatched += 1
                out2(out2p) = (hash >>> 8).toByte
                out2p += 1
                out2(out2p) = hash.toByte
                out2p += 1
              } else {
                literals += 1
                dictionary(hash) = item
                for (i <- 0 until 4) {
                  out3(out3p) = in(3 + itemNr * 4 + i)
                  out3p += 1
                }
              }
              lzpTransitions(lastHash) = hash.asInstanceOf[Short]
              lastMatch = matches
              lastHash = hash
            }
          }
          if (signature != 1) {
            while (signature < 256) {
              signature *= 2
            }
            out1(out1p) = signature.asInstanceOf[Byte]
            out1p += 1
          }
          println("LZP matched:        " + (lzpMatched * 4))
          println("Dictionary matched: " + (dictMatched * 4))
          println("Literals:           " + (literals * 4))
          println("Size:               " + (out1p + out2p + out3p + -3 + in.length - items * 4))
          val out1s = new FileOutputStream(args(2))
          out1s.write(out1, 0, out1p)
          out1s.close()
          val out2s = new FileOutputStream(args(3))
          out2s.write(out2, 0, out2p)
          out2s.close()
          val out3s = new FileOutputStream(args(4))
          out3s.write(out3, 0, out3p)
          out3s.close()
          val out4s = new FileOutputStream(args(5))
          out4s.write(in, 3 + items * 4, -3 + in.length - items * 4)
          out4s.close()
        } else {
          val in1 = Array.ofDim[Byte](new File(args(1)).length().toInt)
          new FileInputStream(args(1)).read(in1)
          var in1p = 0
          val in2 = Array.ofDim[Byte](new File(args(2)).length().toInt)
          new FileInputStream(args(2)).read(in2)
          var in2p = 0
          val in3 = Array.ofDim[Byte](new File(args(3)).length().toInt)
          new FileInputStream(args(3)).read(in3)
          var in3p = 0
          val in4 = Array.ofDim[Byte](new File(args(4)).length().toInt)
          new FileInputStream(args(4)).read(in4)
          val out = Array.ofDim[Byte](in2.length * 32 + in4.length + 3)
          val items = in1.length * 8
          var signature = 1 << 16
          var lastMatch = false
          var lastHash = 0
          var itemNr = 0
          while (in1p < in1.length || in2p < in2.length || in3p < in3.length ||
            (signature & 0xff) != 0) {
            if (signature >= (1 << 16)) {
              signature = (1 << 8) + (in1(in1p) & 0xff)
              in1p += 1
            }
            val predictedHash = 0xffff & lzpTransitions(lastHash)
            val lzpMode = lastMatch
            var matches = false
            if (lzpMode) {
              matches = (signature & 0x80) != 0
              signature *= 2
              if (matches) {
                val chunk = dictionary(predictedHash)
                writeChunk(out, 3 + itemNr * 4, chunk)
                lastHash = predictedHash
              }
            }
            if (!matches) {
              if (signature >= (1 << 16)) {
                signature = (1 << 8) + (in1(in1p) & 0xff)
                in1p += 1
              }
              matches = (signature & 0x80) != 0
              signature *= 2
              var hash = 0
              if (matches) {
                hash = ((in2(in2p) & 0xff) << 8) + (in2(in2p + 1) & 0xff)
                in2p += 2
                val chunk = dictionary(hash)
                writeChunk(out, 3 + itemNr * 4, chunk)
              } else {
                val chunk = makeChunk(in3, in3p)
                writeChunk(out, 3 + itemNr * 4, chunk)
                hash = makeHash(chunk)
                dictionary(hash) = chunk
                in3p += 4
              }
              lzpTransitions(lastHash) = hash.asInstanceOf[Short]
              lastMatch = matches
              lastHash = hash
            }
            itemNr += 1
          }
          for (i <- 0 until in4.length) {
            out(3 + itemNr * 4 + i) = in4(i)
          }
          val outs = new FileOutputStream(args(5))
          outs.write(out, 3, itemNr * 4 + in4.length)
          outs.close()
        }
      }
    }
    Console session logs:
    Code:
    piotrek@piotrek-desktop:~/corpora/enwik$ /devel/scala/scala-2.10.3/bin/scala -J-Xmx30G ~/Projekty/sharcy/target/scala-2.10/sharcy_2.10-1.0.jar c enwik8 enwik8.c1 enwik8.c2 enwik8.c3 enwik8.c4
    LZP matched:        14245080
    Dictionary matched: 68425124
    Literals:           17329796
    Size:               56805643
    piotrek@piotrek-desktop:~/corpora/enwik$ /devel/scala/scala-2.10.3/bin/scala -J-Xmx30G ~/Projekty/sharcy/target/scala-2.10/sharcy_2.10-1.0.jar c enwik9 enwik9.c1 enwik9.c2 enwik9.c3 enwik9.c4
    LZP matched:        206769828
    Dictionary matched: 635651476
    Literals:           157578696
    Size:               526518543
    piotrek@piotrek-desktop:~/corpora/enwik$ /devel/scala/scala-2.10.3/bin/scala -J-Xmx30G ~/Projekty/sharcy/target/scala-2.10/sharcy_2.10-1.0.jar d enwik8.c1 enwik8.c2 enwik8.c3 enwik8.c4 enwik8.d
    piotrek@piotrek-desktop:~/corpora/enwik$ /devel/scala/scala-2.10.3/bin/scala -J-Xmx30G ~/Projekty/sharcy/target/scala-2.10/sharcy_2.10-1.0.jar d enwik9.c1 enwik9.c2 enwik9.c3 enwik9.c4 enwik9.d
    piotrek@piotrek-desktop:~/corpora/enwik$ ~/Projekty/sharc/src/sharc -c1 -i -o < enwik9 > enwik9.sharc.c1
    piotrek@piotrek-desktop:~/corpora/enwik$ ~/Projekty/sharc/src/sharc -c2 -i -o < enwik9 > enwik9.sharc.c2
    piotrek@piotrek-desktop:~/corpora/enwik$ ~/Projekty/sharc/src/sharc -c1 -i -o < enwik8.c1 > enwik8.c1.sharc.c1
    piotrek@piotrek-desktop:~/corpora/enwik$ ~/Projekty/sharc/src/sharc -c1 -i -o < enwik8.c2 > enwik8.c2.sharc.c1
    piotrek@piotrek-desktop:~/corpora/enwik$ ls -al
    razem 3956704
    drwxrwxr-x 2 piotrek piotrek       4096 lis 21 22:09 .
    drwxrwxr-x 4 piotrek piotrek       4096 lis 20 23:32 ..
    -r--r--r-- 1 piotrek piotrek  100000000 cze  1  2011 enwik8
    -rw-rw-r-- 1 piotrek piotrek    5263285 lis 21 22:04 enwik8.c1
    -rw-rw-r-- 1 piotrek piotrek   34212562 lis 21 22:04 enwik8.c2
    -rw-rw-r-- 1 piotrek piotrek   17329796 lis 21 22:04 enwik8.c3
    -rw-rw-r-- 1 piotrek piotrek          0 lis 21 22:04 enwik8.c4
    -rw-rw-r-- 1 piotrek piotrek  100000000 lis 21 22:06 enwik8.d
    -rw-rw-r-- 1 piotrek piotrek   61798546 lis 21 22:05 enwik8.sharc.c1
    -rw-rw-r-- 1 piotrek piotrek   57031758 lis 21 22:05 enwik8.sharc.c2
    -r--r--r-- 1 piotrek piotrek 1000000000 cze  1  2011 enwik9
    -rw-rw-r-- 1 piotrek piotrek   51114109 lis 21 22:05 enwik9.c1
    -rw-rw-r-- 1 piotrek piotrek  317825738 lis 21 22:05 enwik9.c2
    -rw-rw-r-- 1 piotrek piotrek  157578696 lis 21 22:05 enwik9.c3
    -rw-rw-r-- 1 piotrek piotrek          0 lis 21 22:05 enwik9.c4
    -rw-rw-r-- 1 piotrek piotrek 1000000000 lis 21 22:07 enwik9.d
    -rw-rw-r-- 1 piotrek piotrek  610691872 lis 21 22:07 enwik9.sharc.c1
    -rw-rw-r-- 1 piotrek piotrek  538757708 lis 21 22:07 enwik9.sharc.c2
    -r--r--r-- 1 piotrek piotrek         82 lis  4 14:50 _md5sums
    piotrek@piotrek-desktop:~/corpora/enwik$ cd ../max-comp/
    piotrek@piotrek-desktop:~/corpora/max-comp$ ~/Projekty/sharc/src/sharc -c1 -i -o < fp.log > fp.log.sharc.c1
    piotrek@piotrek-desktop:~/corpora/max-comp$ ~/Projekty/sharc/src/sharc -c2 -i -o < fp.log > fp.log.sharc.c2
    piotrek@piotrek-desktop:~/corpora/max-comp$ /devel/scala/scala-2.10.3/bin/scala -J-Xmx30G ~/Projekty/sharcy/target/scala-2.10/sharcy_2.10-1.0.jar c fp.log fp.log.c1 fp.log.c2 fp.log.c3 fp.log.c4
    LZP matched:        16986360
    Dictionary matched: 3396856
    Literals:           233852
    Size:               2682718
    piotrek@piotrek-desktop:~/corpora/max-comp$ /devel/scala/scala-2.10.3/bin/scala -J-Xmx30G ~/Projekty/sharcy/target/scala-2.10/sharcy_2.10-1.0.jar d fp.log.c1 fp.log.c2 fp.log.c3 fp.log.c4 fp.log.d
    piotrek@piotrek-desktop:~/corpora/max-comp$ ls -al
    razem 59896
    drwxrwxr-x 2 piotrek piotrek     4096 lis 21 22:10 .
    drwxrwxr-x 4 piotrek piotrek     4096 lis 20 23:32 ..
    -r--r--r-- 1 piotrek piotrek 20617071 lut  4  2003 fp.log
    -rw-rw-r-- 1 piotrek piotrek   750435 lis 21 22:10 fp.log.c1
    -rw-rw-r-- 1 piotrek piotrek  1698428 lis 21 22:10 fp.log.c2
    -rw-rw-r-- 1 piotrek piotrek   233852 lis 21 22:10 fp.log.c3
    -rw-rw-r-- 1 piotrek piotrek        3 lis 21 22:10 fp.log.c4
    -rw-rw-r-- 1 piotrek piotrek 20617071 lis 21 22:10 fp.log.d
    -rw-rw-r-- 1 piotrek piotrek 11066015 lis 21 22:10 fp.log.sharc.c1
    -rw-rw-r-- 1 piotrek piotrek  6321021 lis 21 22:10 fp.log.sharc.c2
    "Size" message tells what's the size of all produced outputs. Other messages tell how many bytes were matched or non-matched.

  30. The Following User Says Thank You to Piotr Tarsa For This Useful Post:

    gpnuma (22nd November 2013)

  31. #25
    Member
    Join Date
    Aug 2013
    Location
    France
    Posts
    77
    Thanks
    27
    Thanked 26 Times in 11 Posts
    Hey Piotr,

    That really sounds impressive, if native speed is on par with your ratio results then it's a definite winner ! Nice one !
    Good thing is I think, after looking at your code, I can actually find a way to properly optimize your modified algorithm in C, and I also have another optimization going on with the new kernel I'm working on which would probably be beneficial to that same algorithm as well !
    Is there any way to exchange emails via this forum so I can PM you ?

  32. #26
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    Thanks!

    Is there any way to exchange emails via this forum so I can PM you ?
    There's "Notifications" menu in the upper right corner. Actually, I've PM'd you so it should be highlighted for you.



    PS:
    Another idea came to my mind: in LZP mode when there's no match we can try to reduce the number of outputted literals. Simply take the predicted chunk, check which of the 4 bytes bytes differ from the actual chunk, write the bytes that differ to output stream and write 4 flags into the signature that represent which bytes matches.
    Last edited by Piotr Tarsa; 22nd November 2013 at 02:10.

  33. #27
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    "Another idea came to my mind: in LZP mode when there's no match we can try to reduce the number of outputted literals. Simply take the predicted chunk, check which of the 4 bytes bytes differ from the actual chunk, write the bytes that differ to output stream and write 4 flags into the signature that represent which bytes matches."

    Yes. I have an algorithm that does something like this, which can be very useful for "binary" data that aren't text-like and tend to have most of their information in the low bytes of words, the low bytes of byte pairs, or the low bits of bytes. It can also help with common "in-memory" textz data formats like UTF-16 or UTF-32 unicode, and RISC machine code.

    It's harder to do that in simple "dictionary" mode than in LZP "predicted" mode because whatever bits/bytes your hash function uses will generally match before you get to that sort of partial matching comparison, or you wouldn't have found the match in the first place. You can use screwy hash functions that only depend on the parts you're trying to match, and not differentially encode, but it's tricky.

  34. #28
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    Paul W:
    Hashing is the most expensive operation currently in SHARC, so adding more hashing is prohibitive. What I'm suggesting is some sort of partial LZP matches, as the matches currently have granularity of 4-byte chunks - we can make the granularity finer.


    gpnuma:
    Maybe I'll explain the decoding process.


    The original format works as follows:

    At every position:

    bit sequence: 0
    action: copy 4 bytes from input to output, do dictionary[hash(chunk)] = chunk

    bit sequence 1
    action: take 2-byte hash from input and copy a 4-byte chunk from dictionary[hash] to output



    After my modification:


    In non-LZP mode:

    bit sequence: 0
    action: copy 4 bytes from input to output, do dictionary[hash(chunk)] = chunk

    bit sequence 1
    action: take 2-byte hash from input and copy a 4-byte chunk from dictionary[hash] to output, enable LZP mode


    In LZP mode:

    bit sequence: 1
    action: get predictedHash from lzpTransitions[lastHash], copy a 4-byte chunk from dictionary[predictedHash] to output

    bit sequence: 00
    action: copy 4 bytes from input to output, do dictionary[hash(chunk)] = chunk, disable LZP mode

    bit sequence: 01
    action: take 2-byte hash from input and copy a 4-byte chunk from dictionary[hash] to output

    Make sure that after every processing position (in every mode) lzpTransitions[lastHash] = hash, then do lastHash = hash

  35. The Following User Says Thank You to Piotr Tarsa For This Useful Post:

    gpnuma (22nd November 2013)

  36. #29
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts

    mostly parallel hashes for overlapping contexts

    Piotr,

    I'm not sure what you think I think, or why it's a misunderstanding of what you're saying.

    What I thought I was agreeing with is that

    1) LZP-ish prediction can be an improvement over plain dictionary lookups, because it lets you encode matches without explicitly recording the hashes in the compressed output (when you do have a successful prediction), and you can fall back to spitting out hashes for directory hits when you don't have a successful prediction. When you mostly do have successful predictions, its a space big win (as in LZP) because you save 2 bytes by not having to output the hashes, and having the decompressor reconstruct them.

    2) The time cost can be lower than you'd think, because the case where you have a match can be pretty fast. (You do have to generate the hash for the predicted stuff, though, so it can serve as context for the next prediction.)

    3) Another reason the cost can be lower than you might think is that you can do this at 4-byte granularity, too---the same hash value you use for probing the dictionary can be the context for the next iteration. You already need to maintain the hash table in the decompressor anyway, so you're paying that price (over and above, say LZ77 style compression) and might as well use it for prediction, too.

    4) The predictions don't have to be exact to be useful. They may only predict some of the bytes, and you can still take advantage of them with a little more complexity---a code for partial matches, a mask to say which bytes in a 4-byte chunk were incorrectly predicted, and recording those mispredicted bytes. (The decompressor must decode all that stuff before it can generate the hash that will serve as the next context, though.)

    I'm using all those ideas and several related ones in what I hope will be the successor algorithm for WKdm.

    One thing to notice is that generating multiple hashes isn't necessarily very expensive, if you can arrange to have most of the work shared between several consecutive and/or overlapping contexts.

    If you use Bloom's LZP1 hash for 3-byte contexts, for example, you shift all the bits 11 places and XOR them together, then extract the bits you want. You don't process each byte separately, and you just need a shift and a mask to extract the right bits---or just a mask, depending on which bits you care about.

    It's easier to visualize if you pretend we'll shift by 12 bits, and can see how the nybbles overlap.

    For covenience, suppose that our four-byte sequence looks like A1B2C3D4, which shifts down 12 to be 000A1B2C.

    If we just select the bottom 12 bits, that is the XOR of the original bottom 3 nybbles and the the 3 nybbles above that.
    That's our hash of the bottom 3 bytes.

    But we can also shift the pattern down 12 bits to discard those, and take the new bottom 3 bytes, and that is the hash of the TOP 3 bytes.

    That allows us to efficiently compute hashes for two overlapping contexts, one even-aligned and the other odd.

    If we do this with 64 or 128-bit values, we can compute a whole bunch of those at a whack. (Any Intel chip with MMX and any ARM chip with Neon can operate on at least 64 bits at a whack, even in 32-bit mode. Any Intel SSE2-or-better can do 128, as can any recent ARM with NEON. Intel AVX chips can do 256.)

    Hashing can be pretty cheap, at least on the compression side. On the decompression side, you can't do all the hashes in parallel, because you don't have all the data available yet. In the general case, you have to compute the hashes consecutively as the you reconstruct the data---the ones you actually end up using for a successful prediction, anyhow.

  37. The Following User Says Thank You to Paul W. For This Useful Post:

    gpnuma (22nd November 2013)

  38. #30
    Member
    Join Date
    Aug 2013
    Location
    France
    Posts
    77
    Thanks
    27
    Thanked 26 Times in 11 Posts
    Hello Paul,

    Are you sure that this SHIFT-XOR method with a few bits produces good hashes (distribution-wise ?)
    I know it's the basic method used by FNV-1 to dither from 64 or 32 bits to lower bit amounts (partial XOR-folding, although the shift occurs with generally more bits than 12), but I'd be happy to see if it produces nicely distributed hashes.
    Maybe a test with SMHasher could validate that ?

    Also, please correct me on that but I'm afraid that using SSE-2 or AVX registers has a cost, not sure it would be a great benefit (performance-wise) there ?

Page 1 of 5 123 ... LastLast

Similar Threads

  1. Replies: 2
    Last Post: 18th April 2011, 04:13
  2. identity of obscure algorithms
    By asmodean in forum Data Compression
    Replies: 2
    Last Post: 6th August 2009, 08:50
  3. Searching fast decompressable algorithms
    By Mimos in forum Data Compression
    Replies: 8
    Last Post: 24th July 2008, 23:58
  4. Replies: 1
    Last Post: 18th April 2007, 19:36

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •