Results 1 to 17 of 17

Thread: My concurrent block sorter (QRotSort)

  1. #1
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts

    My concurrent block sorter (QRotSort)

    I've created my version of qsufsort. This one uses 14N memory:
    - N for input;
    - N for output;
    - 12N for auxiliary tables (3 int32's per one byte of input data),

    Currently it's much slower than state-of-the-art SACAs (ie. far from libdivsufsort).

    I've parallelized it using OpenMP, but unfortunately it doesn't scale well. For example on my Core 2 Duo there's only about 150 % CPU usage (and that includes some overhead from parallelization!). If someone has experience with OpenMP please drop some piece of advice.

    Basically, my algorithm is based on doubling technique used also in qsufsort. The difference is that I am precomputing the keys in separate step so they aren't modified concurrently when running a sort that doubles radix.

    I've attached current version of my program - it's somewhat buggy, but for some inputs and parameter combinations it outputs fully correct data.

    I've also abandoned my plans for OpenCL block sorter - if QRotSort scales badly on two CPU cores then it will scale tremendously bad on hundreds of stream cores.
    Attached Files Attached Files

  2. #2
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    150% may be not bad. don't forget that you don't double memory speed, and memory is limiting factor for most compression algos, including bwt

    if you are serious about MP experiments, it's better to upgrade to 4 cores at least

  3. #3
    Programmer osmanturan's Avatar
    Join Date
    May 2008
    Location
    Mersin, Turkiye
    Posts
    651
    Thanks
    0
    Thanked 0 Times in 0 Posts
    @Piotr: Bulat can be right. Your possible bottleneck can be memory access. Because, 14N means really something If there is free server with multiple core on your university, you should give it a try. That would give you a good idea. Good luck.
    BIT Archiver homepage: www.osmanturan.com

  4. #4
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    Thanks for tips.

    I'm basically concerned by the fact that I have only 150 % CPU usage instead of 200 % (ie. saturating both cores). Speedup is about 30 % on enwik8, so it's not that really bad. Perhaps I don't know some important things about OpenMP or there are some serial bottlenecks in my code.


    Update:
    I've sorted out some bugs. Now on enwik8 there's:

    real 0m20.221s
    user 0m33.480s
    sys 0m0.750s

    Where the non-parallellized pre-sorting takes 3s.

    Parallel work CPU time = 33.48s + .75s - 3s = 31.23s
    Real parallel work time = 20.22s - 3s = 17.22s

    It's 181% scaling now for the paralleled part

    Total scaling is (33.48 + .75) / 20.22s * 100 % = 169 %

    When run with one thread total time is: 27.79s

    So total speedup is 27.79 / 20.22 * 100 % = 137 %

    I'll try to make parallel chunk size computation to speed up the parallel work.
    Last edited by Piotr Tarsa; 24th May 2011 at 20:00.

  5. #5
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    I have made some good progress, although it's still far from being useable. Basically I've switched from shallow radix sort + many doubling stages scheme to shallow sorting + few doubling stages. I've implemented also sorting compressed streams (using many algorithms: merge sort, shell sort, insertion sort, radix sort; quick sort is left to do). As of now I'm decoding some of the compressed streams only to determine the length of decoded strings. Shallow sorting has depth limit of some fixed number of uncompressed bytes and length of compressed substring is often completely unproportional to length of uncompressed string - so I have to decode to be sure.

    I have made some tests on Manizni's corpus, which consists of almost only natural text data (with some formatting of course). I have preprocessed the files with SREP 2.99 (with default parameters) and compared the compression of bsc on precompressed files and original files. Results are surprising - usually final compression ratio is better when using SREP preprocessing, even when block size is enough to compress all data in one block. I've found SREP preprocessing to offer better final compression ratio (ie after bsc) than the LZP that is built into bsc.

    As doubling stage is still slow (and I don't have a complete idea right now how to speed it up while retaining good scaling) and SREP like preprocessing is very fast, I am thinking about inclusion of SREP-like algorithm into my program. SREP preprocessing generally removes the need for using the radix doubling technique, it also basically removes the need for special handling of tandem repeats. Also without the doubling stage, the shallow sort could be set to unlimited depth (actually the depth will be bounded due to SREP preprocessing) and I could remove completely the decoding from algorithms that are sorting on compressed streams.

    I'm seeking for information about how SREP works and/ or what is SREP's license, if I want to include it. Inclusion wouldn't be the best option IMO as I don't need any special I/O routines from SREP right now. Some simple memory-to-memory SREP-like and SREP-performance algorithm, would be sufficient. If it's possible to make such algorithm parallelizable then it would be even better. (BTW: I've sent PM to Bulat, but haven't received yet any answer - maybe he doesn't see the notification?)
    Attached Files Attached Files

  6. #6
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    i don't yet received notification email

    srep is gpl, like all the my stuff

    srep has compress() function that called multiple times with the same HashTable

    srep main idea described in tridgell's thesis. if you want to understand the algorithm, you may start with earliest program versions (available via svn)

    but i can't believe that srep outperformed lzp and rep. try with larger buffer and hash

  7. #7
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    LZP probably interleaves flags or other such stuff with the literals, which is not very compatible with contextual prediction.
    While srep probably keeps literals and match structures separately.
    Also, a BWT compressor has to encode every symbol in a long match, and there're some precision limits,
    so for long matches it may be more efficient to describe them as external structures - this depends on
    postcoder tuning though, ie I'd say that a fully tuned postcoder would show better results for file.bwt than file.srep.bwt.

  8. #8
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    there are two lzp versions (by Ilya(used in grzip/bsc) and Dmitry(used in freearc)), one keeps indexes separate and one interleaved

  9. #9
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    GPL is too restrictive for my taste. I would like to use BSD or something similiar. BTW: if it's GPL then you should bundle the license file with source code.


    I haven't tested "lzp and rep", only SREP and bsc's own LZP.

    Here is listing from my Manzini's corpus directory (Ubuntu 11.04 64-bit, CPU Intel Core 2 Duo E8400, 8 GiB DDR2 800 MHz RAM) (sorry for some useless columns, I don't know how to format that output using command line switches):
    Code:
    drwxr-xr-x 2 piotrek piotrek      4096 2011-08-18 18:24 ./
    drwxr-xr-x 5 piotrek piotrek      4096 2011-05-22 23:24 ../
    -rwxr-xr-x 1 piotrek piotrek    761820 2011-08-17 23:13 bsc*
    -rw-r--r-- 1 piotrek piotrek  33470341 2011-08-18 18:15 chr22.256
    -rw-r--r-- 1 piotrek piotrek   7286311 2011-08-18 18:19 chr22.256.bsc
    -rw-r--r-- 1 piotrek piotrek  34553758 2011-05-22 21:43 chr22.dna
    -rw-r--r-- 1 piotrek piotrek   7287552 2011-08-17 23:19 chr22.dna.bsc
    -rw-r--r-- 1 piotrek piotrek   8441825 2011-05-22 21:37 chr22.dna.bz2
    -rw-r--r-- 1 piotrek piotrek   7287148 2011-08-17 23:29 chr22.dna.lzpbsc
    -rw-r--r-- 1 piotrek piotrek  33411632 2011-08-17 22:49 chr22.dna.srep
    -rw-r--r-- 1 piotrek piotrek   7275464 2011-08-17 23:20 chr22.dna.srep.bsc
    -rw-r--r-- 1 piotrek piotrek 105277340 2011-05-22 21:44 etext99
    -rw-r--r-- 1 piotrek piotrek  96445448 2011-08-18 18:16 etext99.256
    -rw-r--r-- 1 piotrek piotrek  21350500 2011-08-18 18:20 etext99.256.bsc
    -rw-r--r-- 1 piotrek piotrek  22150002 2011-08-17 23:23 etext99.bsc
    -rw-r--r-- 1 piotrek piotrek  29032096 2011-05-22 21:44 etext99.bz2
    -rw-r--r-- 1 piotrek piotrek  21524735 2011-08-17 23:30 etext99.lzpbsc
    -rw-r--r-- 1 piotrek piotrek  96753497 2011-08-17 22:49 etext99.srep
    -rw-r--r-- 1 piotrek piotrek  21399690 2011-08-17 23:23 etext99.srep.bsc
    -rw-r--r-- 1 piotrek piotrek  69476582 2011-08-18 18:16 gcc-3.0.256
    -rw-r--r-- 1 piotrek piotrek  10471170 2011-08-18 18:21 gcc-3.0.256.bsc
    -rw-r--r-- 1 piotrek piotrek  86630400 2011-05-22 21:40 gcc-3.0.tar
    -rw-r--r-- 1 piotrek piotrek  10662376 2011-08-17 23:26 gcc-3.0.tar.bsc
    -rw-r--r-- 1 piotrek piotrek  13724292 2011-05-22 21:40 gcc-3.0.tar.bz2
    -rw-r--r-- 1 piotrek piotrek  10533354 2011-08-17 23:31 gcc-3.0.tar.lzpbsc
    -rw-r--r-- 1 piotrek piotrek  76152530 2011-08-17 22:49 gcc-3.0.tar.srep
    -rw-r--r-- 1 piotrek piotrek  10498110 2011-08-17 23:26 gcc-3.0.tar.srep.bsc
    -rw-r--r-- 1 piotrek piotrek  39422105 2011-05-22 21:39 howto
    -rw-r--r-- 1 piotrek piotrek  37543665 2011-08-18 18:16 howto.256
    -rw-r--r-- 1 piotrek piotrek   7711142 2011-08-18 18:21 howto.256.bsc
    -rw-r--r-- 1 piotrek piotrek   7836895 2011-08-17 23:27 howto.bsc
    -rw-r--r-- 1 piotrek piotrek  10193169 2011-05-22 21:39 howto.bz2
    -rw-r--r-- 1 piotrek piotrek   7772359 2011-08-17 23:29 howto.lzpbsc
    -rw-r--r-- 1 piotrek piotrek  37347567 2011-08-17 22:50 howto.srep
    -rw-r--r-- 1 piotrek piotrek   7696758 2011-08-17 23:27 howto.srep.bsc
    -rw-r--r-- 1 piotrek piotrek  69728899 2011-05-22 21:37 jdk13c
    -rw-r--r-- 1 piotrek piotrek  41623262 2011-08-18 18:16 jdk13c.256
    -rw-r--r-- 1 piotrek piotrek   2887471 2011-08-18 18:21 jdk13c.256.bsc
    -rw-r--r-- 1 piotrek piotrek   2845564 2011-08-17 23:34 jdk13c.bsc
    -rw-r--r-- 1 piotrek piotrek   4903326 2011-05-22 21:37 jdk13c.bz2
    -rw-r--r-- 1 piotrek piotrek   2875159 2011-08-17 23:31 jdk13c.lzpbsc
    -rw-r--r-- 1 piotrek piotrek  44713291 2011-08-17 22:50 jdk13c.srep
    -rw-r--r-- 1 piotrek piotrek   2940270 2011-08-17 23:35 jdk13c.srep.bsc
    -rw-r--r-- 1 piotrek piotrek 101046571 2011-08-18 18:16 linux-2.4.5.256
    -rw-r--r-- 1 piotrek piotrek  17077992 2011-08-18 18:22 linux-2.4.5.256.bsc
    -rw-r--r-- 1 piotrek piotrek 116254720 2011-05-22 21:41 linux-2.4.5.tar
    -rw-r--r-- 1 piotrek piotrek  17405026 2011-08-17 23:36 linux-2.4.5.tar.bsc
    -rw-r--r-- 1 piotrek piotrek  21508430 2011-05-22 21:41 linux-2.4.5.tar.bz2
    -rw-r--r-- 1 piotrek piotrek  17144412 2011-08-17 23:32 linux-2.4.5.tar.lzpbsc
    -rw-r--r-- 1 piotrek piotrek 107242988 2011-08-17 22:51 linux-2.4.5.tar.srep
    -rw-r--r-- 1 piotrek piotrek  17119707 2011-08-17 23:37 linux-2.4.5.tar.srep.bsc
    -rwxr-xr-x 1 piotrek piotrek     35840 2005-05-19 16:35 LZP.exe*
    -rw-r--r-- 1 piotrek piotrek       429 2011-05-22 21:42 md5sums.txt
    -rw-r--r-- 1 piotrek piotrek 114711151 2011-05-22 21:48 rctail96
    -rw-r--r-- 1 piotrek piotrek  74014726 2011-08-18 18:17 rctail96.256
    -rw-r--r-- 1 piotrek piotrek  10076545 2011-08-18 18:23 rctail96.256.bsc
    -rw-r--r-- 1 piotrek piotrek  10319949 2011-08-17 23:38 rctail96.bsc
    -rw-r--r-- 1 piotrek piotrek  17108374 2011-05-22 21:48 rctail96.bz2
    -rw-r--r-- 1 piotrek piotrek  10119475 2011-08-17 23:32 rctail96.lzpbsc
    -rw-r--r-- 1 piotrek piotrek  77047551 2011-08-17 22:52 rctail96.srep
    -rw-r--r-- 1 piotrek piotrek  10209432 2011-08-17 23:38 rctail96.srep.bsc
    -rw-r--r-- 1 piotrek piotrek 116421901 2011-05-22 21:49 rfc
    -rw-r--r-- 1 piotrek piotrek 106253533 2011-08-18 18:17 rfc.256
    -rw-r--r-- 1 piotrek piotrek  15304363 2011-08-18 18:23 rfc.256.bsc
    -rw-r--r-- 1 piotrek piotrek  15607394 2011-08-17 23:39 rfc.bsc
    -rw-r--r-- 1 piotrek piotrek  21767711 2011-05-22 21:49 rfc.bz2
    -rw-r--r-- 1 piotrek piotrek  15396738 2011-08-17 23:33 rfc.lzpbsc
    -rw-r--r-- 1 piotrek piotrek 106486484 2011-08-17 22:52 rfc.srep
    -rw-r--r-- 1 piotrek piotrek  15261078 2011-08-17 23:39 rfc.srep.bsc
    -rw-r--r-- 1 piotrek piotrek  90681139 2011-08-18 18:17 sprot34.256
    -rw-r--r-- 1 piotrek piotrek  17560730 2011-08-18 18:23 sprot34.256.bsc
    -rw-r--r-- 1 piotrek piotrek 109617186 2011-05-22 21:50 sprot34.dat
    -rw-r--r-- 1 piotrek piotrek  17871491 2011-08-17 23:40 sprot34.dat.bsc
    -rw-r--r-- 1 piotrek piotrek  22744683 2011-05-22 21:50 sprot34.dat.bz2
    -rw-r--r-- 1 piotrek piotrek  17585786 2011-08-17 23:33 sprot34.dat.lzpbsc
    -rw-r--r-- 1 piotrek piotrek 100056302 2011-08-17 22:54 sprot34.dat.srep
    -rw-r--r-- 1 piotrek piotrek  17790912 2011-08-17 23:41 sprot34.dat.srep.bsc
    -rwxr-xr-x 1 piotrek piotrek    107520 2011-08-10 20:08 srep.exe*
    -rw-r--r-- 1 piotrek piotrek 104201579 2011-05-22 21:47 w3c2
    -rw-r--r-- 1 piotrek piotrek  60692125 2011-08-18 18:17 w3c2.256
    -rw-r--r-- 1 piotrek piotrek   5665312 2011-08-18 18:24 w3c2.256.bsc
    -rw-r--r-- 1 piotrek piotrek   6150548 2011-08-17 23:41 w3c2.bsc
    -rw-r--r-- 1 piotrek piotrek  10258708 2011-05-22 21:47 w3c2.bz2
    -rw-r--r-- 1 piotrek piotrek   5764561 2011-08-17 23:34 w3c2.lzpbsc
    -rw-r--r-- 1 piotrek piotrek  66709456 2011-08-17 22:54 w3c2.srep
    -rw-r--r-- 1 piotrek piotrek   5709304 2011-08-17 23:42 w3c2.srep.bsc
    SREP was run with options -a1 -f (-a1 was fastest mode despite the information in manual).

    .bsc and .srep.bsc files are produced by bsc -b128p
    .lzpbsc are produced by bsc -b128H28M255

    Overall, the cumulative size of files:
    - uncomressed is: 875799 kB
    - SREPped: 728438 kB
    - preprocessed with Shkarin's LZP: 694577 kB
    - compressed without preprocessing: 115367 kB
    - compressed with bsc's LZP: 113284 kB
    - compressed with SREP preprocessing: 113184 kB
    - compressed with Shkarin's LZP: 112587 kB

    I'll try with REP or LZP.

    Update:
    Shkarin's LZP didn't give much of an improvement and it generally takes noticeably longer to process than SREP. Also it seems possible for me to parallelize SREP, while LZP is inherently serial.
    Last edited by Piotr Tarsa; 18th August 2011 at 20:02. Reason: Added LZP results

  10. #10
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    >.lzpbsc are produced by bsc -b128H28M255

    are you tried smaller -M values? 32..128 was optimal when i tested grzip years ago

    i spent a lot of time optimizing srep, but the same may be tried for lzp too. moreover, (s)rep's matchfinder can be combined with lzp's match coder

    in particular, both are sequential, and both can be made m/t using, in particular, ICU technique

    >GPL is too restrictive for my taste

    that's intentional - if you want to use my sources, you should publish your own on the same conditions. i think it's fair

  11. #11
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    >GPL is too restrictive for my taste

    that's intentional - if you want to use my sources, you should publish your own on the same conditions. i think it's fair
    Replying for myself:
    I too want people who use my sources to share theirs. Yet I prefer to use non-copyleft licenses because I don't feel entitled to force people to give back.
    If people want to contribute, they will contribute.
    If they don't, they either will accept being forced or go away or violate the license. None of these options is good IMO.

  12. #12
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    m^2:
    BSD license (at least some of its variations), AFAIK, requires to give author a credit in any derivative work. It doesn't require to release changes to open source, but still, the author of original code is credited. I think BSD is good if we want widespread success. bzip2 is released under BSD license and zlib has IMO even more permissive license (depending on which version of BSD licence you are comparing to).

    Bulat:
    What's that ICU technique? Have you tried Bloom filter to filter out false positives (on short signatures) in SREP?
    Wouldn't you mind if I reimplement some of your ideas in BSD-licensed code?

  13. #13
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    m^2:
    BSD license (at least some of its variations), AFAIK, requires to give author a credit in any derivative work. It doesn't require to release changes to open source, but still, the author of original code is credited. I think BSD is good if we want widespread success. bzip2 is released under BSD license and zlib has IMO even more permissive license (depending on which version of BSD licence you are comparing to).
    It depends on what do you mean by 'credit'. The most recent version (2-clause) requires passing the license file (that contains your name) along and nothing more. The first BSD license (4-clause) required that you should be mentioned in all promotional materials too.
    I haven't seen anybody using 4-clause these days because large projects couldn't be effectively promoted, contributor mentions would take too much time / space.
    https://secure.wikimedia.org/wikiped...ki/Bsd_license

  14. #14
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    ICU=Index/Compress/Update

    I don't used Bloom filter. In -m1 mode there are almost no false positives, although for -m3 it's a nightmare (it's why it so slow). can you describe how Bloom filter can be used here?

    You can reimplement my ideas but don't copy my code. Shortly spealing, if you can do it w/o looking into my code - you can do it

    You forget to mention in QRotSort.tgz that this code has BSD license

  15. #15
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    Bloom filter requires many hash functions, but as Wikipedia points out, we can use double hashing (or even better - triple hashing) to produce many hash functions from a few. So with three rolling hashes we can produce as many hash functions as we need for Bloom filter to become useful. Or if our rolling hash function produces big result, the we can slice it to three values, instead of computing three hashes.

    I am now thinking about something like two stage rzip/ srep-like preprocessing. As I am using 14N or 9N bytes for block-sorting, I can utilize that for speeding up preprocessing. For example in first pass I can compute 64-bit hash for every position (that would eat 8N of memory), then sort it, remove values that appear once and remove duplicates. The resulting values I can put into a Bloom filter. Then I could compute SHA-512 hashes and 64-bit rolling hashes for every 256'th position (but compute SHA-512 only for blocks for which 64-bit rolling hash equivalent is in Bloom filter). File would be divided into segments (their number would be few times higher than number of processor cores to provide good balancing) and there any SHA-512 duplicates in one segment would be removed. After such preprocessing I could safely do LZ77 parsing and model updating for every segment in parallel.

    That's only a quick and not well thought idea, but anyway, if I will optimize QRotSort for highly repetitive data, then I'll probably opt for SREP-like preprocessing. I just have no idea how to speed up QRotSort on highly repetitive data. divsufsort ideas probably won't work, because they probably cannot work in parallel.

    QRotSort is still unofficial so there's no license file inside package.

  16. #16
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    anyway, it's great that you open its sources at least. i think that bwt sorting proceudre doesn't need to be good at highly repetetive data since it can be solved with LZ preprocessing

    your scheme is unnecessarily overcomplicated.

    1) sha-1 useful only when you need to be 100% sure that data are the same because you can't reread them. i don't use sha1 in -m2/m3 modes since even my simple rolling hash sufficient on 512-byte blocks - it had only 40 false positives on gigabyte file

    2) you have plenty of ram so it will be not hard to build efficient scheme. the main thing to count is memory accesses - now srep makes less than one memory access per byte checked

  17. #17
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    Bulat:
    Thanks for info.

    Quote Originally Posted by Shelwien View Post
    Also, a BWT compressor has to encode every symbol in a long match, and there're some precision limits,
    so for long matches it may be more efficient to describe them as external structures - this depends on
    postcoder tuning though, ie I'd say that a fully tuned postcoder would show better results for file.bwt than file.srep.bwt.
    Well, I must disagree here. BWT-based compression works good only on uniform data, that's why there are many segmenting techniques and so. Adding a few super-long repetitions to a pretty uniform file with short repetitions will make any (realistic) postcoder inefficient. Adding a repetition of half of previously read data, for example, will increase the run lengths by one for half of the contexts in BWT output. But it won't be local, but dispersed in a random-like way throughout entire file.

    I've made an experiment. Here are the logs, I think they are so obvious that I can omit explanations:
    Code:
    piotrek@p5q-pro:~/Pobrane/corpora/manzini$ base64 --wrap=0 chr22.dna.bsc > _base.dna
    piotrek@p5q-pro:~/Pobrane/corpora/manzini$ base64 --wrap=0 howto.bsc > _base.how
    piotrek@p5q-pro:~/Pobrane/corpora/manzini$ base64 --wrap=0 w3c2.bsc > _base.w3c
    piotrek@p5q-pro:~/Pobrane/corpora/manzini$ cat _base.dna _base.how _base.w3c > _bad1
    piotrek@p5q-pro:~/Pobrane/corpora/manzini$ cat _base.dna _base.how _base.how _base.w3c > _bad2
    piotrek@p5q-pro:~/Pobrane/corpora/manzini$ cat _base.dna _base.how _base.how _base.how _base.w3c > _bad3
    piotrek@p5q-pro:~/Pobrane/corpora/manzini$ ./bsc e _bad1 _bad1.bsc -b500p
    This is bsc, Block Sorting Compressor. Version 2.8.0. 8 August 2011.
    Copyright (c) 2009-2011 Ilya Grebnov <Ilya.Grebnov@gmail.com>.
    
    _bad1 compressed 28366664 into 21322156 in 8.291 seconds.
    piotrek@p5q-pro:~/Pobrane/corpora/manzini$ ./bsc e _bad2 _bad2.bsc -b500p
    This is bsc, Block Sorting Compressor. Version 2.8.0. 8 August 2011.
    Copyright (c) 2009-2011 Ilya Grebnov <Ilya.Grebnov@gmail.com>.
    
    _bad2 compressed 38815860 into 24519117 in 11.551 seconds.
    piotrek@p5q-pro:~/Pobrane/corpora/manzini$ ./bsc e _bad3 _bad3.bsc -b500p
    This is bsc, Block Sorting Compressor. Version 2.8.0. 8 August 2011.
    Copyright (c) 2009-2011 Ilya Grebnov <Ilya.Grebnov@gmail.com>.
    
    _bad3 compressed 49265056 into 24686756 in 13.142 seconds.
    Ideally, adding one repetition should result in a increase of few bytes of compressed file. As you can see, it's the opposite. There's over 3 MB overhead from adding a repetition about 10 MB long. Adding a second repetition, on the other hand, increases compressed file size only a little (well, at least it's negligible compared to previous increase in size).


    Summing up, I think SREP improves compression not due to imprecise postcoders (as I think bsc's one is pretty efficient one, differences of < 1% of compressed file size don't count as serious inefficiences), but because it makes files more uniform.

    Quote Originally Posted by Shelwien View Post
    LZP probably interleaves flags or other such stuff with the literals, which is not very compatible with contextual prediction.
    While srep probably keeps literals and match structures separately.
    Well, as of now, current SACAs aren't aware of flags (or anything like it), ie treating flags as additional symbols in alphabet. They now treat all bytes the same, so matches now add more noise than they should.

    As I'm an author of a SACA, I can add SREP-like preprocessing and I can extend the alphabet to accomodate flags, but it can take long time to do. As for now, some author of BWT-based compressor can talk to eg Yuta Mori, to make his libdivsufsort aware of long match flags. Maybe someone (Gribok?) has already spoken to him about that?

Similar Threads

  1. bsc, new block sorting compressor
    By Gribok in forum Data Compression
    Replies: 363
    Last Post: 14th June 2017, 20:55
  2. Concurrent kernel execution in OpenCL implementations
    By Piotr Tarsa in forum The Off-Topic Lounge
    Replies: 0
    Last Post: 16th April 2011, 16:04
  3. Brute forcing Delta block size
    By SvenBent in forum Data Compression
    Replies: 2
    Last Post: 2nd May 2009, 12:44
  4. Block sorting for LZ compression
    By Bulat Ziganshin in forum Forum Archive
    Replies: 15
    Last Post: 14th April 2007, 15:37

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •