Page 1 of 6 123 ... LastLast
Results 1 to 30 of 153

Thread: pcompress, a deduplication/compression utility

  1. #1
    Member
    Join Date
    Nov 2012
    Location
    Bangalore
    Posts
    114
    Thanks
    9
    Thanked 37 Times in 22 Posts

    pcompress, a deduplication/compression utility

    I'm new to this forum and recently found it out via Google. I'm developing an open-source utility that combines data deduplication and compression using approaches similar to what is being discussed in this thread. I'm using content based chunking based on a rabin-style rolling hash approach (The prime multiplier value from srep worked best). Apart from that my approach has been to take various existing compression algorithms and combine them in interesting ways to achieve better results. For eg the LZP implementation from libbsc appears to benefit many other compression schemes when applied first in sequence.

    It also tries to parallelize by breaking data into segments. That of course affects final ratio slightly. Deduplication is currently local to segment. I'm still working on the global dedupe. My intent is to make this a scalable deduplicated archival store similar to products from major storage vendors. I have some ideas around how to scale with 4KB chunks.

    Source: https://github.com/moinakg/pcompress/
    Website: http://moinakg.github.com/pcompress/

    I have done some comparisons and analysis on my blog:
    http://moinakg.wordpress.com/tag/pcompress/

  2. #2
    Member
    Join Date
    Apr 2012
    Location
    Denmark
    Posts
    65
    Thanks
    19
    Thanked 16 Times in 14 Posts
    I might be missing it, but is there a Windows binary somewhere?

  3. #3
    Member
    Join Date
    Nov 2012
    Location
    Bangalore
    Posts
    114
    Thanks
    9
    Thanked 37 Times in 22 Posts
    It is Linux only at the moment. I am working on a Windows port.

  4. #4
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,254
    Thanks
    305
    Thanked 774 Times in 484 Posts
    Very interesting. I need to add it to my benchmark.

    I'm currently developing a journaling archiver for incremental backups. http://mattmahoney.net/dc/zpaq.html

    "Journaling" means that it is append only. When you do an incremental backup, it saves both the old and new versions of the files. You can revert to an earlier version of the archive to extract the old files if you want. Incremental backups are fast because it compares the dates and only adds files if they change. Typically it takes a couple of hours to back up 100 GB the first time and then a couple of minutes for daily incremental backups.

    For deduplication, I use an average fragment size of 64 KB. You might be familiar with this study by Microsoft: http://static.usenix.org/event/fast1...pers/Meyer.pdf

    Most duplication occurs at the whole file level. I used a large fragment size to reduce memory requirements. I use a rolling hash to find fragment boundaries, then compare the 20 byte SHA-1 hash of the fragment to those stored in the archive. This imposes a theoretical memory requirement of 0.4 GB per TB of data (in reality, 2-3x this), and ought to scale as long as memory and disk grow at the same rate. The format uses 32 bit fragment indexes, which imposes an archive limit of about 250 TB or 4 billion files, whichever comes first.

    Rather than a Rabin hash, I use a rolling 32 bit hash like h=(h+getchar()+1)*M;, where M is an odd number if the current byte is predicted by an order 1 context (a 256 byte table), and an even number not a multiple of 4 otherwise. This makes the hash dependent on the last 32 high-entropy bytes. There is no reason to make it dependent on the whole fragment. Then if the high 16 bits of h are all 0, then I mark a fragment boundary. I also bound fragment sizes to 4K (except at EOF) to 508K. I assume that if the SHA-1 hashes match then the fragments are identical without comparing the contents.

    To compress, I sort by filename extension and pack the fragments into 16 MB blocks and compress the blocks in parallel. I use the previously mentioned order 1 model as a heuristic to detect already compressed data, and if so I simply store the block with no further compression. From the above study, plus looking at my own computers and reports by a few others ( http://encode.ru/threads/1626-Practi...ion-benchmarks ), it seems that the most common file types are either already compressed or are x86 (exe, dll, etc). There is typically very little text, unlike many benchmarks. Thus, I use an E8E9 filter for all compression modes. The filter searches for 5 byte patterns of the form (E8|E9 xx xx xx 00|FF) and replaces the 3 middle bytes by adding the little-endian offset within the block. The pattern occurs rarely enough in random data (p = 2^-14), and never in ASCII or UTF-8, so that I just leave it on all the time rather than try to detect the file type.

    There are 4 compression levels (-method 1..4) The default is method 1, which is faster than zip and compresses smaller. The higher methods compress better but are slower. These are as follows:

    1: LZ77 with a 16 MB window using interleaved Elias Gamma coding of lengths and offsets. Literals are not compressed.
    2. LZ77 with byte-aligned codes match and literal codes context modeled by parse state and order 1 for literals and arithmetic coded.
    3. BWT (divsufsort) modeled with an order 0-1 ISSE chain.
    4. Context mixing with an order 0-5 ISSE chain, order 7 matcher, and order 1 mixer.

    Each element of an ISSE chain maps a progressively higher context hash order to a bit history and then to a pair of weight that mix the previous bit prediction with constant 1 in the logistic (log(p(1)/p(0)) domain. A matcher predicts whatever bit occurred in the previous occurrence of the current context. A mixer is a neural network that combines bit predictions by logistic weighted averaging and adjusts the weights to favor the best predictions. The ZPAQ documentation describes this in more detail.

    ZPAQ stores the decompression algorithm in the archive block headers in an interpreted language called ZPAQL. This allows me to update the archive format with better algorithms in the future without breaking compatibility. libzpaq translates ZPAQL into x86 32 or 64 bit code at run time as an optimization. It runs in Windows, Linux, and Mac, and was reported to work on an ARM with the x86 JIT optimization turned off.

  5. #5
    Member
    Join Date
    Nov 2012
    Location
    Bangalore
    Posts
    114
    Thanks
    9
    Thanked 37 Times in 22 Posts
    Zpaq is quite interesting. Among other things the ability to embed the logic has implications for really long term archival storage and be able to decompress the data correctly at a future date.
    With respect to the Rabin based hashing I have done some statistical analysis using a 360GB mixed dataset. I have posted the results to my blog here and here. Also in terms of dedupe ratio I have observed 1%-2% better results with my approach compared to a reference Rabin implementation for example from this study: http://sc12.supercomputing.org/sched...hp?evid=pap173 Code: http://code.google.com/p/fs-c/

    I have seen the Microsoft paper but my ideas have been insipred by this paper from HP: http://static.usenix.org/event/fast0...illibridge.pdf. Their approach is based on sparse indexing. Essentially they are grouping chunks into segments to exploit locality and do some form of stratified sampling to extract a subset of chunk hashes as representative hashes for the segment. I have given this some thought and have figured another approach where approximate matching techniques can be used. I already have a Delta Compression implementation in pcompress that computes minhashes of chunks at 40% and 60% significance levels and uses bsdiff to encode matching chunks.

    This approach can be used to compute multiple match significance levels for a segment. As opposed to sampling each significance level will indicate a cumulative match till that extent. In this way a 5% significance interval for example will give rise to 18 minhashes (upto 95%) per segment. I am thinking of using Tiger128 for the hashes since Tiger is fast to compute and a 16-byte hash key is enough for this approximate matching. So if we so the math with 8MB segment size, 18 hashes per segment of 16 byte each then a 40GB index (approx with index metadata) is sufficient to handle upto 1 Petabyte of data. We can happily use 4KB chunks within the segments.

    The minhashes above will actually be computed on the SHA256 (or SKEIN256 optionally) hashes of the chunks not on the chunk data themselves. On x86 platforms Intel has provided probably the fastest implementation of SHA256 till date: http://download.intel.com/embedded/p...per/327457.pdf. I am using this in Pcompress and it works successfully on AMD bulldozer and later processors as well (tested on a bulldozer laptop). If using SKEIN it's tree hashing can be leveraged. Once a match with another segment is found we load it's SHA256 table and then actually do the dedupe by comparing these very strong 32 byte hashes. Even after matching at one confidence level we can match at additional lower levels as long as the incoming segment has unmatched chunks. This approach should be very fast requiring very few lookups and disk reads per segment of 2048 chunks (4KB chunk size). On a good server the entire index can be resident in memory. Since hashes are used to match the entire 8MB segment data (not the SHA table) can be compressed.

    There are other data layout optimizations possible like forward referencing that can be done offline. It very hard to do inline forward referencing and still remain performant. AFAIK in the industry only Sepaton is providing forward referencing technology. Their dedupe solution is entirely offline.

  6. #6
    Member
    Join Date
    Nov 2012
    Location
    Bangalore
    Posts
    114
    Thanks
    9
    Thanked 37 Times in 22 Posts
    I added a new feature today that I am calling Adaptive Delta Encoding. The algorithm scans the buffer for spans of values increasing in a fixed arithmetic progression. These values are not single bytes but rather a stride of bytes packed into a big-endian integer representation. Multiple stride lengths of 3, 5, 7 and 8 are tried to find the one that gives maximum benefit. The algorithm also accepts a minimum span length in bytes. An arithmetic sequence less than this minimum is ignored.

    After this the encoder does a run length encoding of spans. The flag, length, starting integer and delta is stored for spans that are in a series. Literal byte spans have a flag and length in addition to the original bytes. This algorithm appears to benefit all compression algorithms in Pcompress especially if used along with LZP. The overhead is of course 5 additional buffer scans. I have experimentally determined good minimum sequence lengths for use with each compression algorithm. The code is currently in the github repo in the delta2 directory.

  7. #7
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,483
    Thanks
    719
    Thanked 653 Times in 349 Posts
    looks like my Delta algorithm: http://freearc.org/Research.aspx

  8. #8
    Member
    Join Date
    Nov 2012
    Location
    Bangalore
    Posts
    114
    Thanks
    9
    Thanked 37 Times in 22 Posts
    Yes looks like. Thanks for the interesting link.

  9. #9
    Member
    Join Date
    Nov 2012
    Location
    Bangalore
    Posts
    114
    Thanks
    9
    Thanked 37 Times in 22 Posts
    Pcompress 1.2 is released. Download: https://code.google.com/p/pcompress/downloads/list. Many bugfixes and improvements are in. I have added Keccak and Delta Encoding for embedded numeric tables. I also added some simple timing measurements to measure throughput. Output from a build with statistics enabled:

    Code:
    ## CPU: Core i5 430M
    ## RAM: 8GB
    ## Compiler: Gcc 4.7.2
    ##
    ## Compression
    ## ./pcompress -D -c lz4 -l1 -L -P -s200m silesia.tar
    Scaling to 1 thread
    Checksum computed at 241.321 MB/s
    Original size: 206612480, blknum: 46913
    Number of maxlen blocks: 0
    Total Hashtable bucket collisions: 17225
    Merge count: 46750
    Deduped size: 206197375, blknum: 242, delta_calls: 0, delta_fails: 0
    Chunking speed 112.189 MB/s, Overall Dedupe speed 100.880 MB/s
    LZP: Insize: 206196371, Outsize: 192127556
    LZP: Processed at 55.839 MB/s
    DELTA2: srclen: 192127556, dstlen: 191899643
    DELTA2: header overhead: 50800
    DELTA2: Processed at 382.530 MB/s
    Chunk compression speed 207.908 MB/s
    
    ##
    ## Decompression
    ## 
    ./pcompress -d silesia.tar.pz silesia.tar.1
    Scaling to 4 threads 
    Chunk decompression speed 383.488 MB/s 
    DELTA2: Decoded at 3030.724 MB/s 
    Checksum computed at 244.235 MB/s
    
    ls -l silesia.tar.pz  
    -rw-rw-r--. 1 testuser testuser 99115899 Jan  5 21:36 silesia.tar.pz
    Another test with a Geographical Dataset from USGS (Global Topographic Elevation map data) having lots of embedded numeric data:

    Code:
    # 
    # Compression 
    # 
    ./pcompress -c lz4 -l1 -P -s100m e020n40.tar  
    Scaling to 1 thread 
    Checksum computed at 237.584 MB/s
    DELTA2: srclen: 86599680, dstlen: 43707440
    DELTA2: header overhead: 2024320
    DELTA2: Processed at 279.484 MB/s 
    Chunk compression speed 211.112 MB/s 
     
    ls -l e020n40.tar.pz  
    -rw-rw-r--. 1 testuser testuser 35360062 Jan  5 21:46 e020n40.tar.pz 
    # 
    # Decompression
    #
    ./pcompress -d e020n40.tar.pz e020n40.tar.1 
    Scaling to 4 threads 
    Chunk decompression speed 622.394 MB/s 
    DELTA2: Decoded at 1971.282 MB/s 
    Checksum computed at 246.015 MB/s
    
    
    Default checksum is Skein 256. This Delta implementation also uses RLE to collapse the detected arithmetic sequences. So it produces a smaller output than the original.
    Last edited by moinakg; 5th January 2013 at 20:04. Reason: Formatting

  10. #10
    Member
    Join Date
    Nov 2012
    Location
    Bangalore
    Posts
    114
    Thanks
    9
    Thanked 37 Times in 22 Posts

    Arrow

    New release of Pcompress is now available: https://moinakg.wordpress.com/2013/0...-1-3-released/. This release was primarily focussed on performance along with a couple of features and bug fixes.

    I have now added BLAKE2 as the default checksum. It is matches or exceeds MD5 in performance and yet is cryptographically strong. Other checksums have been parallelized when dealing with compressing entire file in a single chunk without splitting. For the SHA2 stuff I am using Intel's optimized code that works very well on x86-64. I use xxHash internally for the Deduplication hash table. I have spent some time vectorizing xxHash for 15% improved performance using SSE4: https://moinakg.wordpress.com/2013/0...un-and-profit/. In fact the new approach of processing data in 32-byte blocks in the xxHash core loop shows a 3% speed improvement even without any SSE optimizations.


    Segment-level Deduplication performance has been improved quite a bit with more changes upcoming in a subsequent release. The current code in Github (not in the release) can do sliding window dedupe at 360 MB/s per core. However it remains to be seen how this fares when I eventually implement global dedupplication.
    Delta Compression via bsdiff has been improved to not screw up compression algorithm effectiveness in some cases. Delta Encoding is improved as well.

    LZMA performance has been slightly tweaked. Zlib is also improved since I now use raw Zlib streams and avoid the Adler32 overhead. LZ4HC has a slight performance tweak using SSE2 optimizations. AES CTR mode is speeded up using SSE2 parallel XOR. Runtime CPU capability detection is used to invoke the appropriate compiled variant.

    Some sample runs on a Core i5, 2.27 GHz are shown below:

    Libbsc + LZP + Delta Encoding on enwik9:
    Code:
    ~/pcompress $ ./pcompress -c libbsc -l14 -L -P -s1g enwik9 
    Scaling to 4 threads
    Checksum computed at 701.394 MB/s
    LZP: Insize: 1000000000, Outsize: 971356366
    LZP: Processed at 75.616 MB/s
    DELTA2: srclen: 971356366, dstlen: 971247231
    DELTA2: header overhead: 21728
    DELTA2: Processed at 391.858 MB/s
    Chunk compression speed 3.347 MB/s
    
    ~/pcompress $ ls -l enwik9.pz 
    -rw-rw-r-- 1 testuser testuser 163839474 Jan 31 17:14 enwik9.pz
    Deduplication, Delta Encoding, LZP, SHA256, LZ4 on the silesia test data:
    Code:
    ~/pcompress $ ./pcompress -D -c lz4 -l1 -L -P -S SHA256 -s200m silesia.tar 
    Scaling to 1 thread
    Checksum computed at 363.008 MB/s
    Original size: 206612480, blknum: 46913
    Number of maxlen blocks: 0
    Total Hashtable bucket collisions: 9997
    Merge count: 46750
    Deduped size: 206197375, blknum: 242, delta_calls: 0, delta_fails: 0
    Chunking speed 631.876 MB/s, Overall Dedupe speed 362.743 MB/s
    LZP: Insize: 206196371, Outsize: 192127556
    LZP: Processed at 51.688 MB/s
    DELTA2: srclen: 192127556, dstlen: 191899643
    DELTA2: header overhead: 50800
    DELTA2: Processed at 388.613 MB/s
    Chunk compression speed 212.364 MB/s
    
    ~/pcompress $ ls -l silesia.tar.pz 
    -rw-rw-r-- 1 testuser testuser 99115899 Jan 31 18:07 silesia.tar.pz
    Now using Delta Compression in addition to the above (Delta Compression is slower than normal identity deduplication):
    Code:
    ~/pcompress $ ./pcompress -D -E -c lz4 -l1 -L -P -S SHA256 -s200m silesia.tar 
    Scaling to 1 thread
    Checksum computed at 362.242 MB/s
    Original size: 206612480, blknum: 46913
    Number of maxlen blocks: 0
    Total Hashtable bucket collisions: 10067
    Merge count: 46748
    Deduped size: 206193051, blknum: 244, delta_calls: 1, delta_fails: 0
    Chunking speed 266.688 MB/s, Overall Dedupe speed 202.856 MB/s
    LZP: Insize: 206192039, Outsize: 192127599
    LZP: Processed at 52.045 MB/s
    DELTA2: srclen: 192127599, dstlen: 191898878
    DELTA2: header overhead: 51056
    DELTA2: Processed at 388.370 MB/s
    Chunk compression speed 213.979 MB/s
    
    ~/pcompress $ ls -l silesia.tar.pz 
    -rw-rw-r-- 1 testuser testuser 99115607 Jan 31 18:08 silesia.tar.pz
    Only a single delta chunk was detected. Similar chunks are only Delta - processed if they are at least a minimum distance apart depending on the compression algo selected.

    LZMA2 Ultra compression, KECCAK256 on silesia along with Dedupe and other preprocessing except LZP:
    Code:
    ~/pcompress $ ./pcompress -D -c lzmaMt -l14 -P -S KECCAK256 -s200m silesia.tar 
    Scaling to 2 threads
    Checksum computed at 276.071 MB/s
    Original size: 206612480, blknum: 46913
    Number of maxlen blocks: 0
    Total Hashtable bucket collisions: 9997
    Merge count: 46750
    Deduped size: 206197375, blknum: 242, delta_calls: 0, delta_fails: 0
    Chunking speed 633.779 MB/s, Overall Dedupe speed 364.145 MB/s
    DELTA2: srclen: 206196371, dstlen: 201348940
    DELTA2: header overhead: 497504
    DELTA2: Processed at 373.984 MB/s
    Chunk compression speed 1.447 MB/s
    
    ~/pcompress $ ls -l silesia.tar.pz 
    -rw-rw-r-- 1 testuser testuser 48246418 Jan 31 17:56 silesia.tar.pz
    Overall time taken for silesia compression with full file in single chunk:
    Code:
    time ./pcompress -D -c lz4 -l1 -L -P -S SHA256 -s200m silesia.tar 2> /dev/null
    
    real    0m6.928s
    user    0m6.868s
    sys     0m0.488s
    
    ~/pcompress $ ls -l silesia.tar.pz 
    -rw-rw-r-- 1 testuser testuser 99115899 Jan 31 18:15 silesia.tar.pz
    Overall time taken for silesia compression in parallel mode with 20MB chunks (note 0.26% compression degradation but 2X speed improvement):
    Code:
    time ./pcompress -D -c lz4 -l1 -L -P -S SHA256 -s20m silesia.tar 2> /dev/null
    
    real    0m3.825s
    user    0m10.373s
    sys     0m0.756s
    
    ~/pcompress $ ls -l silesia.tar.pz 
    -rw-rw-r-- 1 testuser testuser 99377419 Jan 31 18:16 silesia.tar.pz
    Another run showing AES encryption performance (encryption is done after compression):
    Code:
    ~/pcompress $ ./pcompress -c lz4 -l1 -S BLAKE512 -e -s200m silesia.tar 
    Please enter encryption password: 
    Please enter encryption password (once more): 
    Scaling to 1 thread
    Chunk compression speed 102.466 MB/s
    Encryption speed 60.639 MB/s
    HMAC Computation speed 363.345 MB/s
    Last edited by moinakg; 31st January 2013 at 15:53. Reason: Add AES data

  11. #11
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,483
    Thanks
    719
    Thanked 653 Times in 349 Posts
    about XOR: something like

    for (i=0; i<bufsize; i+=16)
    *(int32*)(buf+i) = *(int32*)(buf+i) ^ *(int32*)(text+i),
    *(int32*)(buf+i+4) = *(int32*)(buf+i+4) ^ *(int32*)(text+i+4),
    *(int32*)(buf+i+8) = *(int32*)(buf+i+8) ^ *(int32*)(text+i+8),
    *(int32*)(buf+i+12) = *(int32*)(buf+i+12) ^ *(int32*)(text+i+12);

    should be enough

  12. #12
    Member
    Join Date
    Nov 2012
    Location
    Bangalore
    Posts
    114
    Thanks
    9
    Thanked 37 Times in 22 Posts
    Yes but for every 16 bytes AES_encrypt has to be called once. It is not a simple XOR through the buffer. Pseudocode goes like this:

    Code:
    char buf[16];
    
    blockctr = 0
    *((uint64_t)buf) = htonll(nonce);
    
    for (i = 0 to End, i += 16) {
      *((uint64_t)(buf+8)) = htonll(blockctr);
      AES_Encrypt(buf);
      16 byte ciphertext = 16 byte plaintext XOR buf;
      plaintext += 16;
      blockctr++;
    }

  13. #13
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,483
    Thanks
    719
    Thanked 653 Times in 349 Posts
    then it doesn't need `for` loop. btw, you can remove htonll calls. also, i recommend to use full 128-bit nonce for increased security

  14. #14
    Member
    Join Date
    Nov 2012
    Location
    Bangalore
    Posts
    114
    Thanks
    9
    Thanked 37 Times in 22 Posts
    Actually it is a 128-bit nonce per block. 64-bit is a pre-computed nonce and the remaining 64 bits is a counter to ensure per-block uniqueness.

  15. #15
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,483
    Thanks
    719
    Thanked 653 Times in 349 Posts
    yes, i mean precomputed one. correct me if i'm wrong but this "blockctr = 0" effectively means that you are using 64-bit encryption - in order to break any encrypted block with known blockctr one need to search only among 2^64 possible keys

  16. #16
    Member
    Join Date
    Nov 2012
    Location
    Bangalore
    Posts
    114
    Thanks
    9
    Thanked 37 Times in 22 Posts
    There are various ways of generating a counter.
    AFAIK this concatenation of 64-bit random value and 64-bit incrementing integer is a common method. The Key and Nonce/IV pair must not repeat and the counter wrap around must be a very long interval. Since the counter is a 16-byte block number, wrap around will happen after 2^68 bytes which is a very large value. Check:
    http://crypto.stackexchange.com/questions/3203/counter-mode-in-advanced-encryption-standardaes-algorithm

    This is also one of the counter block techniques described in the NIST document:
    http://csrc.nist.gov/publications/nistpubs/800-38a/sp800-38a.pdf

    It generalizes the counter block into a concatenation (m bits) || (b bits) where 'm' can be a random value and b is incrementing.

  17. #17
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,483
    Thanks
    719
    Thanked 653 Times in 349 Posts
    is a common method
    what software uses it?

    i've added the same comment at stackexchange. NIST paper doesn't say anything about m and b. sure, for 448-bit encryption concatenation is acceptable

    are you agree that to break your encryption scheme it's enough to enumerate 2^64 16-byte blocks?

  18. #18
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,483
    Thanks
    719
    Thanked 653 Times in 349 Posts
    oh, sorry, i was wrong. again forgot that we have separate key value

  19. #19
    Member
    Join Date
    Nov 2012
    Location
    Bangalore
    Posts
    114
    Thanks
    9
    Thanked 37 Times in 22 Posts
    Here is the snippet I was referring to on page 18:

    The standard incrementing function can apply either to an entire block or to a part of a block. Let m be the number of bits in the specific part of the block to be incremented; thus, m is a positive integer such that m ≤ b. Any string of m bits can be regarded as the binary representation m of a non-negative integer x that is strictly less than 2 . The standard incrementing function takes m[x]m and returns [x+1 mod 2 ]m.

    For example, let the standard incrementing function apply to the five least significant bits of eight bit blocks, so that b=8 and m=5 (unrealistically small values); let * represent each unknown
    bit in this example, and let ***11110 represent a block to be incremented. The following sequence of blocks results from four applications of the standard incrementing function:
    BTW the AES CTR implementation is not my code. I am just using the code straight from Colin Percival's Tarsnap client. If you think there is a problem then that means the Tarsnap client is broken. Check: tarsnap-autoconf-1.0.33/lib/crypto/crypto_aesctr.c
    It is being used widely and has been peer-reviewed. The only change I did was to use SSE2 intrinsics to do the XORs.

  20. #20
    Member
    Join Date
    Nov 2012
    Location
    Bangalore
    Posts
    114
    Thanks
    9
    Thanked 37 Times in 22 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    oh, sorry, i was wrong. again forgot that we have separate key value
    Ah okay, thanks. I was a little confused by your questions

  21. #21
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,483
    Thanks
    719
    Thanked 653 Times in 349 Posts
    i taken a look at the asm code generated for xxHash/SSE4. operations aren't fully interleaved. it's either because compiler is too smart or too dumb. you can try to interleave computations yourself and check whether it will improve speed. smth like this, of course with variables renamed:

    do {
    __m128i mem = _mm_loadu_si128((__m128i *)p);
    p += 16;


    __m128i mem = _mm_loadu_si128((__m128i *)p);
    p += 16;


    mem = _mm_mullo_epi32(mem, prime2);
    mem = _mm_mullo_epi32(mem, prime2);


    accum = _mm_add_epi32(accum, mem);
    accum1 = _mm_add_epi32(accum1, mem);


    accum = _x_mm_rotl_epi32(accum, 13);
    accum1 = _x_mm_rotl_epi32(accum1, 13);


    accum = _mm_mullo_epi32(accum, prime1);
    accum1 = _mm_mullo_epi32(accum1, prime1);
    } while (p<=limit);


    Last edited by Bulat Ziganshin; 14th February 2013 at 14:39.

  22. #22
    Member
    Join Date
    Nov 2012
    Location
    Bangalore
    Posts
    114
    Thanks
    9
    Thanked 37 Times in 22 Posts
    I tried your suggestion but it did not make any difference. At least not on my CPU. I checked the generated binaries and the interleaving is not exactly 100% in both cases. The compiler will optimize intrinsics in it's own way. To have this fine grained control, inline assembly will be needed.

    However I doubt if 100% interleaving will yield better results. Different instructions will have different latencies and throughput which also vary from CPU model to CPU model. I think the compiler does some general weighting based on the target selected and interleaves accordingly. A higher latency instruction will potentially be bunched with two lower latency independent ones, if possible. After that the instruction pipeline also does it's own thing.

  23. #23
    Member
    Join Date
    Nov 2012
    Location
    Bangalore
    Posts
    114
    Thanks
    9
    Thanked 37 Times in 22 Posts
    So I figured out that calling AES_encrypt() in OpenSSL just uses the standard x86_64 assembly implementation. Latest OpenSSL has 2 additional variants that are used based on cpuid only if the EVP* routines are used. However I cannot assume latest OpenSSL everywhere and there was a bug filed earlier requesting support for older OpenSSL version. In addition some OS environments appear to be too smart and uses linker mapfiles to only export a predefined set of OpenSSL functions in the crypto library. Ubuntu for eg does not expose the vpaes_encrypt() and aesni_encrypt() functions in the shared library's exports section. So I pulled in the VPAES and AES-NI implementations into pcompress and using the cpuid detection I already have.

    VPAES works on SSSE3 CPUs: http://crypto.stanford.edu/vpaes/
    With VPAES I am now getting an encryption throughput of 170 MB/s. I do not have a SandyBridge box, so can't test the AES-NI part. However I've asked a friend having an Ivy bridge MB to try it out.

  24. #24
    Member
    Join Date
    Nov 2012
    Location
    Bangalore
    Posts
    114
    Thanks
    9
    Thanked 37 Times in 22 Posts
    New release (1.4) of Pcompress is now available from here: http://code.google.com/p/pcompress/downloads/list

    This adds the following capabilities:


    1. AES is better optimized using VPAES and AES-NI with CPU auto-detection. Since I had to support linking with older versions of OpenSSL I have copied the relevant implementation files into Pcompress repo.
    2. Added XSalsa20 from Dan Bernstein's NaCL library (SSE2 and reference code). It is extremely fast and has excellent security properties - as much as I could glean reading up on various articles and posts.
    3. Deduplication performance has been improved by 95% by optimizing the Rabin sliding window chunking algorithm and doubling the chunk hash-table size.
    4. From version 1.3 onwards Delta Compression now actually benefits overall compression ratio by using a TOO-NEAR metric. It will only Delta Compress similar chunks if they are at least TOO-NEAR bytes away from each other. This value is dependent on the compression algorithm's average distionary size.
    5. All hashes have been parallelized better by using Merkle Tree style hashing. This happens only in solid mode when entire file is being compressed in a single chunk.
    6. Encryption key length is no longer a compile-time constant. It can be selected at runtime.
    7. The Header HMAC now includes Nonce, Salt and Key length properties as well. This was an oversight in the earlier release.
    8. Better cleanup of temporary variables on the stack in various crypto functions.

  25. #25
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,483
    Thanks
    719
    Thanked 653 Times in 349 Posts
    you may be interesting in looking into botan crypto library that contains several asm/sse optimized hashes and encryptors

  26. #26
    Member
    Join Date
    Nov 2012
    Location
    Bangalore
    Posts
    114
    Thanks
    9
    Thanked 37 Times in 22 Posts
    Thanks for the info. It is indeed quite an interesting library. Also just noticed that the AES-SSSE3 implementation in Botan is the same VPAES version that I am using but converted into intrinsic form for easier maintenance. From SHA2 performance point of view no one can beat Intel's code on x86 as yet (AFAIK).

  27. #27
    Member
    Join Date
    Nov 2012
    Location
    Bangalore
    Posts
    114
    Thanks
    9
    Thanked 37 Times in 22 Posts
    I made two two updates to Pcompress in short succession. One introduced new features like Global Deduplication and the second one is mainly performance tweaks and bug fixes.
    Version 2.1: http://unix.freecode.com/projects/pc...eleases/354552
    Version 2.0: http://unix.freecode.com/projects/pc...eleases/354208

    Global Deduplication adds the capability to detect duplicate blocks across the entire dataset as opposed to the earlier capability which detected duplicates only within a segment. It of course is slower since it requires coordinated access to a single index. Global Deduplication uses two algorithms depending on the size of the dataset and available RAM. If enough memory is available a simple index is used that stores 32-byte or 64-byte cryptographic checksums for each block. This is a full block index. With 4KB average block size and 32-byte checksums the index will require a little more than 8MB RAM per GB of unique data.

    If the required index size is somewhat greater than the available memory (upto 75% of the free RAM can be used for index) then a Segmented Dedupe mode is used. In this case the data is split into blocks and then blocks collected into segments of 8MB size in average. So each segment gets 2048 blocks or 4KB size each. The user specified compression segment size is rounded up to a multiple of 8MB. As usual a 32-byte cryptographic checksum is computed for each block. Now all the checksums for an 8MB segment is aggregated and 25 similarity indicators are computed for the segment (based on min-values sampling). Each similarity indicator is a 64-bit integer value and is stored into the index and the segment's list of block hashes are appended to a metadata file. If at least one similarity indicator for an incoming segment matches an entry in the index then the matching segment is considered approximately similar and it's block checksum table is loaded from the metafile. Now the block checksums of the two segments are compared to do the exact deduplication. I have tested this approach with different kinds of data and found it to provide approximately 95% of the deduplication ratio of using a full block index. This approach requires one disk read and potentially write for every 2048 blocks. Since the metafile is append-only it's I/O pattern is SSD-friendly. So locating it on an SSD should provide a big speed boost.

    This approach requires a relatively tiny index. In the worst case of 100% unique data it will require slightly more than a 25GB index for 1 Petabyte of data. So the index can be resident in RAM. However if memory is still too less for the index then oldest entries are re-used per hashtable slot as the memory gets filled up. The default block checksum used is Intel's optimized SHA512-256 with the option of using BLAKE2, SKEIN or SHA3.

    I have also optimized the sliding-window block splitting to an extent that brings it close to fixed-block splitting performance. The key is to avoid scanning most of the data and start scanning only from minimum allowed block size regions. In addition I am locating the sliding window bytes in a single SSE register.

    Pcompress supports various streaming modes so this dedupe and compression can be used to optimize network file transfer with help from utilities like Ncat.
    Last edited by moinakg; 14th May 2013 at 20:40. Reason: Fix typo

  28. The Following 2 Users Say Thank You to moinakg For This Useful Post:

    avitar (17th August 2013),Bulat Ziganshin (14th May 2013)

  29. #28
    Member
    Join Date
    Nov 2012
    Location
    Bangalore
    Posts
    114
    Thanks
    9
    Thanked 37 Times in 22 Posts
    I did some testing sometime back, comparing Pcompress with several other utilities including Lrzip and eXdupe. The results and analysis are here:
    https://moinakg.wordpress.com/2013/06/01/updated-compression-benchmarks-3/

    My next work is to look at multi-file deduplication for an archival store that can be used for backups for example. It will be using a persistent global index with chunk references across files.

  30. #29
    Member
    Join Date
    Jul 2013
    Location
    Stanford, California
    Posts
    24
    Thanks
    7
    Thanked 2 Times in 2 Posts
    Quote Originally Posted by moinakg View Post
    I did some testing sometime back, comparing Pcompress with several other utilities including Lrzip and eXdupe. The results and analysis are here:
    https://moinakg.wordpress.com/2013/06/01/updated-compression-benchmarks-3/

    My next work is to look at multi-file deduplication for an archival store that can be used for backups for example. It will be using a persistent global index with chunk references across files.
    Hi! Thanks for your pcompress contribution. I have some questions/ideas and thought I'd share here. Your feedback would be most appreciated.

    Is the chunk size specified by -s applied to both the underlying compression algorithm and the de-duplication layers? I'm wondering how this parameter best interoperates with -B, and what relative properties are important for those two values. Is -B used as a kind of heuristic or starting point for a variable block in deduplication? What are the tradeoffs for this one and when should the various settings be chosen? And for the -r option, when is that helpful to include? Is it expected to always give best ratio in the end? Is there an advantage not to using it?

    I'm wondering also if you could elaborate on -G and -EE including their differences and advantages. Does -G never invoke a delta comparison step like the bsdiff variant you've included? When should -EE or -G be used? Does use of -E or -EE ever make a post-pcompress compression pass (with another compressor, for example) less efficient because even a non-bz2'd bsdiff output makes external compressors' pattern matching algorithms more difficult to work effectively? And then would -E or its absence altogether ever be better than -EE or is a check made to see whether there would be an end savings benefit (as long as the final pass compressor is being invoked within pcompress)? It seems that your local bsdiff fork avoids a bzip2 pass so that the resultant data can be compressed better by another compressor after, but are there possible alignment issues that could come up if the "uncompressed" bsdiff output pushed common patterns to different offsets? How much would be needed to integrate courgette into the delta differencing phase? Also, given an asymmetry assumption (more time taken to compress), and given a good measure for level of similarity, could a tree structure be formed that relates most similar blocks to one another (leaf nodes), up the tree to the threshold point of similarity as a starting point (a root node). This would work well with a multidimensional measure of similarity; a kind of Hamming distance could be used as a metric of overall proximity or similarity. Maybe a few combinations could be tried, but I'm interested to know when that kind of tree based structure may be helpful.

    What about having a resumable stream with the ability to flush output and to flush intermediate caches and data structures to disk? It's cool that pcompress can be used in a stream, but if a similar stream is to be sent or resumed in the future, it would be nice to keep track of the state to allow subsequent data with similar characteristics can be sent more efficiently. Maybe it could be like a big sliding window or cache to allow for a kind of incremental refinement. And this caching idea would be useful even if it represents an imperfect but best-guess representation of the data (still good enough to be a starting point). One example I can think of is a series of incremental "zfs send" streams for the same data set. As I understand it, even if the local ZFS data blocks are compressed, zfs send will still provide an uncompressed stream allowing for de-duplication or compression later in the pipeline. Also the use case is not too dissimilar to implementing a versioning or incremental archiver or backup program but in the latter case, some formal metadata describing the formerly archived versions might be more accessible (rather than a sliding window or dictionary of common patterns or blocks). I'm interested to hear more about your ideas around the persistent global index for an archival store.

    Along those lines, I'm wondering what might be some acceptable preprocessors to include with pcompress, either as plugins or as processors that are independently applied to the data before or after pcompress. Would any be best to run before versus after, and vice versa? It seems possible that the LZP phase may affect the ability of another compressor to compress well, but is Delta usually safer to try? What about those used in FreeARC (including srep)? Finally, I'm wondering to what extent a subset of the pcompressed data can be selectively extracted (needle in a haystack at a known fixed offset within the original stream)? Ideally the overhead of extracting the entire set could be avoided, but this requirement to be able to efficiently extract a subset might place limitations on which filters or combinations would be set. I also believe that putting common information across a large data set in an initial place could allow for more parallelism in extraction without sacrificing compression ratio by cutting the window size too small. The other side of this has to do with providing a whole-input perspective on the data when subsets of it are processed in parallel. So, the compression phase could scan the data in parallel to arrive at global patterns or common blocks, synchronise this information amongst threads, and then this master dictionary or block set can be a basis for local modifications. Let me know what you think. I look forward to further pcompress developments.

  31. #30
    Member
    Join Date
    Nov 2012
    Location
    Bangalore
    Posts
    114
    Thanks
    9
    Thanked 37 Times in 22 Posts
    That is a lot of questions in one post Anyway below are the clarifications:
    • -s : Since Pcompress is multithreaded data needs to be split between threads in large buffers. Each buffer is then independently chunked and compressed. This parameter indicates how large the buffer size should be. Total memory requirement for buffers can be calculated as: buffer_size + buffer_size X 2
    • -B : This indicates the average chunk size for deduplication. 1 - 4KB, 2 - 8KB and so on till 5 - 64KB. For '-F' which is fixed block deduplication this is the exact chunk size. For variable block dedupe this is an indicative average. The real chunk sizes will vary but average will be close to the indicative value. I have blogged in detail on the chunking approach here: https://moinakg.wordpress.com/2013/0...ined-chunking/
    • -r : This should typically not be used. For variable block deduplication Pcompress uses a heuristic to partition buffers per thread to minimize undersized trailing chunks at the buffer end. This option disables that heuristic. I made a mistake including this confusing option.
    • -G,-E,-EE : Currently, Global dedulication does not provide the capability of Delta Compression. I plan to implement this in the future. This would require re-reading old data to reconstruct chunks that were delta-compressed. No special checks are made during the delta phase, and yes it does impact the end compression ratio of standard compression algorithms due to alignment and other issues you mentioned. It is possible to refine the similarity algorithm by using Jaccard/Hamming Distance measures, and create a tree structure to combine the most similar chunks. In addition, I am using a min-heap to create a count-min sketch to avoid sorting. This is not terribly accurate. Courgette could be used, but it is tuned for executables and uses disassembly techniques. It does not seem useful for generic data. Maybe it can be detected whether the data is potentially executable or not and then Courgette applied if so.
    • Each buffer in Pcompress is independently compressed and does not use pattern indexes of other buffers. Each buffer is written out with it's own header. So the file format is such that each buffer can be independently de-compressed as well. If I understand correctly, you are asking for a persistent global index for the deduplication case. I am currently working on this feature since I want to develop Pcompress into a deduplicated archival store. So there will be an on-disk archive format, with data and metadata containers and a persistent similarity based global index. The similarity based index can be held entirely in memory and avoids a full index of cryptographic chunk hashes. So it is near-exact deduplication approach providing upto 98% of an exact deduplication efficiency. See: https://moinakg.wordpress.com/2013/0...-store-part-1/ and https://moinakg.wordpress.com/2013/0...-store-part-2/
    • This in-memory similarity index can potentially scale to handle a few petabytes of data in a single node.
    • Both LZP and Delta2 (-P) usually benefit. Only with LZMA extreme compression levels (-l 13, -l 14) these tend to have a marginal negative impact. See some details here: http://moinakg.wordpress.com/2013/05...hmarks-part-2/
    • For other preprocessing secnarios things like Lrzip and Srep should work well since these are technically deduplication but with very short chunks. Since the chunking approach in Pcompress can produce very short chunks on the order of 1KB, I am thinking of including an option to have deduplication using short chunks to get an effect similar to rzip.
    • As mentioned before Pcompress is parallelized and individual buffers do not share pattern information except when using Global Deduplication. Since each entry in the compressed file is independent, it is easy to create an index on the fly and provide random access to the compressed elements. With Global Deduplication this is tricky at present since the global index is not stored anywhere. Global Deduplication synchronizes among threads to access an in-memory index for chunk lookup. Decompression expects previous buffers to have been decompressed already, and acceses that data to rebuild deduplicated chunks later in the stream. It should be possible to have an indexed access to the compressed file if the global index is persisted to disk. You can read more about the lock-free index synchronization and parallelism here: https://moinakg.wordpress.com/2013/03/26/coordinated-parallelism-using-semaphores/

  32. The Following 2 Users Say Thank You to moinakg For This Useful Post:

    avitar (17th August 2013),Intensity (17th August 2013)

Page 1 of 6 123 ... LastLast

Similar Threads

  1. Data deduplication
    By Lasse Reinhold in forum Data Compression
    Replies: 79
    Last Post: 18th November 2013, 08:49
  2. native deduplication in Windows 8 x64
    By jimbow in forum Data Compression
    Replies: 6
    Last Post: 30th October 2012, 23:57
  3. A Microsoft study on deduplication
    By m^2 in forum Data Compression
    Replies: 1
    Last Post: 5th May 2011, 18:15
  4. Remote diff utility
    By Shelwien in forum Data Compression
    Replies: 2
    Last Post: 6th September 2009, 15:37

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •