Results 1 to 7 of 7

Thread: Dedup collisions in obnam

  1. #1
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts

    Dedup collisions in obnam

    After a little investigation, I discover that obnam (versioning incremental backup utility for Linux) uses md5 to test for identical files. Of course md5 is insecure. You can download a collision generator from http://www.win.tue.nl/hashclash/

    As a test, I run the program and in about 1 second it generates 2 files, each 128 bytes. md5sum gives the same hash, but cmp and diff say they are different. I make a backup with obnam and restore. The restored files are identical to each other and to the second of the original files.

    This also works if you append a copy of an arbitrary file to each of the two generated files. The md5 hashes will still be identical.

    There are also programs for generating collisions with a chosen prefix, and for generating X.509 certificates with different identities by exploiting MD5 collisions.

  2. #2
    Member
    Join Date
    Aug 2013
    Location
    Greece
    Posts
    55
    Thanks
    37
    Thanked 21 Times in 15 Posts
    If I remember correctly, I did test obnam a while back and it's default mode is to NOT verify blocks with same hash due to performance reasons.

    By using --deduplicate=verify it will do block comparisons for every deduplicated block but performance suffers. As it is now the default mode is 'fatalist' = no collisions checks.
    Last edited by msmaniac; 22nd March 2014 at 22:42.

  3. #3
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    True, it has that option. It wouldn't be necessary if they used a secure hash. Of course I guess MD5 was secure when it was written.

    I'm a bit nervous about using SHA-1 in zpaq. It wouldn't be hard to change the hash, but it would break archive compatibility.

  4. #4
    Member
    Join Date
    Aug 2013
    Location
    Greece
    Posts
    55
    Thanks
    37
    Thanked 21 Times in 15 Posts
    I think obnam is ok to use to aggregate file-level backups from many networked sources but with a relatively small individual size (like clients' home folders) mostly due to dedup block verification speed issues. Also dedup uses a fixed chuck size which limits its overall effectiveness on changing files.

    To put hashes and dedup in perspective:
    Opendedup uses for its fixed and variable blocks the non-crypto MurmurHash3 at 128-bits . It allows changing hash algo only when fixed blocks are used. I cannot find what other hashes it supports and whether it does verification or not on dedup.
    Lessfs is a filesystem with fixed blocks only and uses 192-bit Tiger Hash and optionally SHA-256. No verification on dedup.
    ZFS uses SHA-256 and an option to enable verification on dedup. Fixed blocks only.
    Win2012 NTFS offline dedup. Could not find algorithm/hash used but from the stated 2TB/day I can assume it must use block verification on top of a low-memory (350MB) hash look-up. Uses rabin fingerprint for sub-file variable blocking.
    EMC VNX storages use either SHA1 or byte block comparison for dedup. Fixed blocks only.

    Nevertheless, for zpaq, I am slightly nervous too but not overly worried, since . Now, I schedule every month on sunday, after the backup, a compare with sources, in order to verify that everything is relatively OK.

    Perhaps for next zpaq version (irregardless of backwards compatibility):
    - flexible hashing algorithm selection sha1/sha256/512/vmac....
    - general allowances to the archive format for future extensions by breaking only forward compatibility (feature IDs).
    - use compressed block hashes
    - persistent archive creation options are stored in the header and reused on subsequent operations (hash type, fragment size, etc).
    - per block store the compression method used, to be used by block-level operations like purge with optional re-compress.
    - advanced file update decisions ala 7-zip's -u option with one or two common presets.
    - extend the compressibility data analysis part in order to set "delta" flags to generate appropriate ZPAQL PCOMP delta pre-filters per block depending on method.

  5. #5
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    >vmac

    useless for hashes stored in the archive

    >- use compressed block hashes

    what you mean?

  6. #6
    Member
    Join Date
    Aug 2013
    Location
    Greece
    Posts
    55
    Thanks
    37
    Thanked 21 Times in 15 Posts
    >vmac >> useless for hashes stored in the archive
    Ok, my mistake.
    >- use compressed block hashes
    Now zpaq stores wih its compressed segments/blocks (collection of fragments) a sha1 for the uncompressed data. In order to check segment/block validity you have to decompress it first and either use it to check decompressed data, and/or check with individual fragment hashes contained within.
    A sha1 of the actual compressed bytes of the segment/block will allow for fast checks.

    I just realized that the zpaq compare command does not actually decompress the data to perform the comparison, but it compares the hashes only. I may have to change stategy and do monthly extracts and do byte compare with actual sources (files having same modification date only) or snapshots of them that the backup was based on.

    EDIT: Also, in the scope of zpaq usage (a few or at most tens of TB in controlled environments), I would think the problem of conflicting fragment SHA-1 hashes wil almost certainly never present itself. The above statement of me doing monthly comparisons is just to appease my own sanity.

    EDIT2: Also it is argued that SHA-1 has less chance of failing than it is possible to have a hardware crc/ecc detection failure on a disk block.

    Also adding to the previous list:
    Synamtec Backup Deduplication uses MD5 for dedup segments in conjunction with a fingerprinting algorithm to prevent collisions.
    Last edited by msmaniac; 23rd March 2014 at 16:56. Reason: added EDITs

  7. #7
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Interesting about the different dedup strategies.

    > - flexible hashing algorithm selection sha1/sha256/512/vmac....

    I am considering sha256. I already have the code and it seems about as fast as sha1. blake2 would probably be faster for 64 bit but slower for 32 bit.

    - general allowances to the archive format for future extensions by breaking only forward compatibility (feature IDs).

    There are some hooks to allow for future extensions to the compression format like alternatives to ZPAQL.

    - use compressed block hashes

    The hashes and fragment sizes (h blocks) can't be compressed (except the leading bits of the size) so they are stored uncompressed.

    The test command decompresses, verifies the fragment hashes, and discards the output. The purpose is mainly to test for bugs in the decompression algorithm and measure decompression speed without the extra disk writing. The test -all command also verifies the data block hashes (also of the uncompressed data). Normally, extract works like test and verifies only the fragment hashes and only decompresses up to the last fragment needed. You can use the -fragile option to skip the hash verification and extract a little faster.

    - persistent archive creation options are stored in the header and reused on subsequent operations (hash type, fragment size, etc).

    I will have to consider methods that break forward compatibility.

    - per block store the compression method used, to be used by block-level operations like purge with optional re-compress.

    It already stores the decompression code in the block header.

    - advanced file update decisions ala 7-zip's -u option with one or two common presets.

    I'm not sure which of the other -u options are useful. With zpaq you can use -force and it will add files even if the date is unchanged. Of course if it is really identical then it will deduplicate.

    I wonder if it would be worthwhile to use the Windows archive bit as originally intended. Add files if the bit is set and then clear it.

    - extend the compressibility data analysis part in order to set "delta" flags to generate appropriate ZPAQL PCOMP delta pre-filters per block depending on method.

    I plan to look into this for better compression of bmp, tiff, wav, etc.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •