Results 1 to 2 of 2

Thread: A Microsoft study on deduplication

  1. #1
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts

    A Microsoft study on deduplication

    Thought you may find it interesting.
    It also contains statistics on what file types consume most space on Windows workstations / servers.

    http://www.usenix.org/events/fast11/...pers/Meyer.pdf

  2. #2
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Very interesting. Some conclusions:

    1. Very little of what is on the average disk is text and images, unlike most benchmarks.
    2. Lots of x86 (exe, dll, lib), database (pdb), virtual disk (vhd, iso), and compressed data (cab, pdf, msi, chm, wma, mp3). But the biggest category is "other".
    3. The average disk is a constant 40% full in spite of capacity doubling each year.
    4. Whole file deduplication saves 22%. 8KB Rabin segmentation saves another 9%.
    5. Deduplication across file systems saves another 3% for each doubling of the number of disks.

    This is different from the types of extremely large data sets I've come across at Ocarina/Dell like

    - Seismic data from thousands of sensors by oil companies.
    - DNA sequencer raw data (huge TIFF images of sensor arrays) and intermediate analysis.
    - Medical images (CAT scans, MRI).
    - High resolution video from TV and movie studios.
    - High resolution animation from movie studios.
    - Satellite imagery.
    - Huge image databases from photo sharing websites.

    These all require specialized compressors, often one of a kind. In some cases, lossy compression is possible in theory but only by working closely with those doing the analysis, who may not know which data is safe to discard, so usually we do lossless. Models are often simple but require high speed because we have petabytes. This contrasts with a typical PC that requires complicated delayering, like JPEG inside of PDF inside of ZIP inside of ISO.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •