A Microsoft study on deduplication
Thought you may find it interesting.
It also contains statistics on what file types consume most space on Windows workstations / servers.
Very interesting. Some conclusions:
1. Very little of what is on the average disk is text and images, unlike most benchmarks.
2. Lots of x86 (exe, dll, lib), database (pdb), virtual disk (vhd, iso), and compressed data (cab, pdf, msi, chm, wma, mp3). But the biggest category is "other".
3. The average disk is a constant 40% full in spite of capacity doubling each year.
4. Whole file deduplication saves 22%. 8KB Rabin segmentation saves another 9%.
5. Deduplication across file systems saves another 3% for each doubling of the number of disks.
This is different from the types of extremely large data sets I've come across at Ocarina/Dell like
- Seismic data from thousands of sensors by oil companies.
- DNA sequencer raw data (huge TIFF images of sensor arrays) and intermediate analysis.
- Medical images (CAT scans, MRI).
- High resolution video from TV and movie studios.
- High resolution animation from movie studios.
- Satellite imagery.
- Huge image databases from photo sharing websites.
These all require specialized compressors, often one of a kind. In some cases, lossy compression is possible in theory but only by working closely with those doing the analysis, who may not know which data is safe to discard, so usually we do lossless. Models are often simple but require high speed because we have petabytes. This contrasts with a typical PC that requires complicated delayering, like JPEG inside of PDF inside of ZIP inside of ISO.