+ Reply to Thread
Results 1 to 9 of 9

Thread: Duplicate File Finder Engine

  1. #1
    Member
    Join Date
    Aug 2011
    Location
    Canada
    Posts
    64

    Duplicate File Finder Engine

    I'm not sure that this is the right place to put this, but I made a fast duplicate file finder and wanted to share it. It takes options from a configuration file and writes a report of all duplicates found. The different configuration commands are as follows:


    Code:
    verbose={0,1,2}      :  Sets the level of console verbosity
    recurse={true,false} :  Sets the option to recurse into subdirectories
    report={filename}    :  Sets the output filename for the report
    filter={filter}      :  Sets the filter for finding files
    alldirs={true,false} :  Sets the option to search all directories, not just the ones that match the search filter
    AddPath {path}       :  Add a path to search for duplicates.  Multiple paths may be specified in the same configuration file
    The report format should be fairly simple. Finding duplicate files occurs in several stages. First, all files from the given paths are enumerated. They are then sorted by size and divided into groups of the same size. Groups with only one item are discarded. Then, all files over 256kb have a 16 byte sample taken from the middle and are sorted and divided similarly to how they were divided by size. Afterwards, all remaining files are hashed using SHA-1 and the hashes are sorted to find duplicates.

    The program is written in VB.net, so the .NET framework is required. File enumeration occurs through native API instead of using the framework classes to increase speed. If there are any errors, please let me know.
    Attached Files Attached Files

  2. #2
    Member
    Join Date
    May 2008
    Location
    HK
    Posts
    86
    Mine has one written in Perl/PHP.
    http://rtfreesoft.blogspot.com/2010/...e-initial.html

  3. #3
    Member chornobyl's Avatar
    Join Date
    May 2008
    Location
    ua/kiev
    Posts
    144

  4. #4
    Member
    Join Date
    Aug 2011
    Location
    Canada
    Posts
    64
    Good work roytam1! It's definitely a lot more portable than mine.

    Thanks for the link for finddupe. I may also work on a utility that removes duplicates or replaces them with hardlinks given the report as input. The problem that I run into is choosing which file to keep from the list of duplicates. A GUI might be better suited for selecting which file to keep, but I don't think that most people have the time to look over a list of several thousand duplicates.

    I also ran a benchmark on all three programs on the 'C:\cygwin' directory on my laptop, which contains 21832 files and 5192 directories. The total size of all files is 463,815,672 bytes. I ran 'dir /s C:\cygwin' before running the benchmark to ensure that all engines start on equal grounds. The tests were run in the following order:
    1. FileSystemDedup.exe
    2. finddupe.exe
    3. finddup.pl

    All times are the global times as measured by Timer 11.00. The results are as follows:
    Code:
    FileSystemDedup.exe      119 seconds
    finddupe.exe             168 seconds
    finddup.pl               245 seconds
    I also ran benchmarks on the speed of a second sequential test of each program:
    Code:
    FileSystemDedup.exe      10 seconds
    finddupe.exe             235 seconds
    finddup.pl               34 seconds
    However, being the author of one of the pieces of software in the benchmark, I suggest that other benchmarks are done in order to ensure the reliability of the results.

  5. #5
    Member
    Join Date
    Jan 2012
    Location
    cluj
    Posts
    1
    I've been using this so far

    http://duplicatefilesdeleter.com/


    but I guess I'll try your too!

  6. #6
    Member Karhunen's Avatar
    Join Date
    Dec 2011
    Location
    USA
    Posts
    54
    I use http://www.joerg-rosenthal.com/en/an.../download.html for my Win32 box and its slow, but more useful for checking content duplicate pictures. Don't have access to a Win64 box, anyone who has such an OS might like to try it.

  7. #7
    Member
    Join Date
    Aug 2011
    Location
    Canada
    Posts
    64
    Thanks for the links. Both AntiTwin and DuplicateFilesDeleter have more robust multimedia deduplication engines, but are slower because of this. They're also more user friendly. I guess that the main difference between most deduplication programs is how they handle the tradeoff between features/user friendliness/speed.

    It's difficult to benchmark GUI programs, so I'm not including them in the benchmark as of yet. However, if I find an accurate way to automate GUI benchmarking, I will include them.

  8. #8
    Member
    Join Date
    Jun 2009
    Location
    Cracov, Poland
    Posts
    711
    Look at AutoHotKey.

  9. #9
    Member Menno de Ruiter's Avatar
    Join Date
    Mar 2012
    Location
    Amsterdam, The Netherlands
    Posts
    6
    Wrote one too in C#

    with Parallel CPU 8 cores usage (order the input of files on filesize due to then it takes about the same time each parellel invoking
    and 4 different offsets with 4 different outcome in the CRC extreme unique crc (16 bytes)


    Parallel.Invoke
    (
    () =>
    {
    etc.

    as well to take ownership of the file when ntfs security blocks reading a file


    using (new ProcessPrivileges.PrivilegeEnabler(Process.GetCurr entProcess(), Privilege.TakeOwnership))
    {
    FileSecurity fSecurity = File.GetAccessControl(file);
    fSecurity.AddAccessRule(new FileSystemAccessRule(@"WIN-EMU5IJJMFO6\Administrator", FileSystemRights.FullControl, AccessControlType.Allow));
    File.SetAccessControl(file, fSecurity);
    File.SetAttributes(file, fileAttributes);
    System.Security.Principal.SecurityIdentifier sid = new System.Security.Principal.SecurityIdentifier(Syste m.Security.Principal.WellKnownSidType.LocalSystemS id, null);
    System.Security.Principal.NTAccount acct = sid.Translate(typeof(System.Security.Principal.NTA ccount)) as System.Security.Principal.NTAccount;
    string strEveryoneAccount = acct.ToString();
    try
    {
    System.Security.AccessControl.FileSecurity sec = System.IO.File.GetAccessControl(file);
    sec.AddAccessRule(new System.Security.AccessControl.FileSystemAccessRule (
    sid,
    System.Security.AccessControl.FileSystemRights.Ful lControl,
    System.Security.AccessControl.AccessControlType.Al low));
    File.SetAccessControl(file, sec);
    }
    catch (UnauthorizedAccessException)
    {
    // handle permissions problem
    }
    sid = new System.Security.Principal.SecurityIdentifier(Syste m.Security.Principal.WellKnownSidType.BuiltinUsers Sid, null);
    acct = sid.Translate(typeof(System.Security.Principal.NTA ccount)) as System.Security.Principal.NTAccount;
    strEveryoneAccount = acct.ToString();
    try
    {
    System.Security.AccessControl.FileSecurity sec = System.IO.File.GetAccessControl(file);
    sec.AddAccessRule(new System.Security.AccessControl.FileSystemAccessRule (
    strEveryoneAccount,
    System.Security.AccessControl.FileSystemRights.Rea dAndExecute,
    System.Security.AccessControl.AccessControlType.Al low));
    File.SetAccessControl(file, sec);
    ok = true;
    Released = Released + 1;
    }
    catch (UnauthorizedAccessException)
    {
    // handle permissions problem
    }
    }

+ Reply to Thread

Similar Threads

  1. Index-Compress-Update: parallel LZ match finder algo
    By Bulat Ziganshin in forum Data Compression
    Replies: 22
    Last Post: 10th January 2012, 21:36
  2. A fast diffing engine
    By m^2 in forum Data Compression
    Replies: 36
    Last Post: 21st September 2011, 20:30
  3. Replies: 7
    Last Post: 19th March 2011, 11:50
  4. Can't extract file from ARC file.
    By Absurd in forum Data Compression
    Replies: 3
    Last Post: 26th January 2009, 22:11
  5. RZM - a dull ROLZ compression engine
    By Christian in forum Forum Archive
    Replies: 178
    Last Post: 1st May 2008, 22:26

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts