Page 1 of 4 123 ... LastLast
Results 1 to 30 of 110

Thread: FastBackup: yet another deduplication engine

  1. #1
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts

    FastBackup: yet another deduplication engine

    i've started development of one more zpaq/exdupe clone. the very first version that just deduplicates data in memory and doesn't write any archive, made available as http://freearc.org/download/research/fb001.zip

    but i need your help. i'm poor on good program names. can you suggest me a good one? generally speaking, i plan to make all-in-one incremental backup tool, with dedupe, compression, encryption, recovery record and so and so

  2. #2
    Member FatBit's Avatar
    Join Date
    Jan 2012
    Location
    Prague, CZ
    Posts
    189
    Thanks
    0
    Thanked 36 Times in 27 Posts
    Dear Mr. Ziganshin,

    Data Squeezer is suitable?

    FatBit

  3. #3
    Member
    Join Date
    May 2008
    Location
    Germany
    Posts
    410
    Thanks
    37
    Thanked 60 Times in 37 Posts
    the program is really very fast - congratulation!

    - as a name for the program i would suggest "fastback" or "fastbak"
    - recovery record as in rar sounds wonderful
    - maybe a special variant of a wellproofed archive-format like 7z - but only 1 or 2 compression methods?
    best regards

  4. #4
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    Incremental backups (at least in the zpaq/exdupe style) requires new archive format. It's why i can't just add new features to freearc/7-zip. Also, "Data Squeezer" name looks more appropriate for archiver rather than backup tool

    Now i have names fastback and sciback (both "scientific" and "sky"-like, meaning "cloud backup"). more ideas please!

  5. #5
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    772
    Thanks
    63
    Thanked 270 Times in 190 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    more ideas please!
    First part of your family name, zigzag files together with fast speed and name is not in use accept one gmail account.

    Zigzagfast
    Last edited by Sportman; 16th May 2014 at 17:19. Reason: Added where the name idea came from

  6. #6
    Member Bloax's Avatar
    Join Date
    Feb 2013
    Location
    Dreamland
    Posts
    52
    Thanks
    11
    Thanked 2 Times in 2 Posts
    Quote Originally Posted by Sportman View Post
    Zigzagfast
    ZippyDeDup ~ Use it for backup :^)
    Last edited by Bloax; 16th May 2014 at 17:18.

  7. #7
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    How would you explain the difference between deduplication and compression?

  8. #8
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    dedup engines splits input data into content-defined chunks and stores SHA hash of every chunk in the archive. then new data are checked against existing chunks and only new chunks are added to the archive

  9. #9
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,954
    Thanks
    359
    Thanked 332 Times in 131 Posts
    FreeArc ... FreeBackup? Or BulletBackup (Backup from Bulat)

  10. #10
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    BulatBackup looks interesting..

  11. #11
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    Quote Originally Posted by Bloax View Post
    ZippyDeDup ~ Use it for backup :^)
    I kinda like this one, but a lot of non-native English speakers probably won't get it. (A punny reference to "Zip ah dee do dah", made famous by a Disney movie.) I don't know if kids these days watch "Song of the South," either.

    https://www.youtube.com/watch?v=LcxYwwIL5zQ

    The reference would be clearer if it was "ZipADeDup," but probably still not recognizable enough.

  12. #12
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    I would just call it freearc and add it as a new format.

  13. #13
    Member
    Join Date
    Dec 2013
    Location
    Italy
    Posts
    342
    Thanks
    12
    Thanked 34 Times in 28 Posts
    sr (sleepright)
    ---
    I suggest some ideas for "definitive" backup: create two files.
    One index, one (or more) data.
    Data file always grow, but does not change (aka: append only new data).
    This is great for rsync (with or without --append)

    Index file for very fast extracting of files (the main "limitation" of zpaq).
    Into datafile insert a recovery header (for the chunks) just in case index is lost
    (or you don't need fast extraction).
    ---
    Eventually split (basing on SHA's) on more than one datafile.

    Example: when adding
    file1.txt hash aba12345...
    file2.txt hash aa0...
    file3.txt hash aa3...

    add the first in
    ab_backup.sr

    second and third in
    aa_backup.sr

    If you have a very big backup, you can easily split (sharding) without need to "choose" where to add.
    Using 0,1,2 or 3 chars\level (in HEX) you will get 1, 16, 256 or 4096 set.
    Last edited by fcorbelli; 16th May 2014 at 21:20.

  14. The Following User Says Thank You to fcorbelli For This Useful Post:

    Bulat Ziganshin (23rd May 2014)

  15. #14
    Member
    Join Date
    Oct 2009
    Location
    usa
    Posts
    56
    Thanks
    1
    Thanked 9 Times in 6 Posts
    I also think you should add it to freearc as a new format, independent of .arc, even compile it as an option into the arc.exe file.

  16. #15
    Member
    Join Date
    Dec 2013
    Location
    Italy
    Posts
    342
    Thanks
    12
    Thanked 34 Times in 28 Posts
    Here's the very early test on my source-tree
    [code]
    F:\zarc>z:\fb64.exe -t4 f:\zarc
    4+2 threads * 8mb buffers, 48b..4kb.. chunks
    Scanning: 6,238,754,776 bytes in 32,976 files
    100.0% deduplicated: 6,238,754,776 => 3,952,103,894 bytes

    F:\zarc>zpaq64 a r:\temp\kao f:\zarc\*.* -method 0
    zpaq v6.51 journaling archiver, compiled May 7 2014
    ....
    0 + (6238754776 -> 4164812809) = 4164812809
    93.907 seconds (all OK)[/quote]
    As you can see deduplication seems good (better than zpaq's)

  17. #16
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    Quote Originally Posted by nburns View Post
    How would you explain the difference between deduplication and compression?
    Deduplication only looks for repeated sequences which are above some quite big threshold in length. Compression usually also involves making statistics and/ or looking for short repeated patterns.

  18. #17
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    zpaq uses 64 KB average fragment size by default. To test 4 KB fragments use option -fragment 2

    Also, there is already a compressor named sr (symbol ranking).

  19. #18
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Matt Mahoney View Post
    Also, there is already a compressor named sr (symbol ranking).
    If sr is taken, the logical second choice is jr.

  20. The Following User Says Thank You to nburns For This Useful Post:

    PSHUFB (14th July 2014)

  21. #19
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    dedup engines splits input data into content-defined chunks and stores SHA hash of every chunk in the archive. then new data are checked against existing chunks and only new chunks are added to the archive
    That's pretty much what I thought, but that's just a kind of dictionary compression.

    Are you familiar with git's object database? The applications probably don't overlap perfectly, but it's probably worth studying git's design decisions and how they turned out. Backup and revision control are not thought of as the same, but there is a fair amount of common ground IMO.

  22. #20
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    Quote Originally Posted by fcorbelli View Post
    As you can see deduplication seems good (better than zpaq's)
    they are exactly the same since i'm using zpaq's chunking algo the difference is due to using smaller chunk size by default and lack of archive index

    i plan to beat zpaq on the speed side as well as flexibility, but deduplication efficiency probably will remain the same
    Last edited by Bulat Ziganshin; 16th May 2014 at 23:00.

  23. #21
    Member
    Join Date
    Jun 2008
    Location
    G
    Posts
    372
    Thanks
    26
    Thanked 22 Times in 15 Posts
    Backup ULtimATive

  24. #22
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    why not just Backup ULtimATe?

  25. #23
    Member
    Join Date
    Jun 2008
    Location
    G
    Posts
    372
    Thanks
    26
    Thanked 22 Times in 15 Posts
    Do you also want define a decompression language like zpaql??? So its possible to create an own custom compression algorithnms? So its possible to decode for everyone?

  26. #24
    Member
    Join Date
    Jun 2008
    Location
    G
    Posts
    372
    Thanks
    26
    Thanked 22 Times in 15 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    why not just Backup ULtimATe?
    if its more correct in english then ok, i thought ultimative would sound nicer

  27. #25
    Member Bloax's Avatar
    Join Date
    Feb 2013
    Location
    Dreamland
    Posts
    52
    Thanks
    11
    Thanked 2 Times in 2 Posts
    The least cheesy solution would be to integrate it into FreeArc, yes.

  28. #26
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Using a fixed format like tornado should be faster than a decompression language like zpaq. The disadvantage is losing compatibility when you want to improve the compression. I think this is more of a problem for high end compression like CM where you are more likely to make changes.

  29. #27
    Tester
    Nania Francesco's Avatar
    Join Date
    May 2008
    Location
    Italy
    Posts
    1,565
    Thanks
    220
    Thanked 146 Times in 83 Posts
    For me ..
    FLASHDUP
    or
    STORMDUP

  30. #28
    Member
    Join Date
    Dec 2013
    Location
    Italy
    Posts
    342
    Thanks
    12
    Thanked 34 Times in 28 Posts
    FatBackup?

  31. #29
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    bzig2 -- from your first initial and first three of last name, and "2" because it's your second backup program. Do you think it sounds original enough?

  32. #30
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    Maybe something Hollywood sounding, like: The Synchronizer.


Page 1 of 4 123 ... LastLast

Similar Threads

  1. Duplicate File Finder Engine
    By david_werecat in forum Download Area
    Replies: 10
    Last Post: 10th February 2018, 14:08
  2. Data deduplication
    By Lasse Reinhold in forum Data Compression
    Replies: 79
    Last Post: 18th November 2013, 08:49
  3. Blackbox identification of compression engine
    By Luntik in forum Data Compression
    Replies: 6
    Last Post: 19th January 2013, 20:57
  4. A fast diffing engine
    By m^2 in forum Data Compression
    Replies: 36
    Last Post: 21st September 2011, 19:30
  5. RZM - a dull ROLZ compression engine
    By Christian in forum Forum Archive
    Replies: 178
    Last Post: 1st May 2008, 21:26

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •