24x24 version. Probably it's too big, anyway, check it out!
24x24 version. Probably it's too big, anyway, check it out!
Seriously optimized MD5. Note that MD5 will be the default hash, because:
- MD5 is the most popular hash
- MD5 is FAST (Faster than SHA-1)
- MD5 hash is shorter than SHA-1 thus it is more readable
Some timings on ENWIK9 (including I/O):
Indy10 -> 18 sec (Shame!)
RFC implementation -> 7 sec
My MD5 -> 4 sec
If Indy10 component IdHashMessageDigest worked at descent speed (at least 1.5x slower than reference implementation) I will never consider writing own MD5 (SHA-1,SHA-256) implementation. But since it is nearly 5x times slower - I can't afford such price... So I worked seriously to hand-optimize C++ code and got something interesting. Now CHK is MUCH faster than many hash tools!
afair, sha1 in srep runs at 300mb/s and md5 is 500mb/s. it's a code from the LibTomCrypt library, w/o any modifications
Some timings on ENWIK9 (including I/O)
Where'd the timings go? :-)
I am... Black_Fox... my discontinued benchmark
MD5 is broken. http://en.wikipedia.org/wiki/MD5#Security
I am... Black_Fox... my discontinued benchmark
MD5 is broken, I know. I even posted a ZIP file containing two different files sharing the same MD5! But browsing the web I see that MD5 is the most common hash.
About speed. I can't claim that I have fastest MD5, since I'm using Borland C++ Builder - pretty slow compiler, and it's a GUI program, but... I squeezed all from this compiler. I tested bunch of GUI MD5 tools and CHK is the fastest! I'm sure that if compile my code for single-threaded command line tool the code will be slightly faster.
What I've done. I just collected all optimization ideas for MD5 from the most fastest MD5 implementations - OpenSSL, HashCat, MD5 papers... and tested them. Some of them worked some of them not. Simple.
A few examples:
Use "union" instead of translating array of bytes to array of integers.
Cheapest trick, but many MD5 implementations still load array of ints using many operations like:Code:union { Byte Buf[64]; UInt X[16]; };
Yes it's needed for Big-Endian compatibility, but in most cases this is not needed I guess. Even Macs are Little-Endian now (same Intel CPU as with PC)Code:X[i]=Buf[j]|(Buf[j+1]<<8)|(Buf[j+2]<<16)|(Buf[j+3]<<24);
Some things are not worked as they should. Many implementation uses "optimized" function form Wei Dai:
Original code runs faster somehow.Code:(((c^d)&b)^d) // Instead of ((b&c)|(~b&d))
I like the Round 3 optimization idea from the author of HashCat, that not worked for me though:
Code:a+=(b^c^d)+X[5]+0xfffa3942UL; a=Rol(a, 4)+b; d+=(a^b^c)+X[8]+0x8771f681UL; d=Rol(d, 11)+a; // These b^c (and later d^a) can be precomputed: UInt t=b^c; a+=(t^d)+X[5]+0xfffa3942UL; a=Rol(a, 4)+b; d+=(a^t)+X[8]+0x8771f681UL; d=Rol(d, 11)+a; t=d^a; c+=(t^b)+X[11]+0x6d9d6122UL; c=Rol(c, 16)+d; b+=(c^t)+X[14]+0xfde5380cUL; b=Rol(b, 23)+c; // and so on
Added CRC16. Because. of LHA/LZH uses it.
Now you can copy hash. of multiple files. The result is:
Automatic Backup[10].rbk, MD5: 6084AF67AC459E171654F2C5B3C9296B
Automatic Backup[11].rbk, MD5: 2DD4750866DA3666A386BFD0BA511FF3
PM-RegScan.bmp, MD5: C552AAE37D439096077085E223F5B186
RMScrn.exe, MD5: B3A026B8D5DFBA292187576C95B511AF
XRegistry.bin, MD5: 77D7200CC17366DF5A04248CF0FF2C4C
nu.exe, MD5: 0769E2260F1F29CA92872EFA59000EB2
etc.
New Refresh command updates hashes of all files.
Checked "Magnet Links" conception. Well, it's yet another field where CHK can be used. So, I will add MD4 since it's not really dead - eD2K network (eMule and others) use it. Thus you will able to create eD2K Magnet Links for files that equal or less than 9500 KB in size. For larger files (more than one chunk) the ED2K hash is an MD4 of hash list.
Another idea is to add hash list feature. For each file, instead of generating just one hash, we will generate a hash list - list of hashes for each file chunk/part. Say as with ED2K we well generate an MD4 for each 9500 KB chunk and output these hashes as: hash1:hash2:hash3:hash4 and so on. So you will able to detect files with the same begining or detect what part was downloaded incorrectly, having the complete hash list.
Checked the KaZaa hash - worst idea - MD5 with data skip. We hash 300 KB of data, then skip 300 KB, next, we hash 600 KB and again skip same 600 KB, next, read 1200 KB, skip 1200 KB and so on. The hacker can modify huge part of a file being transparent - the hash check will unable to detect file changes...
Yet another adea is Base32 encoding for hashes. SHA1-Base32 is frequently used in P2P networks - being extremely readable. The SHA1-Base32 length is equal to MD5 (Hex)!
There is a prospect of Pirate Bay tracker using magnet links, will those be the same ED2K hash or something else?
I am... Black_Fox... my discontinued benchmark
Wikipedia states the following about MD4: "generating a collision is now as cheap as verifying it (a few microseconds)".
Magnet links are very flexible. TPB can use any hash algorithm to generate hashes, but I hope they'll not opt for something already broken.
Yes, in the original paper from China that simultaneously broke MD4, MD5, TIGER, and RIPEMD-128, they said that finding collisions for MD4 could be done by "hand calculation". But I suppose CRC-32 has its uses too.
AFAIK, there are no known secure 128 bit hashes. But I suppose you could throw away half of the output of SHA-256 and be safe.
The development goes, but slooowly, due to my extreme busyness at my main job.
Anyway, I added MD4. Thinking about SHA1-base32 - nice secure and not too long&readable hash. Probably this one must be the default - too many users complain about MD5 weaknesses. But I guess, SHA1-base32 must be called in more compact way, say, G2 (gnutella 2 uses it) or Magnet, or even CHK.
Thinking about new program name. HashFrog? Or some name that will be unique and hash-related. Anyway, CHK is okay too, I guess.
I really like the thing that people visit my homepage even without news from my side! Even if I'm far from home (I've just returned from Siberia super tour) I can find motivation to improve or create things... Thank you!![]()
CryptHashData() from win32 api seems to be fastest: 636 mb/s for md5 and 474 mb/s for sha1 on 2600k@4.6GHz
Thanks for the link, anyway.
Added "Save as..." command to save/dump the file list to TXT file. The layout is subject to change, however it looks like:
// Generated by CHK v1.03 (Optional)
// 05.03.2012 21:15 or 03/05/2012
// Generated by CHK v1.03 on 03/05/2012
// Generated on Sunday March 4th, 2012 at 11:36:43
C:\test.txt, MD5: xxxxxxxxxxxxxxxx
Probably I should dump filesize as well:
C:\test.txt (3,735 bytes), MD5: xxxxxxxxxxxxxxxx
Timestamp is useful here. Not sure about program and its version, but it can be useful too.
Maybe tab (\t) can be more useful as a separator between filename, hash and file size. Because, a script can easily process such dump files.
BIT Archiver homepage: www.osmanturan.com
Can you describe what the binaries do, do they add a hash to the data buffer? Is there a better example for newbies for a console stdin to stdout i.e. "flush" the buffer to stdout ?
Not really understood the question...![]()
Now CHK dumps hash list in Unicode. Please check the attached file. If you will have any issues with this file - let me know!
It's a bit hard to parse such notation with regular expressions. Because, parenthesis can be used in file names too. Why don't you use \t or similar instead of some cosmetic stuffs which makes harder to parse?
BIT Archiver homepage: www.osmanturan.com
sha*sum, md5sum and probably other *sum programs, if they exist, uses format like:
<hash> <filename>
eg
I've checked sha1sum and it silently ignores any invalid line when checking, only reporting that all lines are invalid if that is the case.Code:piotrek@p5q-pro:~/Pobrane/corpora/enwik$ sha1sum * 88e00330c706b76aac18e7c71e36b06578943d60 0enwik8 d31fb012d587941d7d0c76eece89967455490be9 0enwik8.xwrt 57b8363b814821dc9d47aa4d41f58733519076b2 enwik8 57b8363b814821dc9d47aa4d41f58733519076b2 enwik8.dec 0c65b4c7314408d0cefb6d28f62bdb5adf49dcbe enwik8.lzp ff04a4d8231bd89a59893e50cf686155cee19ed8 enwik8.lzpccm 7fff4c0cd40db0b0e6974e571f43f11b6d46aa7a enwik8.lzpccm2 46e10cd26cbcae8dfbe741cb0995e1efebfb2216 enwik8.sel fc4ca3271ee798c7b4c17cf4a93f594eae5334dd enwik8.swi 27fe85921f14de8a959bcc23b0da0e68a8726ab2 enwik8.xwrt 2996e86fb978f93cca8f566cc56998923e7fe581 enwik9 2996e86fb978f93cca8f566cc56998923e7fe581 enwik9.dec 54191e1331c75ef0edeb50841e43dda96d8b60c5 enwik9.lzp 1e43f981aa7e355dec89653a8a0a50696573a46b enwik9.lzpccm d6ee95bc29c8be4cbe606f696f4018a6f825b1b6 table 19790d9fa9bcc2a207126caa3da8f27ea2cf648b _tables.out sha1sum: xwrt: Jest katalogiem piotrek@p5q-pro:~/Pobrane/corpora/enwik$ sha2
<Hash> <FileName>
or
<Hash> *<FileName>
has one disadvantage - I can't see the hash type from such listing!
Anyway, I have an idea - add comment with the hash type as a header:
I guess I will add CRC64 - since 7-Zip already use it. Probably I shouldn't keep ED2K - since it's just MD4 basically (for files that less or equal to 9500 KB in size, for larger files it's MD4 of all MD4 sums of each 9500 KB block)Code:# SHA1 88e00330c706b76aac18e7c71e36b06578943d60 0enwik8 d31fb012d587941d7d0c76eece89967455490be9 0enwik8.xwrt 57b8363b814821dc9d47aa4d41f58733519076b2 enwik8 57b8363b814821dc9d47aa4d41f58733519076b2 enwik8.dec 0c65b4c7314408d0cefb6d28f62bdb5adf49dcbe enwik8.lzp ff04a4d8231bd89a59893e50cf686155cee19ed8 enwik8.lzpccm 7fff4c0cd40db0b0e6974e571f43f11b6d46aa7a enwik8.lzpccm2 46e10cd26cbcae8dfbe741cb0995e1efebfb2216 enwik8.sel fc4ca3271ee798c7b4c17cf4a93f594eae5334dd enwik8.swi 27fe85921f14de8a959bcc23b0da0e68a8726ab2 enwik8.xwrt 2996e86fb978f93cca8f566cc56998923e7fe581 enwik9 2996e86fb978f93cca8f566cc56998923e7fe581 enwik9.dec 54191e1331c75ef0edeb50841e43dda96d8b60c5 enwik9.lzp 1e43f981aa7e355dec89653a8a0a50696573a46b enwik9.lzpccm d6ee95bc29c8be4cbe606f696f4018a6f825b1b6 table 19790d9fa9bcc2a207126caa3da8f27ea2cf648b _tables.out
Probably I shouldn't include CRC16 - it produces too many collisions. The only reason to include it is that LZH/LHA use CRC16 for file integrity checking.
Tested CRC64 - really like it! It's probably the future standard. Altough using 64-bit arithmetic is somewhat slow on 32-bit machine/code. Will rewrite it using 32-bit arithmetic only. With CRC it is easily possible.
Did you consider to implement hardware accelerated CRC32? IIRC, your CPU has already support it. But, it's output slightly different from well-known variant (it uses 0x1EDC6F41 polynomial instead of 0x04C11DB7). I have to note that it's really fast.
BIT Archiver homepage: www.osmanturan.com