Too bad, it's slow.
I wished for Skein or Blake...
Yeah, I'm looking for a SHA-1 replacement in ZPAQ, just in case. I'll have to do some performance testing. SHA-1 is probably OK for integrity checking and key strengthening, but collisions could be a problem for deduplication.
I am not sure I trust these so called open contests. Years past I assumed that the ones the US blessed where hard to break yet I am sure the NSA kept them weak enough so that they could break them. However to day I don't think that is true. I saw a video on the net last year that talked about the Chinese breaking one such method. However someone pointed out that they where using a version that was wrong. So they sent the corrected code back to china in the rump session the next day they presented a break for the corrected version. So its quite possible the Chinese are keeping very quit about this method. Its also possible the Chinese may even directed the winner after all the math skills and spying skills of the Chinese have passed that of the US in the last few years. I hope some one who has a pointer to the video that was on the net can post it. I don't remember the main speaker it might have been Zimmerman. Please some one post the video its worth seeing again.
At least that is my view of watching from the sidelines these last few years.
I would rather have an open, international contest (developed mostly outside the U.S.) than a secret one, such as was used to design DES and Skipjack. Still, you never know. The best argument for security is that a lot of people have tried to attack it and failed.
Anyway, I am concerned about SHA-1 in zpaq, not so much for error detection, as originally used, or even for key strengthening for encrypted archives, but for collision detection for deduplication. For example, I have two files on my computer with identical MD5 hashes. Such pairs are easy to produce. If I had used MD5 for deduplication, then one of them would be extracted incorrectly as the other. So far, the best attack on SHA-1 is 2^63 hashes to find a collision. That is still strong enough, but just barely. My bigger concern is that there will be a discovery like the Chinese group that simultaneously broke MD4, MD5, TIGER, and RIPEMD-128. SHA-1 and SHA-2 have similar structures to these except for a larger state. SHA-3/Keccak has a very different structure, but less time to analyze its security.
Many archives use non-cryptographic checksums like CRC-32 or even nothing to detect errors. So SHA-1 is probably overkill. Also, I used it for key strengthening, where I only need pre-image resistance, not collision resistance. For encryption I use RC4-drop-768 which has known weaknesses but is fast. To strengthen the key I use key1 = SHA-1(password, 0, salt, 0, password, 1, salt, 0, password, 2, salt, 0,...) repeated out to 10 MB in order to slow down brute force key search. For this purpose, even a weak hash like MD4 would work, as long as there is not a fast inversion. Also, RC4 is vulnerable to related key attacks, so I use SHA-1(key1, offset) as the RC4 key in 4 KB blocks to allow random access. Again, a weak hash is sufficient. The only purpose is so that consecutive keys differ in more than just a few bits.
At some point I will probably use either SHA-2 or SHA-3 for deduplication, but I'll wait for now.
Sorry had to much beer.
Last edited by biject.bwts; 18th October 2012 at 23:52. Reason: just because
Unicity distance makes it hard to detect when the correct key is guessed in a ciphertext-only attack because the plaintext still looks random. ZPAQ is of course vulnerable because it uses predictable headers. But a bijective compressor does not protect in general either. Even though any file is valid input to a bijective decompresser, you have to assume that an attacker can detect valid decompressed output, such as containing text in some natural language. To be protective, the compressor has to be optimal (which implies it is bijective), but that problem is not computable.
Instead, I used an encryption format that is believed to be secure against chosen plaintext attacks.
My reply was just wasted space
Last edited by biject.bwts; 18th October 2012 at 23:53. Reason: just because
I noticed the answers just now...
But I think that for almost all dedupe uses you don't need crypto strength, you need something that mixes things well, so they don't collide on their own. So I'd use SpookyHash with modified mixing to have a fast, 256-bit checksum. And just maybe add a crypto-secure option.
Is there a reason why you don't use PBKDF2 or scrypt for password strengthening?
And xsalsa / snow2 for encryption?
I think it is important to use a cryptographic hash for deduplication. If it is possible to produce collisions, then there is a small but not negligible chance that they could be produced deliberately. For example, I have two files on my computer with the same MD5 checksums. This should not cause deduplication to fail. In any case, SHA-1 is fast enough. My implementation adds about 10 sec/GB on my laptop, which adds 10% to the fastest compression mode. Slavasoft fsum -sha1 is about twice as fast, though I haven't figured out how.
Also, I think 160 bit hashes is about right. It is long enough to make collisions infeasible. Longer hashes would just eat memory, which already exceeds 2 GB when reading several TB archives. If I switched to SHA2 or SHA3 I would probably throw away the extra bits.
And I am embarrassed to say I did not know about PBKDF2 or scrypt, although I'm doing the same thing. I got the idea from crypt() but wanted to avoid the 56 bit password size limit.
Encryption has to be fast. The default compression mode for zpaq is faster than zip for compression and about the same as zip for decompression. With RC4 it is 10% slower. I initially tried using SHA-1 as a keystream generator but it is 10 times slower than RC4. Logically I would use a well tested encryption algorithm like AES in CTR mode. I need to test its speed first.
Of course I was reluctant to add encryption at all, because it is so hard to get it right. The problem is that a separate program would have to decrypt the whole archive, then update, and encrypt it again. This would be too slow for incremental updates of huge archives. Having the encryption integrated means that only the index needs to be decrypted and only the appended data needs to be encrypted.
Matt, I can assure you I won't be screwing myself by deliberately corrupting my data and if I did, I would find simpler methods.
Not everybody can say it, but many do. For them crypto is a waste of time. 10% is not a terrible waste, but big enough to be uncomfortable.
As to size, I did the maths some day, but forgot the results already, I don't claim 256 is needed.
As to PBKDF2, it's a well tested standard, which makes it more trustworthy than ad-hoc solutions even if the solutions are actually OK. I believe you'd do best by switching either to it or to scrypt (which is less tested than PBKDF2, but still much more than your scheme and is technically better than either).
Than why do you use RC4? I think at the time you started it was broken already.Logically I would use a well tested encryption algorithm like AES in CTR mode. I need to test its speed first.
Because RC4 is fast and just a few lines of code, and not broken the way I am using it. I do like scrypt because it requires a lot of memory, but key strengthening is not a critical part of the encryption. I don't want to break compatibility with earlier versions without a good reason. I did break it once when I found a bug in my code that had nothing to do with my choice of crypto primitives.
But I disagree that you don't need crypto hashes for deduplication. You can no longer claim 2^-n probability of collision for an n bit hash. Li and Vitanyi show in http://oai.cwi.nl/oai/asset/1991/1991A.pdf that average case complexity over a universal distribution of inputs is the same as worst case complexity for all algorithms. Their proof is for time and space complexity, but you can apply a similar argument to finding hash collisions. Collision probability is higher than 2^-n if the input is compressible.
Matt, i encourage you to read smth like "applied cryptography". another good idea is to use the same library as winzip - it's free and OSS
I have read some books on crypto, which is why I was reluctant to add it to zpaq. I may add AES in CTR mode if it is fast enough.
However you could have a mode for the really insecure. Where you do AES in CTR mode then do a BWTS pass then another pass of AES CTR mode followed by another UNBWTS pass followed by a third AES in CTR mode use a different key all 3 encryption passes. Of course it would be slow but it would only be for those who have my amount of faith in the Chinese. That is until they get large quantum computers.
Nobody knows which crypto is secure because you can't prove security. If P = NP then all crypto is broken, but nobody has been able to prove P != NP. But I doubt that anyone has secretly broken AES. AES was developed by an open, international process. I would trust it more than something I developed myself. Of course you can never refute conspiracy theories.
Quantum computers cannot break symmetric key crypto by any known algorithm. It can break all public key crypto including RSA, Diffie-Hellman key exchange and elliptic curves using Shor's algorithm. Currently no such computer is known to exist. The largest quantum computers are made by D-Wave with 512 qubits, but they are not capable of running Shor's algorithm. Their architecture is only capable of solving discrete optimization problems, not the quantum Fourier transform.
I think that if AES could be broken, then so could 3 pass AES. You propose AES + BWTS + AES + UNBWTS + AES. Since the input to BWTS and UNBWTS is effectively random, it would leave the data size (almost) unchanged, and you could replace this with any non-crypto permutation or just take it out with no change in security because AES is also a permutation.
I think that the main sources of possible collision problems for deduplication
are the collision demos from corresponding cryptography articles,
while accidental match is still unlikely.
Thus, my workaround is to use a slightly modified version of md5/sha1 (eg. add/remove a line or change a constant).
I'm quite certain that md5 with an extra round won't be any less secure, but its much less likely to encounter a collision for it.
Public Key crypto such as RSA rely on zero knowledge meaning you can till immediately if you have the correct key.
The fact is take any file several megabytes long. Encrypt in forward direction with one key and then in reverse direction with another key in both use CBC mode. Modify one byte near the middle of file. Then try to decrypt the file using the two keys. You get back the whole file that is exactly the same as original file except for a few bytes of corrupted data near the center of file. I tested this years ago. So I am sure if you did three forward passes with AES in CTR mode and changed the middle of file you would have the same problem.
The reason is the AES block is so small and that data is not really spread spectrum through the whole file. It might appear to be since like in the CBC mode the whole file is changed. But the unicity distance is extremely short. And with any small segment of only a few bytes of ciphertext and a matching set of plaintext the attacker has all the information for a break. The BWTS and UNBWTS spread the data through out the file thus making for a much larger Unicity Distance so to say AES and BWTS are permutations is misleading since one is applied to a very small number of bytes while the other is a permutation on the whole file note BWTS can easily handle 1 gig abyte files as a single block.
I have collision demos for MD5. I don't know of any for SHA-1. But if it becomes a possibility then I should probably use SHA-3 instead of something I design myself. Adding another round to MD5 may not make it any stronger, and if it ever became popular then collisions for it would start to appear too.
No, that has nothing to do with it. Shor's algorithm solves discrete logarithms, which are central to all public key cryptography, but not to any symmetric key algorithms like AES.Quantum computers cannot break symmetric key crypto is based in the fact that there might be more than one key that leads to the same ciphertext.
Anyway, I'm only considering worst case security against chosen plaintext attacks. For fast random access I use XOR with a keystream such as RC4 or (I would consider) AES in CTR mode. Knowing the plaintext, the problem is only to find the key from the keystream or predict more of the keystream. What you do with the plaintext is irrelevant. Maybe you could apply BWTS to the keystream, but I don't see the point. You have to assume that the attacker knows everything but the key, and can just do UNBWTS to recover the keystream.
" A conventional computer encodes a bit, a 1 or 0, as the presence or absence of a voltage. A quantum computer, in contrast, uses a quantum-physical property of a particle to encode a 1 or 0, for example the spin axis of an electron (which is either "up" or "down" but nothing in between) or the polarization of a photon. In either case, calculations will be performed on a group of bits, say 32 bits, at one time. 32 bits can be arranged in a total of 4,294,967,296 different ways, allowing them to encode that number of different numbers. Suppose a conventional computer were to perform a calculation requiring a test of all possible values that can be represented by 32 bits. It would have to loop through every single value, one at a time."
or from http://www.springerlink.com/content/u4877618u916720g/
"Applying Quantum Search to a Known-Plaintext Attack on Two-Key Triple Encryption"
I can't read the article since it only for elites like you and not me. Maybe you have access to such stuff.
A quick Google search finds the abstract. http://cs.anu.edu.au/iojs/index.php/...cle/view/11557
I don't want to spend $29.95 to be elite, but I guess they propose using Grover's algorithm to attack 3DES in O(2^(n/2)) = O(2^56) time. A more informative paper may be http://www.kurims.kyoto-u.ac.jp/~kyo...df/1166-29.pdf
Quantum computers don't break symmetric key cryptography, as far as anyone knows, but you do have to double your key sizes.
One common definition is that a cryptosystem is broken when you can can search for the key in a faster time than a random blind search through the key space.
The paper implies a tremendous speed up in getting the key. Suppose there is a keyspace seach say 9 bits at which the random blind search matches this quantum method for the break. Well if you increase the key space to 17 bits the blind search would take 256 times longer that it did for the time where the two methods match. The quantum computer 16 as long as it did for 9 bits. Which means the convention computer at 17 bits when compared to quantum time at 17 bits of keyspace would take 16 times longer so yes its a break for the plain text attack on a symmetric cipher.
Last edited by biject.bwts; 6th November 2012 at 22:15. Reason: fixed errors
Yes, that is a common definition but not a very useful one. For example, RSA is not considered broken even though you need several hundred bits of key because there are factoring algorithms faster than trial division. It is more useful to specify the strength and declare it broken if a faster attack is found. For example, declare AES-256 has 128 bits of security against Grover's algorithm, which is still plenty.
To many AES is a close as you can get to provably secure. But looking at Shannons Unicity Distance concerns its not very secure especially with plain text attack. Using something like BWTS and UNBWTS in between three pass of whatever version of AES one thinks is the lastest provably secure crypto can only help by increasing Unicity Distance.
Heck let me throw all the cards on the table. You really don't even want the same text to be encrypted the same way each time even if you use the same key. You could add random amounts of data each pass using code and principles from
These routines written in the 90's last updated in 2000 before BWTS so the exes will not run you need to recompile them
I invented away to bijectively add the index to a normal BWT in my search to find BWTS.
The concept of bijectively adding and index to a data block was not well understood by the masses. Many thought just add a universal number in front or back of block. Missing the concept that that any file could be thought of as either a file and index or a file alone. The number can only be in the range of 0 to N-1 it was bijective. Anyway if you under stand these old routines and methods you can easily add random data or such making even plain text attack breaking harder.
in order to be sure yopu need a research. and without knowing details of zpaq encryption architecture, you may only believe. you don't know about attack possibility which isn't the same as knowledge what attack is impossibleNope. In my case, no research is needed. I know that any sensible function will be OK and that took much less time than using SHA1 to hash a gigabyte.
Knowing all users as well as knowing myself is well sufficient to correctly estimate attack probability. And in my case - as well as the case of almost any individual using the code only for their own needs (I guess any but ones who play with collisions) - the chance is 0.