# Thread: Lightweight compression with encryption based on Asymmetric Numeral Systems

1. ## Lightweight compression with encryption based on Asymmetric Numeral Systems

2. Thanks, Hamid.
tANS of a given size has similar (e.g. HW) cost as S-block (permutation), but mixes information in unimaginable more complex way: in chunks of bits for which even lengths are unpredictable, depend not only on the processed symbol but also the internal state - it would be a waste not to use it also in cryptographic applications.
It is a good exercise to try to attack even a single pure tANS of reasonable parameters ... but sure in practice, at least at the beginning, it should be used together with a trusted encryption - if, as usual, you need to both compress and encrypt the data - put some encryption in tANS (nearly for free) and reduce the number of e.g. AES cycles from 10 to 5.
For e.g. remote sensors, a cheap delta filter + tANS can do sufficient compression+encryption for many applications (e.g. a thermometer).

The question is how to make the cryptographic society to really look at it?

3. I would not recommend this. Encrypting twice with two weak methods is still weak. Stick with well tested encryption methods like normal AES and don't try to create your own.

4. Matt,
in most IoT situations, like humidity sensor in flower pot, the "AES or none" philosophy leads in practice to none encryption - to maintain let say 5cent cost of the chip, for sensor which should gather (standard example): on average 1 bit/s for 10 years from 1 AAA battery.
Hence, there have been born large field of so called lightweight cryptography - which is cheap from both HW and energy perspectives:

Using tANS, we could have cheap compression+encryption, e.g. by just putting into the sensor:

s = val - prev_val; prev_val = val; // delta filter ... or use some other simple filter
nbBits = (x + ns[s]) >> r;
writeBits(x, nbBits); // write nbBits youngest bits in buffer
x = encodingTable[start[s] + (x >> nbBits)]

Where tables are chosen for a specific probability distribution and crypto-key, in many situations can be fixed in ROM of a given sensor. Coding tables are chosen using CSPRNG (outside) - even if someone would read the tables, he gets absolutely no information about the used cryptokey.
https://en.wikipedia.org/wiki/Crypto...mber_generator

Having cheap compression+encryption, producers could decide to use it for situations when encryption was not planned at all as being costly (like flower pot) - motivated by savings offered by compression: reduced required buffer and transmission cost.

5. If an adversary has the encoding tables, then what stops them from decoding the message?

The sci.crypt FAQ is over 20 years old, but 2.3 is still applicable. http://www.contrib.andrew.cmu.edu/~shadow/crypt.html

Specifically, write some code. When we break it, don't just keep patching it. Offer some theoretical reasons why your method is better (more secure, faster, etc) than existing methods.

I suggest reading a book on cryptography before you try to design your own algorithms.

6. The coding tables are unique for a given device and made such that even if someone would get the tables, he gets absolutely no information about the cryptokey.
One scheme is like in the diagram from the first post above: start with generating tANS symbol spread (determining table) for a given probability distribution, then perturb it (scrambler) using a (considered safe) CSPRNG seeded with hash of concatenated a few values:
- cryptokey,
- device ID number - tables are unique for this device, in basic situations (like flower pot) generated outside the device and just written in its (programmable) ROM,
- for more costly scenarios: also data block number - making the perturbation depends on the block number (getting coding table for one block gives no information about coding tables for other data blocks),
- salt for extra security - a random number written in the file (/message/block) header - to prevent adaptive plaintext type of attacks: when attacker uses a slightly modified messages - thanks to salt, every time he will get a completely different ciphertext.

Using CSPRNG means that a tiniest change of one of the above values leads to a completely different (uncorrelated) perturbation, and that getting the tables gives no information about these values (cryptokey).

Regarding safeness, there is a huge space of possibilities to choose the perturbation from (10^100 .. 10^1000), and evolution of tANS state (x) is chaotic: if the attacker lacks some information e.g. about the tables, he will quickly loose any information.
Specifically, the evolution of internal state while processing a symbol of probability p is
lg(x) -> ~ lg(x) + lg(1/p) modulo 1
leading to 3 sources of chaocity:
1) asymmetry - small change of the state takes us to a different symbol and so a different lg(1/p) shift,
2) ergodicity - lg(1/p) is usually irrational, making that even a single shift wants to cover the entire space of states,
3) diffusion - above rule is approximate - even knowing the probabilities, the attacker would loose information about the exact state.

For more discussion you can look to the arxiv paper, written with cryptography specialist: https://arxiv.org/pdf/1612.04662
And generally, please try to attack even pure tANS - you will see that it is extremely difficult. Its degenerated case (much simpler) is Huffman, for which we can read in Rivest et. al. paper: ”We find that a Huffman code can be surprisingly difficult to cryptanalyze”.
The main problem is lack of synchronization - the attacker has no idea how to divide the bit sequence into blocks corresponding to successive symbols. For Huffman their lengths depend on the symbol, for tANS it additionally depends on the hidden chaotic internal state - which determines both bits to produce and even length of the current bit block (nbBits).
And until getting trust, tANS can be used only to shift some security, like instead of using standard compression + 10 rounds of AES, use cryptANS + 5 rounds of AES.

7. For more discussion you can look to the arxiv paper, written with cryptography specialist: https://arxiv.org/pdf/1612.04662
it doesn't include any attacks, checking only statistic properties of algo. you will have the same properties with any good prng with 32-bit seed, though. ask your colleague what this work really proves and what it doesn't prove

8. It tests the basic statistical properties required for cryptosystem (balancing, avalanche, nonlinearity, diffusion and completeness).
Testing against attacks requires objective people being aware and actually trying to attack it - I encourage for years here and would gladly discuss.
Are there some standard attacks you could directly apply here?
Remember that this is a highly nonstandard situation: in standard cryptosystem the attacker knows from the beginning how to split the bit sequence into bit blocks. In contrast, for tANS the number of bits to use for given symbol (and their values) depend on the hidden state behaving in a chaotic way.

And I am definitely not claiming that pure tANS is already sufficiently safe for all purposes:
- it can be used for situations where encryption would usually just not be used because of HW and energy cost, like humidity senor in flower pot - just as a free bonus for including compression to get saving in hardware (buffer) and transmission,
- to shift some encryption for savings, e.g. instead of standard compression + 10 rounds of AES, use cryptANS + 5 rounds of AES.
However, deciding the details requires real discussion - especially with the cryptographic society.

There are many papers about including encryption in Huffman, BWT, LZ, or AC, e.g.: https://www.google.pl/search?q=arith...ure+encryption
tANS has much better behavior for including encryption than
- Huffman, which is much simpler degenerated case of tANS (symbols have fixed bit sequences) or
- AC, which is relatively costly and only allows to modulate probabilities (tANS has additional huge freedom of choosing exact symbol spread) and it's behavior is not chaotic - the final value determines the splits.
But again it's not about claiming that pure tANS is already safe, only to consider shifting some of security to it - the details obviously have to be discussed, analyzed.

9. It tests the basic statistical properties required for cryptosystem (balancing, avalanche, nonlinearity, diffusion and completeness).
as i already said, good prng with 32-bit seed wil have the same properties. so it "must have" for any good prng, but doesn't even prove that you have a good prng, not even saying about cprng, which should be checked with cryptoanalysis. i don't found "cryptoanalysis" section in the paper, though

Testing against attacks requires objective people being aware and actually trying to attack it - I encourage for years here and would gladly discuss.
do you tried to read papers related to AES contest? some algos were published, other cryptographers truied to perfrom known attacks on simplified algo and revealed how they can be breaked. this work wasn't done for your algo, and i doubt that top cryptographers will ever spend time on you algorithm (like you don't spend time on analyzing all those "absolutte compression" schemes published here or in comp.compression). your coauthor doesn't provided any attack analysis too. so all we know at this time is that your algo passes some prng tests. nothing more

compare that to xorshift algo - it generates prng at the speed about 10 GB/s per core, has excellent proved statistical properties and noone still published a way to restore seed / forecast next data from preceding ones. it looks superior in any proven aspect to ans, but noone tries to claim that it can be used as encryption algo. try to attack xorshift1024 considered as encryption algo. if you cannot do it, then this means that all your work doesn't cost much since it can be applied to any good prng and give the same conclusion - "we can't break this algo so we believe it's secure"

Are there some standard attacks you could directly apply here?
i'm not cryptographer, so i have no idea of these attacks. the real quesion is "which arracks were tried by authors of the paper [and what are their results]?", the answer is None

Remember that this is a highly nonstandard situation: in standard cryptosystem the attacker knows from the beginning how to split the bit sequence into bit blocks. In contrast, for tANS the number of bits to use for given symbol (and their values) depend on the hidden state behaving in a chaotic way.
yeah, people seeking for inventing compression algos from scratch also believes that unusual operations make their compressors stand out. you know that's not true. existing encryption algos stay out of such ideas exactly for the rreason you menationed - it makes harder to analyze the, but doesn't improve their encryption strength. like making encryption algo closed-source, it's a security by obscurity. it may increase cryptoanalysis costs, but once the cryptoanalysis done, such algorithm will be breaked like any with fixed-size blocks

- it can be used for situations where encryption would usually just not be used because of HW and energy cost, like humidity senor in flower pot - just as a free bonus for including compression to get saving in hardware (buffer) and transmission,
it's well-known debatable question. some may believe that using Ceasar cipher still improve security over plain text, other will argue that it will give false sense of security. anyway, xorshift outperfroms cryptANS for such cases

- to shift some encryption for savings, e.g. instead of standard compression + 10 rounds of AES, use cryptANS + 5 rounds of AES.
this is absolutely false. 5-round AES is breakable, so you actually propose to replace well-known AES with xorshift-like prng. if it ever had any meaning, it was be done much earlier

There are many papers about including encryption in Huffman, BWT, LZ, or AC, e.g.: https://www.google.pl/search?q=arith...ure+encryption
and people still use well-analyzed algorithms rather than ad-hoc solutions for a reason. using unproven encryption algo is just like repalcing lzma with random bit manipulation algo, hoping that large amount of bit operations will outperform carefully-thought algorithm

10. Uh, the "standard attack" can be done pretty easily here: http://www.cprover.org/cbmc/
Just write the decoding algo in C, and then add asserts that mean "decoded bytes can't match the known plaintext".

But in general I agree with Jarek on this - better to have some encryption, than not to have anything.
At least it would protect the device from children or something.

Because with Matt's approach we can say that app signing can be broken (there're known examples with botnets etc),
so it shouldn't be used at all - while ignoring the cost of this crack, which is \$100k or more.

11. Bulat,
Regarding CSPRNG, I though about using some generally considered as safe, like Mersenne twister or many others.
For pure tANS in flower-pot like remote sensor, it has to be used outside - once per device.
For standard compressors, it has to be used once per data frame: tANS usually uses 2048 states, making scrambler in 8-symbol blocks, we need 256 blocks * 3 bits for cyclic shift in every block = 96 bytes generated from CSPRNG (2^768 possibilities) ... per let say 30kB data frame - such cost of additional scrambler is practically negligible, even including dependence on the number of data frame.

I have tried to attack pure tANS and it seems extremely difficult, standard cryptoanalysis methods don't work as this is a very nonstandard situation (bit blocks of varying unknown length).
The only way I could find is trying to retrace the process from the initial state ... but there is not enough information, and there are simple ways to prevent such attacks, like just encrypting the initial state.
But sure, I am not an expert in cryptoanalysis - to get trust, many experienced people should try to attack it.
And again, I am not claiming that single pure tANS is completely safe, only that it has perfect properties (much better than Huffman and AC) to consider as one of tools for the common purpose of combined compression+encryption.

Shelwien,
Uh, the "standard attack" can be done pretty easily here: http://www.cprover.org/cbmc/
Just write the decoding algo in C, and then add asserts that mean "decoded bytes can't match the known plaintext".
This is entropy coder - it removes redundancy which could be used for attacks.
For flower-pot-like situation you observe encrypted tANS initial state and further bits for tANS decoder with symbol spread perturbed in an unknown way (e.g. 2^768 possibilities) - how exactly would like to read such ciphertext?

12. Of course there's no way to attack it if we have just the tANS stream and nothing else.
But if we precisely know the decoding algorithm and its output (the decoded data), and
just some parameters are missing (the whole state table or whatever), then we'd likely
be able to find these parameters with CBMC, if enough plaintext is available.
https://en.wikipedia.org/wiki/Boolea...bility_problem

13. Ok, assume you know both the transmitted ciphertext and its plaintext (it would require stealing e.g. the flower-pot to get the exact plaintext) - can you tell anything about e.g. the coding tables it uses?
Let say it uses the simplest encryption for the transmitted initial tANS decoder state, even just xor with a fixed mask (generated by the CSPRNG).
So you don't know the initial state, nor the coding tables (2^768 possibilites) - how would you like to get any information e.g. about the tables?
Even if you could, due to CSPRNG you still would get no information about the cryptokey - you only could get the tables for this e.g. flower pot ...

And this is the simplest flower-pot-like situation, now imagine there are additionally e.g. 5 AES rounds after it, instead of standard 10 rounds ...
Compression itself means reducing redundancy - getting closer to pure noise - removing the anchor for cryptoanalysis ... now add chaotic tANS behavior to it ...

14. If there's AES after it, then tANS can use static tables - it won't really change anything.

The problem with lightweight schemes (like crc, or stdlib rand, or rc/ANS) is that there would be simple correlations between
data bits and unknown parameter values. Cryptoalgorithms are specifically designed to deal with it, by mixing up all the bits.

And as I said, I actually agree that breakable encryption is better than no encryption at all.

15. Instead of reducing the number of AES rounds after tANS (how much?), the best would be to completely replace it with a cheap dedicated single layer ... but it seems a difficult process to choose a considered safe one ...

Regarding correlations, the CSPRNG completely removes any of these with the cryptokey.
There are also no correlations between different ciphertexts for a given table - a specific choice are just small perturbations of the coding tables, it changes state->symbol relation while decoding, and chaosity of state evolution removes any correlations.

16. Its impossible to have no correlations at all - you'd have them even with cryptohashes, just that they would involve most of hash bits.
And in tANS case, the number of bits in "equations" won't be that big, so it would be possible to enumerate all the combinations in memory
and build an inverse function. Sure, there would be still some bruteforce search involved, but probably not as much as you think (if any).
Variable-length codes are certainly annoying for cryptoanalysis, but much less annoying than SHA rounds or something.

17. If there remain some correlation after pure tANS, the question is if they can be used ... for flower-pot-like situations?

Regarding checking all possibilities - lack of the initial state makes that attacker starts with 2048 of them, for 8-symbol block scrambler, every symbol grows the number of possibilities to consider ~8 times ... the number of possibilities grows exponentially, some verification can give the final state, but it is thousands of symbols away ...

Honestly, attacking even pure tANS seems extremely tough (try yourself) - I think it's impossible for ciphertext-only like the flower-pot-like situations. Stronger attacks would require stealing the flower pot ... in which case someone might physically obtain the coding tables from ROM (not cryptokey) ... but there are simpler/cheaper ways like just planting a bug.

18. Well, here's what we'd need to extract a bit from a byte, by its index - http://ideone.com/r9GQUW (I mean the boolean function there).
For tANS we'd need 8x of something similar to extract a compressed data byte by its index, and that would be all of the actual logic -
most of other things are simply unknown variable bits.

And then the main algo is several iterations of this:
```
nbBits = (x + ns[s]) >> r;  // # of bits for renormalization
writeBits(x, nbBits);  // send youngest bits to bitstream
x = encodingTable[start[s] + (x >> nbBits)];```

Here we know "s" and the writebits result, and unknown are:
x: 8? bits
ns[256]: 3*256=768 bits, but only 3 bits involved here
start[s]: 8? bits here
encodingTable[]: 8? bits here
bit pointer: 10-15 bits

So 37-42 bits total involved in each step, with known code bits, which would reduce the uncertainty.
Sure, there're many variables, so it would require a few kbs of data to accumulate the dependencies.
But I don't see a problem with building a boolean system and then solving it with some existing software.

However, the cryptographic strength of this would be determined by these 42 bits (actually likely 32 or less).

19. EncodingTable is larger: for 2048 states it is 2048*11 bits, we need about 4kB for this number of states (states x: 11 bits). Could be halved to 1024 state, 2kB.
However, if it provides let say 2x compression, for 1 transmission/day and 1bit/s, it means 12kB -> 6kB buffer reduction, so still savings.
Plus the most important - energy savings for transmission.

Tables are determined by symbol spread: state -> symbol assignment: it is first chosen for a given probability distribution (as proportions of occurrences), than perturbed with scrambler using CSPRNG seeded with both cryptokey and device ID.
For decryption it's better to look from the point of view of decoding step:

```t = decodingTable[x];
x = t.newX + readBits(t.nbBits); //state transition
writeSymbol(t.symbol); //decoded symbol```

If you don't know the exact state or table, you don't know the produced symbol (decodingTable[x].symbol which is the symbol spread), nor the number of bits to use (decodingTable[x].nbBits, for symbol of probability p it is floor or ceiling of lg(1/p)) - you don't know how to split the bit sequence into bit blocks corresponding to successive symbols.
You could try all the possibilities, but their number grows exponentially and there is practically nothing to anchor to.

20. In a way, its similar to xor'ing the data with a 2k random string - you can't crack it with less than 2k data and without plaintext.
But it doesn't mean that its equivalent to 16kbit encryption :)

Sure, higher state precision would increase complexity, but only by that much.

> you don't know how to split the bit sequence into bit blocks corresponding to successive symbols

That's why I posted that ideone link - there's no need to split anything, it simply becomes a boolean function.
Which is also not of cryptographic quality, so there'd be some patterns.

And yes, I guess there's symmetry, so we could add equations based on decoding too - that would reduce the amount of necessary plaintext at least.

21. Oh tANS behavior is muuuch more complex than just xor'ing with a random sting - which is a linear transform (and symmetric - in contrast to entropy coding).
In contrast, here we have a very nonlinear chaotic transformation - the further behavior is strongly dependent on the entire past (state and position).
While xor'ing, there is only unknown bit to xor with. Here for a given bit, the attacker doesn't know to which symbol it corresponds to, on which position for this symbol, and from which state.

See example for 16 states and 3 size alphabet (in practice e.g. 2048 states and 256 size alphabet).
At the bottom there is a bit sequence and its splitting into bit blocks while decoding - strongly dependent on the history and tables:

Attachment 4788

22. Its more complex, yes, but same as xor string, the dependencies here are mostly local - the tANS step doesn't depend on all previous bits or anything like that,
but just state and bit pos.
For example, when symbol and state have given specific values, it would write the same specific bits to output, so finding some matching bitstrings
in the code and pairing them up with matching substrings of uncompressed input would already give us some anchor points for analysis.

In any case, we can try checking this easily enough - you just have to provide a simple tANS enc/dec C/C++ source (and input data for it, if necessary).
Then we can feed it to CBMC and see if it can compute the missing tables from the data.

23. Originally Posted by Shelwien
For example, when symbol and state have given specific values, it would write the same specific bits to output
Indeed, for the same symbol sequence and starting state, the produced sequence of bits will be the same (strongly dependent on the unknown coding table), but to use it you need a relatively long repeating sequences and there is only ~1/2048 probability that they will start with the same state - it needs relatively long data.
A simple practically cost-free protection (not perfect) is using also frame number in the seed of CSPRNG, enforcing independent tables every frame (e.g. 30kB).
Adding salt (random number written in header and also used in seed) makes that every time you get completely different tables - getting one of them gives absolutely zero information about other files.

But for many situations like flower pot, it seems sufficient to protect against ciphertext-only attacks, others would need stealing the pot (you require knowing plaintext).
Your attack, especially with tables changing every 30kB data frame, would require the strongest attack: adaptive chosen plaintext (free access to encoder) to really reconstruct the coding tables (only for this flower-pot, or a single 30kB data frame - what could you do with this information?).
And generally a simple additional encryption layer, like just 1 round of AES (a complex transformation in e.g. 64 bit blocks depending on key generated with the CSPRNG), would ultimately protect against such attacks.

24. Reminds me of Maraca...
No defended attacks = no strength. You need to get some cryptographers on it or else it's useless. Bob Jenkins found a way, you may try something similar.

25. So pure tANS is rather not supposed to be the entire cryptosytem, only one of the building blocks for compression+encryption situations - the details and additions for various applications (protection level) have yet to be established.
Sure, there are needed cryptographers to polish these details by trying to attack various possibilities, but it's extremely difficult to make them even seriously look at it.
However, I believe it is only a matter of time now (e.g. there are dozens of papers about secure AC, which is terrible at it comparing with tANS) ... finally someone will just massively use e.g. pure tANS for compression+encryption in flower-pot like remote sensors which usually have no encryption, getting a focus from the cryptographic society ...

26. Actually I think that arithmetic coding should be cryptographically stronger than ANS (and especially tANS) - simply because there're more bits in coder state.

27. Ok, let's compare them:
- the AC renormalization bits directly return position of symbol in the probability distribution. The final value directly returns sequence of divisions - it's behavior is very predictable. In contrast, ANS evolution is chaotic (three sources: asymmetry, ergodicity and diffusion),
- cryptokey-based perturbation of AC is just modulation of probabilities. In tANS, beside the probability distribution, there is still huge (>10^100) freedom to choose the details (symbol spread),
- tANS is cheaper from hardware and energy perspectives (no multiplication).

If you believe that the number of states is so essential,
1) in tANS it can artificially increased - I thought about it in 2009 paper, also for including error correction (add a forbidden symbol and rescale the remaining ones, it is not used while encoding, its appearance while decoding allows to localize errors).
For example, take some additional register and in every step switch its youngest bit with the youngest bit of state x, than make cyclic shift of this register.
2) You could use rANS instead - needs multiplication, but is cheaper and much more chaotic than AC.
However, like AC it has smaller than tANS freedom to perturb the coding process (basing on cryptokey).
3) One could make a hybrid to repair this lack of freedom - divide the natural numbers into identical ranges like in rANS, but inside these ranges don't just divide into subranges like in standard rANS, but use a more complex tANS-like symbol spread ... in analogy to Fabian's alias rANS.

The only advantage of AC is forward decoding, but for remote sensor you just encode forward and decode backward.

28. I think the issue with using compression to encrypt is that the encryption is only as good as the model. The goal of encryption AFAIK is to produce bits that look random. With a less than perfect model, you will not get every bit sequence with equal probability.

PS. WB Shelwien!

29. Thanks, though I've been around for half of the year at least.

And unfortunately the model won't help in this case - its just a generator of input data for entropy coder,
while the equivalent of encryption key is the permutation of tANS lookup tables.
You are correct that the model can significantly improve the strength of such encryption (by randomly rearranging symbol intervals or something),
but it would be normally still weaker than real encryption, and also would add complexity to the scheme.

Afaik, the original idea was that tANS adds an option to ("lightly") encrypt data without adding any overhead -
simply by lookup table permutations.

30. Originally Posted by Shelwien
Thanks, though I've been around for half of the year at least.

And unfortunately the model won't help in this case - its just a generator of input data for entropy coder,
while the equivalent of encryption key is the permutation of tANS lookup tables.
You are correct that the model can significantly improve the strength of such encryption (by randomly rearranging symbol intervals or something),
but it would be normally still weaker than real encryption, and also would add complexity to the scheme.

Afaik, the original idea was that tANS adds an option to ("lightly") encrypt data without adding any overhead -
simply by lookup table permutations.
That's correct. I'm trying to shed light on the weaknesses (Jarek has been talking about this for years).

When you have a stretch of data that doesn't compress well, it makes the output less random.

Suppose tANS is being used with a static model and you know that 'x' is a rare symbol. You also know that the data starts with "xxxxxxxxxxx". tANS is going to keep hitting the same state(s) repeatedly, so you can start to work out which state(s) represent 'x'.

Page 1 of 2 12 Last

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•