# Thread: Compression via Number Factorisation

1. ## Compression via Number Factorisation

Been thinking about this lately. A file may be represented using a large number (like when Base64), right? So considering this, I have a 2 ideas which may be used in compression.

1) Say a large number X is made up of smaller numbers x1, x2, x3, xn
e.g. X=2510202550100 x1=2 x2=5 x3=10 x4=20 x5=25 x6=50 x7=100

Now, all of those smaller numbers are factors of 100, so X may then simply be represented as 100. Of course, issues arise when the smaller numbers are not in ascending/descending order, or when not all of the factors of 100 are included. But other than that, what do you guys think of this idea?

2) Say a number Y is a perfect power of another number?
e.g. Y=10460353203 but Y=3^21 then Y may simply be represented as 3^21 or 3,21.

The main issue with both of these ideas is that there isn't yet an algorithm that can factorise any number almost instantly, so factorising a huge number can take a huge amount of time. Now, I've been thinking about this and I came up with another idea. Until someone can come up with a super fast factorising algorithm, why not use a predefined list of factors that a compression algorithm can use to look up factors? This way, the factors are already known, so all that needs doing is for the compression algorithm to look through the list and choose the appropriate factors.

So, what do you guys think? Is it doable using current processing technologies?

I'm busy doing some informal research to see if a quick factorising algorithm exists. I like to think of it as a fun puzzle that I must solve in my lifetime. I'm strange in that regard - I find unsolved physics/maths problems and try to solve them. If I can solve at least one of those then I would have left my mark on humanity.  2. 3. The problem is this. Its far better for the result of a general file compress to be a single number than a set of more than one number.

The reason is this you need to represent the number as a string of bits. A string of more than one number need to be that of as a list of universal numbers. see

http://en.wikipedia.org/wiki/Univers...a_compression)

The limit of this is the binary coding. Note binary coding is not a universal numbering system for multiple numbers. Note we are looking at numbers from 1 to N where N gets very large
notice it starts like this
0 is 1
1 is 2
00 is 3
01 is 4
10 is 5
11 is 6
000 is 7

To create that string you add 1 to the number look at its binary representation and drop the leading 1
example:

9 add 1 get ten which is 1010 drop the always present leading 1 you get 010 which is the coding
needed. If you can reduce compression to a single number you can always bijectively transform this
to bytes or whatever no problem.

Now lets look at what happens if you forced to use just two numbers for all compression and somehow
you know that the product of the two represent the file you are compressing. Assume that you get a
2 and 3 well that is 6 items you could have used 0 to 11 for each of the items since it covers all
6 values. Instead you need 1 to stand for 2 and 00 to stand for 3 that's already 3 bits while the
one number value is 2 bits max. Worse yet a number like 100 has no internal divisions one does
not know how to decode it. When you need the string to be more than one number so you have to
use a universal code that has the number ends as stand alone.
example using Fibonacci number but apples to Ellis or whatever.
6 is 10011 for 5 bits
2 is 011 for 3 bits
3 is 0011 for 4 bits

1 1 in 2 number state means item 1
1 2 above means 2
1 3 above means 3
2 1 above means 4
2 2 above means 5
2 3 above msans 6

so converting to other universal number system makes the length grow more. In short stick with one number when possible.

I use to argue with people for years on other compression sites. When some guy says he has best compression possible and then stated that the first number is number of ones and next is number is number of zeroes and third number is number representing in smallest form the number of the permutation. Well it can't be optimal to a simple bijective arithmetic since that uses only a single number. But it seems most people have trouble grasping this. I hope this helps. 4. it's called counting argument: http://www.mattmahoney.net/dc/dce.html#Section_11 5. Besides the counting argument, there is another simple reason why this won't work. A number X takes log(X) bits to write. If you factor X = x1 x2 x3..., then the you need log(x1) + log(x2) + log(x3) +... = log(x1 x2 x3...) = log(X) bits to write all the factors, so there is no savings. 6. Originally Posted by Matt Mahoney Besides the counting argument, there is another simple reason why this won't work. A number X takes log(X) bits to write. If you factor X = x1 x2 x3..., then the you need log(x1) + log(x2) + log(x3) +... = log(x1 x2 x3...) = log(X) bits to write all the factors, so there is no savings.
Matt yes the math equation above is correct. However when you write a series of numbers even if combined arithmetically using fractional bits.

The number of bits used for writing X1,X2,X3 is longer in general than the number of bits used in writing than the number for
X1*X2*X3 because of the need to know when ln(X1) ends and ln(X2) starts. So the math equation above is only valid when writing a single number NOT A SERIES OF NUMBERS.

However a series of Numbers say 3,4,5 where its known that the three items represents three finite sets then the end is built example say 3 represents 3 of 8 , 4 of 16 , 5 of 8 then the length is the same since the ends and starts predetermined. Yes in this case its whole number of bits but fractional almost as easy. I think my starting statements in this tread is still the best example. But that is just how my mind sees it. 7. ln(1) = 0 so when writing a series of number ln(x+1)/2 is used when x can be 1 to infinity. where ln is the log base 2.

thus ln(X1 + 1) + ln(X2 + 1) + ... . = ln((x1+1)(x2+1)*...)
and ln(1 + X1*X2*...) is less then ln((x1+1)(x2+1)*...) 8. To make the codes self delimiting you actually need log x + log log x + log log log x + ... to code x, the length of x, the length of the length of x, and so on. Or, of course you could use some different probability model to code them. You could also argue that to code a list of factors you can sort them, discard composite numbers, and delta code the remainder to save some space, but ultimately the counting argument still wins. 9. Hold on. You guys are talking in Greek. All right, let me see if I understand this.

For my first proposal (i.e. representing a list of factors using it's lowest common factor), are you saying, for example, that the current binary system has a limitation that ensures that the lowest common factor 100 would actually contain more 1s and 0s than its string of factors 2,5,10,20,25,50? And all of that maths that you're throwing at me is aimed at proving this?

Also, just scanning the above posts I picked out something:

"Besides the counting argument, there is another simple reason why this won't work. A number X takes log(X) bits to write. If you factor X = x1 x2 x3..., then the you need log(x1) + log(x2) + log(x3) +... = log(x1 x2 x3...) = log(X) bits to write all the factors, so there is no savings."

This only applies to factors which are multiplied together to form the number X. However, I'm referring to the "lowest common factor". So X does not = x1 x2 x3... xn. Rather X is simply a multiple of x1, and x2, and x3, ..., and xn. What you are saying is that 2*3*4=24 so the string 2,3,4 has the same number of bits as the number 24, which may be true, but what I'm saying is that 24 also has factors 1,6,12, and 24, so that those factors could also be represented by 24.

I don't know. Currently there isn't any algorithm that can predict exactly what factors a number has, nor do I believe, is there one that shows "how many" factors a number has. Is there a theorem that predicts the maximum number of factors a number can have? As such, isn't it possible that there could be a number (or numbers) that has so many factors, that representing the number alone actually requires less bits than representing it's string of factors? Or is the above "maths" actually aimed at rejecting this possibility as well?

You guys may be right, but how does this apply to my second proposal (i.e. representing a large number as a smaller number raised to a power)?

Anyway, I'll give that link (http://www.mattmahoney.net/dc/dce.html) a read. Maybe it will clarify things for me.

Now, until I can read that above link, can someone please tell me in plain English what this "counting argument" is? 10. yes, we are talking greek because all authors of such proposals are "writers" rather tthan "readers". the counting argument is a 5 lne long and if you understand math it's all you need to realize why all such schemes can't work 11. Originally Posted by Bulat Ziganshin yes, we are talking greek because all authors of such proposals are "writers" rather tthan "readers". the counting argument is a 5 lne long and if you understand math it's all you need to realize why all such schemes can't work
I agree the counting argument can be used to totally counter the forms of compression the first author of the thread was attempting. However I am not sure that those that try what he is doing see how it applies to what they are doing. To me the counting argument works best when you try to show that not all files can be compressed smaller. But the author is not saying that. So I am attempting to show by simple means why his methods fail. I have often seen articles by people who understand the basic counting argument yet they go on to propose what they claim is optimal compression. They usually have some several fields where numbers are used that can be of any size at least for the first few. Example:
Someone who knows the counting theorem proposes to compress an arbitrary string of ones and zero to a shorter string of ones and zeroes. They might say use 3 number the first being number of zeros and the second being number of ones and the last number
being the number representing the combination which is a limited number based on the number of ones and zeroes. This method
is claimed by the person to be optimal. And the person realizes that some string compress smaller while others compress longer.
But this is false it can't be optimal and if the person understood the counting theorem as well as some of the experts at this site they would see that. But stating the counting theorem over and over as proof is not going to help see why there method is wrong.

For the example above you can get an optimal compression of strings to strings using those 3 numbers but not by encoding the 3 numbers separately in the string. And of course to be optimal when compressing all strings to all strings it has to be bijective. 12. The counting argument is like this. There is no algorithm that will compress all 1000 byte files to 999 bytes or less because there are 256^1000 possible inputs but only 256^999 different ways to encode them. If two inputs produce the same output then you can't know which one to decompress it to, no matter what algorithm you use.

People who believe in universal compression seem to be universally unable to understand this argument. 13. Originally Posted by biject.bwts I agree the counting argument can be used to totally counter the forms of compression the first author of the thread was attempting. However I am not sure that those that try what he is doing see how it applies to what they are doing.
The counting argument surely does *not* say that there are no sources you couldn't compress this way. It just says that an iid random source on the source interval (whatever this may be) cannot be compressed by this method (nor any other, of couse). Thus, the question is rather this: How would the random source need to look like to allow compression by this argument, and is this a good model for any natural source? To answer the first question: Apparently, the distribution of numbers must be different from the distribution of the natural numbers, and in specific their prime factors, for this to pay off. To answer the second question: No - I'm not aware of any natural source for which prime factorization would allow to build a good model. The third question might have been: Is the OP actually aware of this or is this just the typical compression scam. I afraid the answer to this is rather obvious. 14. people are usually just too lazy to read the counting argument theorem. as said in one russian anecdote, "i'm not the reader, i'm the Writer!" 15. If I remember correctly, the enigma machine use in WWII had the flaw that no letter could encode itself, and that weakness was exploited to find daily keys to the radio traffic. #### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•