Results 1 to 15 of 15

Thread: Compression via Number Factorisation

  1. #1
    Member
    Join Date
    Jun 2012
    Location
    South Africa
    Posts
    15
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Post Compression via Number Factorisation

    Been thinking about this lately. A file may be represented using a large number (like when Base64), right? So considering this, I have a 2 ideas which may be used in compression.

    1) Say a large number X is made up of smaller numbers x1, x2, x3, xn
    e.g. X=2510202550100 x1=2 x2=5 x3=10 x4=20 x5=25 x6=50 x7=100

    Now, all of those smaller numbers are factors of 100, so X may then simply be represented as 100. Of course, issues arise when the smaller numbers are not in ascending/descending order, or when not all of the factors of 100 are included. But other than that, what do you guys think of this idea?

    2) Say a number Y is a perfect power of another number?
    e.g. Y=10460353203 but Y=3^21 then Y may simply be represented as 3^21 or 3,21.

    The main issue with both of these ideas is that there isn't yet an algorithm that can factorise any number almost instantly, so factorising a huge number can take a huge amount of time. Now, I've been thinking about this and I came up with another idea. Until someone can come up with a super fast factorising algorithm, why not use a predefined list of factors that a compression algorithm can use to look up factors? This way, the factors are already known, so all that needs doing is for the compression algorithm to look through the list and choose the appropriate factors.

    So, what do you guys think? Is it doable using current processing technologies?

    I'm busy doing some informal research to see if a quick factorising algorithm exists. I like to think of it as a fun puzzle that I must solve in my lifetime. I'm strange in that regard - I find unsolved physics/maths problems and try to solve them. If I can solve at least one of those then I would have left my mark on humanity.

  2. #2
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts

  3. #3
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    The problem is this. Its far better for the result of a general file compress to be a single number than a set of more than one number.

    The reason is this you need to represent the number as a string of bits. A string of more than one number need to be that of as a list of universal numbers. see

    http://en.wikipedia.org/wiki/Univers...a_compression)

    The limit of this is the binary coding. Note binary coding is not a universal numbering system for multiple numbers. Note we are looking at numbers from 1 to N where N gets very large
    notice it starts like this
    0 is 1
    1 is 2
    00 is 3
    01 is 4
    10 is 5
    11 is 6
    000 is 7

    To create that string you add 1 to the number look at its binary representation and drop the leading 1
    example:

    9 add 1 get ten which is 1010 drop the always present leading 1 you get 010 which is the coding
    needed. If you can reduce compression to a single number you can always bijectively transform this
    to bytes or whatever no problem.

    Now lets look at what happens if you forced to use just two numbers for all compression and somehow
    you know that the product of the two represent the file you are compressing. Assume that you get a
    2 and 3 well that is 6 items you could have used 0 to 11 for each of the items since it covers all
    6 values. Instead you need 1 to stand for 2 and 00 to stand for 3 that's already 3 bits while the
    one number value is 2 bits max. Worse yet a number like 100 has no internal divisions one does
    not know how to decode it. When you need the string to be more than one number so you have to
    use a universal code that has the number ends as stand alone.
    example using Fibonacci number but apples to Ellis or whatever.
    6 is 10011 for 5 bits
    2 is 011 for 3 bits
    3 is 0011 for 4 bits

    1 1 in 2 number state means item 1
    1 2 above means 2
    1 3 above means 3
    2 1 above means 4
    2 2 above means 5
    2 3 above msans 6

    so converting to other universal number system makes the length grow more. In short stick with one number when possible.

    I use to argue with people for years on other compression sites. When some guy says he has best compression possible and then stated that the first number is number of ones and next is number is number of zeroes and third number is number representing in smallest form the number of the permutation. Well it can't be optimal to a simple bijective arithmetic since that uses only a single number. But it seems most people have trouble grasping this. I hope this helps.

  4. #4
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    it's called counting argument: http://www.mattmahoney.net/dc/dce.html#Section_11

  5. #5
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Besides the counting argument, there is another simple reason why this won't work. A number X takes log(X) bits to write. If you factor X = x1 x2 x3..., then the you need log(x1) + log(x2) + log(x3) +... = log(x1 x2 x3...) = log(X) bits to write all the factors, so there is no savings.

  6. #6
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    Quote Originally Posted by Matt Mahoney View Post
    Besides the counting argument, there is another simple reason why this won't work. A number X takes log(X) bits to write. If you factor X = x1 x2 x3..., then the you need log(x1) + log(x2) + log(x3) +... = log(x1 x2 x3...) = log(X) bits to write all the factors, so there is no savings.
    Matt yes the math equation above is correct. However when you write a series of numbers even if combined arithmetically using fractional bits.

    The number of bits used for writing X1,X2,X3 is longer in general than the number of bits used in writing than the number for
    X1*X2*X3 because of the need to know when ln(X1) ends and ln(X2) starts. So the math equation above is only valid when writing a single number NOT A SERIES OF NUMBERS.

    However a series of Numbers say 3,4,5 where its known that the three items represents three finite sets then the end is built example say 3 represents 3 of 8 , 4 of 16 , 5 of 8 then the length is the same since the ends and starts predetermined. Yes in this case its whole number of bits but fractional almost as easy. I think my starting statements in this tread is still the best example. But that is just how my mind sees it.

  7. #7
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    ln(1) = 0 so when writing a series of number ln(x+1)/2 is used when x can be 1 to infinity. where ln is the log base 2.

    thus ln(X1 + 1) + ln(X2 + 1) + ... . = ln((x1+1)(x2+1)*...)
    and ln(1 + X1*X2*...) is less then ln((x1+1)(x2+1)*...)

  8. #8
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    To make the codes self delimiting you actually need log x + log log x + log log log x + ... to code x, the length of x, the length of the length of x, and so on. Or, of course you could use some different probability model to code them. You could also argue that to code a list of factors you can sort them, discard composite numbers, and delta code the remainder to save some space, but ultimately the counting argument still wins.

  9. #9
    Member
    Join Date
    Jun 2012
    Location
    South Africa
    Posts
    15
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Hold on. You guys are talking in Greek. All right, let me see if I understand this.

    For my first proposal (i.e. representing a list of factors using it's lowest common factor), are you saying, for example, that the current binary system has a limitation that ensures that the lowest common factor 100 would actually contain more 1s and 0s than its string of factors 2,5,10,20,25,50? And all of that maths that you're throwing at me is aimed at proving this?

    Also, just scanning the above posts I picked out something:

    "Besides the counting argument, there is another simple reason why this won't work. A number X takes log(X) bits to write. If you factor X = x1 x2 x3..., then the you need log(x1) + log(x2) + log(x3) +... = log(x1 x2 x3...) = log(X) bits to write all the factors, so there is no savings."

    This only applies to factors which are multiplied together to form the number X. However, I'm referring to the "lowest common factor". So X does not = x1 x2 x3... xn. Rather X is simply a multiple of x1, and x2, and x3, ..., and xn. What you are saying is that 2*3*4=24 so the string 2,3,4 has the same number of bits as the number 24, which may be true, but what I'm saying is that 24 also has factors 1,6,12, and 24, so that those factors could also be represented by 24.

    I don't know. Currently there isn't any algorithm that can predict exactly what factors a number has, nor do I believe, is there one that shows "how many" factors a number has. Is there a theorem that predicts the maximum number of factors a number can have? As such, isn't it possible that there could be a number (or numbers) that has so many factors, that representing the number alone actually requires less bits than representing it's string of factors? Or is the above "maths" actually aimed at rejecting this possibility as well?

    You guys may be right, but how does this apply to my second proposal (i.e. representing a large number as a smaller number raised to a power)?

    Anyway, I'll give that link (http://www.mattmahoney.net/dc/dce.html) a read. Maybe it will clarify things for me.

    Now, until I can read that above link, can someone please tell me in plain English what this "counting argument" is?
    Last edited by MiY4Gi; 16th July 2012 at 18:52.

  10. #10
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    yes, we are talking greek because all authors of such proposals are "writers" rather tthan "readers". the counting argument is a 5 lne long and if you understand math it's all you need to realize why all such schemes can't work

  11. #11
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    yes, we are talking greek because all authors of such proposals are "writers" rather tthan "readers". the counting argument is a 5 lne long and if you understand math it's all you need to realize why all such schemes can't work
    I agree the counting argument can be used to totally counter the forms of compression the first author of the thread was attempting. However I am not sure that those that try what he is doing see how it applies to what they are doing. To me the counting argument works best when you try to show that not all files can be compressed smaller. But the author is not saying that. So I am attempting to show by simple means why his methods fail. I have often seen articles by people who understand the basic counting argument yet they go on to propose what they claim is optimal compression. They usually have some several fields where numbers are used that can be of any size at least for the first few. Example:
    Someone who knows the counting theorem proposes to compress an arbitrary string of ones and zero to a shorter string of ones and zeroes. They might say use 3 number the first being number of zeros and the second being number of ones and the last number
    being the number representing the combination which is a limited number based on the number of ones and zeroes. This method
    is claimed by the person to be optimal. And the person realizes that some string compress smaller while others compress longer.
    But this is false it can't be optimal and if the person understood the counting theorem as well as some of the experts at this site they would see that. But stating the counting theorem over and over as proof is not going to help see why there method is wrong.

    For the example above you can get an optimal compression of strings to strings using those 3 numbers but not by encoding the 3 numbers separately in the string. And of course to be optimal when compressing all strings to all strings it has to be bijective.

  12. #12
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    The counting argument is like this. There is no algorithm that will compress all 1000 byte files to 999 bytes or less because there are 256^1000 possible inputs but only 256^999 different ways to encode them. If two inputs produce the same output then you can't know which one to decompress it to, no matter what algorithm you use.

    People who believe in universal compression seem to be universally unable to understand this argument.

  13. #13
    Member
    Join Date
    Apr 2012
    Location
    Stuttgart
    Posts
    437
    Thanks
    1
    Thanked 96 Times in 57 Posts
    Quote Originally Posted by biject.bwts View Post
    I agree the counting argument can be used to totally counter the forms of compression the first author of the thread was attempting. However I am not sure that those that try what he is doing see how it applies to what they are doing.
    The counting argument surely does *not* say that there are no sources you couldn't compress this way. It just says that an iid random source on the source interval (whatever this may be) cannot be compressed by this method (nor any other, of couse). Thus, the question is rather this: How would the random source need to look like to allow compression by this argument, and is this a good model for any natural source? To answer the first question: Apparently, the distribution of numbers must be different from the distribution of the natural numbers, and in specific their prime factors, for this to pay off. To answer the second question: No - I'm not aware of any natural source for which prime factorization would allow to build a good model. The third question might have been: Is the OP actually aware of this or is this just the typical compression scam. I afraid the answer to this is rather obvious.

  14. #14
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    people are usually just too lazy to read the counting argument theorem. as said in one russian anecdote, "i'm not the reader, i'm the Writer!"

  15. #15
    Member Karhunen's Avatar
    Join Date
    Dec 2011
    Location
    USA
    Posts
    91
    Thanks
    2
    Thanked 1 Time in 1 Post
    If I remember correctly, the enigma machine use in WWII had the flaw that no letter could encode itself, and that weakness was exploited to find daily keys to the radio traffic.

Similar Threads

  1. Euler's Number Triangle and Random Data
    By BetaTester in forum Data Compression
    Replies: 55
    Last Post: 19th February 2013, 05:02
  2. Replies: 41
    Last Post: 25th December 2012, 11:16
  3. rnd - simple pseudo random number generator
    By encode in forum Forum Archive
    Replies: 21
    Last Post: 14th January 2008, 03:41

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •