Page 1 of 2 12 LastLast
Results 1 to 30 of 42

Thread: xtremecompression

  1. #1
    Member BetaTester's Avatar
    Join Date
    Dec 2010
    Location
    Brazil
    Posts
    43
    Thanks
    0
    Thanked 3 Times in 3 Posts

    xtremecompression

    Best of ALL Paq
    http://www.xtremecompression.com/compare.html

    Xtremecompression methods

    ====================
    Attribute Vector Coding

    Attribute vector coding is the first vector transform method for compressing multidimensional database tables. It is tuple-oriented and semantic. It breaks new ground by capturing and exploiting the innumerable interrelationships, functional dependencies, and statistical correlations in data without having to solve the intractable problem of explicitly identifying and defining them. It achieves unequaled compression because it systematically models data at the highest levels of abstraction, across dimensions and data types. That makes it far less subject than conventional methods to compressibility limits imposed by information theory.

    Attribute vector coding recovers an empirical understanding of the processes that created the data and then strives to emulate those processes through functions embodied in software and codified in data. Those functions, together with block-decomposable degenerated canonical Huffman codes that encode them into symbols, decorrelate across dimensions and data types simultaneously.

    Beyond Entropy

    Beyond entropy is a compound lexical method founded on two new design principles. It can compress length-constrained text fields below the entropy of the original data model. The technique synthesizes a data model that is representationally equivalent to the original but operates at lower entropy. It builds lexicons from the input symbols, warms one by exploiting latent symbol substructure, and assigns the symbols entropy codes using the revised statistics.

    Beyond entropy reveals the presence of a technology gap between the art of data modeling and the science of information theory, the value of new principles that bridge that gap, and the benefit of exploiting latent substructure in complex data symbols. It also provides new insight into how symbols can represent objects, and into how entropy differentials can exist between the processes that generate real-world data and the generative data models that emulate them.

    Repopulation

    Repopulation is a structural method for compressing the common configuration of access data structures consisting of separate hash and collision tables, and handling collisions ? sequences of database record numbers ? through external chaining. It populates hash table locations that would otherwise be empty with parts of collision strings that would otherwise occupy memory.

    Unlike almost every other compression method, repopulation is not a replacement scheme. Instead, repopulation is permutational and mechanistic; it works like a chess-playing automaton. It draws on no information-theoretic concepts. Repopulation simultaneously achieves the access speed of a low hash table load factor and the table compactness of a high one, thus avoiding that historical compromise.

    Superpopulation

    Superpopulation is a variable-to-variable-length algorithm that compresses index tables, lists, arrays, zerotrees, and similar data. It systematically accommodates wide local variations in data statistics. Superpopulation may be used by itself or in conjunction with repopulation.

    Superpopulation recognizes that distributions of values in access data structures are often far from random. They can be highly correlated due to the nature of the processes that generated the data, and often have areas of high and low correlation. In a typical index table application, superpopulation decomposes each collision string into a sequence of adjacent substrings, each classified as one of two distinct target types or as an incompressible substring. Each target is then compressed using one of two target type-specific encoding methods.

    Wordencoding

    Wordencoding is a 0-order (context-independent) variable-to-variable-length algorithm for compressing text strings, hereinafter called words, in database table record fields. It achieves compression close to the 0-order source entropy without sacrificing speed. It does that by providing an efficient way to maximize effective combined data locality over three areas: the compressed record fields, the lexicons holding the words, and their access data structures. Wordencoding deals explicitly with the structure and statistics of the data by recognizing that redundancy in text strings exists in different forms at different levels of granularity.

    Last edited by BetaTester; 10th February 2012 at 01:50. Reason: include graphic

  2. #2
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    The test is on synthetic data. The best compression is the original program that generated it. http://www.tpc.org/tpch/spec/tpch_2_14_3.tgz is 21901 KB, which would give a compression ratio of 45.66 to 1 (I assume it's the 1 GB lineitem table). I've had discussions with the author. I'm just too lazy to write a table oriented compressor for this file.

  3. #3
    Member
    Join Date
    Feb 2012
    Location
    Northern Virginia
    Posts
    11
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Actually it was 7.8 GB. FYI, the TPC is about to replace (or augment) the TPC-H I used with TPC-DS. Nobody really liked TPC-H but they all used it!

  4. #4
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    What does it do on enwik8 and enwik9?

  5. #5
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    No idea. Code is not released.

  6. #6
    Member
    Join Date
    Feb 2012
    Location
    Northern Virginia
    Posts
    11
    Thanks
    0
    Thanked 0 Times in 0 Posts
    I suppose it would crash. It's domain-specific and it shows the effectiveness of domain specificity. Some users benefit from that. They're my target.

  7. #7
    Member Skymmer's Avatar
    Join Date
    Mar 2009
    Location
    Russia
    Posts
    681
    Thanks
    37
    Thanked 168 Times in 84 Posts
    Nice name of the project. Probably the whole bunch of marketers hardly worked on the name of it.
    Is there some kind of demo binary, so we can test it?

  8. #8
    Member
    Join Date
    Feb 2012
    Location
    Northern Virginia
    Posts
    11
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Not right now due to concerns about reverse engineering.

  9. #9
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    What about compressing on server side? Ie ask user to upload a (test) file, then compress the file on server and provide a link to resulting self-extracting archive?

  10. #10
    Member
    Join Date
    Feb 2012
    Location
    Northern Virginia
    Posts
    11
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Primarily because it doesn't work on Lena! These days unless your compressor works on porn it's considered no good.

  11. #11
    Member Karhunen's Avatar
    Join Date
    Dec 2011
    Location
    USA
    Posts
    91
    Thanks
    2
    Thanked 1 Time in 1 Post
    If the compressor is not generic, i.e. if it is for relational databases or the like, wouldn't a description of the data format be illustrative ? After all, if I take an arbitrary string and compress it with packbits or RLE, most compressers won't be able to compress the data much further. But if I know the data is indexed per a particular entity-relationship schema or I have data from many examples of an arcane database format, might I have insight on how to make a transform i.e. codec to make it more compressible ? Basically what I am asking is: If this data is encoded some wierd way, couldn't a tool like Precomp do the same job, assuming one used deflate afterwards ( or paq8o8pre or lprepaq ) ?

  12. #12
    Member
    Join Date
    Feb 2012
    Location
    Northern Virginia
    Posts
    11
    Thanks
    0
    Thanked 0 Times in 0 Posts
    There are two file formats of interest here (see http://www.tpc.org/tpch/spec/tpch2.14.3.doc for the schema), called lineitem table records and orders table records. The files consist of variable-length ASCII character-coded records. Lineitem table records have 16 delimited fields and vary in length between about 100 and 160 bytes. Orders table records have 9 delimited fields and vary in length between about 80 and 160 bytes.

    I chose those because both Oracle and IBM have published their compression ratios for them. Oracle measured about 1.5:1, IBM got about 2.5:1, and my compressor got about 8.6:1 for the lineitem table and 7.6:1 for the orders table.
    Last edited by Glenn Davis; 15th February 2012 at 06:08.

  13. #13
    Member
    Join Date
    Oct 2010
    Location
    Germany
    Posts
    275
    Thanks
    6
    Thanked 23 Times in 16 Posts
    Quote Originally Posted by BetaTester View Post
    It can compress length-constrained text fields below the entropy of the original data model.
    When I read this I stopped reading further

  14. #14
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    So I guess your test file was the TPC-H lineitem table with a scale factor of 10?

  15. #15
    Member
    Join Date
    Feb 2012
    Location
    Northern Virginia
    Posts
    11
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Yes, scale factor 10. You can generate it using the TPC's population generator and you should be able to duplicate it exactly. The scale factor 10 lineitem file measures 7,835,713,740 bytes and has 59,986,052 records. I also ran scale factor 1 as a check (765,864,502 bytes and 6,001,215 records). CR was 8.4:1 on that. And I also ran the orders table at the same scale factors. So in all I measured 4 compression ratios, 4 compression times, and 4 decompression times for each of 140 compressors. If you want I'll send you the spreadsheet with all the numbers and compressor settings.

  16. #16
    Member
    Join Date
    Feb 2012
    Location
    Northern Virginia
    Posts
    11
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Sebastian, people working with word-based text compression and certain types of greedy parsing have probably been coding below entropy for decades without knowing it. Part of the reason is that when the symbols themselves have substructure, as words do, conventional entropy calculations underestimate compressibility. That's because there's no way to consistently factor into entropy calculations the distributed mutual information due to symbol self-similarity. But it's there.

  17. #17
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    437
    Thanks
    137
    Thanked 152 Times in 100 Posts
    By definition you cannot reliably encode a signal to smaller than its true entropy, however this is a semantic difference perhaps. What you're saying is that we simply miscalculate entropy, which is likely true unless you wish to restrict entropy to everything outside of the modelling.

    I suspect the objections are simply over a confusing use of the term Beyond Entropy.

  18. #18
    Member
    Join Date
    Feb 2012
    Location
    Northern Virginia
    Posts
    11
    Thanks
    0
    Thanked 0 Times in 0 Posts
    There's a principle at work here called representational equivalence. For some sets of transparently decomposable symbols (words in this case), multiple generative probabilistic models can exist that generate identical data but differ in calculated entropy. Among all representationally equivalent models that one can imagine for the same data, the original may not happen to be the one with the lowest entropy.

    That tells us to try to find a different process for the same data with lower entropy, and then code to a level just above that process's entropy.

    In this case I was able to exploit latent symbol substructure in the words to synthesize a representationally equivalent model whose entropy was a few percent lower than that of the original process. I then designed a compressor for that, ignoring the actual process I had used to generate the data. http://www.xtremecompression.com/beyondentropy.html describes this.

    Doing that let me code to just above the lower entropy and below the higher one. I'm sure I'm not the first to do this although I can't recall ever seeing it mentioned in the literature.

    Has anyone seen a discussion of this?
    Last edited by Glenn Davis; 16th February 2012 at 09:03.

  19. #19
    Member
    Join Date
    Oct 2010
    Location
    Germany
    Posts
    275
    Thanks
    6
    Thanked 23 Times in 16 Posts
    Quote Originally Posted by Glenn Davis View Post
    There's a principle at work here called representational equivalence. For some sets of transparently decomposable symbols (words in this case), multiple generative probabilistic models can exist that generate identical data but differ in calculated entropy.
    Entropy is the expected information content of an i.i.d. stochastic process X over a finite alphabet of symbols with an assigned probability distribution over this alphabet.

    1. The unit of the entropy is ussually [Bits per Symbol] not [Bits]!
    2. When you calculate the entropy you assume a source X with certain properties (because you need the (Joint-)PDF) which are mostly not met.
    for example: I can easily calculate different entropies of a source assuming it is a markov source of finite order-n with some fixed n. Commonly the calculated entropy gets lower as the order increases but it is mostly not the "true" entropy as you don't know the sources properties and if the assumptions of i.i.d are met.

    3. If you like to calculate the entropy of a whole message with length n you have to know the probability of every possible message of the given length, which you certainly do not know as long as you don't know the sources properties!

    4. A stochastic process does not generate ONE message. It can generate every possible message over the alphabet with some probability.

    5. Two processes X and Y over the same alphabet with different PDFs have different entropies. If their PDFs are identical then X=Y.
    Last edited by Sebastian; 16th February 2012 at 10:29.

  20. #20
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Suppose you are given an arbitrary string x = x1 x2 ... xn over some alphabet S={1, 2, ..., N} with cardinality N. You can consider two situations:

    I. The string x is arbitrary and there is no data generating process in the information theoretical sense. To compress this string sequentially (symbol by symbol, whatever you consider symbols to be) you have to estimate a sequence of N-dimensional vectors p1, p2, ..., pn; where for each vector pi(1) + pi(2) + ... + pi(N) = 1, pi(j)>0 for 1<=i<=n, 1<=j<=N.

    Fact: The optimal code length w.r.t. the sequence of estimated vectors p1, p2, ..., pn is -[log(p1(x1)) + log(p2(x2)) + ... + log(pn(xn))].

    So when you compress such a string using algorithm/model A and algorithm/model B you will most likely end up with a shorter and a longer encoding. That means one of these compresses the sequence better than the other or in the terminology of a statistical compression, assign a higher probability to this particular string. The situation can be different for other strings. This is why we benchmark different algorithms on different data.

    II. The string is generated by some stochastic process, which assigns a probability p(x) to this string. Suppose that there is a compression algorithm B, which compresses this particular string x shorter than -log(p(x)) bits. Can this happen? Yes, indeed. That just means, that B estimates a higher probability to this particular string x than the process actually assigns. But this means that B assigns lower probabilities to other strings y than the process (and thus, produces longer code lengths for other strings). On average (if you consider all strings the process might generate) the code length of the encoding of B will be longer. Suppose q(x) is the distribution the algorithm B assigns to a string x and p(x) is the probability that the process emits x.

    The expected redundancy of q w.r.t. p for a string of length n (or KL-divergence) is:

    D(p||q) := sum over all strings x of length n over S: p(x) [-log(q(x)) - (-log(p(x)))].

    Fact: We have D(p||q)>=0, with equality for p=q, i.e., the average redundancy is guaranteed to be non-negative and is zero, if and only if you use a model q, which exactly mimics the probability assignment p of the data-generating process.

    Summary:
    - Deterministic Setting (I): The compression is determined by the quality of the model, which has to be tested empirically. Don't estimate the entropy using model A, than compress using model B and claim that you compressed below the entropy because the compression is better with model B than model A. This just means that for that particular sequence B works better than A.
    - Stochastic setting (II): You can compress particular sequences better, others worse than the data generating process. You cannot compress below entropy (which is the average/expected code length).
    Last edited by toffer; 16th February 2012 at 12:50.
    M1, CMM and other resources - http://sites.google.com/site/toffer86/ or toffer.tk

  21. #21
    Member Karhunen's Avatar
    Join Date
    Dec 2011
    Location
    USA
    Posts
    91
    Thanks
    2
    Thanked 1 Time in 1 Post
    Thanks Toffer, but for us who are more mathematecally challenged, can we have a weaker proof, like that this coding scheme is injective but not surjective, i.e. you can compress many strings or tuples or whatever but there are tuples without relationships ? Perhaps a pathological test case ? Incidentally, I suggest an "undergraduate" topic forum other than off-topic for questions posed by myself and others unable to solve strong induction I am a simple librarian, and have the task of cataloging all the books that don't list themselves as a reference. But when I am finished, do I list this catalog in itself ? It is after all, a book. If I don't, my list will not be complete, but if I do, it won't be a book of books that don't list themselves.

  22. #22
    Member
    Join Date
    Feb 2012
    Location
    Northern Virginia
    Posts
    11
    Thanks
    0
    Thanked 0 Times in 0 Posts
    toffer, my sample was 1,000,000 strings of randomly selected words. That's large enough a sample for an alphabet of 5000, and I used 32-bit Mersenne Twister random number generators, which are known to be very good. So the data are representative.

    But you missed an important point. The data-generating process is complex. It consists of two very different kinds of random processes that are independent. The drawing on the web page I referenced in post #18 shows that the output of the first feeds the second, and information is lost between them in the process due to their independence (and also, I suppose, because they operate in different domains). In this case (and this is certainly NOT ALWAYS true for complex sources and perhaps NOT USUALLY the case) it is possible for a person to imagine a data-generating mechanism (the new model) that is different in a logical sense but always generates identical data, and then assume that was the actual source of the data when designing the compressor.

    That's all I did. I was lucky -- the new model has slightly lower entropy than the original. It could just as easily have been higher, as it would have been had the models been interchanged.

  23. #23
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts

    Smile

    Actually you can compress below entropy.
    Example if you compress to strings and not bytes it easy to see.
    suppose you have a perfect IID source of Ones and Zeros and "1" occurs 51% of time and "0" occurs 49% of time.
    if "1" occurs you should use 0.971430848 bits
    if "0" 0ccurs you should use 1.02914635

    below sort by entropy and assign to next available short string
    1 = 0.971430848 compresses to 0
    0 = 1.02914635 to 1
    11 = 1.9428617 to 00
    01 = 2.00057719 to 00
    10 = 2.00057719 to 01
    00 = 2.05829269 to 11

    and so on notice above 0, 01, 10, 00 all compressed below entropy
    while 1, 11 compress to larger than entropy

    entropy is a statistical thing on the averages a bijective arithmetic well round up or down.
    Last edited by biject.bwts; 16th February 2012 at 19:27. Reason: highlight to match previous post

  24. #24
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts

    Smile

    Here are the kinds of things that occupy my mind. Suppose you are interested in compressing from the space of all bytes files to files. If you have a half way decent compressor that works on all files ( that doesn't mean it makes all files shorter). How many bytes are needed to compress a file that is of form:
    ABCCDDDDDDDD or ABBCDDDDDDDD or AABCDDDDDDDD or any version of those truncated so that it is 1 to 12 bytes

    where A B C D any byte such that:
    A can be any value 0x to 0x255 any of 256 values
    B can be any value of 256 values
    C can be any value 255 values but can not be same a A unless B is same as A in which case it at can be any of 256 values
    D can be any value of 254 values but can not be same as A or B unless:
    1)when A and B the same then any of 256 values.
    2)when B and C the same then any of 256 values.

    I would like some to calculate the entropy of the above assuming any file possible would be the result of above
    and how many bytes needed to code the above?


    Hint I have a bijective DC that does this in 5 bytes or less. I can explain the simple bijective DC here if anyone interested.

  25. #25
    Member
    Join Date
    Oct 2010
    Location
    Germany
    Posts
    275
    Thanks
    6
    Thanked 23 Times in 16 Posts
    Quote Originally Posted by biject.bwts View Post
    below sort by entropy and assign to next available short string
    1 = 0.971430848 compresses to 0
    0 = 1.02914635 to 1
    11 = 1.9428617 to 00
    01 = 2.00057719 to 00
    10 = 2.00057719 to 01
    00 = 2.05829269 to 11
    this is not a prefix-free code

  26. #26
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    @Karhunen:
    I have no idea what you're talking about (english is not my mother tounge).

    @David:
    You code is not uniqely decodeable. How do you want to distinguish the code words 0, 00? Or 00, which maps to 11 and 01.

    Statements like "i compressed the string x to l(x) < -log(p(x))" are valid. But that doesn't mean you compress below the entropy. Since the entropy is the average code length, you have to take into account the code lengths of all strings (e.g., with a fixed length) weighted by their appearence probability (given by p(x)).

    @Glenn:
    First, i don't exactly know what you did there, since everything is vaguely described. What you claim is most likely an instance of I.
    M1, CMM and other resources - http://sites.google.com/site/toffer86/ or toffer.tk

  27. #27
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    Quote Originally Posted by biject.bwts View Post
    Actually you can compress below entropy.
    Example if you compress to strings and not bytes it easy to see.
    suppose you have a perfect IID source of Ones and Zeros and "1" occurs 51% of time and "0" occurs 49% of time.
    if "1" occurs you should use 0.971430848 bits
    if "0" 0ccurs you should use 1.02914635

    below sort by entropy and assign to next available short string
    1 = 0.971430848 compresses to 0
    0 = 1.02914635 to 1
    11 = 1.9428617 to 00
    01 = 2.00057719 to 00
    10 = 2.00057719 to 01
    00 = 2.05829269 to 11

    and so on notice above 0, 01, 10, 00 all compressed below entropy
    while 1, 11 compress to larger than entropy

    entropy is a statistical thing on the averages a bijective arithmetic well round up or down.

    look I tried to type in carefully the
    11 goes to 00
    01 goes to 10 it was a typo. there should be 4 strings 00 01 10 11 likewise when I go to three bit files there is 8 patterns of the bits.


    Quote Originally Posted by toffer View Post

    @David:
    You code is not uniqely decodeable. How do you want to distinguish the code words 0, 00? Or 00, which maps to 11 and 01.


    yes (assuming you saw the typo where second 00 should have been 10) its uniquely decodeable to simply things taking about bit files not byte files

    you don't think in terms of code words when you arithmetically code. Since you carry the fractional bits in a state register. There is a unique coding for each string.
    on decompression
    0 goes to 1
    1 goes to 0
    00 goes to 11 they all are unique strings its not fixed word Huffman.

    I could have used 8 bits bytes and forces the output file to have either 8, 16, 24, ... but then tables to long to look at so did an example with bit files instead of byte files so it would be easier to understand. But the comment is the same.

    Here is another view to may help with normal files suppose you have a source that is IID and of 3 symbol each with probability 1/3 I wrote code for this very example before suppose you have 3000 character in the file it should compress
    to ((ln(3) / ln(2)) / * 3 000 = 594.360938 bytes but my bijective arithmetic code would compress some to 594 bytes and some to 595 bytes.

    However a bijective optimal huffman coder would compress such a file from the same source to 3000/8 = 375 bytes up to 750 bytes but in average with huffman since IID source over many files you would get 625 bytes the average for arithmetic is lower than huffman but there is always some huffman compressed with this set that is smaller not matter how you chop up the frequencies for the IID source.

  28. #28
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    > http://www.xtremecompression.com/beyondentropy.html

    According to my calculations, the entropy of the test file is less than 9,350,000 bytes, which is below the compression achieved by any program.

    If I understand correctly, you first generate independent random words from a probability distribution that gives an average word length of 5.258 characters and 1.75 bits per character, both including the trailing space. Then you generate a stream from which you randomly sample 1,000,000 40 byte strings (I assume no overlap) and add a newline (CR LF) for a total of 42 bytes.

    Since the newlines are predictable by their location, they carry no information. So this would give 8,750,000 bytes (1.75 x 40 x 1,000,000 / except for the fact that the first and last words might be truncated. To account for this effect, note that chopping the first word is equivalent to adding all the suffixes of each word in the vocabulary to the lexicon. This increases the lexicon by a factor of at most the average word length, or actually less because of common suffixes. So this adds at most log2(5.25 ~ 2.4 bits of entropy.

    There is a similar effect on the truncated last word. The characters at the beginning of the word have more effective entropy than those at the end because there is less context. Again, the exact amount depends on the lexicon, but we may apply the same reasoning to put an upper bound of 2.4 bits. So we have ((40 x 1.75) + 2.4 + 2.4) x 1,000,000 / 8 = 9,350,000 bytes.

  29. #29
    Member
    Join Date
    Feb 2012
    Location
    Northern Virginia
    Posts
    11
    Thanks
    0
    Thanked 0 Times in 0 Posts
    During the run, 8,512,430 words are selected by the first random process (that's an actual count, not an estimate).
    Each time a word is randomly selected, a log2p term is computed using the selected word's probability, and that is
    added to a running total.

    Sampling is the second random process. Each time the sampling 'turns on', a separate log2p term is computed and added
    to a second running total; this time p is the reciprocal of the current word's length, since all positions are equally
    likely to be the first one sampled. When the sampling 'turns off' after 40 characters, there is no entropy contribution
    because once we know the word identities and the 'backset' position of the first word we know the current length of the
    string being made.

    The two totals are added after the run terminates. The accumulated sum of the 8,512,430 log2p terms for the words is
    78,322,877 bits, which is 9,790,360 bytes. The accumulated sum of the 1,000,000 log2p terms for the backsets is
    2,269,684 bits, which is 283,710 bytes. The sum of those is 80,592,560 bits or 10,074,070 bytes.
    Last edited by Glenn Davis; 17th February 2012 at 15:02.

  30. #30
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    @David:

    You didn't understand what i'm talking about. I will now quote myself:
    - Stochastic setting (II): You can compress particular sequences better, others worse than the data generating process. You cannot compress below entropy (which is the average/expected code length).
    I'll take your first example and fix a string length of n=1, where p(1) = 0.51. The entropy ("expected code length") is

    L* := -p(0) log(p(0)) - p(1) log(p(1)) = 0.999711442 bit/symbol.

    Whereas your encoding assings a code length of 1 bit to either symbol, i.e., you have q(1) = 0.5 in your model, the average code length (when you compress strings of length n=1 drawn at random from the process where p(1)=.51) is

    L = -p(0) log(q(0)) - p(1) log(q(1)) = 1 bit/symbol.

    And in general L-L* = D(p||q) >=0, so you can have ideally have L=L*, iff q=p.


    @Glenn:

    To end this discussion, why don't you upload the program which generates you data strings? You can do a simple experiment to see, that you initial entropy calculation is wrong. Let the second process generate, say 4 byte strings (the length has to be small, since there're exponentially many). Each string is an instance of the outcome of your stochastic process. Now generate a few millions such strings (the number of samples should be alot larger than the outcome space, e.g., when there are 30 characters you have at most 30^4 possible distinct outcomes, so maybe 10^5*30^4 sampels would be OK, i'm not a statistician, but i guess you get the point) and calculate the empirical PDF and its entropy. It will be lower than your estimate.



    M1, CMM and other resources - http://sites.google.com/site/toffer86/ or toffer.tk

Page 1 of 2 12 LastLast

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •