# Thread: Rationale for Text Compression

1. ## Rationale for Text Compression

I have read http://mattmahoney.net/dc/rationale.html and have two questions, especially about the "How Much Compression can be Achieved?" section. Sorry if these have been discussed earlier.

First, I am not sure my math is right. The rationale says, that no program to date reaches about 1 bit per character, but durilca reaches a compressed size of 127,784,888 bytes (including code+dictionary) vs. 1,000,000,000 bytes uncompressed for enwik9. If I assume 8 bits per byte, that should be a rate of 1.02 bits per character?

The second question is about the fact, that you (the "tester") of course must take into account the dictionary size for the decompression program. On the other hand, the tests that were done with humans (to measure their ability to predict the next character), you did _not_ add any dictionary size. If you would make the test with a first grader, you would get a worse prediction rate. With adults, the rate may approach the cited 0.6 - 1.2 bits per character, but wouldn't you need to add something to account for the information that is "stored" in the brain?

2. First, your math is right. Of course the entropy of written English depends on the test data and who is taking it, so it is really a rough estimate. I haven't done experiments with human guessing like Shannon did on enwik9 to measure its entropy. His methods only give broad bounds because different probability distributions can result in the same guess sequences. Cover and King attempted to overcome this shortcoming by having people explicitly assign probabilities in a gambling game. But this is really slow and people are not really good at assigning odds rationally. So they ended up with higher estimates, like 1.3 to 1.7 bpc for individuals and 1.3 bpc when the results are combined. It took them about 5 hours to measure one sentence.

Second, we can't make an artificial distinction between code and data, so we have to include both when considering bounds on Kolmogorov complexity. The goal is to predict as well as humans. When we test humans, it is on text they have not seen before. We could do the same test on machines by concealing the test data like some benchmarks do, and then you would be allowed to include all the domain knowledge you wanted with no penalty. I am not convinced this is the right approach because it is not possible for people to independently verify the results. I think that the cost of including the decompresser exactly balances the benefit of knowing the test data.

3. Thanks for your response, Matt. You wrote:

> When we test humans, it is on text they have not seen before.

But it is/was english text for english speakers, right? If I give you a text in a language you don't know and cannot deduce (any indo-european), let's say some korean text, you will have a hard time predicting.

In other words, if we want to measure the ability of a machine to predict from previous input, without any knowledge of that input (except that it is a stream of 0 and 1 symbols), and then compare it to human ability, we should give humans the same stream of bits, without any knowledge of its meaning. That would be "fair".

In other words, what I am wondering is, even if we haven't long reached the goal of "understanding", haven't we long reached the goal for a machine to get much better (and much faster prediction of unknown data from previous data compared to us mere humans? Do we label a machine as inferior (measured in "bpc"), just because we don't give it the information we have in our brains about grammar, words, and anything else that we "understand" and can use to predict letters of unknown texts?

4. i always thought that to make comparision fair, we should train program on the first 1024 mb of data and then predict next 1 mb. repeat it for the every other 1 mb in the second gb. add results together and include decompressor size as Matt does. it is directly comparable to human trianed with 1 gb of of natural texts/speech and tryoing to predict random english text. in particular, it will show consistent results on first and last bytes of this 1 mb corpus because program is already learned like the human. such mesurement will probably show that computers are outpefromed humans here, like they already done in math and chess

and if you are going to test untrained program, please compare it to newborn child - both will learn on first data they receive and improve prediction closer to the end of corpus

5. Keep in mind that these "machines" are highly specialized for a single task.
It's not even remotely fair to compare their "intelligence" based on their performance at their specialized task.
Ask Durilca for a cup of tea for instance, and measure its "intelligence" on that one...

Of course software are better than humans at their task. That's why we no longer have big rooms full of clerks.

Don't take the 0.6 - 1.2 bits indication at face value, given the method by which it was estimated, it's just a vague approximation.
It's merely helpful at providing a goal to reach, which is more fun.

6. yes, yes! we use computers for tasks they perform better/cheaper than humans, like we does with horses and automobiles. people are also specialized in running, swimming, talking, but this doesn't necessarily means that it cannot be perfromed better. my favourite example - human write perfromance is just about 1 byte per second (either speaking or typing), despite it's one of our natural built-in abilities. just stop thinking that God created Human and Human cannot create something better than God's creatures

7. I think there are two different models of compression in play. One model is the Kolmogorov model, where the compressor/decompressor do not have a priori knowledge of the distribution and the compressed output has to be self-contained. The other model allows a priori knowledge. Shannon's experiment depends on a priori knowledge of English.

8. Originally Posted by cfeck
The rationale says, that no program to date reaches about 1 bit per character, but durilca reaches a compressed size of 127,784,888 bytes (including code+dictionary) vs. 1,000,000,000 bytes uncompressed for enwik9. If I assume 8 bits per byte, that should be a rate of 1.02 bits per character?
Enwik9 is not plain English, the file is actually in UTF-8, and it probably holds words like places or names written in foreign languages and even in different scripts (non latin), thus it holds less than 1,000,000,000 characters. Since UTF-8 continuation bytes are in the 80-BF range, removing those would give the exact number of Unicode code points in the file.
Some advanced typography marks like “ ” — … or currency signs like the euro € may also require more than a byte.

9. I did experiments comparing enwik8/9 with plain text from a 27 character alphabet (space and a-z only) in http://mattmahoney.net/dc/textdata.html in the section "relationship of wikipedia text to clean text". The clean text compresses about 5% smaller with most good compressors.

Anyway, the goal is to study language modeling, which is just a part of the AI problem. For example, here we find very good text compression using recurrent neural networks of just a simple character n-gram model, so it is evidently learning some high level language construts like semantics and grammar. http://encode.ru/threads/1825-Compre...eural-networks

10. Originally Posted by caveman
Enwik9 is not plain English, the file is actually in UTF-8, and it probably holds words like places or names written in foreign languages and even in different scripts (non latin), thus it holds less than 1,000,000,000 characters. Since UTF-8 continuation bytes are in the 80-BF range, removing those would give the exact number of Unicode code points in the file.
Some advanced typography marks like “ ” — … or currency signs like the euro € may also require more than a byte.
The biggest reason enwik9 would compress better than text would likely be the XML markup. It's very repetitive, and so it inflates data like a balloon, while contributing little actual entropy.

11. There is a big section in the middle of enwik9 of articles about towns automatically generated from census data that is highly compressible. This is a big reason enwik9 compresses better than enwik8.

12. I think the community needs to move on from enwik9. It really isn't representative of "text" because it cheats, for the reasons given above, it's UTF 8, has XML markup, special symbols, repitition, etc. It seems to just be a very poor example, and I think the text-compression community can and should prepare a better "ASCII-text sample set" .

Sure, somebody can compile (ie, concatenate) 1G of *ASCII* text from more representative sources, eg. public domain novels, nonfiction works, etc. The main 1GB file should be done in english, of course, but other languages should be represented by perhaps smaller datasets as well.

Why hasn't this been done? Why do we still retain the "crutch" of enwik9?

13. The only reason that compressors gets so close to (or surpass) abilities of human is our limited ability to remember things. If we had the same number of gigabytes of RAM that computers have then computers would never outperform us as it's us who created the computers. But in out current form we cannot simulate the algorithms we create in our brains.

Thus, I predict that given enough amount of memory and big enough textual input computers will always win, if we're constrained to only use our brains of course (ie no helper sheets of paper). Beating human will become an very satisfactory milestone for compressors developers but otherwise I think it will become a norm in few dozens years.

But we (humans) can do better than Shannon's predicted I think. AFAIU, the experiment measured how an average individual predicts text. If we allow collective thinking, allow people to discuss what the probability distribution of next symbol will be, then it would improve the results just as model mixing in improves compression ratio over using a single model.

14. Originally Posted by zyzzle
I think the community needs to move on from enwik9. It really isn't representative of "text" because it cheats, for the reasons given above, it's UTF 8, has XML markup, special symbols, repitition, etc. It seems to just be a very poor example, and I think the text-compression community can and should prepare a better "ASCII-text sample set" .
Getting rid of the XML markup is a piece of cake. If you remove the XML from enwik, you won't have to track down a whole new sample of text. I don't see what's wrong with UTF-8. There's nothing unrealistic about text in UTF-8.

I have a feeling that if you tried to pick a better text sample, you'd probably end up liking enwik more. It has a credible case for being representative of lots of kinds of text, it's available in large quantities, and it's free.

15. "I don't see what's wrong with UTF-8. There's nothing unrealistic about text in UTF-8."

It seems to me it's the usual thing these days for text in files in Western languages. Am I wrong?

It's especially interesting because you can convert it to UTF-16 or UTF-32, which are common in-memory text representations. (And UTF-16 is common for files too.) Then you can automatically compress those to something better than UTF-8, rather than converting to UTF for file I/O and compressing that.

My understanding is that people often convert to UTF-16 to UTF-8 for file I/O on the assumption it will "compress" UTF-16, but for some languages (Chinese and Japanese?) that's wrong because they're not basically 8-bit codes, and you want to compress the 16-bit or 32-bit strings directly---that should give you better compression, faster.

16. > I think the community needs to move on from enwik9.

There are other large text benchmarks like http://pizzachili.dcc.uchile.cl/texts.html
The problem is that it is a lot of work to test hundreds of compressors. Any volunteers?

> Thus, I predict that given enough amount of memory and big enough textual input computers will always win, if we're constrained to only use our brains of course (ie no helper sheets of paper).

That might already be true. In Shannon's 1950 experiments, he allowed subjects to use tables of n-gram frequencies and a dictionary to help them guess the next letter.

17. Originally Posted by Matt Mahoney
> I think the community needs to move on from enwik9.

There are other large text benchmarks like http://pizzachili.dcc.uchile.cl/texts.html
The problem is that it is a lot of work to test hundreds of compressors. Any volunteers?
It ought to be scripted.

18. ## The Following User Says Thank You to Bulat Ziganshin For This Useful Post:

Paul W. (19th November 2013)

19. More on the Micron "automata processor" is at http://www.micron.com/about/innovati...sing?source=mb

Includes a FAQ and links to an IEEE PaDS paper, some supplementary info, and simulation videos.

"Unlike a conventional CPU, the AP is a scalable, two-dimensional fabric comprised of thousands to millions of interconnected processing elements, each programmed to perform a targeted task or operation. Whereas conventional parallelism consists of a single instruction applied to many chunks of data, the AP focuses a vast number of instructions at a targeted problem, thereby delivering unprecedented performance."

"The AP is not a memory device, but it is memory based. Unlike a conventional CPU, the AP is a scalable two-dimensional fabric comprised of thousands of processing elements each programmed to perform a targeted task or operation, ultimately delivering unprecedented performance. Additionally, the AP is massively parallel. Whereas conventional CPU architectures can have anywhere from 2 to 64 processors, an AP can encompass hundreds of thousands or even millions of tiny processors."

"The AP uses a DDR3-like memory interface chosen to simplify the physical design-in process for system integrators. The AP will be made available as single components or as DIMM modules, enhancing the integration process. A PCIe board populated with AP DIMMs will be available to early access application developers to jump-start plug-in development of AP applications."

"Many of today’s most challenging computer science problems require highly parallel methods to solve. In conventional computing, parallelism involves the same operation being computed on many chunks of data at once and can be cumbersome and complex, often requiring significant effort on the part of programmers and system designers. The Automata Processor’s parallelism exploits the very high and natural parallelism found in Micron’s semiconductor devices—a different kind of parallelism that is more appropriate than conventional CPUs for the class of problems the AP targets. By answering thousands or even millions of different questions about data as it is being streamed across the chip, the AP provides an architecture that delivers the parallelism required to address problems in an efficient, manageable method."

20. A bit of a rant, sorry...

I've been wondering for a long time why DRAM chips don't all contain a few simple-ish, slow-ish ~100 K transistor DRAM-process RISC cores per gigabit, so you only use a few percent of your die area on processors and have 100 processors per GB to play with, on the far side of the main memory bus.

(I realize that transistors in the RISC cores have to be be somewhat bigger and less efficiently laid out than DRAM transistors, but even granting a seemingly generous factor of 10 for EACH of those things---a factor of 100 overall---you should be able to put 800 pipelined RISC cores in the space on a 1 GByte DRAM chip, or 80 if you only use 10 percent of the die area, so it's almost all DRAM and you get scores of cores "almost for free.")

It seems to me that we should have scores if not hundreds of RISC cores out in memory-land per GB of DRAM. Anything less is a basic failure to scale.

(Dave Patterson was pushing this idea about 15 years ago... what happened to it?)

It shocks me that over the last few decades, I've used computers with memories that have grown from 16 kilobytes through 16 megabytes to 16 gigabytes. Holy cow, that's a whole lot of factors of two, and a couple of factors of a thousand. But the number of CPU cores has grown not at all almost of that time, and finally is up to four. I guess that on average, each time I increase memory size by a factor of a thousand, I get a factor of two more CPU cores, if I'm lucky. Whee.

Of course I'm happy that the cores I do have are so superscalar and superpipelined and fast, but now I can't use IF statements anymore without a huge hit if the branch is not very predictable---you know, like if my IF is actually asking a question I actually don't already know the answer to. :-/ I really miss my 3-stage pipelined RISC machine, and my 6502, where I had some freaking idea what was actually going on on my computer. :-/

Is this situation as crazy as it seems to me?

21. Matt, yes, I was thinking Connection Machine. (Like maybe a CM-2, with fairly normal-looking processors, but with a CM-1-style 2D grid interconnect rather than a Banyan network.)

I want my CM-2's in my memory modules. I thought I'd have that by now, if had billions and billions of transistors to play with. I studied Connection Machines back in the Stone Age, not long after I moved up from punching holes in cards. (With flint implements.) This stuff is REALLY, REALLY old in dog years.

22. Yes, but the Connection Machine was never successful. The problem is that it was really hard to program. It is easy to convert a parallel algorithm to a serial one and run it at 100% CPU, but very hard to go the other way. Moving data and code to available processors eats up much more hardware and power than actually doing the work.

Look at the most successful example of parallel processing, the human brain. A neural network of the same size would need about 1 Pflop. But look at how slow we are at simple arithmetic.

23. Out-of-order cpus inside are perfect Connection Machines

24. Originally Posted by Paul W.
Of course I'm happy that the cores I do have are so superscalar and superpipelined and fast, but now I can't use IF statements anymore without a huge hit if the branch is not very predictable---you know, like if my IF is actually asking a question I actually don't already know the answer to. :-/ I really miss my 3-stage pipelined RISC machine, and my 6502, where I had some freaking idea what was actually going on on my computer. :-/

Is this situation as crazy as it seems to me?
CPU designers took the things that could be made faster and made them faster. Conditionals probably only seem slow because they didn't speed up as much as some other things. But conditionals are probably as fast as they could be.

25. Originally Posted by Matt Mahoney
Yes, but the Connection Machine was never successful. The problem is that it was really hard to program. It is easy to convert a parallel algorithm to a serial one and run it at 100% CPU, but very hard to go the other way. Moving data and code to available processors eats up much more hardware and power than actually doing the work.
It sounds a lot different than large numbers of CPUs running Lisp. It's an intriguing idea. They seem to be targeting it at computational biology, so I wonder if this will be in PCs or if it will be a specialized tool for academic and industry research labs.

Check out the technical details: http://www.micron.com/~/media/Docume...ical_paper.pdf

Look at the most successful example of parallel processing, the human brain. A neural network of the same size would need about 1 Pflop. But look at how slow we are at simple arithmetic.
That's why we invented computers.

26. Matt:

"Yes, but the Connection Machine was never successful. The problem is that it was really hard to program. It is easy to convert a parallel algorithm to a serial one and run it at 100% CPU, but very hard to go the other way. Moving data and code to available processors eats up much more hardware and power than actually doing the work."

Of course, I know those things and I'm not assuming that programming parallel computers will be anything like easy in anything remotely like the general case---that's why I stressed using only 10 percent or less of the die area. 99 percent of your code may not run on the hundreds of processors that you scalably get almost-for-free for a little percentage of your RAM dies. Those processors out there in DRAM-land would just be for the subset of stuff for which you CAN figure out how to leverage them, e.g. bottom-upping some stuff into a friendlier form for your conventional cores to chew on with fewer pipeline-killing conditionals.

But some stuff IS just embarrassingly parallel--or maybe just easier to figure out how to do fast enough in a brutal, embarrassingly parallel way than it is to come up with a brilliantly efficient and serial algorithm for.

For example, lately I'm really, really concerned with what amounts to something like image segmentation for lossless data compression---not of actual image data, mostly, but trying to recognize segments of data to be compressed as being likely machine code in one machine language or another, images of one kind or another in one format or another, human-readable text in one language or another and one character/ideogram encoding or another, etc., etc., etc.

Right now I'm working really hard at doing this cleverly on normal CPUs using a variety of tricks and heuristics, and I think I'll do okay, but this is the kind of thing where "if brute force isn't working, you're not using enough." There are things that I could check for analytically with a bunch of cheap parallel processors that I have to do heuristically on a very few, very expensive serial processors, and I have to think really, really hard about the likelihood and consequences of wrong guesses---and I could always be wrong if my test data are not sufficiently representative. That's just a fact---if I had a thousand processors, I'd know certain things for sure, but I don't, so I won't, and have to rely on inspired guesswork.

Even accepting that I have to do a lot of heuristic guesswork on a few extremely expensive processors, it's just basically hard to figure out how to do it, precisely because it's fundamentally guesswork---the most useful and important conditionals, which most usefully discriminate between likely categories of things I might be dealing with, are precisely the least predictable ones. I'm essentially compressing my code down to a set of nested conditionals where all the common branches are as unpredictable as possible, precisely because they're the most informative, probabilistically.

That seems like a basic problem that will never go away. No matter how smart and knowledgeable I am, there's an information limit, and the most useful condionals will tend to be the most costly, precisely because they yield the most information.

"Look at the most successful example of parallel processing, the human brain. A neural network of the same size would need about 1 Pflop. But look at how slow we are at simple arithmetic."

Sure. I'm not sure what that's supposed to tell me.

27. Hmmm... it seems like there's got to be a fundamental theorem about that. If your compression algorithm is fast because its branches are predictable, that's got to be because the data are redundant in ways corresponding to the branches going the predicted ways.

Compression can only be fast on a serial processor to the extent that the problem is easy.

Efficient compression ought to be slow on branch-predicting uniprocessors, and only fast on parallel slow processors, where you can tolerate chronically unpredictable branches.

(I have no idea if that actually makes sense.)

28. Originally Posted by Paul W.
Hmmm... it seems like there's got to be a fundamental theorem about that. If your compression algorithm is fast because its branches are predictable, that's got to be because the data are redundant in ways corresponding to the branches going the predicted ways.
If it doesn't work out that way naturally, you could probably code it in such a way that predictable bits lead to predictable branches. There's a natural correspondence between the two, because they're both making predictions whose outcomes depend on the data being compressed.

29. ## An argument for data parallel compression

nburns, I do that sort of thing.

But it's like the 90/10 rule... once you've unrolled your inner loops, the frequency distribution flattens out a whole lot and the next level of optimization gets a whole lot harder.

Similarly, once you've accounted for your usual low-information common cases---which branch prediction may mostly handle for you automatically---the remaining high-information discriminations are exactly where the compression action is. And they could go go any old way, because they're actually answering good questions rather than confirming strong prior suspicions.

But those things would often be no problem for data parallelism.

For example, suppose I expect that the high bytes of machine words are

1. fairly likely to be zero, but probably not... and if not, they're
2. fairly likely to contain repeats of some other common value, but probably not...l and if not, they're
3. high bytes of aligned byte pairs are likely to be zero... or if not...
4. high bytes of aligned byte pairs are likely to be nonzero but identical, or if not...
5. n high bytes are fairly likely to be identical to the n corresponding high bytes in any one of the m few previous few machine words, with the the probability going down but the benefit going up with n, and decreasing with m but not negligible for several values of m; if not...
6. one of several less likely but not rare cases is likely to be true...

OR, if none of those things is true...

7. LZ will work fine for the next machine word or two with a texty preloaded dictionary, or
8. LZ will work fine for the next machine word or two with a dictionary preloaded with common machine code riffs, or
9. LZ will work fine for the next machine word or two if you mask out the low 2 bits of bytes and pack those into literals, or if not
10. LZ will be better than nothing if you mask out the low 4 bits of bytes and pack those into literals, or
11. the data is random so far as you can reasonably tell, so you should punt and encode it as a literal.

All of these cases are pretty common for in memory data that may have UTF-8 text, UTF-16 text, UTF-32 text, (in any language, alphabetic or pictographic), Intel machine code, RISC machine code, vectors of numbers, pointer-linked heap data structures, malloc/gc headers, Java object headers, C++ virtual function pointers and alignment padding, numeric tuples, arrays of any damned thing, etc.

Each is very common for some kinds of data I regularly encounter, and could encounter any of at any moment, a thousand times a second.

There is just no way to order the tests for all these common-but-not-usual things without usually taking multiple pipeline hits. You have to pick which few regularities you're going to detect and exploit, or be very good at guessing which to try to exploit, using some cheaper discrimator/predictor ahead of time---a big subject unto itself, and you can always be wrong because these kinds of data get interspersed with each other in memory, potentially at almost any granularity. (Think binary buddy allocator.) I can do clever things with multiple bit masks and hash functions and Bloom filters and whatnot, but it's still not going to be really FAST.

But look at the first several common-but-not-usual things. They're all very local properties---the first four are regularities within a single machine word, and the next few are regularities across a few machine words.

If I had a few hundred or thousand slow processors (100x slower than my main processor), I could be checking for all those things exactly and in parallel, and categorizing and preprocessing my data just fine, ahead of my main compression stream and before my main CPU gets to most of them, and it would avoid taking most of those nasty deep pipeline hits. It would rock through the data, consistently handling all the common-but-not-usual cases well.

Like Matt says, what we have here is basically an AI problem of detecting and exploiting regularities, and many common regularities are just trivial or easy to detect bottom-up and in parallel, slowly per unit but efficiently and very fast in bulk. (As in the first few layers of the primary visual cortex.)

Or so it seems to me. But then, I've never actually coded that sort of data-parallel thing, and I expect Bulat to disabuse me. :-/

Page 1 of 2 12 Last

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•