# Thread: Compression and mining for Multiple Long Common Subsequences

1. ## Compression and mining for Multiple Long Common Subsequences

I've bugged Matt Mahoney on Quora enough to make him send me here. I'm not exactly
a compression guy, more AI person, so please, forgive me some illiteracy.

Some compression and AI foundations intersect though, and I have some basic questions,
if you guys willing to help me out. I'll try to keep the message/question concise.

Both compression and AI are looking for repetitive patterns for slightly different purposes [simplified].
Matt gave me an example of BANANA sequence with repetitive "NA" pattern, which could be exploited.
I enhanced the BANANA by breaking "NA" with random symbols, like "BANcANdA". Now, there are
no easily discoverable continuous patterns-substrings, but still discoverable subsequences
"N*A", with * being a wildcard, which might be exploited.

Obviously, on mutated but very short BANANA level, those subsequences only make compression worse,
because of handling logistics. On the long texts it might make sense.

Another example is a string like "a--b++c--d++e ", which is compressed to "aAbBcAdBe" with 2 entries dictionary "—" = A and "++" = B.
Now compressing that string again with 1 entry dictionary "A*B" = C (* is a wildcard) we can get:
a) aCbcCde (13 symbols --> 7 symbols) losslessy
b) CC (13 symbols --> 2 symbols) lossy

The number of possible combinations-patterns of length M made from N possible symbols is N^M.
A wild card introduction might be treated as extra symbol "*" and number of possible patterns
becomes (N + 1) ^ M, which grows faster then N^M.

The task of mining for "wildcarded" patterns is close to Multiple Longest Common Subsequences (MLCS) search.
I'm experimenting with "probabilistic MLCS miner", which processes an arbitrary texts corpora for MCS (cannot claim longest).
Here are some results:

It sure makes no sense in practical compression, because it takes days to mine and approach is explicitly corpus specific.
There are unexpected useful consequences from theoretical and AI viewpoint though, which is of interest I can get into later.

Now comes the first question: are there compression methods based on search of MCS-like patterns?
And the second one: I like my long 16-symbols long patterns, do you guys?

Matt nudged me in the "DNA compressors" directions (whatever they are), but brief papers reading shows: people shy away from dealing
with ALL encountered skip-patterns, and tend to minimize the wildcards number [to one?].

Thank you for reading. Nice to meet you.

2. > It sure makes no sense in practical compression,
> because it takes days to mine and approach is explicitly corpus specific.

Sometimes there're contests where that is applicable.
http://prize.hutter1.net/
http://mailcom.com/challenge/

> Now comes the first question: are there compression methods based on search of MCS-like patterns?

Actually all modern LZ77 compressors (with rep-matches and parsing optimization; eg. LZMA)
make use of some wildcard patterns.
You can see an example here: https://encode.ru/threads/1288-LZMA-markup-tool

> Matt nudged me in the "DNA compressors" directions (whatever they are), but
> brief papers reading shows: people shy away from dealing with ALL
> encountered skip-patterns, and tend to minimize the wildcards number [to one?].

That's mostly because there're TBs of DNA data and its impractical to perform
that kind of analysis just to compress a file.

But you may be able to find some specific patterns that can be used
in an actual compressor.

An example of DNA data:
http://corpus.canterbury.ac.nz/descr...ge/E.coli.html
http://corpus.canterbury.ac.nz/descriptions/#large

3. ## The Following User Says Thank You to Shelwien For This Useful Post:

Bullbash (22nd March 2019)

4. > Sometimes there're contests where that is applicable.
Oh man, not much competitive (sad sigh...)

> Actually all modern LZ77 compressors ... make use of some wildcard patterns
My staring at 1288-LZMA-markup-tool PNGs yielded approximate max size of the patterns in the dictionary, like 14 symbols.
There are "some wildcards", which answer my question. But... any 14-symbols sequence produce 2^14 - 1 of "wildcard sub-patterns",
and this is what I'm after.
So, if you allow me another question - how big are the typical/max dictionaries when max sequence length <= 14?

You do not obviously keep all the patterns, right? What are the criteria to discard low utility patterns?

My approach: I create a *full* dictionary of short patterns <= 2 (3), encode the input in terms of that dictionary and
build another "dictionary layer" upon the new input, again with lengths <=2 (3). Now my second level dictionary keep patterns 4(9) symbols length.
And so on. Each next dictionary ~x10 bigger then previous one, so at some level I must get rid of "low utility" patterns.

I mean, I ask my algo : find me some looong patterns in some looong texts, and it does (no dynamic programming, tables r too big).
I mine those patterns for text comprehension like, wondered if you guys use something like that for compression?

Another question: are there "hierarchical dictionaries" at all?

I've red some papers on DNA patterns discovery, state-of-the-art claims 10K symbols length sources with 4-symbols alphabets processable only... I do a few millions with 100 symbols alphabets... what to think?

5. > My staring at 1288-LZMA-markup-tool PNGs yielded approximate max size
> of the patterns in the dictionary, like 14 symbols.

Its not really maximum... just whatever minimizes the compressed size of the file.
Actual max match length is 273 for lzma (=2+16+255)
There's also no explicit dictionary, it just can reference any substring

> There are "some wildcards", which answer my question.
> But... any 14-symbols sequence produce 2^14 - 1 of "wildcard sub-patterns",
> and this is what I'm after.

Yes, but there's no point in considering patterns with less than 2 instances.
So, rather than counting all possible subsets, LZMA encoder would compare
current data with known similar strings (having the same prefix of 2+ symbols).
If it improves compression even with some differences, LZMA would use a partial match too.

> So, if you allow me another question - how big are the typical/max dictionaries when max sequence length <= 14?

LZMA effectively can use window size up to 2G, actual memory use is window_size*11.5.
With that it would be able to find any substring of any size (well, 2+ symbols) within the "window".

> You do not obviously keep all the patterns, right?
> What are the criteria to discard low utility patterns?

Their codelengths are longer?
but in the end its the same Shannon's entropy.

> And so on. Each next dictionary ~x10 bigger then previous one, so at some
> level I must get rid of "low utility" patterns.

Once upon a time, I tried writing a tool that worked like this:
0) expand bytes of original data to 32-bits per symbol
1) scan the data, find the pattern <s1><gap><s2> with max occurrence count
2) replace all <s1><gap><s2> in data with a new symbol (actually: delete <s2>, replace <s1> with new symbol)
3) goto 1

> I mean, I ask my algo : find me some looong patterns in some looong texts,
> and it does (no dynamic programming, tables r too big).
> I mine those patterns for text comprehension like,
> wondered if you guys use something like that for compression?

In a way, that's why only actual compression algorithms are only developed with C/C++.
The more popular languages are just too inefficient for this.

Its a convenient tool for finding string matches.
Only needs 5N memory for full index where LZMA's bt4 tree needs 11.5N.

> Another question: are there "hierarchical dictionaries" at all?

Maybe? For example, PPM/CM papers would frequently talk about "order-N models" and such.

> I've red some papers on DNA patterns discovery, state-of-the-art claims 10K
> symbols length sources with 4-symbols alphabets processable only...
> I do a few millions with 100 symbols alphabets... what to think?

Might be not that kind of processing.
I mean, scanning of actual DNA molecules is still done in chunks.

Btw, I actually tried looking for patterns with gaps in LZMA output.
Found stuff like this (world192.txt):
Code:
```<,000 kW capacity; ??,>
< billion, per capita \$??0>
< (1988), ??.>
<a, ?a>
<] ?2>
<; ?,???,>```
But nothing really interesting.

6. ## The Following User Says Thank You to Shelwien For This Useful Post:

Bullbash (23rd March 2019)

7. Oh man, I'm not prepared, need time to digest.

Look, the very general view of AI (AGI flavor) is a big dictionary of input subsequences (as opposed to substrings) loosely associated with a big dictionary of output subsequences-patterns. The input dictionary is an Environment model.

Everybody tries to avoid subsequences for the sake of efficiency (C... blah - I do respect C and you guys, myself started with Assembly). I mean, you speak in terms of substrings (for the sake of efficiency), I speak in terms of subsequences (for the sake of generality). I enjoy your input and do not want to over use your time. Can argue almost on every paragraph of yours though. And that is not the best approach.

If we could agree on one single piece of ground truth per answer, we might slowly progress... if you willing to. Hutter and Mahony state that proper compression is an AI foundation - so maybe it is worth it, the dialog I mean. There might be something interesting for both communities.
So, I suggest - reglament first.
Let me think on the first statement.

Ok, do we agree that "no explicit dictionary",(references, hashes, whatever - any form) is still a dictionary with a defined entry size and defined dictionary size?

8. > Hutter and Mahony state that proper compression is an AI foundation -
> so maybe it is worth it, the dialog I mean.

https://en.wikipedia.org/wiki/Maximu...ood_estimation
https://en.wikipedia.org/wiki/Minimu...ription_length
https://en.wikipedia.org/wiki/Minimum_message_length
https://en.wikipedia.org/wiki/Kolmogorov_complexity

All these things are equivalent to compression.
So yes, I agree that compression is an AI foundation.
And in addition, I have an opinion that
(1) Lossless data compression is the only approach with objective model quality metric;
(2) Data compression still has many techniques which are not duplicated in AI research;
As a practical proof of (2), there's no record-breaking compressor based on AI research,
even despite contests with money prizes.
Mahoney himself started his project with a NN predictor,
which didn't have any success: http://mattmahoney.net/dc/nnword.html
There's also a newer attempt to use LSTM for compression: http://mattmahoney.net/dc/text.html#1750
but on its own its also far from impressive.

I do think that the main reason for this is actually the fact that you can't
compress a file of reasonable size with a python script, simply due to speed/memory reasons.
But still, its not like people didn't try.

> Ok, do we agree that "no explicit dictionary", (references, hashes, whatever - any form)
> is still a dictionary with a defined entry size and defined dictionary size?

Sure, I can work with any well-defined terminology.
Just keep in mind, that in data compression the term "dictionary" is kinda overused.

On one hand, the normal english definition as "a set of words" is widely accepted in compression.
For example, there's a well-known preprocessing method, which improves text compression
using a dictionary: https://github.com/inikep/XWRT

Also, there's a historical distinction between compression methods which used an explicit
dictionary (LZ78/LZW) and ones with references simply to "already processed data".
https://en.wikipedia.org/wiki/LZ77_and_LZ78#LZ78

But in practice, some people (eg. 7-zip/lzma documentation) would call
the data window which can be used for reference as a dictionary. https://en.wikipedia.org/wiki/Lempel...ata_structures
And recently it became popular to provide an option to "train"
a compressor with precomputed reference data... which also gets called a dictionary. https://github.com/facebook/zstd
So yeah...

9. ## The Following User Says Thank You to Shelwien For This Useful Post:

Bullbash (23rd March 2019)

10. > Sure, I can work with any well-defined terminology.
Good. I'll try to get the next statement tomorrow... sorry, too many beers tonight.

> Lossless data compression is the only approach with objective model quality metric;
lossy is the right answer, but we get to this a bit later.

Have some rest, I appreciate curious men. We get to this later, promise. We speak about complexity and NN application to compressing things.
I did try communicate some stuff on Quora, pretty messy though. If you really want to, you might check that https://masterdata.quora.com...
But we might get it better structured as we speak. I'm not looking for money or glory any more, just tryin' to shed some light... if people willing to listen.
Thank you for conversation. Later.

11. > lossy is the right answer, but we get to this a bit later.

For content digitized from analog sources, sure.
But you can't really compare two lossy image compression algorithms and say which is better.
Not what I said, anyway.

Btw, in practical data compression, lossy methods are frequently not an option, simply because of external dependencies.
For example, we can apply lossless compression to jpeg images found in any data,
but lossy approach would require implementing 100% stable parsers/generators for all formats with embedded jpegs (pdf,docx etc).

12. Let's set the stage up. We start from very origins of memory concept and quickly traverse to compression.

Point A). Consider a trivial imaginary creature with 3 binary sensors (8 potential input situations) and 2 binary motors-effectors 4 possible responses), trying to adapt to an Environment.

The best adaptation strategy is to memorize all 8x4 = 24 input-output combinations along with associated rewards. The strategy is universal to any Environment and reward nature. Humans might keep the records in a
contingency table, creature's cells might develop connections to form a graph (what we call a NN).

Point B). When sensors/motors complexity grows - the number of "spatial" perceived observations grows. There will be "temporal" patterns as well, holding time-sequences of "spatial" observations.

No memory capacity could keep up with exponential growth of the contingency table (a.k.a. dictionary a.k.a. NN). There is need for compression of said dictionary. Ideally it is lossless compression, but it is not enough.
A complex entity must resort to lossy one.

If one wants to compress files - s/he applies lossless or lossy wizardry, if another one tries to build AI - then only lossy will do.

------------------
Can we agree on points A) and B) ? If not - let's discuss, if yes - I'll get next statements to advance.

An absence note: I'll be away for a week, not sure if be able to attend, please excuse. I'll be bakh (c).

13. > If one wants to compress files - s/he applies lossless or lossy wizardry,
> if another one tries to build AI - then only lossy will do.

Lossy memory is ok.
I only said that "Lossless data compression is the only approach with objective model quality metric".
That is, we can compress a file and see its compressed size - which is the model quality metric
(although sometimes it may be necessary to also include compressed decoder size).
And then we can take decoder and compressed file to another machine and run the decoder.
If decoder output matches original file, we have a proof that
(1) this model really can compress this file to given size;
(2) there're no errors (like implicit use of "future information") or "human factor" in
compression estimation.

This can't be said about lossy compression (where instead of original file,
only something "similar" is decoded), and can't be said about most of NN models,
where its always easy to overtune the model (ie store half of the data in the model coefs),
or simply measure its performance with a specially designed method,
which would make it better than competition.

> An absence note: I'll be away for a week, not sure if be able to attend, please excuse. I'll be bakh (c).

Sure, I'd likely still be around in a week.

14. ## The Following User Says Thank You to Shelwien For This Useful Post:

Bullbash (24th March 2019)

15. Originally Posted by Shelwien
>So yes, I agree that compression is an AI foundation.
And in addition, I have an opinion that
(1) Lossless data compression is the only approach with objective model quality metric;
Originally Posted by Shelwien
But you can't really compare two lossy image compression algorithms and say which is better.
I second that. I completely agree.

16. Originally Posted by Shelwien
> Ok, do we agree that "no explicit dictionary", (references, hashes, whatever - any form)
> is still a dictionary with a defined entry size and defined dictionary size?

Sure, I can work with any well-defined terminology.
Just keep in mind, that in data compression the term "dictionary" is kinda overused.

On one hand, the normal english definition as "a set of words" is widely accepted in compression.
For example, there's a well-known preprocessing method, which improves text compression
using a dictionary: https://github.com/inikep/XWRT

Also, there's a historical distinction between compression methods which used an explicit
dictionary (LZ78/LZW) and ones with references simply to "already processed data".
https://en.wikipedia.org/wiki/LZ77_and_LZ78#LZ78

But in practice, some people (eg. 7-zip/lzma documentation) would call
the data window which can be used for reference as a dictionary. https://en.wikipedia.org/wiki/Lempel...ata_structures
And recently it became popular to provide an option to "train"
a compressor with precomputed reference data... which also gets called a dictionary. https://github.com/facebook/zstd
So yeah...
@Bullbash, reading the posts I think your "dictionary" = the *information* that a data compression model uses. (In AI: environment representation.)
Let's not call it a "dictionary". It really makes a reader from the data compression community narrow it down to a list of words or list of strings - which is not what you mean.

17. Originally Posted by Bullbash
Point A). Consider a trivial imaginary creature with 3 binary sensors (8 potential input situations) and 2 binary motors-effectors 4 possible responses), trying to adapt to an Environment.
The best adaptation strategy is to memorize all 8x4 = 24 input-output combinations along with associated rewards. The strategy is universal to any Environment and reward nature.
I suppose your definition of this trivial "Point A)" setup is the following: there is no spatiality nor temporality (as opposed to "Point B"), and there is a direct connection between the inputs+responses and the rewards. I agree: in this case memorization would help the most in modeling the environment.

A small correction: memorization may be a part of a model/strategy, but itself is not a "strategy". For a working model you need to define how to calculate the best output (response) based on the current inputs and previous success/failure events. In the case of "Point A)" there is a direct mapping. This is what you mean - if I understand you correctly.

Originally Posted by Bullbash
Point B). When sensors/motors complexity grows - the number of "spatial" perceived observations grows. There will be "temporal" patterns as well, holding time-sequences of "spatial" observations.
Originally Posted by Bullbash
There is need for compression of said dictionary. Ideally it is lossless compression
I think I know what you mean. But "lossless compression" may not be the best choice of words here.

a) A creature with perfect memory: a creature living in a static maze (or in any other environment with a static well defined structure) it performs the best when it perfectly remembers the maze it traversed. The model represents the environment in a "lossless" way.

b) A creature that forgets: the larger the maze and/or the smaller is the memory of the creature, or when the maze changes frequently, it would be acceptable when our creature forgets maze parts it visited longer time ago and are farer from its current location in favor to newly visited parts. It's not connected to lossy compression, however! The model simply does not have "perfect" memory.

c) When the environment is too fluid/fuzzy or too complex so that it is not worthy to remember it perfectly, a neural network would work the best. We, humans are such creatures. I still would *not* say that there is "lossy compression" here. The model does represent the environment in a "lossy" way, but there is no "compression" involved. To be on the same page, let's then call it lossy memory, or lossy representation of the environment.

Originally Posted by Bullbash
If one wants to compress files - s/he applies lossless or lossy wizardry, if another one tries to build AI - then only lossy will do.
"wizardry" - good word!
Please note, that what you mean is not "lossy compression". What you mean is: the model represents/remembers the environment in a lossy/fuzzy way.
When we talk about "lossy compression" in this forum we mean this: when compressing images/audio or video data, for the sake of a higher compression rate we accept that the uncompressed file won't be identical to the original file and our eyes and ears and brain will do their best to recover and perceive the intended information properly.

"Lossy compression" is about the result, not about the internal representation of the environment.

Please note, that there are lossless compression programs that have lossy "memory" in a "fuzzy" model.

I hope I caught a possible source of misunderstanding.

18. Have you looked at GLZA? It is a dictionary compressor that iteratively finds and replaces common sequences with dictionary symbols. It has Pareto Frontier decompression speed and better compression ratios than other LZ style algorithms for many text and DNA files.

19. I do know of it. I don't think it supports fuzzy matches though?
Btw, is there a way to dump the dictionary / file parsing results out of GLZA?

20. Originally Posted by Shelwien
I do know of it. I don't think it supports fuzzy matches though?
Btw, is there a way to dump the dictionary / file parsing results out of GLZA?
No, it does not support fuzzy matches (yet?). There has been some thought and a little discussion, but nothing implemented.

Command line option -v1 (or -v2) prints the dictionary to stdout if used during compression. It doesn't show the hierarchy though.

21. ## The Following 2 Users Say Thank You to Kennon Conrad For This Useful Post:

Mike (26th March 2019),Shelwien (26th March 2019)

22. I made a modified lzma decoder and a perl script to sort its output, it was used to generate patterns in attach above.
That is, I consider lzma's match-repmatch patterns to be equivalent to fuzzy matches.

http://nishi.dreamhosters.com/u/lzmapat_v0.rar

23. I'm confused. In classic BANANA, what is recurring "NA" ? Entry in a dictionary? "the *information*? "environment representation ?

To me it looks like an entry of a dictionary. It might be expressed as an element in a "list of strings" or as a list of references or any other way.

So, if I go thru some text and collect recurring substrings and memorize it in any form - would we call it a dictionary?

Or what term we might use for a container of repetitive substrings?

24. I must put it vague: instead of
> Point A). Consider a trivial imaginary creature with 3 binary sensors
Point A). Consider a trivial imaginary creature with 3 ordered binary sensors.

There is explicit spatiality now, residing in a table-dictionary comprising 8 immutable "strings". All "strings" (000, 001, ..., 111) hold distinct spatial patterns, or I understand "spatial" wrong.

I agree. memorization is not a strategy, it is a strategy foundation.

I feel like there is no big misunderstanding here: I'm trying to say, that in case of simple "source strings" (compression) or simple observable Environment (AI) - they are the same -
the best way to prepare to compress or comprehend is to memorize all subpatterns in bulk - all of them. And only after decide, which one to drop out - to forget (if any).

When "source strings" aka Environment get more complex (in spatial and/or temporal way), memorization of all substrings/subsequences is impossible and we must start
discarding some elements of a Dictionary(?) as we go, on the fly. As soon as we start forgetting substrings/subsequences - we might lose ability to
a) decompress back exactly to source
b) model exact Environment.
That is why I tried to use term "lossy". If there is a better term, tell me, i will gladly follow.

Ideally, we'd find some agreeable statements, instead of my clumsy Point A,B). That is why I'm here, bugging you guys, compression profies.

if source/Environment is simple (from memory capacity standpoint?), it is better to memorize all substrings/subsequences/observations/experiences to whatever processing we 'r going to subject it (compress, comprehend).
when total memorization is impossible - because of the source/Environment complexity, we must start losing some pieces, and lose ability to restore source/Environment exactly.

Complexity is the reason of forgetting some information.

If you can put it better, please do.

25. >No, it does not support fuzzy matches

are those considered "fuzzy matches"?

or

26. Originally Posted by Bullbash
I'm confused. In classic BANANA, what is recurring "NA" ? Entry in a dictionary? "the *information*? "environment representation ?

To me it looks like an entry of a dictionary. It might be expressed as an element in a "list of strings" or as a list of references or any other way.

So, if I go thru some text and collect recurring substrings and memorize it in any form - would we call it a dictionary?

Or what term we might use for a container of repetitive substrings?
Dictionary seems as good a term as any to me. Others may feel differently.

GLZA creates a context-free grammar. For GLZA, the "dictionary" is essentially a container for the the grammar rules that it creates.

27. > I'm confused. In classic BANANA, what is recurring "NA" ?
> Entry in a dictionary? "the *information*? "environment representation ?

It may be a dictionary entry, if there's an explicit dictionary.
That approach _is_ used in some compression methods, for example LZ78/LZW and GLZA mentioned above.

Normally its just reference data though.
LZ77 would more likely compress "BANANA" as 'B','A','N',{-2,3}
where "-2" is distance from offset 3 to offset 1, and "3" is match length.
Note that match is overlapped - you can't do that with a dictionary.

> To me it looks like an entry of a dictionary.
> It might be expressed as an element in a "list of strings"
> or as a list of references or any other way.

As I said, even some compression developers do call any reference data as dictionary.

> So, if I go thru some text and collect recurring substrings and memorize it in any form -
> would we call it a dictionary?
> Or what term we might use for a container of repetitive substrings?

You can use the term, if you like it.

In practice though, modern computers don't have any hardware support for text strings,
so "a container of substrings" might not physically exist.
Its important, because it directly affects the algorithm's complexity.
https://en.wikipedia.org/wiki/Analysis_of_algorithms

> Complexity is the reason of forgetting some information.

As was said, there's a difference between lossy memory and lossy compression.
A compression algorithm might not remember everything, but still output
enough information to losslessly reconstruct input information.

Also, this lossless property of compression algorithm is important,
because it provides an easy way of objective comparison of different algorithms/models.

> memorization of all substrings/subsequences is impossible and we must start discarding

Sure, compression algorithms tend to always update their reference data.
They usually would attempt to remember a limited amount of actual input data
(called "data window"), but derivative data, such as statistics,
would be usually discarded all the time.

Still, if its substrings of input, we can actually maintain access to all of them.
Computers today have plenty of storage (TBs at least), so it would be hard to
find enough relevant+unique information to fill all of that.

Keep in mind that compression algorithms can be used to remove any duplicate information
from the "memory".

> are those considered "fuzzy matches"?

I don't think so. Something like "says ~ had" would be, but your examples seems to be plain substrings -
wildcards on the sides don't matter for the search algorithm.

28. Originally Posted by Bullbash
I'm confused. In classic BANANA, what is recurring "NA" ? Entry in a dictionary? "the *information*? "environment representation ? To me it looks like an entry of a dictionary. It might be expressed as an element in a "list of strings" or as a list of references or any other way. So, if I go thru some text and collect recurring substrings and memorize it in any form - would we call it a dictionary? Or what term we might use for a container of repetitive substrings?
In your first post, and some others when you talk about BANANA and subsequences (substrings), the term "dictionary" is a good term.
But when you talk about AI, sensors, reward and the representation of the environment is a general way, the term "dictionary" is confusing (at least to me).

29. Originally Posted by Bullbash
we must start discarding some elements of a Dictionary(?) as we go, on the fly. As soon as we start forgetting substrings/subsequences - we might lose ability to a) decompress back exactly to source
No-no. We don't lose that ability.
I'm trying to find the source of the confusion. I had a quick glance at your quora link, too.

Is this is what you have in mind: find useful substrings, make a dictionary from them, compress it, then compress the data. When uncompressing, uncompress the dictionary first and restore the data using the dictionary?

When you say "we might lose ability to decompress back exactly to source" what do you mean? You know, you won't have all the substrings in that dictionary, otherwise the compressed dictionary size + compressed data size would be too large (because of the too large dictionary).
Generally when you want to compress the data better you'll need a larger dictionary. But having a larger dictionary will cost you bytes.
You will need to find a good balance/tradeoff.

Another method would be (as you wrote) "on the fly". The compression program would use only substrings it encountered in the input so far to encode the next sequence. So as compression progresses the dictionary grows. When the computer's memory is exhausted it needs to forget sequences that are small and/or encountered a long time ago and/or encountered too few times. Or it may use a sliding window (see in Shelwien's post). In any case the decompressor will do the same - so at any byte position the compressor and the decompressor have the same dictionary and they both expand it as compression/decompression progresses.

I have the feeling that you know all this. But if you do, where the confusion comes from?

30. > LZ77 would more likely compress "BANANA" as 'B','A','N',{-2,3}

I'd called that a different form of a dictionary presentation.

But we are drifting away from some simple points I was trying to check with you step by step.
I'll try to produce the list of those points/observations, and lets see where it will take us:

a) collecting recurrent substrings (as opposed to subsequences) are the good [but not only] way to prepare to compress *text* (as opposed to tensors/images)
b) number of all possible substrings of length L in text of length M is (M - L)/L - it is not that big, thus usually manageable, could be kept in some kind of a dictionary
c) number of possible subsequences derived from each substring of length L is 2^L, so in worst case scenario a text of length M contains 2^L * (M - L)/L, any dictionary capacity is not enough
d) a subsequence is generalization of a substring and it might hold invisible for substrings high utility patterns. Example: BANANA (NA - recurrent), BAN+AN-A (N*A - recurrent) and it is invisible for substrings searcher.
In AI, subsequence-like structures called patterns. They are important, and I'm working on general patterns searchers (might be interesting for compression too, thus I'm here).
When working with patterns, the best strategy is to memorize patterns to capacity and then drop out irrelevant ones. In compression, relevancy defined by frequency of occurrence.
Following that recipe, I mined some patterns of length 18 symbols on 20GB of RAM. Example: "the************the" in 3MGb of War and Piece. That is I searched in the
2^18 * (3,000,000 - 18 ) / 18 dimensional space.
e) to search for patterns in high dimensional search space there must be a special dictionary employed, I'd call it hierarchical. An attempt to collect 18-symbols patterns directly exhausts any RAM in first KBs of source (call it an attention span). If patterns dispersed beyond this attention span (and they do), there will be no recurrence and no clue what to drop out.
So, one must collect smaller size patterns of say of size 2 or 3 symbols in a rather small first-level dictionary and substitute those patterns with dictionary indexes in a sliding window, then slide along the text further.
Now, that encoded piece of source gets examined for again short 2 or 3 symbols patterns, which creates a second layer of the dictionary, which memorizes now patterns of length 2*2 or 2*3 or 3*3, parameters dependent. And that layering process could be extended indefinitely. Experiments show, that without droping out, layers of dictionary grow like x12 time bigger if each layers patterns size is 2. Not exponentially.
And "attention span" is much much longer, so we might store a lot of long subsequences-patterns along with younger sisters - substrings, enough to get more or less feasible stats to start intelligent cleanup.
So, to look for 18-symbols long patterns I defined dictionary layers with patterns sizes like 3*3*2 = 18. And I found some full substrings and some subsequences: "said Prince Andrew" - 27 times (substring), "the************the" - 34 times. Could it be used in compression? my guess - why not, though such preprocessing is a time/resource consuming.
BTW, keeping such layered dictionaries in form of a graph produces structures (surprise!) similar to what we call ANNs. Those exotic ANNs are very good of supporting fast huge dictionaries (or Environment models if you will)
Question: are there known multi-layered dictionaries out there?
f) I'm not much interested in compression itself. but some effects of the approach:
f.1) said dictionaries-ANNs are universally good for looking for patterns in images too, just define the pattern size as MxN. But image patterns are even more RAM hungry (mysterious visual cortex for God sake).
f.2) they are good for images stream-video and multi-streams as well. I implemented and open-sources the image-text engine with the rapper App and stopped there - not enough RAM.
f.3) they are good for packing motoric circuitry, a stream is a stream. And, well, mapping of input to output is again a collection of patterns (inference that is).
Those are supposed to be interested to AI people, but they generally shy away (unlike you guys!)

From that vantage point I have a couple of questions for compression community:
f.4) are there adversarial examples for Claude Shannon "source coding limits" of about 1.2 bps? Not like sequence of primes or constant-based (pi) heuristics, real texts-like?
f.5) may Shannon theory be applied to images?
f.6) may Kolmogorov complexity be generalized to images?

That is all, a bit messy (have my excuses, did not expected to get to the computer today, but here we are... disappearing for a few days again).
What do you think?

31. Originally Posted by Bullbash
> Question: are there known multi-layered dictionaries out there?
BPE
Re-pair
Sequitur
IRR
GLZA

32. ## The Following 2 Users Say Thank You to Kennon Conrad For This Useful Post:

Bullbash (27th March 2019),Gotty (27th March 2019)

33. > I'd called that a different form of a dictionary presentation.

That's right in theory.
We can consider all substrings of processed data to be a part of our dictionary
and LZ77's {offset;length} pair to be a fancy dictionary indexing scheme.
{distance;length} is more annoying, since it implies that dictionary is adaptive
and reindexed after every processing step.

> a) collecting recurrent substrings (as opposed to subsequences) are the
> good [but not only] way to prepare to compress *text* (as opposed to
> tensors/images)

1) in theory, it may be
2) in practice, in a computer program, its preferable to not have any explicit strings at all.

Current computer hardware is not designed to work with text strings,
and data compression and related tasks are computationally intensive -
even for 1D data types we can't afford to use wrong tools.

> b) number of all possible substrings of length L in text of length M is (M
> - L)/L - it is not that big, thus usually manageable, could be kept in some
> kind of a dictionary

Its actually M-L+1.

> c) number of possible subsequences derived from each substring of length L is 2^L,
> so in worst case scenario a text of length M contains 2^L*(M-L)/L, any dictionary capacity is not enough

Here my point about using right tools becomes important.
Yes, we can't afford to store all subsets of all n-grams as string objects.

Yet, depending on implementation, working with these subsets may be possible.

For example, there's such a term as "lazy evaluation" -
1) we can generate necessary subsets on demand at any time, so there's no point to store them all.
2) we should actually consider what kind of subsets are relevant;
with additional assumptions like "using a subset should improve compression",
or at least "only subsets of a string that occurs at least twice are interesting",
we won't have such "capacity problems".

This is actually why programmer is still a profession -
we don't have automated translation software for tasks like this.

> I mined some patterns of length 18 symbols on 20GB of RAM.

Yeah, well, I previously demonstrated that lzma compressor
handles all substrings of lengths up to 273, including
subsets of these strings, using only 11.5N memory, where
N is filesize, so 35Mb in your case.

Its a matter of correct implementation.
It could be 5N using BWT, or even 0.25N using compressed BWT like here: http://www.mattmahoney.net/dc/text.html#1605

> Question: are there known multi-layered dictionaries out there?

Yes, but it seems to me that you're struggling with your programming language here,
rather than actually solving the problem.

https://en.wikipedia.org/wiki/Burrow...eler_transform
Its how fast string indices are actually built.

> f.4) are there adversarial examples for Claude Shannon "source coding limits" of about 1.2 bps?
> Not like sequence of primes or constant-based (pi) heuristics, real texts-like?

http://www.mattmahoney.net/dc/text.html shows
a 1Gb text compression record at 116940874*8/1E9=0.936bps

But for e.coli dna sequence, which is technically text, we have 1095987*8/4638690=1.89bps.
Here we can see it in actual text: http://corpus.canterbury.ac.nz/descr...ge/E.coli.html

> f.5) may Shannon theory be applied to images?

Yes?
At least if you mean this: https://en.wikipedia.org/wiki/Inform...rmation_source
Then it can be perfectly reached by these coding methods:
https://en.wikipedia.org/wiki/Arithmetic_coding
https://en.wikipedia.org/wiki/Asymme...umeral_systems

Direct statistical contextual coders for images also compress better than
transformation-based ones.

> f.6) may Kolmogorov complexity be generalized to images?

It can be applied to any data really.
Matt Mahoney even made a compressor called ZPAQ, based on
a special VM/bytecode designed for compression -
in theory, any file can be turned into a zpaq archive, containing config + incompressible noise data,
which would be a good approximation of KC.

For images, there're some attempts at actual implementation:
https://en.wikipedia.org/wiki/Fractal_compression
https://3dprintingindustry.com/news/...models-146286/

34. ## The Following User Says Thank You to Shelwien For This Useful Post:

Bullbash (27th March 2019)

35. Originally Posted by Kennon Conrad
BPE
Re-pair
Sequitur
IRR
GLZA
Bullbash, you may like the paper Identifying Hierarchical Structure in Sequences: A linear-time algorithm. It's about Sequitur.
Also don't forget to check out the paper in the first post in https://encode.ru/threads/2427-GLZA

36. ## The Following User Says Thank You to Gotty For This Useful Post:

Bullbash (27th March 2019)

37. Originally Posted by Bullbash
So, to look for 18-symbols long patterns I defined dictionary layers with patterns sizes like 3*3*2 = 18. And I found some full substrings and some subsequences: "said Prince Andrew" - 27 times (substring), "the************the" - 34 times. Could it be used in compression?
Yes, it can be used in compression.

I'll try to squeeze here some info about how paq8px works with such patterns/sequences and how it stores them (well, it does not actually store them at all).

Paq8px is a context mixing data compressor. It predicts next symbols by "averaging" the predictions of a large number of models. (More precisely: symbols are bits, "averages" are weighted averages, where the weights are updated adaptively.)
Being a context mixing compressor paq8px considers *multiple* different sequences just preceding the current file position. A couple of models that seem to be related to your ideas:
(The list is not exhaustive, and then descriptions are simplified.)

There is a "NormalModel". It collects statistics about the previous 1 byte, 1-2 bytes, 1-3 bytes, ... 1-14 byte sequences.
There is a "SparseModel". It collects statistics about the previous 1-8 bytes with gaps.
There is a "MatchModel". It consideres long sequences only (with length: 5-65535) and it predicts the next symbol based on the last such long match.
There is a "sparseMatchModel". See the paq8px thread: #post57767 #post57770 #post57860
There is a "Wordmodel". Among others it uses the previous 1-5 words as contexts. Not all 5 words are considered. Some such contexts are: "word * word", "word * * word", "word word".

These models or their equivalents are (or may be) also present in other context mixing data compressors like emma, cmv, paq8pxd and cmix.

Paq8px does not store the strings/sequences.
It uses the hashes of the strings/sequences to access and update statistics in hash tables. By hashing it uses less memory. Hashes are faster to process than strings/sequences. And more importantly: compression is better by being able to keep more information in memory.
When hash tables are full, it simply overwrites old hash elements with new ones.

"dictionary":
As you can see paq8px does not have a dictionary in the classical sense. It has a blurry fixed amount of memory so to speak. One of your major questions is about the storage of a "large dictionary". This is one possible approach.

"lossy" vs. "lossless":
Using hashes is lossy and overwriting statistics is lossy. But the compression is lossless.

38. ## The Following 2 Users Say Thank You to Gotty For This Useful Post:

Bullbash (27th March 2019),schnaader (27th March 2019)

39. Thank you guys for input. Will go thru and come back .

Page 1 of 2 12 Last

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•