> My staring at 1288-LZMA-markup-tool PNGs yielded approximate max size
> of the patterns in the dictionary, like 14 symbols.
Its not really maximum... just whatever minimizes the compressed size of the file.
Actual max match length is 273 for lzma (=2+16+255)
There's also no explicit dictionary, it just can reference any substring
of already processed data.
> There are "some wildcards", which answer my question.
> But... any 14-symbols sequence produce 2^14 - 1 of "wildcard sub-patterns",
> and this is what I'm after.
Yes, but there's no point in considering patterns with less than 2 instances.
So, rather than counting all possible subsets, LZMA encoder would compare
current data with known similar strings (having the same prefix of 2+ symbols).
If it improves compression even with some differences, LZMA would use a partial match too.
> So, if you allow me another question - how big are the typical/max dictionaries when max sequence length <= 14?
LZMA effectively can use window size up to 2G, actual memory use is window_size*11.5.
With that it would be able to find any substring of any size (well, 2+ symbols) within the "window".
> You do not obviously keep all the patterns, right?
> What are the criteria to discard low utility patterns?
Their codelengths are longer?
You can also talk about this in terms of "maximum likelihood" or "minimum description length",
but in the end its the same Shannon's entropy.
> And so on. Each next dictionary ~x10 bigger then previous one, so at some
> level I must get rid of "low utility" patterns.
Once upon a time, I tried writing a tool that worked like this:
0) expand bytes of original data to 32-bits per symbol
1) scan the data, find the pattern <s1><gap><s2> with max occurrence count
2) replace all <s1><gap><s2> in data with a new symbol (actually: delete <s2>, replace <s1> with new symbol)
3) goto 1
> I mean, I ask my algo : find me some looong patterns in some looong texts,
> and it does (no dynamic programming, tables r too big).
> I mine those patterns for text comprehension like,
> wondered if you guys use something like that for compression?
In a way, that's why only actual compression algorithms are only developed with C/C++.
The more popular languages are just too inefficient for this.
I'd suggest reading about BWT: https://en.wikipedia.org/wiki/Burrow...eler_transform
Its a convenient tool for finding string matches.
Only needs 5N memory for full index where LZMA's bt4 tree needs 11.5N.
> Another question: are there "hierarchical dictionaries" at all?
Maybe? For example, PPM/CM papers would frequently talk about "order-N models" and such.
> I've red some papers on DNA patterns discovery, state-of-the-art claims 10K
> symbols length sources with 4-symbols alphabets processable only...
> I do a few millions with 100 symbols alphabets... what to think?
Might be not that kind of processing.
I mean, scanning of actual DNA molecules is still done in chunks.
Btw, I actually tried looking for patterns with gaps in LZMA output.
Found stuff like this (world192.txt):
But nothing really interesting.
<,000 kW capacity; ??,>
< billion, per capita $??0>
< (1988), ??.>