# Thread: A proposed new fast match searching structure

1. ## A proposed new fast match searching structure

I've been investigating an old idea which was occupying my mind, in order to create a fast approximative match finder for long range searches. Although it took some time to "think" the problem correctly, in the end, the solution looks surprisingly simple.

The idea starts from a familiar structure, a Hash Table.
In order to create a fast look-up algorithm, it is just needed to hash the "minmatch" initial sequence, and store the position into the table. Later on, when checking for the same sequence, we'll find it in its cell.

OK, that's fine, but even without collisions, it does only provide us with a single answer, which is the closest sequence starting with minmatch bytes. But maybe, somewhere else farther, there is a better, i.e. longer match.

Looking at these other possibilities can be done in several ways. A simple method consists in linking all positions sharing the same hash value into a list. It's called Hash Chain, and it works fairly well. But if the sequence is really redundant, the number of positions searched can become terribly high, and with increased depth of search, it becomes prohibitive.

Hence the simple idea : in order to avoid waiting forever, the search is stopped after a number of attempts. The higher the number, the better the result. The lower the number, the faster the search. This is the trade-off.
Basically, it means that for large distance search, there is no shame in making a "partial search" in order to achieve acceptable speed.

OK, so since the number of searches can be arbitrarily limited, why not storing them in a table in the first place ? The row number is given by the hash, and all corresponding positions are orderly stored into the row. This is much faster to run, since there are less memory jumps, and easier to setup.

This method is not so new, and has been especially championed in the early 90's by Ross Williams, for its compressor LZRW-3a. Therefore we'll call it LZRW.
LZRW structure has in my opinion 2 advantages : one is controlled speed, through the size of rows (=Nb of attempts), and the other one is controlled memory, through the selection of table size.

This latest property is critical : most "full search" methods require a huge amount of memory to work properly. By "huge", i mean something in the range of 10x (with the exception of Hash Chains, but they are too slow for long distance searches anyway). So you want to search within a 256MB dictionary ? You need more than 2.5GB of memory. Farewell 32 bits systems.

One has to wonder : is it the right method ? Such amount of memory is simply prohibitive. Maybe by accepting a "less perfect" search but using less memory, we may nonetheless achieve a better compression rate thanks to the use of longer search distances. This is a position defended by Bulat Ziganshin, for its compressor FreeArc. For this scenario, LZRW comes as an obvious candidate : you can for example setup a 256MB search table, and use it to look into a 1GB dictionary. Now, long distance searches look affordable !

OK, so LZRW works, but there is no miracle : the longer the search, the more time it costs. In the end, the return on investment can be quite low, and with large number of searches (say for example 256), the search becomes so slow as becoming unpractical.
This is to be expected, all that is guaranteed by the table is that all elements share the same row, hence the same Hash Value. But that's it. So after finding an element with minmatch common bytes, we'll test another, and another, and so on. This is wasteful. What we want after finding minmatch bytes is to find another position with at least minmatch+1 common bytes.

In order to avoid testing each and every position, several methods are possible. One is to build a Binary Tree on top of the row. It is very efficient. A particularly good implementation of this idea was made by Christian Martelock for its RZM compressor (now discontinued). Unfortunately, it also means even more memory, consumed for the BT structure. And in the end, it is not so much different from a full BT structure over the entire search window, except that we lose full search capability.

OK i'm looking for something simpler. After checking for a minmatch candidate, i want to look for a minmatch+1 one, straight away. No search nor wandering.
This can be done quite simply : just hash the "minmatch+1" sequence, and store its position directly into an hash Table.
OK, so there are several tables you say, one per sequence size ?
Well, why ? No, a single one.

Just share the Hash Table among all the sequences. The simple idea here, is that a given position should be stored only once in the table, either with the hash of its "minmatch" first bytes, or "minmatch +1", or "minmatch+2", well whatever.
So, which sequence size should be stored ? Well, just start with the "minmatch" one. When a new equivalent sequence gets into the same table cell, we just move on, and store the old position at its "minmatch+1" hash. And the next time, move it to "minmatch+2" and so on. So the position will move into the table, up to the point where it is considered "dead", either off limit, or because of elimination rule.

Yes, we have to take into consideration the natural occurrence of collisions into this structure. And it can happen between any sequence size. For example, a newer "minmatch" sequence could be willing to occupy the same slot as an older (and different) "minmatch+2" sequence. Well, one has to move on, or be eliminated.
That's where different rules can be applied. The most basic one is that the youngest position always win. It's good enough, and i recommend it. But it can also be mixed with some more complex heuristic, to avoid losing too fast long-distance anchors, keeping in mind that this structure is created precisely for a situation where we have to afford less than one cell per dictionary entry.

So we've got our basic idea here, a cross-over between cuckoo hashing and progressively longer hashed sequence. Hence the proposed name, Progressive Hash Series.

Now is time to put this idea into practice. I've been creating a simple long-range match finder based on greedy matching strategy. It's a simple and fast way to compare above methods. The file tested is enwik9, used into Matt's Mahoney LTCB. This file is very redundant, and particularly intensive for the match finder. LZRW and PHS are compared on their capacity to find longer matches (which translates into better compression), and on their respective speeds. Here are the results :

So what do we learn from these figures ?
First, PHS looks more efficient than LZRW. For an identical amount of memory, it converges faster towards optimal match length. This is particularly telling for the 4-attempts version, which is about has efficient as a 80-attempts LZRW.

Note however that for a given number of attempts, PHS is more complex, and requires more time. For example, the 4-attempts PHS is approximately the same speed as a 14-attempts LZRW. So, in the end, we've got the compression power of a 80-attempts LZRW for the cost of a 14-attempts one.

This is my first review of this new method, and i guess we have only scratched the surface of it. Further refinements are to be expected ahead.
I have not found yet this idea described somewhere on the net. It is not even mentioned in the thorough Match Finder review by Charles Bloom. So, is it new ?

 Shamelessly copied from http://fastcompression.blogspot.com/...method-to.html

2. ## A dumb question and suggestion?

I guess I am missing something, why the large memory requirement when you first mentioned linking the hashes together.

In my code (and I may be misunderstanding what you were saying) the hash table contains the position of the last matching hash, it feeds that in the 32 bit link table meaning I only need the hash_table*4 bytes + the data length + 4*the data_length for the links.

Or is my misunderstand that you are using 64 bit links? I never needed to compress a window that was more than 16MBytes in size and the hash_tables I use are usually have only 64K to 64M entries.

Okay, I think that is my problem, I got to think bigger.

Suggestion, on your limited searches. One trick I have done when I was programming on an Amiga 1000 with only 4.5MBytes of memory, instead of using 24 or 32 bit pointers, I used a 16 offset in each back-link entry (value = 0 meaning end of chain). Thus if there were not many matches the links ended naturally by themselves.

Love your idea of using one hash table for more than one length of min-match, I got to try it out. Til now I have been using three separate hash_tables, I never thought of just using one big table and hashing the diffirent lengths into it, thanks for the tip.

I hope I don't come across as a dumb idiot, I understand only a small percentage of the posts in this forum, but this idea of your's is helping me to look at my code in a new light.

Thanks again, and good-luck.

3. Cyan, i don't understand what's "simulated" speed. you have compressor, can you put new method into it and publish executables for our testing?

LZRW hashing is all about reducing cache misses, so i'm not sure that PHS is able to outperform it

4. @Earl :
Yes, you are correct with the Hash Chain Memory requirement.
On 32 bits systems, it is 4N, + the initial dispatch table, which can be any size, but is typically 2N. So it ends up being 6N

However, Hash Chains are not suitable for long-range matches. Charles Bloom just made a long interesting study about Match Finders, with basically this conclusion. The only way to make Hash Chains viable for Long Range is to limit the number of hops, hence the number of attempts, resulting in incomplete searches.
So, if we accept the concept of incomplete search, we can go all the way down to N.

All full search structure which are viable for long range matches are at least as complex as Binary Tree, which is 2 pointers per position, hence 8N for 32 bits systems. Add the 2N of initial dispatch table, and this is 10N. + obviously N for the dictionary, and this is 11N.

@Bulat
Sure, simulated speed should be considered with a grain of salt.
I've tried to build a simulator which runs identically for all considered methods. So nothing has been really optimised.
I'm actually preparing a new version of LZ4 using the "2-attempts" PHS method. When it's ready, i'll proposed it in this forum for evaluation.

LZRW hashing is all about reducing cache misses, so i'm not sure that PHS is able to outperform it
That's correct, LZRW has all positions stored next to each other, therefore there is no memory jump between 2 attempts.
Except for the position to be tested ...

PHS requires to calculate a new hash value, resulting in one more memory jump than LZRW. That's why, for an equivalent number of attempts, LZRW is faster. In the simulation, a 4-attempts PHS has approximately the speed as a 14-attempts LZRW.

But since PHS converges much faster, the point is that it is still faster than LZRW for an equivalent match finder strength.

5. >Except for the position to be tested ...

but these data may be cached, so you can test up to 8-16 match candidtaes with one memory access, while with phs you need 2-3-4 accesses. and since speed can be measured in number in memaccesses, we may expect that phs2/3/4 will have the same speed as ht16/24/32. you may try to measure number of accesses as best approximation of real speed. you may use tornado ht matchfinders to measure number of memaccesses for optimized HT code

6. Cached ? So you mean having a copy of the MINMATCH sequence alongside the position within the cell ?
I tested that hypothesis long ago, but it did not deliver. Maybe i should try it again.
I can upgrade the simulator with this strategy to get some comparison figures.
Note that, since comparison is done at identical memory budget, it effectively divides Table size by 2 to accomodate some room for the cached data. Therefore, it affects compression result.

Note also that i do not consider "hash collisions" to be the main issue with LZRW : even in a perfect case, where all positions in a row share an identical "MINMATCH" characters, the point is the LZRW structure provides no hint for character N+1, resulting in many useless comparison attempts.

I'll have a look at tornado matchfinders anyway. Thanks for the tip.

Rgds

7. This wasn't popular when I floated it but it might be worth a glance and dismiss: http://encode.ru/threads/1121-hashing-LZ

Do you need to hash MINMATCH+1 and MINMATCH+2 (or ROLZ prefix or whatever)? Why not just looking ahead in the table for MINMATCH hash but offset by one in look-ahead buffer instead?

If you probe for some match A, and you probe also for the next sequence B, before dereferencing them, you are visiting two parts of the hash-table before visiting the potential match in the buffer to validate it.

However, you are probably going to visit B anyway if you have no matches, so its a question about the order you access cache-lines not whether you access them at all.

8. >Cached ? So you mean having a copy of the MINMATCH sequence alongside the position within the cell ?

no. short caching is just about storing a few bits of the match to the same 4 bytes just to resolve match collisions. long caching is storing match[MINMATCH-1..MINMATCH+2] bytes with pointer in order to to resolve match collisions AND check short matches w/o additional memory access. you may find long caching mf in tornado CachingMatchFinder class, and short ones in REP and lzma ht

9. Yep, just read the tornado matchfinder, and indeed, the caching is about the next bytes after minmatch.
So indeed, this is much more interesting. It has great potential to reduce the number of data checks, therefore limiting memory fetches.

Still, it also means that there is basically twice less references for a table of identical size, resulting in decreased match finder capability. This decreased efficiency can be compensated by increasing the size of rows, which in turn decrease speed. So now it is a trade-off. I guess it is probably positive in the end. I shall have a deeper look at it for comparison.

10. Sometimes there's a spare two or three bits next to the offset that you can store some additional bits from the hash. This can save a few dereferences.

I forget if it was BALZ or freearc or whatever that I put code in to count the effect of that, but it was quite good.

11. @willvarfar:
I've been reading at your post, and i believe it is a pretty good idea.
At least, it is in the context of 2D Hash Tables. You are basically probing linearly 2 rows looking for correlation, reducing cache misses.

Compared to the "cached" version of Bulat match finder, I guess it can only be slower. Plus the scheme is bound to have some limitations at long range (running out of candidates as a consequence of row size).
The nice compensation is that there is no need to cache any data, therefore it keeps memory requirement at 4 bytes per stored position, which is beneficial to the match finder efficiency.

@both :

Yes, i know about the trick of using available bits in the reference to further improve precision. I've not implemented it in the simulator however, since i'm targeting very long range search window, and there are very few bits remaining, if any. Still, maybe i should ...

12. with lzma, i use 2 bits for caching, limiting dictionary to 1 gbyte

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•