Results 1 to 6 of 6

Thread: HST4: hash-sorted transformation and matchfinder

  1. #1
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts

    HST4: hash-sorted transformation and matchfinder

    Let's compare output of ST4 transformation with snapshot of HT4 matcfinder table at some moment. They are pretty similar - both contains pointers to buffer sorted by 4 pointed chars. The difference is that, 1) ST4 doesn't suffer from match collisions, and 2) ST4 includes all buffer positions while HT4 holds only last N entries for every hash value

    ST4 output even may be used to search matches like the HT4 table - first, we should find pointer to the current position in the buffer (using some sort of binary search or index structure), and then scan preceding positions - either all with the same 4 bytes, or limiting search to fixed depth

    Going further, we can replace ST4 transformation with HST4 - sorting pointers by some hash of 4 bytes pointed plus position. HST4 advantages over ST4: 1) the transformation can be done in single pass and work faster, 2) we can save indexes used in the transformation and employ them for the following search operations:
    Code:
    //HST4: sorting
    for(i=0; i<N; i++)
      word = *(uint32*)(buf+i)
      index[hash(word)]++
    
    for(i=1; i<index_elems; i++)
      index[i+1] += index[i]
    
    for(i=0; i<N; i++)
      word = *(uint32*)(buf+i)  
      idx = index[hash(word)]++
      hst4_output[idx] = buf+i
    
    for(i=index_elems-1; i>0; i--)
      index[i] = index[i-1]
    
    // Now index[hash(word)] points to the first hst4_output[] element having the same hash(word)
    // As we will go through the buf[], these indexes will be updated to point to hst4_output[] element where we should start search:
    
    // HST4: searching
    for(i=0; i<N; i++)
      word = *(uint32*)(buf+i)
      idx = index[hash(word)]++
      for (; hash(*(uint32*)hst4_output[idx]) == hash(word); idx--)
        check match length at postion pointed by hst4_output[idx]
    this brings us the complete matchfinder solution but it's hardly any better that plain HT4. But unlike HT4 and any other dynamic matchfinders, HST4 allows to implement multi-threaded lazy/greedy matchfinder. The key is that hst4_output[] doesn't updated after sorting so it may be used in multiple threads. so we can split input buffer into muliple pieces and compress them all simultaneously. all we need to replicate for every thread is the index[] array. modified code parts are:
    Code:
    for (core=0; core<number_of_cores; core++)
      // first, save initial indexes for this part of buffer
      for(i=0; i<index_elems; i++)
        local_index[core][i] = index[i]
    
      // second, process the next N/number_of_cores positions
      for(i=core*N/number_of_cores; i<(core+1)*N/number_of_cores; i++)
        word = *(uint32*)(buf+i)  
        idx = index[hash(word)]++
        hst4_output[idx] = buf+i
    
    // HST4 searching in the piece number K
    for(i=K*N/number_of_cores; i<(K+1)*N/number_of_cores; i++)
      word = *(uint32*)(buf+i)
      idx = local_index[K][hash(word)]++
      for (; hash(*(uint32*)hst4_output[idx]) == hash(word); idx--)
        check match length at postion pointed by hst4_output[idx]
    Last edited by Bulat Ziganshin; 4th December 2013 at 23:15.

  2. #2
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Isn't LZHAM's matchfinder dynamic and multithreaded too?

  3. #3
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    it checks every position as used for optimal parsing:

    Before parsing, the main thread scans each 512KB block to determine all unique character digrams present in the block. The list of unique digrams is equally subdivided and distributed to the finder threads. Each digram (and binary tree) is associated with only a single helper thread, eliminating any need for synchronization while inserting strings and finding matches.
    Each match finder thread exhaustively finds a list of roughly sorted matches for every dictionary entry that begins with a digram assigned to that thread.
    Last edited by Bulat Ziganshin; 4th December 2013 at 23:32.

  4. #4
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    Bulat,

    I wonder if a LCP table would speed things up. It depends on how you would use the hash table, eg what'e the limit on match length or what's the limit of checked positions during single search (ie search at one position)? LCP can not only reduce the number of comparisons but also the number of random memory accesses. For example, current match length (for current entry in hash table) is 5. The LCP table says the common prefix between current entry in hash table and next entry (in order of checking them) is higher than 5. In that case you know that with next entry you will still have match length 5 and you don't need to check that.

  5. #5
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    we have looked around BWT- and ST4-matchfinders for a years, since they have been somewhat employed in the IMP. but there were no good ideas. BWT is rather expensive to construct and while it immediately gives you the largest matchm, you will need to scan multiple entries if you need to find one that 1) precedes current position, 2) nearest of all equal matches and especially 3) minimizes the encoding cost. both BWT and ST4 has problems with searching for initial position (kept in local_index[] in my scheme)

    my scheme solves these problems, but finally, it has not much advantages over HT4. and since entries are sorted by hash, the LCP table will be useless for ~~50% of entries. well, i think it's useless even for ST4-sorted table, it looks reasonable only for BWT data where it should help to find match lengths without any checks to dictionary itself. so it looks like a great idea that may give new life to the BWT-sorted matchfinder. let's try to describe its construction in greater details!

    More readings: http://encode.ru/threads/1375-More-ideas-for-LZ-matchfinding
    Last edited by Bulat Ziganshin; 30th July 2014 at 15:53.

  6. #6
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    my scheme solves these problems, but finally, it has not much advantages over HT4. and since entries are sorted by hash, the LCP table will be useless for ~~50% of entries. well, i think it's useless even for ST4-sorted table, it looks reasonable only for BWT data where it should help to find match lengths without any checks to dictionary itself.
    Actually, you can compute LCP on ST4 table if you keep the starting 4-byte chunks in final hash table (that means higher memory occupancy) or if you check those bytes through pointer before checking LCP (that means more random memory accesses).
    And you don't need very deep LCP, you can cap it at eg 20 or so. You can compute LCP naively, using simple string comparison. That would be also easy to modify to efficiently find tandem repeats ( http://en.wikipedia.org/wiki/Tandem_repeat ). And if you detect a tandem repeat you can mark it, eg by value -1 in LCP table so you know you can find super long matches then.
    You have a tandem repeat when equal prefixes of two strings are adjacent to each other.
    Last edited by Piotr Tarsa; 5th December 2013 at 00:40.

Similar Threads

  1. A new, minimally-sorted BWT variant
    By nburns in forum Data Compression
    Replies: 8
    Last Post: 12th July 2013, 05:00
  2. lzma's binary tree matchfinder
    By Shelwien in forum Data Compression
    Replies: 7
    Last Post: 13th October 2011, 08:38
  3. Nibble MMC matchfinder
    By Shelwien in forum Data Compression
    Replies: 4
    Last Post: 25th April 2011, 12:21
  4. Transformation for UIQ2
    By Jan Ondrus in forum Data Compression
    Replies: 49
    Last Post: 4th October 2009, 17:30
  5. Delta transformation
    By encode in forum Forum Archive
    Replies: 16
    Last Post: 4th January 2008, 12:13

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •