Results 1 to 4 of 4

Thread: Space-Efficient Construction of Compressed Indexes in Deterministic Linear Time

  1. #1
    Member
    Join Date
    Feb 2010
    Location
    Nordic
    Posts
    200
    Thanks
    41
    Thanked 36 Times in 12 Posts

    Space-Efficient Construction of Compressed Indexes in Deterministic Linear Time

    I bumped into this 2016 paper on HN: https://arxiv.org/abs/1607.04346

    We show that the compressed suffix array and the compressed suffix tree of a string T can be built in O(n) deterministic time using O(nlogσ) bits of space, where n is the string length and σ is the alphabet size. Previously described deterministic algorithms either run in time that depends on the alphabet size or need ω(nlogσ) bits of working space. Our result has immediate applications to other problems, such as yielding the first linear-time LZ77 and LZ78 parsing algorithms that use O(nlogσ) bits.
    (emphasis mine)

    HN comments: https://news.ycombinator.com/item?id=14333882

  2. #2
    Member
    Join Date
    Mar 2015
    Location
    Germany
    Posts
    57
    Thanks
    34
    Thanked 38 Times in 15 Posts
    I stumpled upon that paper last year when doing research for my SACA (I implemented it as second stage improvement for libdivsufsort). Gave up on it relativly fast because it uses 3*N memory (if I understand it corectly) and rank queries (which are usally quite slow due to cache misses) while normal SACAs already use 4 * N and there's already the paper 'A Linear-Time Burrows-Wheeler Transformusing Induced Sorting' which does the Burrows-Wheeler Transform in O(n) time using O(n * log σ * log log_σ(n)) (around 2,2 * n in practice) which is almost as fast as SAIS. I don't think that this can be reduced much but maybe I simply misunderstood it.

    Anyway last year was a very interesting year for SACAs: "Linear-time Suffix SortingA new approach for suffix array construction" Uwe Baier gave the first non recursive linear time SACA. I turned it into a O(n * log(n)) SACA that passes only twice over the suffix array (and therefore is quite cache friendly). It turned out to be around 50% faster with enwik9 in libdivsufsort than the original second stage. You can find it here: https://github.com/akamiru/sort. An implementation of his paper is available here: https://github.com/waYne1337/gsaca.

    Moreover I could show that "Linear-time Suffix Sorting A new approach for suffix array construction" also gave the first independant algorithm for generating the lyndon array without using global data structures. Prof. William F. Smyth gave a speak about this (and other things) at the London Stringology Days that I could share on request.

    "An Efficient Algorithm for Suffix Sorting" Z Peng, Y Wang sounded quite interessting, too. I guess it incorporates some of the ideas I used for daware but sadly its not available on the internet.

  3. The Following 5 Users Say Thank You to Christoph Diegelmann For This Useful Post:

    Bulat Ziganshin (15th May 2017),hexagone (15th May 2017),jibz (15th May 2017),Shelwien (16th May 2017),willvarfar (16th May 2017)

  4. #3
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts

  5. #4
    Member
    Join Date
    Mar 2015
    Location
    Germany
    Posts
    57
    Thanks
    34
    Thanked 38 Times in 15 Posts
    It turns out dsufsort ("An Efficient Algorithm for Suffix Sorting") works exactly like I thought from the abstract:

    Normal prefix doubling sorts each group of depth d by 2d by additonally sorting by ISA[SA[i] + d]. Therefore after each round the already sorted depth doubles (and therfore the parameter d) resulting in a maximum of log(n) rounds. Dsufsort uses an additional array D to store the actual sorting depth for each group and updating the depth to D[i] + D[ISA[SA[i] + D[i]]]. This means that when the first groups are already sorted, some of the next SA may be sorted by d + 2d = 3d. This means the average group depth may grow even faster than twice each round.

    Daware uses the trick of GSACA (Uwe Baier's new algo) to improve upon that. It works from the lexicographically highest group to the lowest and sorts them completly until their next smaller suffix (hence the sorting depth equals the value in the lyndon array). We only read every ISA value only once (GSACA is linear) after sorting (this is the only O(n*log n) part here) it. Therefore we can override it after the next sort with the new depth (so we save the space for D). In a second pass we sort from the lowest to the highest group finishing our sort.

    It's rather hard to explain in short so I recommend interested ones to read the GSACA paper for a formal prove or take a look at the pseudocode in my repo: https://github.com/akamiru/sort/wiki...tion-of-DAware.

  6. The Following User Says Thank You to Christoph Diegelmann For This Useful Post:

    Bulat Ziganshin (16th May 2017)

Similar Threads

  1. BWTIL: a set of tools to work with BWT-based text indexes
    By Nicola Prezza in forum Data Compression
    Replies: 31
    Last Post: 29th August 2014, 14:58
  2. Lightweight BWT construction
    By Matt Mahoney in forum Data Compression
    Replies: 2
    Last Post: 15th July 2013, 19:03
  3. An Elegant Algorithm for the Construction of Suffix Arrays
    By willvarfar in forum Data Compression
    Replies: 18
    Last Post: 11th July 2013, 16:01
  4. Dark Space Random Thoughts
    By Tribune in forum The Off-Topic Lounge
    Replies: 19
    Last Post: 14th March 2009, 15:22
  5. Welcome to MY space!
    By encode in forum Forum Archive
    Replies: 9
    Last Post: 20th May 2007, 12:58

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •