Results 1 to 6 of 6

Thread: Finding most frequency sequences in data?

  1. #1
    Member RichSelian's Avatar
    Join Date
    Aug 2011
    Location
    Shenzhen, China
    Posts
    156
    Thanks
    18
    Thanked 50 Times in 26 Posts

    Finding most frequency sequences in data?

    Given a sequence of bytes S, find a sub-sequence X which has maximum value of (Len(X)-1)*(Count(X)-1)... Then we can use shorter symbol to replace X in S.
    Is there any algorithm for that? (with acceptable time and memory usage)

    I've read XWRT's code, but it seems to seperate words by semantic, not for generic data?

  2. #2
    Member Fu Siyuan's Avatar
    Join Date
    Apr 2009
    Location
    Mountain View, CA, US
    Posts
    176
    Thanks
    10
    Thanked 17 Times in 2 Posts
    Do you mean text preprocessor? You should check dict in FreeArc.

  3. #3
    Member RichSelian's Avatar
    Join Date
    Aug 2011
    Location
    Shenzhen, China
    Posts
    156
    Thanks
    18
    Thanked 50 Times in 26 Posts
    Quote Originally Posted by Fu Siyuan View Post
    Do you mean text preprocessor? You should check dict in FreeArc.
    For generic data, we cannot use semantic information. For example, you can seperate English words by "," "."... but what if the input stream is a Chinese article?

  4. #4
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    Firstly do suffix sorting and compute LCP array (it could be done in one step).

    Then you can apply following algorithm (assuming input has length N):
    - prepare integer array of length N and initalize with zeros,
    - initialize variable max to 0,
    - for every item from LCP array (assuming you are iterating from beginning to end and keep current index in variable `index`):
    -- if item == max do nothing
    -- if item > max then do: while (max < item) { max++; array[max] = index; }
    -- if item < max then do: while (max > item) { try((index - array[max]) * (max - 1), sa[index], max); max--; }

    Where sa[] is an suffix array and try(weight, start, length) is a function that tracks the best subsequence (ie a simple max function inside).

    Or something like that, probably this is slightly broken but general idea should work.
    Last edited by Piotr Tarsa; 17th September 2012 at 18:46.

  5. #5
    Member Karhunen's Avatar
    Join Date
    Dec 2011
    Location
    USA
    Posts
    91
    Thanks
    2
    Thanked 1 Time in 1 Post
    Quote Originally Posted by RichSelian View Post
    For generic data, we cannot use semantic information. For example, you can seperate English words by "," "."... but what if the input stream is a Chinese article?
    I assume one would report the frequencies of the unicode characters and dump the contents as hex? Or maybe Google translate

  6. #6
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    856
    Thanks
    447
    Thanked 254 Times in 103 Posts
    Quote Originally Posted by Karhunen View Post
    Or maybe Google translate
    Prepare for a lossy transform ...

Similar Threads

  1. loseless data compression method for all digital data type
    By rarkyan in forum Data Compression
    Replies: 157
    Last Post: 9th July 2019, 17:28

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •