Results 1 to 6 of 6

Thread: Huffman tree, Autumn leaves and Winter cut-back

  1. #1
    Member caveman's Avatar
    Join Date
    Jul 2009
    Location
    Strasbourg, France
    Posts
    190
    Thanks
    8
    Thanked 62 Times in 33 Posts

    Huffman tree, Autumn leaves and Winter cut-back

    This is a sample of the ALWC (Autumn Leaves Winter Cut-back) algorithm I came up last night, it's a kind of false dynamic Huffman, not sure its really helpful since both the encoder and the decoder need the real number of occurrences of each letter beforehand which leads to larger headers for limited gains.

    Basically it starts by building a standard Huffman prefix-free code, after each symbol encoding/decoding it updates its occurrence count and keeps the symbol list sorted until the symbol count reaches zero — the Autumn Leave — resulting in a small prefix-free code change (two of the longest codes are fused together to form a one bit shorter code) — the Winter Cut-back —.

    Sample used "Les beaux kiwis que voici !"
    Symbol count between square brackets.

    Code:
    [5]  _   00
    [4]  i   010
    [3]  e   011
    [2]  s   1000
    [2]  u   1001
    [1]  x   1010
    [1]  !   10110
    [1]  L   10111
    [1]  a   11000
    [1]  b   11001
    [1]  c   11010
    [1]  k   11011
    [1]  o   11100
    [1]  q   11101
    [1]  v   11110
    [1]  w   11111
    
    L (10111)
    
    [5]  _   00        [5]  _   00
    [4]  i   010       [4]  i   010
    [3]  e   011       [3]  e   011
    [2]  s   1000      [2]  s   1000
    [2]  u   1001      [2]  u   1001
    [1]  x   1010      [1]  x   1010
    [1]  !   10110     [1]  !   1011
    [1]  a   10111     [1]  a   11000
    [1]  b   11000     [1]  b   11001
    [1]  c   11001     [1]  c   11010
    [1]  k   11010     [1]  k   11011
    [1]  o   11011     [1]  o   11100
    [1]  q   11100     [1]  q   11101
    [1]  v   11101     [1]  v   11110
    [1]  w   11110     [1]  w   11111
             11111
    
    e (011)
    
    [5]  _   00
    [4]  i   010
    [2]  e   011
    [2]  s   1000
    [2]  u   1001
    [1]  x   1010
    [1]  !   1011
    [1]  a   11000
    [1]  b   11001
    [1]  c   11010
    [1]  k   11011
    [1]  o   11100
    [1]  q   11101
    [1]  v   11110
    [1]  w   11111
    
    s (1000)
    
    [5]  _   00
    [4]  i   010
    [2]  e   011
    [2]  u   1000
    [1]  s   1001
    [1]  x   1010
    [1]  !   1011
    [1]  a   11000
    [1]  b   11001
    [1]  c   11010
    [1]  k   11011
    [1]  o   11100
    [1]  q   11101
    [1]  v   11110
    [1]  w   11111
    
    _ (00)
    
    [4]  _   00
    [4]  i   010
    [2]  e   011
    [2]  u   1000
    [1]  s   1001
    [1]  x   1010
    [1]  !   1011
    [1]  a   11000
    [1]  b   11001
    [1]  c   11010
    [1]  k   11011
    [1]  o   11100
    [1]  q   11101
    [1]  v   11110
    [1]  w   11111
    
    b (11001)
    
    [4]  _   00        [4]  _   00
    [4]  i   010       [4]  i   010
    [2]  e   011       [2]  e   011
    [2]  u   1000      [2]  u   1000
    [1]  s   1001      [1]  s   1001
    [1]  x   1010      [1]  x   1010
    [1]  !   1011      [1]  !   1011
    [1]  a   11000     [1]  a   1100
    [1]  c   11001     [1]  c   11010
    [1]  k   11010     [1]  k   11011
    [1]  o   11011     [1]  o   11100
    [1]  q   11100     [1]  q   11101
    [1]  v   11101     [1]  v   11110
    [1]  w   11110     [1]  w   11111
             11111
    
    e (011)
    
    [4]  _   00
    [4]  i   010
    [2]  u   011
    [1]  e   1000
    [1]  s   1001
    [1]  x   1010
    [1]  !   1011
    [1]  a   1100
    [1]  c   11010
    [1]  k   11011
    [1]  o   11100
    [1]  q   11101
    [1]  v   11110
    [1]  w   11111
    
    a (1100)
    
    [4]  _   00        [4]  _   00
    [4]  i   010       [4]  i   010
    [2]  u   011       [2]  u   011
    [1]  e   1000      [1]  e   1000
    [1]  s   1001      [1]  s   1001
    [1]  x   1010      [1]  x   1010
    [1]  !   1011      [1]  !   1011
    [1]  c   1100      [1]  c   1100
    [1]  k   11010     [1]  k   1101
    [1]  o   11011     [1]  o   11100
    [1]  q   11100     [1]  q   11101
    [1]  v   11101     [1]  v   11110
    [1]  w   11110     [1]  w   11111
             11111
    
    u (011)
    
    [4]  _   00
    [4]  i   010
    [1]  u   011
    [1]  e   1000
    [1]  s   1001
    [1]  x   1010
    [1]  !   1011
    [1]  c   1100
    [1]  k   1101
    [1]  o   11100
    [1]  q   11101
    [1]  v   11110
    [1]  w   11111
    
    x (1010)
    
    [4]  _   00        [4]  _   00
    [4]  i   010       [4]  i   010
    [1]  u   011       [1]  u   011
    [1]  e   1000      [1]  e   1000
    [1]  s   1001      [1]  s   1001
    [1]  !   1010      [1]  !   1010
    [1]  c   1011      [1]  c   1011
    [1]  k   1100      [1]  k   1100
    [1]  o   1101      [1]  o   1101
    [1]  q   11100     [1]  q   1110
    [1]  v   11101     [1]  v   11110
    [1]  w   11110     [1]  w   11111
             11111
    
    _ (00)
    
    [4]  i   00
    [3]  _   010
    [1]  u   011
    [1]  e   1000
    [1]  s   1001
    [1]  !   1010
    [1]  c   1011
    [1]  k   1100
    [1]  o   1101
    [1]  q   1110
    [1]  v   11110
    [1]  w   11111
    
    k (1100)
    
    [4]  i   00        [4]  i   00
    [3]  _   010       [3]  _   010
    [1]  u   011       [1]  u   011
    [1]  e   1000      [1]  e   1000
    [1]  s   1001      [1]  s   1001
    [1]  !   1010      [1]  !   1010
    [1]  c   1011      [1]  c   1011
    [1]  o   1100      [1]  o   1100
    [1]  q   1101      [1]  q   1101
    [1]  v   1110      [1]  v   1110
    [1]  w   11110     [1]  w   1111
             11111
    
    i (00)
    
    [3]  i   00
    [3]  _   010
    [1]  u   011
    [1]  e   1000
    [1]  s   1001
    [1]  !   1010
    [1]  c   1011
    [1]  o   1100
    [1]  q   1101
    [1]  v   1110
    [1]  w   1111
    
    w (1111)
    
    [3]  i   00        [3]  i   00
    [3]  _   010       [3]  _   010
    [1]  u   011       [1]  u   011
    [1]  e   1000      [1]  e   100
    [1]  s   1001      [1]  s   1010
    [1]  !   1010      [1]  !   1011
    [1]  c   1011      [1]  c   1100
    [1]  o   1100      [1]  o   1101
    [1]  q   1101      [1]  q   1110
    [1]  v   1110      [1]  v   1111
             1111
    
    i (00)
    
    [3]  _   00
    [2]  i   010
    [1]  u   011
    [1]  e   100
    [1]  s   1010
    [1]  !   1011
    [1]  c   1100
    [1]  o   1101
    [1]  q   1110
    [1]  v   1111
    
    s (1010)
    
    [3]  _   00        [3]  _   00
    [2]  i   010       [2]  i   010
    [1]  u   011       [1]  u   011
    [1]  e   100       [1]  e   100
    [1]  !   1010      [1]  !   101
    [1]  c   1011      [1]  c   1100
    [1]  o   1100      [1]  o   1101
    [1]  q   1101      [1]  q   1110
    [1]  v   1110      [1]  v   1111
             1111
    
    _ (00)
    
    [2]  _   00
    [2]  i   010
    [1]  u   011
    [1]  e   100
    [1]  !   101
    [1]  c   1100
    [1]  o   1101
    [1]  q   1110
    [1]  v   1111
    
    q (1110)
    
    [2]  _   00        [2]  _   00
    [2]  i   010       [2]  i   010
    [1]  u   011       [1]  u   011
    [1]  e   100       [1]  e   100
    [1]  !   101       [1]  !   101
    [1]  c   1100      [1]  c   110
    [1]  o   1101      [1]  o   1110
    [1]  v   1110      [1]  v   1111
             1111
    
    u (011)
    
    [2]  _   00        [2]  _   00
    [2]  i   010       [2]  i   010
    [1]  e   011       [1]  e   011
    [1]  !   100       [1]  !   100
    [1]  c   101       [1]  c   101
    [1]  o   110       [1]  o   110
    [1]  v   1110      [1]  v   111
             1111
    
    e (011)
    
    [2]  _   00        [2]  _   00
    [2]  i   010       [2]  i   01
    [1]  !   011       [1]  !   100
    [1]  c   100       [1]  c   101
    [1]  o   101       [1]  o   110
    [1]  v   110       [1]  v   111
             111
    
    _ (00)
    
    [2]  i   00
    [1]  _   01
    [1]  !   100
    [1]  c   101
    [1]  o   110
    [1]  v   111
    
    v (111)
    
    [2]  i   00        [2]  i   00
    [1]  _   01        [1]  _   01
    [1]  !   100       [1]  !   10
    [1]  c   101       [1]  c   110
    [1]  o   110       [1]  o   111
             111
    
    o (111)
    
    [2]  i   00        [2]  i   00
    [1]  _   01        [1]  _   01
    [1]  !   10        [1]  !   10
    [1]  c   110       [1]  c   11
             111
    
    i (00)
    
    [1]  i   00
    [1]  _   01
    [1]  !   10
    [1]  c   11
    
    c (11)
    
    [1]  i   00        [1]  i   0
    [1]  _   01        [1]  _   10
    [1]  !   10        [1]  !   11
             11
    
    i (0)
    
    [1]  _   0         [1]  _   0
    [1]  !   10        [1]  !   1
             11
    
    _ (0)
    
    [1]  !   1
             0
             
    ! ()

  2. #2
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    There are a lot of old papers that compare static arithmetic coder where one has all the counts in a header and than used the reduced stats for each symbol when it occurs. The paper shows that its basically the same as a simple adaptive arithmetic compressor where each symbol starts with weight of one and you increment up.

    So its unlikely this would be any better on average than a simple bijective adaptive huffman compressor.

  3. #3
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    Quote Originally Posted by caveman View Post
    This is a sample of the ALWC (Autumn Leaves Winter Cut-back) algorithm I came up last night, it's a kind of false dynamic Huffman, not sure its really helpful since both the encoder and the decoder need the real number of occurrences of each letter beforehand which leads to larger headers for limited gains.

    Basically it starts by building a standard Huffman prefix-free code, after each symbol encoding/decoding it updates its occurrence count and keeps the symbol list sorted until the symbol count reaches zero — the Autumn Leave — resulting in a small prefix-free code change (two of the longest codes are fused together to form a one bit shorter code) — the Winter Cut-back —.

    Sample used "Les beaux kiwis que voici !"
    Symbol count between square brackets.

    Code:
    [5]  _   00
    [4]  i   010
    [3]  e   011
    [2]  s   1000
    [2]  u   1001
    [1]  x   1010
    [1]  !   10110
    [1]  L   10111
    [1]  a   11000
    [1]  b   11001
    [1]  c   11010
    [1]  k   11011
    [1]  o   11100
    [1]  q   11101
    [1]  v   11110
    [1]  w   11111
    
    L (10111)
    
    [5]  _   00        [5]  _   00
    [4]  i   010       [4]  i   010
    [3]  e   011       [3]  e   011
    [2]  s   1000      [2]  s   1000
    [2]  u   1001      [2]  u   1001
    [1]  x   1010      [1]  x   1010
    [1]  !   10110     [1]  !   1011
    [1]  a   10111     [1]  a   11000
    [1]  b   11000     [1]  b   11001
    [1]  c   11001     [1]  c   11010
    [1]  k   11010     [1]  k   11011
    [1]  o   11011     [1]  o   11100
    [1]  q   11100     [1]  q   11101
    [1]  v   11101     [1]  v   11110
    [1]  w   11110     [1]  w   11111
             11111
    
    e (011)
    
    ....

    Here a sample from a bijective 2 pass adaptive Huffman compared to yours the static table takes up
    the same amount of space. Same number of variables and the counts add up to the same value so table
    should take same amount of code. Actually the Header of mine can be smaller up general but that't
    not the point I am trying to make. I am assuming that you have a list of used symbols and there count
    Yours orders from highest to lowest. I ordered from first seen to next seen. The numbers add to same
    value in each case. Then again depending on counts and symbols used your table be smaller than mine.
    I am only looking at this example you picked.

    However you count down. I count up and get a free spot for where they start
    instead of counts in first table store relative jumps

    Code:
    [1]  L
    [1]  e
    [1]  s
    [1]  _ 
    [1]  b
    [2]  a
    [1]  u
    [1]  x
    [2]  k
    [1]  i
    [1]  w
    [4]  q
    [4]  v
    [1]  o
    [2]  c
    [3]  !
    [/QUOTE]

    Now the data part of stream

    the "Les_b" is encoded with NULL (no bits)

    no for the current Huffman table the next character "e" is encoded with at most 2 bits
    since only 5 symbols used so far and each has a weight of 1

    the "aux" encoded with Null (no bits) since position known

    the "_" is encoded next since with at most 4 bits depending on how you build Huffman table

    the "kiw" again encoded with Null (no bits) since known position and so on

    So far have "Les_beaux_kiw" only used at most 6 bits so far and no symbol that only exist once
    need be encoded at all. The worst case encoding occurs for symbols that occur only twice
    and the max they could be encoded with would be 5 bits.

    I am sure this is shorter than your example.

  4. #4
    Member caveman's Avatar
    Join Date
    Jul 2009
    Location
    Strasbourg, France
    Posts
    190
    Thanks
    8
    Thanked 62 Times in 33 Posts
    Quote Originally Posted by biject.bwts View Post
    Here a sample from a bijective 2 pass adaptive Huffman compared to yours the static table takes up
    the same amount of space. Same number of variables and the counts add up to the same value so table
    should take same amount of code. Actually the Header of mine can be smaller up general but that't
    not the point I am trying to make. I am assuming that you have a list of used symbols and there count
    Yours orders from highest to lowest. I ordered from first seen to next seen. The numbers add to same
    value in each case. Then again depending on counts and symbols used your table be smaller than mine.
    I am only looking at this example you picked.

    However you count down. I count up and get a free spot for where they start
    instead of counts in first table store relative jumps

    Code:
    [1]  L
    [1]  e
    [1]  s
    [1]  _ 
    [1]  b
    [2]  a
    [1]  u
    [1]  x
    [2]  k
    [1]  i
    [1]  w
    [4]  q
    [4]  v
    [1]  o
    [2]  c
    [3]  !
    I had to read it at least six times to start to figure out how it works.
    I suppose that something like this gives your first table:
    Code:
    Les_beaux_kiwis_que_voici_!
    11111 211 211   4   41 2  3
    Now the data part of stream

    the "Les_b" is encoded with NULL (no bits)

    no for the current Huffman table the next character "e" is encoded with at most 2 bits
    since only 5 symbols used so far and each has a weight of 1
    should be at least 2 and at most 3, or I'm totally lost?!

    the "aux" encoded with Null (no bits) since position known

    the "_" is encoded next since with at most 4 bits depending on how you build Huffman table
    minimum variance would ensure 3 bits

    Now correct me if I'm wrong, you have to build or at least update the Huffman tree (Vitter algorithm?) when symbol counts change, whereas I only build it once and make a small change to the prefixfree codes when a symbol count has reached zero (which is of course way less precise, I wanted to stay away from manipulating the tree but still being able to virtually chop it down at relatively low CPU cost). Your proposal is very clever and more efficient but also seems slower since you have to progressively grow the tree... it is also definitely less poetic.
    Last edited by caveman; 10th January 2014 at 06:50.

  5. #5
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    No the values I stored are relative jumps. You stored counts The headers have the symbols used I used offset where you used counts.

    the "Les_b" is encoded for free. Since the table that is stored with the code only contains relative offsets. IT IS NOT A STANDARD HUFFMAN TABLE
    though the same size as yours.1

    the first time you have to write data when the next e appears since position unknown

    at this point if using Huffman you have 5 variables they all have weight of 1
    Code:
    [1] L     10       I go from  10000.....  to 000...  for Huffman vales  because of method I use to convert from bit string or byte file bijectively
    [1] e     01
    [1] s     11
    [1] _     001
    [1] b     000     so the first unknown character e is encoded with  01 two bits
    
     You know have compressed "Les_be"
    
    
    now e has weight of two    moved to top
    
    [2] e  10  
    [1] L  01
    [1] s  11
    [1] _  001
    [1] b  000
    
     Because of stored table you don't write code for
    next two letters there placement know from table so you still have the 01 two bits
    with the data "Les_beaux"  encoded
    but know the Huffman table used to compress that exists only in the computer doing the coding
    need to add the "aux" and encode for the next symbol used in the file.
    
    I don't really think of its as Vitter since Vitter always has cell for the unknown symbol but call it what you want.
    
    [2] e  100 
    [1] L  010
    [1] s  110
    [1] _  001
    [1] b  101
    [1] a  011
    [1] u  111
    [1] x  000
    
     now code the next _ as 001
    
      update table so _  has count of 2
    and  the "kiw"  is free
    add the kiw to running Huffman table
    
    [2] _  100 
    [2] e  010
    [1] L  110
    [1] s  001
    [1] b  101
    [1] a  0110
    [1] u  1110
    [1] x  0001
    [1] k  0111
    [1] i  1111
    [1] w  0000
    
    encode the i with 1111  
    
    you have compressed 
    up to "Les_beaux_kiwi"
    
    update and continue
    This table with jumps is like what I did with my bijective DC so it does work.
    Maybe will write this program sure I have done it before.

  6. #6
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    Quote Originally Posted by caveman View Post
    Now correct me if I'm wrong, you have to build or at least update the Huffman tree (Vitter algorithm?) when symbol counts change, whereas I only build it once and make a small change to the prefixfree codes when a symbol count has reached zero (which is of course way less precise, I wanted to stay away from manipulating the tree but still being able to virtually chop it down at relatively low CPU cost). Your proposal is very clever and more efficient but also seems slower since you have to progressively grow the tree... it is also definitely less poetic.
    Well I think bijective compression less wasteful more poetic but slower. I also think as quantum computers more common there will need to be in use encryption programs that
    have bijective compression with likely added noise in such a way that it is impossible to break since for quantum computers to solve a problem there has to be a single answer. If
    there are many keys that work to give something the quantum computer is of little use.

Similar Threads

  1. Demixer - new tree-based bitwise CM codec is in development
    By Piotr Tarsa in forum Data Compression
    Replies: 34
    Last Post: 17th March 2013, 20:33
  2. LZWC - A fast tree based LZW compressor
    By david_werecat in forum Data Compression
    Replies: 6
    Last Post: 16th January 2013, 03:45
  3. Suffix Tree's internal representation
    By Piotr Tarsa in forum Data Compression
    Replies: 4
    Last Post: 18th December 2011, 08:37
  4. lzma's binary tree matchfinder
    By Shelwien in forum Data Compression
    Replies: 7
    Last Post: 13th October 2011, 08:38
  5. OptimFROG is back!
    By m^2 in forum Data Compression
    Replies: 3
    Last Post: 16th February 2011, 00:23

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •