Results 1 to 24 of 24

Thread: Easy for you , very hard for me......aid !

  1. #1
    Member
    Join Date
    May 2014
    Location
    c++
    Posts
    2
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Easy for you , very hard for me......aid !

    I want a programming compression software (Huffman coding) , but I have a problem.

    I'm a newbie in this domain. But my question is very important is how the program compresses data file with a large capacity in a short time? What is the structure programming used , analyzes file all bytes, bytes quickly?

    I programming software compresses but : file 1 MB compression in 10 minutes it’s big time !

  2. #2
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts

  3. #3
    Member RichSelian's Avatar
    Join Date
    Aug 2011
    Location
    Shenzhen, China
    Posts
    156
    Thanks
    18
    Thanked 50 Times in 26 Posts
    assume you're using naive static huffman coding:

    counting for frequency table: O(n)
    building huffman table: O(k)
    runing huffman encoding: O(n)

    how can you cost 10 minutes for n=1,000,000 and k=256?

    BTW i think this page helps: http://www.catb.org/esr/faqs/smart-questions.html

  4. #4
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by RichSelian View Post
    assume you're using naive static huffman coding:

    counting for frequency table: O(n)
    building huffman table: O(k)
    runing huffman encoding: O(n)

    how can you cost 10 minutes for n=1,000,000 and k=256?

    BTW i think this page helps: http://www.catb.org/esr/faqs/smart-questions.html
    What's the algorithm for building the tree in O(k)? With a binary heap, it's O(k log(k)).

  5. #5
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by nburns View Post
    What's the algorithm for building the tree in O(k)? With a binary heap, it's O(k log(k)).
    http://www.staff.science.uu.nl/~leeuw112/huffman.pdf

  6. #6
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Building a Huffman code in O(k): http://mattmahoney.net/dc/dce.html#Section_31

    If you want something fast and simple, use LZ77 or LZSS with fixed codes. For example, code a match with 1 byte for length and 2 or 3 bytes for the offset. Code a literal with 1 byte for the length. Use 1 or 2 bits of the length code to indicate whether it is a match or a literal and how many match bytes. During compression use a hash table to look for matches. Look at the code for snappy, lazy, or lz4 for examples.

    For better compression, you can use variable bit length codes (not whole bytes). For example, zpaq method 1. deflate (zlib) uses Huffman codes with tables stored in the compressed file, but in my tests there is not much advantage over using simple variable length codes.

    You can improve compression further using context modeling and arithmetic coding of matches and literals like LZMA (7zip) or zpaq method 3, but it is slower. I also use a suffix array to find matches in zpaq method 3, which finds better matches and slows compression but not decompression. Depending on heuristics I may compute the BWT from the suffix array instead, which is a fast step.

  7. #7
    Member
    Join Date
    May 2014
    Location
    c++
    Posts
    2
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Thank you for this good tips

  8. #8
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    I skimmed through that and it describes linear approximate algorithm (so not necessarily an optimal prefix code) and O(k lg k) precise algorithm (so no breakthrough).

    Building a Huffman code in O(k): http://mattmahoney.net/dc/dce.html#Section_31
    Maintaining a priority queue of subtrees requires a O(lg k) cost per merge, thus the whole algorithm is O(k lg k).

  9. #9
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    You're right, sorting the codes by probability take O(k log k) time.

  10. #10
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    I skimmed through that and it describes linear approximate algorithm (so not necessarily an optimal prefix code) and O(k lg k) precise algorithm (so no breakthrough).


    Maintaining a priority queue of subtrees requires a O(lg k) cost per merge, thus the whole algorithm is O(k lg k).
    I'm sorry you are not impressed by Jan van Leeuwen's paper that provided the first proof that optimal Huffman codes could be built in O(n) time if the codes are already sorted by probability. It was a breakthrough at the time it was published.

  11. #11
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    Sorting takes O(k lg k) time unless you use non-comparison sort. Also I don't see any hard numbers so there's possibility that constants in asymptotic complexity kills the idea (like in eg AKS sorting network). Does anyone use that Leeuwen's algorithm and can provide performance measurements?

    Edit:
    I've finally read slightly deeper through the paper and algorithms seems to be of pretty low complexity.
    Basically what's only needed is a linked list of subtrees sorted by weight plus an iterator which supports examining the immediate left and immediate right, moving to right and inserting on immediate left, all maintaining the sortedness.
    Last edited by Piotr Tarsa; 7th May 2014 at 10:29.

  12. #12
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    Sorting takes O(k lg k) time unless you use non-comparison sort. Also I don't see any hard numbers so there's possibility that constants in asymptotic complexity kills the idea (like in eg AKS sorting network). Does anyone use that Leeuwen's algorithm and can provide performance measurements?

    Edit:
    I've finally read slightly deeper through the paper and algorithms seems to be of pretty low complexity.
    Basically what's only needed is a linked list of subtrees sorted by weight plus an iterator which supports examining the immediate left and immediate right, moving to right and inserting on immediate left, all maintaining the sortedness.
    The Leeuwen paper is referenced on wikipedia. The way it's written is old-fashioned, and confusing because he is talking about different kinds of trees and not clearly distinguishing them. I am pretty sure that algorithm builds real Huffman trees in O(n) for pre-sorted probabilities.

    It seems like a pretty easy proof. You have a sorted array and you're looping over it. For every two elements, you create a new element equal to their sum. Pretend that each time you make a new element, you just set it to the side. What order are the new elements going to come out in?

    It seems clear to me that they're going to come out in increasing order (it's late, though). So you have sorted elements that were already there, and you have new elements coming out sorted. All you have to do is merge them.

    I think you can do this with no data structures other than the array itself. Also, I think further improvements are possible. All you really want are the lengths. You don't want the tree. All you need to know is how many of each length are there? So it may be sublinear, in fact.

  13. #13
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    i've discovered this algo myself and implemented it with 2 arrays in tornado. but probably there are many huffman implementations doing the same

  14. #14
    Member
    Join Date
    Nov 2013
    Location
    US
    Posts
    131
    Thanks
    31
    Thanked 29 Times in 19 Posts
    There are some easy explanations for merging two arrays (leaf and symbols) if you have probabilities already sorted:
    http://www.compressconsult.com/huffman/#codelengths

  15. #15
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    So what are the dominant costs in practice, if you're doing all this right? Radix sorting the frequencies? Outputting the bit stream?

  16. #16
    Member RichSelian's Avatar
    Join Date
    Aug 2011
    Location
    Shenzhen, China
    Posts
    156
    Thanks
    18
    Thanked 50 Times in 26 Posts
    Quote Originally Posted by Matt Mahoney View Post
    no FSE?

  17. #17
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    You mean asymmetric coding? http://mattmahoney.net/dc/dce.html#Section_33

  18. #18
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by cade View Post
    There are some easy explanations for merging two arrays (leaf and symbols) if you have probabilities already sorted:
    http://www.compressconsult.com/huffman/#codelengths
    That is pretty good. I'm thinking you might be able to avoid building the tree, even. For canonical codes, all you want to know is how many leaves are at each depth.

    You start at the leftmost leaf at the deepest level of the tree and you visit the nodes level-wise, ascending toward the root. It's basically a breadth-first traversal, but backwards. You can tell for sure that you're on the next level when you make a new internal node out of the first internal node from the previous level. It seems like there might be enough information to know when you've changed levels in O(1) time and using basically cpu registers for storage. Since you know that levels will be contiguous, and you don't care about symbols per se, just counts, you could potentially skip some nodes and just search for the end with something like binary search. Here are some potential tools: the Kraft Inequality; the total sum is known; you can rearrange the tree as much as you want as long as the depths remain the same; the start of successive levels should be separated by around a factor of 2.

    It's not a very important problem, since it's already fast enough. But the final algorithm might be incredibly elegant.
    Last edited by nburns; 7th May 2014 at 23:53.

  19. #19
    Member
    Join Date
    Nov 2013
    Location
    US
    Posts
    131
    Thanks
    31
    Thanked 29 Times in 19 Posts
    Quote Originally Posted by Paul W. View Post
    So what are the dominant costs in practice, if you're doing all this right? Radix sorting the frequencies? Outputting the bit stream?
    For me, it is sorting the frequencies. I use gnome sort to do this because their arrangement is usually coherent between table generations. Beyond that, I just have a merge of two sorted arrays. Assigning codes is probably the second most costly area, but it is fixed time and relatively fast. The whole process is fast enough for practical purposes.

  20. #20
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    I use gnome sort to do this because their arrangement is usually coherent between table generations.
    i sorted on freq+symbol to ensure stable sorting. in fact it was exactly the bug fixed in tor 0.4a

  21. #21
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    Why stable sort? Did you just want to handle the case where an user compressed and decompressed with tornado binaries having different std::sort implementations compiled in? If so then more efficient solution would be to include a fixed sort algorithm in the huffman code building algorithm.

  22. #22
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    well, it may be faster, but overall depending on details of sorting implementation looks like a bad idea. imagine that i want to repalce it with new better sort()

  23. #23
    Member
    Join Date
    Nov 2013
    Location
    US
    Posts
    131
    Thanks
    31
    Thanked 29 Times in 19 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    i sorted on freq+symbol to ensure stable sorting. in fact it was exactly the bug fixed in tor 0.4a
    To clarify, I did not mean "coherent" to be stable sorting. I meant that they are coherent in that the arrangement of frequencies has not changed much, so gnome sort is nice when your input is mostly sorted.

  24. #24
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    i sorted on freq+symbol to ensure stable sorting. in fact it was exactly the bug fixed in tor 0.4a
    I was thinking that you'd keep a set of frequencies keyed by symbol and strip away the symbol for doing Huffman. At the end of Huffman, you have a set of freq, length pairs. You can then do something like counting sort to reattach symbols to lengths via frequencies.

Similar Threads

  1. How to build a developer blog / download site the easy way?
    By packDEV in forum The Off-Topic Lounge
    Replies: 1
    Last Post: 10th January 2014, 22:17
  2. Virtual Hard Disk Compress/Dedupe
    By JayM in forum Data Compression
    Replies: 6
    Last Post: 22nd July 2013, 00:00
  3. Easy way for BMF to png
    By SvenBent in forum Data Compression
    Replies: 5
    Last Post: 13th November 2008, 09:13
  4. Replies: 4
    Last Post: 17th March 2008, 21:19

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •