# Thread: Easy for you , very hard for me......aid !

1. ## Easy for you , very hard for me......aid !

I want a programming compression software (Huffman coding) , but I have a problem.

I'm a newbie in this domain. But my question is very important is how the program compresses data file with a large capacity in a short time? What is the structure programming used , analyzes file all bytes, bytes quickly?

I programming software compresses but : file 1 MB compression in 10 minutes it’s big time ! 2. 3. assume you're using naive static huffman coding:

counting for frequency table: O(n)
building huffman table: O(k)
runing huffman encoding: O(n)

how can you cost 10 minutes for n=1,000,000 and k=256?

BTW i think this page helps: http://www.catb.org/esr/faqs/smart-questions.html 4. Originally Posted by RichSelian assume you're using naive static huffman coding:

counting for frequency table: O(n)
building huffman table: O(k)
runing huffman encoding: O(n)

how can you cost 10 minutes for n=1,000,000 and k=256?

BTW i think this page helps: http://www.catb.org/esr/faqs/smart-questions.html
What's the algorithm for building the tree in O(k)? With a binary heap, it's O(k log(k)). 5. Originally Posted by nburns What's the algorithm for building the tree in O(k)? With a binary heap, it's O(k log(k)).
http://www.staff.science.uu.nl/~leeuw112/huffman.pdf 6. Building a Huffman code in O(k): http://mattmahoney.net/dc/dce.html#Section_31

If you want something fast and simple, use LZ77 or LZSS with fixed codes. For example, code a match with 1 byte for length and 2 or 3 bytes for the offset. Code a literal with 1 byte for the length. Use 1 or 2 bits of the length code to indicate whether it is a match or a literal and how many match bytes. During compression use a hash table to look for matches. Look at the code for snappy, lazy, or lz4 for examples.

For better compression, you can use variable bit length codes (not whole bytes). For example, zpaq method 1. deflate (zlib) uses Huffman codes with tables stored in the compressed file, but in my tests there is not much advantage over using simple variable length codes.

You can improve compression further using context modeling and arithmetic coding of matches and literals like LZMA (7zip) or zpaq method 3, but it is slower. I also use a suffix array to find matches in zpaq method 3, which finds better matches and slows compression but not decompression. Depending on heuristics I may compute the BWT from the suffix array instead, which is a fast step. 7. Thank you for this good tips 8. I skimmed through that and it describes linear approximate algorithm (so not necessarily an optimal prefix code) and O(k lg k) precise algorithm (so no breakthrough).

Building a Huffman code in O(k): http://mattmahoney.net/dc/dce.html#Section_31
Maintaining a priority queue of subtrees requires a O(lg k) cost per merge, thus the whole algorithm is O(k lg k). 9. You're right, sorting the codes by probability take O(k log k) time. 10. Originally Posted by Piotr Tarsa I skimmed through that and it describes linear approximate algorithm (so not necessarily an optimal prefix code) and O(k lg k) precise algorithm (so no breakthrough).

Maintaining a priority queue of subtrees requires a O(lg k) cost per merge, thus the whole algorithm is O(k lg k).
I'm sorry you are not impressed by Jan van Leeuwen's paper that provided the first proof that optimal Huffman codes could be built in O(n) time if the codes are already sorted by probability. It was a breakthrough at the time it was published. 11. Sorting takes O(k lg k) time unless you use non-comparison sort. Also I don't see any hard numbers so there's possibility that constants in asymptotic complexity kills the idea (like in eg AKS sorting network). Does anyone use that Leeuwen's algorithm and can provide performance measurements?

Edit:
I've finally read slightly deeper through the paper and algorithms seems to be of pretty low complexity.
Basically what's only needed is a linked list of subtrees sorted by weight plus an iterator which supports examining the immediate left and immediate right, moving to right and inserting on immediate left, all maintaining the sortedness. 12. Originally Posted by Piotr Tarsa Sorting takes O(k lg k) time unless you use non-comparison sort. Also I don't see any hard numbers so there's possibility that constants in asymptotic complexity kills the idea (like in eg AKS sorting network). Does anyone use that Leeuwen's algorithm and can provide performance measurements?

Edit:
I've finally read slightly deeper through the paper and algorithms seems to be of pretty low complexity.
Basically what's only needed is a linked list of subtrees sorted by weight plus an iterator which supports examining the immediate left and immediate right, moving to right and inserting on immediate left, all maintaining the sortedness.
The Leeuwen paper is referenced on wikipedia. The way it's written is old-fashioned, and confusing because he is talking about different kinds of trees and not clearly distinguishing them. I am pretty sure that algorithm builds real Huffman trees in O(n) for pre-sorted probabilities.

It seems like a pretty easy proof. You have a sorted array and you're looping over it. For every two elements, you create a new element equal to their sum. Pretend that each time you make a new element, you just set it to the side. What order are the new elements going to come out in?

It seems clear to me that they're going to come out in increasing order (it's late, though). So you have sorted elements that were already there, and you have new elements coming out sorted. All you have to do is merge them.

I think you can do this with no data structures other than the array itself. Also, I think further improvements are possible. All you really want are the lengths. You don't want the tree. All you need to know is how many of each length are there? So it may be sublinear, in fact. 13. i've discovered this algo myself and implemented it with 2 arrays in tornado. but probably there are many huffman implementations doing the same 14. There are some easy explanations for merging two arrays (leaf and symbols) if you have probabilities already sorted:
http://www.compressconsult.com/huffman/#codelengths 15. So what are the dominant costs in practice, if you're doing all this right? Radix sorting the frequencies? Outputting the bit stream? 16. Originally Posted by Matt Mahoney no FSE?  17. You mean asymmetric coding? http://mattmahoney.net/dc/dce.html#Section_33 18. Originally Posted by cade There are some easy explanations for merging two arrays (leaf and symbols) if you have probabilities already sorted:
http://www.compressconsult.com/huffman/#codelengths
That is pretty good. I'm thinking you might be able to avoid building the tree, even. For canonical codes, all you want to know is how many leaves are at each depth.

You start at the leftmost leaf at the deepest level of the tree and you visit the nodes level-wise, ascending toward the root. It's basically a breadth-first traversal, but backwards. You can tell for sure that you're on the next level when you make a new internal node out of the first internal node from the previous level. It seems like there might be enough information to know when you've changed levels in O(1) time and using basically cpu registers for storage. Since you know that levels will be contiguous, and you don't care about symbols per se, just counts, you could potentially skip some nodes and just search for the end with something like binary search. Here are some potential tools: the Kraft Inequality; the total sum is known; you can rearrange the tree as much as you want as long as the depths remain the same; the start of successive levels should be separated by around a factor of 2.

It's not a very important problem, since it's already fast enough. But the final algorithm might be incredibly elegant. 19. Originally Posted by Paul W. So what are the dominant costs in practice, if you're doing all this right? Radix sorting the frequencies? Outputting the bit stream?
For me, it is sorting the frequencies. I use gnome sort to do this because their arrangement is usually coherent between table generations. Beyond that, I just have a merge of two sorted arrays. Assigning codes is probably the second most costly area, but it is fixed time and relatively fast. The whole process is fast enough for practical purposes. 20. I use gnome sort to do this because their arrangement is usually coherent between table generations.
i sorted on freq+symbol to ensure stable sorting. in fact it was exactly the bug fixed in tor 0.4a  21. Why stable sort? Did you just want to handle the case where an user compressed and decompressed with tornado binaries having different std::sort implementations compiled in? If so then more efficient solution would be to include a fixed sort algorithm in the huffman code building algorithm. 22. well, it may be faster, but overall depending on details of sorting implementation looks like a bad idea. imagine that i want to repalce it with new better sort() 23. Originally Posted by Bulat Ziganshin i sorted on freq+symbol to ensure stable sorting. in fact it was exactly the bug fixed in tor 0.4a To clarify, I did not mean "coherent" to be stable sorting. I meant that they are coherent in that the arrangement of frequencies has not changed much, so gnome sort is nice when your input is mostly sorted. 24. Originally Posted by Bulat Ziganshin i sorted on freq+symbol to ensure stable sorting. in fact it was exactly the bug fixed in tor 0.4a I was thinking that you'd keep a set of frequencies keyed by symbol and strip away the symbol for doing Huffman. At the end of Huffman, you have a set of freq, length pairs. You can then do something like counting sort to reattach symbols to lengths via frequencies. #### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•