I have talked with another person and I will likely post in this thread about bijective coding. I know that its to much to hope that people will make the Whole approach to file compression as bijective. But I would like to discuss how to make a standard Entropy coder bijective.
The first one to discuss is how does one make a huffman compress with a fixed table map its tokens which are of various length to a byte type of file that is of 8 bit bytes.
Actually with the same ease you could map to any infinite structure of data blocks even those of varying number of bits but lets kill the byte case first.
I notice that many entropy compressors which is also the last stage in many popular file compressors have difficulty in meshing there endings with the byte structure of data files.
Most people some to think you have to output extra stuff such as a symbol count of some type or reserve space in the Huffman table ( if using huffman ) for some sort of EOF symbol. Also space waste in last byte since the EOF symbol not only wasteful but will likely not fill the last byte without waste itself. Also it wastes space when not used in the rest of the file since it robs space in the output file itself.
Simple Example: Say your compressing some of 4 symbols A B C D
supose A occurs half the time and B occurs 1/4 the time and C and D each occur 1/8 the time in the files of interest.
It would be nice to use a fixed table for these files so the tables don't have to be in the code.
A = 1
B = 01
C = 001
D = 000
However how would one end the file. Or better yet how does one map bijectively this file set to the byte file set.
Most users would say lets create a number field in front that tell how many huffman
tokens are used. Well that waste space for the number. Some smarter people may notice that since bytes are 8 bits and since the longest token is 3 bits. One could write a small number field and only output the number for last tokens in part of last byte written. Others Might say lets creat a huffman table for 5 symbols. Lets look at that
A = 1
B = 01
C = 001
D = 0001
E = 0000 this is otptimal huffman table for the probabilites mentioned you
assume that the E the EOF is less likely than the other symbols.
Well you run the huffman coder using above table and when at last word you could cleverly just output enough Zeros to fill last byte. If the last symbol ends on the lucky last bit of a byte you don't even need to write the EOF symbol since you could pretend the FILE EOF is actually the huffman EOF in this special lucky alignment case.
Well you still lose a lot of space. Since we know every 1/8 time a D appears and we should optimally write 3 bits out but now we are writting 4 so we lost big time again for a long file.
Here is a bijective solution for this case part one
use the correct table
A = 1
B = 01
C = 001
D = 000 /* if at least one B or C follows somewhere in rest of string to end */
D = 0001 /* if only zero or more A tokens follow till end of huffman string */
write this to a byte file when at end of file keep writing to at last byte is filled
notice this last byte will never be all zeroes. (note if using huffman tokens that
are not all zeros if they are last truncate the last bytes at this point so no last byte of zero.
This can't be bijective since we don't allow for the zero byte at end. The above is close to bijective. In fact its at most one byte off.
part two process
looks at the byte just written if the last byte is not 10000000 you are done.
if the next to last byte is not a 00000000 or 10000000 your done
if its a 10000000 loop to read next most last byte
now it must be a byte of 00000000 all zeros.
so drop the last byte written your done its bijective
It's that simple (look a proof read this a dozen times but I am sure my English sucks)