Results 1 to 14 of 14

Thread: Question on Run Length Encoding

  1. #1
    Member
    Join Date
    Aug 2017
    Location
    Mauritius
    Posts
    59
    Thanks
    67
    Thanked 22 Times in 16 Posts

    Question on Run Length Encoding

    As it customary for newbies , we start with simplest algorithms and below is my implementation of RLE

    Code:
    #include <iostream>
    #include <stack>
    #include <unordered_map>
    
    
    
    
    std::string encode(const std::string inputstr) {
        std::string encoded_str;
        std::unordered_map<char, int> str_umap;
    
    
        for (int i = 0; i < inputstr.size() ; ++i) {
             char key = inputstr.at(i);
             auto it = str_umap.find(key);
             if(it == str_umap.end()) { // key not in msp
                 str_umap[key] = 1;
             } else {
                 str_umap[key] += 1;
             }
            //Not at end of string
             if (i != inputstr.size() - 1) {
                 //Detect if run is over
                 if(inputstr.at(i) != inputstr.at(i + 1)) {
                     if(str_umap[key] == 1) {
                         encoded_str +=  key;
                     } else {
                         encoded_str += std::to_string(str_umap[key]) + key;
                     }
                     str_umap.clear();
                 }
             } else {
                 if(str_umap[key] == 1) {
                     encoded_str +=  key;
                 } else {
                     encoded_str += std::to_string(str_umap[key]) + key;
                 }
                 str_umap.clear();
             }
        }
    
    
        return encoded_str;
    
    
    }
    However this implementation is buggy as there is no single interpretation of the encoded string.

    This seems to work but it will break pretty soon

    Input string : AAAABBBAC
    Encoded string : 4A3BAC


    Input string : AAAABBBAC111c
    Encoded string : 4A3BAC31c

    The problem is that the decoder won't know where we want 3 `1` and 1 `c` or 31 `c`.

  2. #2
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    You need to come up with scheme that is reversible, then. For example:
    - encode run length after the symbol, not before,
    - have a minimum run length of two symbols after which you place run length - this is to conserve space,
    - instead of printing full number encode it in following way:
    -- divide the run length by 9 then place that many nines
    -- after that put symbol for (run_length % 9)

    Examples:
    ABBCD => ABB2CD
    AAAB => AA3B
    A => A
    AA => AA2
    AAA => AA3
    AAAAAAAAA => AA90
    AAAAAAAAAA => AA91
    AAAAAAAAAAA => AA92

    You also see that you can subtract minimum run length from actual run length before encoding. This way the above example will look like that:
    ABBCD => ABB0CD
    AAAB => AA1B
    A => A
    AA => AA0
    AAA => AA1
    AAAAAAAAA => AA7
    AAAAAAAAAA => AA8
    AAAAAAAAAAA => AA90
    Last edited by Piotr Tarsa; 28th April 2018 at 20:53.

  3. The Following User Says Thank You to Piotr Tarsa For This Useful Post:

    khavish (28th April 2018)

  4. #3
    Member
    Join Date
    Aug 2017
    Location
    Mauritius
    Posts
    59
    Thanks
    67
    Thanked 22 Times in 16 Posts
    Thanks Piotr for your valuable insights.I added the run length after the symbol and there is still more to do in terms of optimizations for space as you describe.For now i am trying to get a correct encode/decode scheme that works for all strings.

    Code:
    std::string encode(const std::string inputstr) {
    
    
        //handle single character corner case
        if(inputstr.size() == 1) {
            return inputstr;
        }
    
    
        std::string encoded_str;
        std::unordered_map<char, int> str_umap;
    
    
        for (int i = 0; i < inputstr.size() ; i++) {
             char symbol = inputstr.at(i);
             auto it = str_umap.find(symbol);
             if(it == str_umap.end()) { // symbol not in msp
                 str_umap[symbol] = 1;
             } else {
                 str_umap[symbol] += 1;
             }
            //Not at end of string
             if (i != inputstr.size() - 1) {
                 //Detect if run is over
                 if(inputstr.at(i) != inputstr.at(i + 1)) {
                     encoded_str += symbol + std::to_string(str_umap[symbol]);
                     str_umap.clear();
                 }
             } else {
                 encoded_str += symbol + std::to_string(str_umap[symbol]);
                 str_umap.clear();
             }
        }
    
    
        return encoded_str;
    
    
    }
    
    
    std::string decode(const std::string encodedstr) {
        if(encodedstr.size() == 1) {
            return encodedstr;
        }
        std::string decoded_str;
        int next_expected = 0; // 0 - alpha , 1 - run_length
        char symbol = '\0';
        int num_of_times = 1;
        for (int i = 0; i < encodedstr.size() ; i++) {
            if(next_expected == 0) {
                symbol = encodedstr.at(i);
                next_expected = 1;
            } else {
                num_of_times = static_cast<char>(encodedstr.at(i)) - '0';
                for (int j = 0; j < num_of_times; j++) {
                    decoded_str += symbol;
                }
                next_expected = 0;
            }
        }
        return decoded_str;
    }
    It will work for some string but not for others...For e.g

    ABBCD => ABB2CD works

    AAAAAAAAAA => A10 decodes fails as expected

    Check this line here

    Code:
     num_of_times = static_cast<char>(encodedstr.at(i)) - '0';
    Initially i thought i could just put it in a loop to get correct run length(10) but while it will work for this string it is still buggy as an RLE codec

    Encoding currently is as follows

    <symbol><run_length><symbol><run_length><symbol><r un_length> ....

    The problem here is that i can't guarantee that symbol won't start with a number so if i loop for all digits between letters , it will still break.

    Thanks a lot for your help once again

  5. #4
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    My Internet connection is breaking so I'll write a short post.

    1. You don't really need a map:
    Code:
    std::unordered_map<char, int> str_umap;
    a last seen symbol and it's count will be enough.
    2. When encoding length you need to encode it like I described i.e. decompose run length to a * 9 + b (where b < 9), then write 'a' nines and 'b' as char. You're code just writes normal decimal representation which won't work:
    Code:
    encoded_str += symbol + std::to_string(str_umap[symbol]);
    3. After you change the way you encode run lengths you need to fix the decompressor, i.e. change this line:
    Code:
    next_expected = 0;
    to this:
    Code:
    next_expected = num_of_times == 9;
    4. There are many more bugs

  6. The Following User Says Thank You to Piotr Tarsa For This Useful Post:

    khavish (1st May 2018)

  7. #5
    Member
    Join Date
    Aug 2017
    Location
    Mauritius
    Posts
    59
    Thanks
    67
    Thanked 22 Times in 16 Posts
    Here is the updated encode function that implements run length encode.Let me know what ypu think and thanks again for your time

    Code:
    std::string encodeRunLength(int rlen) {
        std::string result;
       int no_of_nine =  (rlen / 9);
       if (no_of_nine > 0) {
           for (int i = 0; i < no_of_nine ; ++i) {
               result += "9";
           }
       }
       result += std::to_string(rlen % 9);
       return result;
    }
    
    std::string encode(const std::string inputstr) {
    
    
        char last_seen_symbol = inputstr.at(0);
        int last_seen_symbol_count = 0;
        char current_symbol;
        std::string encoded_str;
        for (int i = 0; i < inputstr.size(); i++) {
             current_symbol = inputstr.at(i);
             if(current_symbol == last_seen_symbol) {
                 last_seen_symbol_count += 1;
             } else {
                 //encoded_str += last_seen_symbol + std::to_string(last_seen_symbol_count);
                 if(last_seen_symbol_count > 1) {
                     encoded_str += last_seen_symbol;
                     last_seen_symbol_count -= 2;
                     encoded_str += last_seen_symbol + encodeRunLength(last_seen_symbol_count);
                 } else {
                     encoded_str += last_seen_symbol;
                 }
                 last_seen_symbol = current_symbol;
                 last_seen_symbol_count = 1;
             }
        }
        if(last_seen_symbol_count > 1) {
            encoded_str += last_seen_symbol;
            last_seen_symbol_count -= 2;
            encoded_str += last_seen_symbol + encodeRunLength(last_seen_symbol_count);
        } else {
            encoded_str += last_seen_symbol;
        }
        return  encoded_str;
    
    
    }

  8. #6
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    Looks okay(ish). What output it gives on my examples?

    Couple of nitpicks:
    1. "if (no_of_nine > 0) {" isn't needed as the condition is checked anyway in for loop inside that if.
    2. "encode" function should work for empty strings so you should add an if at the beginning: "if (inputstr.empty()) return std::string();"
    3. If you initialize last_seen_symbol_count with 1 then main loop can start from 1, i.e. "for (int i = 1; i < inputstr.size(); i++) {"

  9. #7
    Member
    Join Date
    Aug 2017
    Location
    Mauritius
    Posts
    59
    Thanks
    67
    Thanked 22 Times in 16 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    Looks okay(ish). What output it gives on my examples?

    Couple of nitpicks:
    1. "if (no_of_nine > 0) {" isn't needed as the condition is checked anyway in for loop inside that if.
    2. "encode" function should work for empty strings so you should add an if at the beginning: "if (inputstr.empty()) return std::string();"
    3. If you initialize last_seen_symbol_count with 1 then main loop can start from 1, i.e. "for (int i = 1; i < inputstr.size(); i++) {"
    Great observations.It seems that the encode works properly now for all your examples as well as my own.I fixed some warnings and here is the updated code :
    Code:
    std::string encodeRunLength(int rlen) {
        std::string result;
        int no_of_nine = (rlen / 9);
        for (size_t  i = 0; i < no_of_nine; ++i) {
            result += "9";
        }
        result += std::to_string(rlen % 9);
        return result;
    }
    
    
    std::string encode(const std::string inputstr) {
    
    
        if (inputstr.empty()) return std::string();
    
    
        char last_seen_symbol = inputstr.at(0);
        int last_seen_symbol_count = 1;
        char current_symbol;
        std::string encoded_str;
        for (size_t i = 1; i < inputstr.size(); i++) {
            current_symbol = inputstr.at(i);
            if (current_symbol == last_seen_symbol) {
                last_seen_symbol_count += 1;
            } else {
                //encoded_str += last_seen_symbol + std::to_string(last_seen_symbol_count);
                if (last_seen_symbol_count > 1) {
                    encoded_str += last_seen_symbol;
                    last_seen_symbol_count -= 2;
                    encoded_str += last_seen_symbol + encodeRunLength(last_seen_symbol_count);
                } else {
                    encoded_str += last_seen_symbol;
                }
                last_seen_symbol = current_symbol;
                last_seen_symbol_count = 1;
            }
        }
    
    
        if (last_seen_symbol_count > 1) {
            encoded_str += last_seen_symbol;
            last_seen_symbol_count -= 2;
            encoded_str += last_seen_symbol + encodeRunLength(last_seen_symbol_count);
        } else {
            encoded_str += last_seen_symbol;
        }
        return encoded_str;
    
    
    }
    ABBCD => ABB0CD
    AAAB => AA1B
    A => A
    AA => AA0
    AAA => AA1
    AAAAAAAAA => AA7
    AAAAAAAAAA => AA8
    AAAAAAAAAAA => AA90
    Last edited by khavish; 12th May 2018 at 21:00.

  10. #8
    Member
    Join Date
    Aug 2017
    Location
    Mauritius
    Posts
    59
    Thanks
    67
    Thanked 22 Times in 16 Posts
    For the decoding part , I am wondering how to fetch the run length.

    For e.g AAAAAAAAAAA1 => AA901

    After a pair of similar symbol(AA) we are sure that the run length is next.How do we determine whether the encoded run length is 90 or 901 ?

  11. #9
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    How do we determine whether the encoded run length is 90 or 901 ?
    It's simple. Stop after first digit that is not a '9'. So in this case you read '9', notice it's '9' so you read further, then you get '0', notice it's not '9' so you stop there. After that you add 9 + 0 = 9 and that's the shrinked run length. Now you add 2 that was subtracted before encoding run length and you get 9 + 2 = 11 A's.

  12. The Following User Says Thank You to Piotr Tarsa For This Useful Post:

    khavish (8th May 2018)

  13. #10
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    437
    Thanks
    137
    Thanked 152 Times in 100 Posts
    There are at various ways to do RLE. Off the top of my head:

    A) With guard byte - the simplest IMO. Here a special "literal" symbol is used to mean "run". If we're encoding plain text, then byte 0 (not "0", but value 0) could be used. Then when we see ABBACCCCD we can replace it with ABBA<0><4>CD, or ABBAC<0><4>D, etc. Order doesn't really matter. It's not worth replacing the Bs here as we don't save anything. If we want to do this on binary data that happens to contain the guard byte, then we have to encode it using the guard itself. Eg with the first ordering, AB<0>CD would be AB<0><1><0>CD.

    B) With automatic length suffixes. Here we encode each literal in turn, but if we go beyond a certain number of repetitions we say "and X more". Thus ABBCCCDDDDE may be ABBCCC<0>DDD<1> if minimum run length is 3, or ABB<0>CC<1>DD<2>E if the minimum length is 2. This has no guard at all and is trivial to decode, however it may be poorer performing as we could get a lot of <0> suffixes added.

    C) The RLE-Meso format (posted elsewhere here) started with a list of symbols that benefit from RLE. All of those symbols would be replaced by symbol+length and RLE wouldn't be used anywhere else. So ABBACCCCABAAAAAD with B and C labelled as worthy of RLEing would be (preamble) <2>BC plus data AB<2>AC<4>AB<1>AAAAAD. The advantage of this is you don't have the lead-in (eg DDD<1> in B above) and can save more, but only if the repeats are predictable.

    D) Multi-byte RLE. DUCKducklingducklingducklingducklingducklingDUCK could be described in English as DUCK, 5 ducklings and another DUCK - that's RLE in language. . In binary you're now looking more like Z_RLE in zlib (limited LZ basically), so maybe DUCKduckling<0><8><4>DUCK (ie 8 back, copy string 4 times, with a special guard byte).

    There are probably more schemes out there too, or a mixture of the two - such as joint B/C where some symbols always get lengths while others only get additional lengths after so many repetitions. Runs longer than 255 can either be implemented using variable length integer encoding (eg write 7 bits with top being meaning 7 more to come) or just limiting the length and having multiple literal+run combinations.

  14. The Following 2 Users Say Thank You to JamesB For This Useful Post:

    Gotty (9th May 2018),khavish (9th May 2018)

  15. #11
    Member
    Join Date
    Aug 2017
    Location
    Mauritius
    Posts
    59
    Thanks
    67
    Thanked 22 Times in 16 Posts
    Quote Originally Posted by JamesB View Post
    There are at various ways to do RLE. Off the top of my head:

    A) With guard byte - the simplest IMO. Here a special "literal" symbol is used to mean "run". If we're encoding plain text, then byte 0 (not "0", but value 0) could be used. Then when we see ABBACCCCD we can replace it with ABBA<0><4>CD, or ABBAC<0><4>D, etc. Order doesn't really matter. It's not worth replacing the Bs here as we don't save anything. If we want to do this on binary data that happens to contain the guard byte, then we have to encode it using the guard itself. Eg with the first ordering, AB<0>CD would be AB<0><1><0>CD.

    B) With automatic length suffixes. Here we encode each literal in turn, but if we go beyond a certain number of repetitions we say "and X more". Thus ABBCCCDDDDE may be ABBCCC<0>DDD<1> if minimum run length is 3, or ABB<0>CC<1>DD<2>E if the minimum length is 2. This has no guard at all and is trivial to decode, however it may be poorer performing as we could get a lot of <0> suffixes added.

    C) The RLE-Meso format (posted elsewhere here) started with a list of symbols that benefit from RLE. All of those symbols would be replaced by symbol+length and RLE wouldn't be used anywhere else. So ABBACCCCABAAAAAD with B and C labelled as worthy of RLEing would be (preamble) <2>BC plus data AB<2>AC<4>AB<1>AAAAAD. The advantage of this is you don't have the lead-in (eg DDD<1> in B above) and can save more, but only if the repeats are predictable.

    D) Multi-byte RLE. DUCKducklingducklingducklingducklingducklingDUCK could be described in English as DUCK, 5 ducklings and another DUCK - that's RLE in language. . In binary you're now looking more like Z_RLE in zlib (limited LZ basically), so maybe DUCKduckling<0><8><4>DUCK (ie 8 back, copy string 4 times, with a special guard byte).

    There are probably more schemes out there too, or a mixture of the two - such as joint B/C where some symbols always get lengths while others only get additional lengths after so many repetitions. Runs longer than 255 can either be implemented using variable length integer encoding (eg write 7 bits with top being meaning 7 more to come) or just limiting the length and having multiple literal+run combinations.
    Thanks for such as amazing overview of the different schemes for RLE.

    Based on what i have observed thus far RLE has two main problems.

    1.Symbols with too few repetitions very often don't lead to compression.
    2.In situation where we have a very long repetition of a symbol , the run length can become so large that it doesn't lead to compression(i.e in images)

    There are different schemes designed to fix those problems as you outlined in the post above.

    @Tarsa
    I have implemented both the encode and decode successfully
    Code:
    #include <iostream>
    #include <cstdlib>
    
    
    std::string encodeRunLength(int rlen) {
        std::string result;
        int no_of_nine = (rlen / 9);
        for (size_t i = 0; i < no_of_nine; ++i) {
            result += "9";
        }
        result += std::to_string(rlen % 9);
        return result;
    }
    
    
    std::string encode(const std::string inputstr) {
    
    
        if (inputstr.empty()) return std::string();
    
    
        char last_seen_symbol = inputstr.at(0);
        int last_seen_symbol_count = 1;
        char current_symbol;
        std::string encoded_str;
        for (size_t i = 1; i < inputstr.size(); i++) {
            current_symbol = inputstr.at(i);
            if (current_symbol == last_seen_symbol) {
                last_seen_symbol_count += 1;
            } else {
                if (last_seen_symbol_count > 1) {
                    encoded_str += last_seen_symbol;
                    last_seen_symbol_count -= 2;
                    encoded_str += last_seen_symbol + encodeRunLength(last_seen_symbol_count);
                } else {
                    encoded_str += last_seen_symbol;
                }
                last_seen_symbol = current_symbol;
                last_seen_symbol_count = 1;
            }
        }
    
    
        if (last_seen_symbol_count > 1) {
            encoded_str += last_seen_symbol;
            last_seen_symbol_count -= 2;
            encoded_str += last_seen_symbol + encodeRunLength(last_seen_symbol_count);
        } else {
            encoded_str += last_seen_symbol;
        }
        return encoded_str;
    
    
    }
    
    
    
    
    std::string decode(const std::string encodedstr) {
    
    
        if (encodedstr.empty()) return std::string();
    
    
        if (encodedstr.length() == 1) return encodedstr;
    
    
        char last_seen_symbol = encodedstr.at(0);
        bool run_length_expected = false;
        std::string decoded_str;
        unsigned int rlen = 0;
        decoded_str += last_seen_symbol;
        char current_symbol;
        for (size_t i = 1; i < encodedstr.size(); i++) {
            current_symbol = encodedstr.at(i);
            if (current_symbol == last_seen_symbol) {
                decoded_str += current_symbol;
                run_length_expected = true;
                last_seen_symbol = current_symbol;
                continue;
            }
    
    
            if (run_length_expected) {
                //time to build run length
                //current_symbol is a number here
                int current_digit = static_cast<char>(current_symbol) - '0';
                rlen += current_digit;
                if(current_digit != 9) {
                    for (int j = 0; j < rlen; ++j) {
                        decoded_str += last_seen_symbol;
                    }
                    run_length_expected = false;
                    rlen = 0;
                } else {
                    continue;
                }
            } else {
                decoded_str += current_symbol;
            }
            last_seen_symbol = current_symbol;
        }
        return decoded_str;
    }
    I will be impementing other RLE schemes soon to cement my understanding.

  16. #12
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    RLE-Meso format could be extended to provide separate minimum run length for each possible symbol, eg:
    - 'A' - minimum run = 2 symbols
    - 'B' - minimum run = 3 symbols
    - 'C' - minimum run = 5 symbols
    - 'D' - RLE disabled
    With above scheme (4 possibilities per symbol == 2 bits to encode choice) we have 256 possible symbols * 2 bit/ symbol = 512 bits = 64 bytes to encode choices for full set of possible bytes.

    Length of RLE'd sequence can also be encoded using variable length quantity scheme: https://en.wikipedia.org/wiki/Variable-length_quantity
    I have implemented VLQ encoding that support signed numbers: https://github.com/tarsa/SortAlgoBox...mber_codec.hpp

    2.In situation where we have a very long repetition of a symbol , the run length can become so large that it doesn't lead to compression(i.e in images)
    What do you mean? Show an example of that.

  17. #13
    Member
    Join Date
    Aug 2017
    Location
    Mauritius
    Posts
    59
    Thanks
    67
    Thanked 22 Times in 16 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    What do you mean? Show an example of that.
    KIndly ignore that , it was a misconception on my part.

    I have a question regarding use of RLE for images.For a greyscale image , we have a 2d array of number where each number is a pixel that holds a meaning(0-255).The rows and colums represent the width and the height of the image.If we collapse the image into 1d and apply RLE on it , what do we put in the empty "entries" before concert the 1d back to 2d.

  18. #14
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    First you need to have methods to losslessly (bijectively) convert between 2D and 1D representations. Then you do RLE and unRLE on 1D representation and also care about bijectivity - ie unRLE(RLE(array1D)) == array1D. If you following steps then you don't have to worry about 2D RLE:
    Code:
    // compression from array2D to rleArray1D
    var array2D = // input array
    var array1D = convert2Dto1D(array2D, columns_number, rows_number)
    val rleArray1D = rle(array1D)
    Code:
    // decompression from rleArray1D to array2D
    var rleArray1D = // input array
    val array1D = unRle(rleArray1D)
    var array2D = convert1Dto2D(array1D, columns_number, rows_number)

Similar Threads

  1. TurboRLE: Turbo Run Length Encoding
    By dnd in forum Data Compression
    Replies: 23
    Last Post: 16th June 2019, 17:39
  2. Run-Length Decoding of almost-random data
    By Alexander Rhatushnyak in forum The Off-Topic Lounge
    Replies: 2
    Last Post: 8th February 2018, 05:03
  3. Implementing Run-length encoding in CUDA
    By nemequ in forum Data Compression
    Replies: 1
    Last Post: 28th June 2016, 13:08
  4. Replies: 38
    Last Post: 27th April 2016, 18:01
  5. xrle-eXtreme Run Length Encoding
    By algorithm in forum Data Compression
    Replies: 5
    Last Post: 20th April 2015, 23:02

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •