Results 1 to 9 of 9

Thread: Is there a solution, change the frequency of the letters within the text in the file

  1. #1
    Member
    Join Date
    Aug 2016
    Location
    libya
    Posts
    4
    Thanks
    9
    Thanked 0 Times in 0 Posts

    Is there a solution, change the frequency of the letters within the text in the file

    TEXT FILE = ((HDADGABHAEABABEAEGEDEBHEGEAEDEAEABBBEBEHHEGFEHEB EBEDEAEEAEHABDBGEBGADAEDAHFEFAHFEHACEHBDGDABAAHDHE EDDBBABBBGGEHAHCAFAEAGABHADABABAAAFHAGABACAGBBBAFA BBAGECFEBEFAEBEBEHEBEGEHEDEBBHGABCCAGAFAECEADGDABG EABAHDGBDBEGEABDAHGGBAEEEAABADEEBAHCFFEEFCABBHHACB HDFBDBDBEEGHCFBACBCEAGAEBBEGEBEFEBEEEBEBHBDAHEBAEB FCGEABBHABHBGBABAABAHBAEGHHAFFCAFCCAHHDAHHBCFBDADA BADABACAGAFADAEAEAEAABABDBABBHABFBABFHHHFBEBA ))
    I have a text file that contains letters,, each character has a frequency: Example
    Order Symbol Frequency
    ---------------------------------------------
    1 A ,70
    2 B ,62
    3 E ,35

    4 H ,32
    5 D, 29
    6 G ,28
    7 F ,25
    8 C ,19

    TOTAL=-300 char

    How can change the frequency letters?, for example:
    Order Symbol Frequency
    -------------------------------------
    1 A, 85
    2 B ,52
    3 E ,30

    4 H ,32
    5 D ,29
    6 G ,28
    7 F ,25
    8 C ,19

    TOTAL=-300 char

    thanks
    Omran Frhat
    Last edited by omran farhat; 29th August 2016 at 01:15.

  2. #2
    Member
    Join Date
    Sep 2015
    Location
    Italy
    Posts
    216
    Thanks
    97
    Thanked 128 Times in 92 Posts
    EDIT: I had misunderstood the question, the following examples are wrong.

    Here are 3 examples of C function (not tested) to change an upper case letter to another upper case letter.
    c0 is the current character, c1 is the previuos character.

    Code:
    int change(int c0) {
      static int counter = 0;
      return (c0 >= 'A' && c0 <= 'Z') ? ((c0 + counter++) % 26 + 'A') : c0;
    }
    
    int change(int c0, int c1) {
      return (c0 >= 'A' && c0 <= 'Z') ? ((c0 + c1) % 26 + 'A') : c0;
    }
    
    // Call change(-1) to initialize the MTF table.
    int change(int c0) {
      static int mtf[26];
      int i, c0b = c0;
      if (c0 == -1)
        for(int i = 0; i < 26; i++) mtf = 'A' + i;
      else if (c0 >= 'A' && c0 <= 'Z') {
        // Maybe it's not the best MTF implementation.
        for(c0b = 0; c0b < 25 && c0 != mtf[c0b]; c0b++) ;
        for(i = c0b; i > 0; i--) mtf = mtf[i - 1];
        mtf[0] = c0b += 'A';
      }
      return c0b;
    }
    Last edited by Mauro Vezzosi; 29th August 2016 at 12:02. Reason: See EDIT

  3. The Following User Says Thank You to Mauro Vezzosi For This Useful Post:

    omran farhat (29th August 2016)

  4. #3
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Here's how you do it with a rangecoder - http://ideone.com/HZ9q6R (the method, not an actual implementation)
    This method actually allows to produce the specific requested freq distribution, because 2nd distribution is "smaller" than 1st one:

    c[n_,k_]:=N[Binomial[n,k]]
    // 1st distribution
    Log[2.,c[300,70]*c[300-70,62]*c[300-70-62,35]*c[300-70-62-35,32]*c[300-70-62-35-32,29]*c[300-70-62-35-32-29,28]*c[300-70-62-35-32-29-28,25]*c[300-70-62-35-32-29-28-25,19]]
    Log[2.,Binomial[19,19]*Binomial[44,25]*Binomial[72,28]*Binomial[101,29]*Binomial[133,32]*Binomial[168,35]*Binomial[230,62]*Binomial[300,70]]
    http://www.wolframalpha.com/input/?i...5B300,70%5D%5D
    =833.036 bits of data can be encoded via choice of specific permutation with given freqs

    // 2nd distribution
    Log[2.,c[300,85]*c[300-85,52]*c[300-85-52,30]*c[300-85-52-30,32]*c[300-85-52-30-32,29]*c[300-85-52-30-32-29,28]*c[300-85-52-30-32-29-28,25]*c[300-85-52-30-32-29-28-25,19]]
    Log[2.,Binomial[19,19]*Binomial[44,25]*Binomial[72,28]*Binomial[101,29]*Binomial[133,32]*Binomial[163,30]*Binomial[215,52]*Binomial[300,85]]
    http://www.wolframalpha.com/input/?i...5B300,85%5D%5D
    =822.44 bits

    There's less permutations of 2nd distribution, so its possible to transform 1st into 2nd perfectly with a rangecoder.
    But you'd lose 10.596 bits of data in that transformation, so a perfect inverse transform would be impossible -
    you'd lose (833.036 - 822.44)/Log[2, StringLength["ABEHDGFC"]] = 3.532 ... basically 4 symbols from the first string.

  5. The Following User Says Thank You to Shelwien For This Useful Post:

    omran farhat (29th August 2016)

  6. #4
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    To use a simple example, he is asking how to do a transform that would change the distribution to (I guess) to something less uniform to make the file more compressible, like "ABAB" (A=2, B=2) to "AAAB" (A=3, B=1). The answer is you can't. The second string has lower entropy (in an order-0 model), so it would compress smaller, but the transform is lossy. You lose information. There are 6 possible strings with A=2, B=2 but only 4 with A=3, B=1.

  7. The Following User Says Thank You to Matt Mahoney For This Useful Post:

    omran farhat (30th August 2016)

  8. #5
    Member
    Join Date
    Aug 2016
    Location
    libya
    Posts
    4
    Thanks
    9
    Thanked 0 Times in 0 Posts
    Thanks Matt
    That's what I'm looking for ..
    If you have a file with
    a = 1000 * (1) = 1000 bits
    b = 800 *(2) = 1600 bits
    c = 300 * (2) = 600 bits
    ----------------------------
    total =3200 bits = 400 bytes
    If we make the frequency change will get the best compression ratio,,
    a = 1300 * (1) = 1300 bits ******** a plus 300 char
    b = 600 * (2) = 1200 bits ******* lower 200 char
    c = 200 *(2) = 400 bits ****** lower 100 char
    ------------------------------
    total = 2900 bits ~ 362.5 bytes
    Is there a compression algorithm works this way?

  9. #6
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    That's not how you count information.
    There's a_freq occurrences of "a" in a string of length=total.
    So from combinatorics you'd know that there's Binomial[total,a_freq] = total!/a_freq!/(total-a_freq)! unique placements of a_freq x "a" in the string.
    1) Log[2.,Binomial[2100,1000]*Binomial[1100,800]] = 3015.40 http://www.wolframalpha.com/input/?i...1100,800%5D%5D
    2) Log[2.,Binomial[2100,1300]*Binomial[800,600]] = 2651.57 http://www.wolframalpha.com/input/?i...B800,600%5D%5D
    That's how much bits you need to encode each string, when you know the freqs.
    Arithmetic coding also makes it possible to reach these numbers almost perfectly.
    But there's more combinations in (1), so you can't transform string2 to any string1 - that's exactly why it takes more bits to encode string1.

    Although simply using arithmetic coding would be just as good as what you want, though 3016 bits instead of 2900 :)
    But remember that freqs actually also have to be encoded :)

  10. The Following User Says Thank You to Shelwien For This Useful Post:

    omran farhat (30th August 2016)

  11. #7
    Member
    Join Date
    Aug 2016
    Location
    libya
    Posts
    4
    Thanks
    9
    Thanked 0 Times in 0 Posts
    Thank you mr Shelwien and Dr.Matt Mahoney
    Example we have text file abcabcdbeaca ............
    step 1: I use Huffman's algorithm with a variable-length code table:
    a (500) 00 2-bits
    b (400) 010 3-bits
    c (300) 0110 4-bits
    d (200) 01110 5-bits
    e (100) 01111 5-bits
    In normal case, the source coding, such as ,,
    a b c a b c d b e a c a ...............
    00 010 0110 00 010 0110 01110 010 01111 00 0110 00 ............ Here compressed file
    ==4900 bits - 612.5 bytes
    step 2 : But before the completion of the encoding process,, we want to Change the frequency for a source symbol
    step 3: Change in frequency
    For example, you get the following frequency
    a (650) 00
    b (300) 010
    c (250) 0110
    d (200) 01110
    e (100) 01111
    step 4 now in this step encoding a source symbol and Add in the head characters that have been changed
    a b a a b c d a e a a a ........
    00 010 00 00 010 0110 01110 00 01111 00 00 00 ........
    == 4700 bits - 587.5 bytes
    with Huffman's algorithm: we got 4900 bits
    a b (c) a b c d (b) e a (c) a .........
    a b (a) a b c d (a) e a (a) a .......
    00 010 (0110) 00 010 0110 01110 (010) 01111 00 (0110) 00
    00 010 (00) 00 010 0110 01110 (00) 01111 00 (00) 00
    if we the change in frequency probably we can get 4700 bits we save 25 bytes
    This idea was
    omran farhat

  12. #8
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Quote Originally Posted by omran farhat View Post
    Is there a compression algorithm works this way?
    No. As I explained, changing character frequencies loses information, so it is not possible to restore the original text.

  13. The Following User Says Thank You to Matt Mahoney For This Useful Post:

    omran farhat (22nd September 2016)

  14. #9
    Member
    Join Date
    Apr 2015
    Location
    Norway
    Posts
    16
    Thanks
    5
    Thanked 2 Times in 2 Posts
    Each character is unique. What Matt Mahoney said, if you change frequencies you mess up the original information.
    Frequencies in itself store some information about the content. One can think philosophically about what this frequency-information actually means for the wholeness.
    If you want to make one-symbol out of other symbols (in your example: TEXT FILE = ((HDADGABHAEABABEAEGEDEBHE..."):
    HA AD GA BH AE AB AB EA EG ED EB HE in pairs instead of ones, you get something like:
    HA=1, AD=1, GA=1, BH=1, AE=1, AB=2, EA=1, EG=1, ED=1, EB=1, HE=1,...

    it's better to do this bitwise (instead of bytewise) though since the frequency information is closer to the original if you change bit-length.
    you can then make a table (or dictionary of some sorts) of all the pair-instances, do a sorting-algorithm that is reversible and exclude the pairs that does not exist to save space.
    depending on the bitlength of each symbol-pair you are able to reduce the frequency information, but you increase the complexity of the symbol-lookups.
    wether or not one is able to compress by this method i dont know. how to store and decode information about this complexity is not on my list to figure out right now.
    what you are able to achieve with this method is some kind of transformation about how to interpret the data atleast..

  15. The Following User Says Thank You to Rudi For This Useful Post:

    omran farhat (26th October 2016)

Similar Threads

  1. Replies: 3
    Last Post: 29th March 2016, 02:27
  2. Graphs of where letters appear in English words
    By nburns in forum The Off-Topic Lounge
    Replies: 0
    Last Post: 2nd June 2014, 08:50
  3. Finding most frequency sequences in data?
    By RichSelian in forum Data Compression
    Replies: 5
    Last Post: 21st September 2012, 03:29
  4. Design change
    By Skymmer in forum The Off-Topic Lounge
    Replies: 2
    Last Post: 16th November 2009, 18:02
  5. Text Detection
    By Simon Berger in forum Data Compression
    Replies: 15
    Last Post: 30th May 2009, 09:58

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •