Page 1 of 2 12 LastLast
Results 1 to 30 of 38

Thread: I am looking for the best TEXT compression algorithm available

  1. #1
    Member
    Join Date
    Jun 2018
    Location
    Slovakia
    Posts
    80
    Thanks
    22
    Thanked 3 Times in 3 Posts

    I am looking for the best TEXT compression algorithm available

    Hello users!

    I am newbie in that forum, but I am skilled user in term of compression/decompression algorithms.

    I am currently looking for the "best" TEXT compression algorithm available so far - I need to compress text file with much repeated strings (but these strings are not grouped).

    Explanation:

    Input text file consist of these characters:

    162434E125F11718191A262728292A35363738393A43444546 4748494A535455565758595A636465666768696A7374757677 78797A82838485868788898A92939495969798999AA2A3A4A5 A6A7A8A9AAB2B3B4B5B6B7B8B9BAC2C3C4C5C6C7C8C9CAD2D3 D4D5D6D7D8D9DAE2E3E4E5E6E7E8E9EAF2F3F4F5F6F7F8F9FA FFDA000C03010002110311003F0079BE20673D297EDC1875AC 6129EE78A9EDD5A68BCC43B9738047AD711A421CCEC77B339C F95179EE4E0EDE7BD574796E25FBACB1E01DC056ED96951896 40467643827DCD6AD9E991790B1328F9C119F4209C53A74DC9 9E8462A2AC44E6A270CA577731F4FB2D36DAEA0BA40F34D19C A939E0D7536D7524C41718E2B1DAC25B49385DCB9AD2B3663C 15229422A2B42C726DEE49B08C36D71BE37D415AF6DAD03E04 49924762DCFF00203F3AEB11BE5AF36F15DC2B6A8D36371676 CFD38C56757E033AF2B248BA7F11A50576C7E933CF1DC4D1B9 254AEE56FC40ABFAC6A11C7731194B2B88C2851D1F278FEB54 B46B88ED6DDDEE07EECFDC27A568DFC1E6341A94902B2DB025 54F7071CFE1CD71287BF72ECEE74463AF30EDA1911EABAA3A8 486FCC68830A8BD8554D6ADEFF00586592EE7596454091B018 C0049C1FCCD49AEDC470DC2CD0B0114AB91C7435930EAF2467 0CC1C7A1ADF9AEAC444E67135918D240F0CC52552A41C1CD58 8B4CBCBA42D6F6CEE9FDE02B5EE658751B6126C5F333803DFF 00CF1459EB00C4221F23A0C019AB1DAE8E6B1A6CCC69ACE7B6 C8B881D3201524511C992D9E78033F4ADE9EF5EEA231142EAD C15EB8A8344F0DDDEA376CBF208A294A487773C75C0A86064D 1ACA377A0BA368F77A9879C2E21538666E95D4691A77D92E96 78E647118230061B06ADC73893477B78AD4DB794A405030057 3835958F6AB310EA7EF038FC2A1EE2DD91185F53AAD6

    I need something like this:

    1(26)2(15)E(8) etc.

    Explanation:
    I need an algorithm that is capable to compress (or ´express´) repeated strings that are found in text document with much smaller strings - e.g. algorithm starts looking for the first string (in our case its "1") and continue after the end of the document is reached. After that, algorithm write the repetetion of each searched string (e.g. 1(26), A(40) etc.) instead of actual character.

    In conclusion - input file - 250 KByte; output 1KByte

    Time to compress or decompress is absolutely NOT important.

    So, is there any compression algorithm that can achieve my specifications? I have found RLE (Run Lenght Encoding), but the algorithm is "resetted" after it detects a new string e.g. "AAAAAAAAAAAABBBBAAAAAAABB" is expressed as a "12A4B7A2B" instead of "19A6B".

    Thanks a lot for your help.

    Best regards,
    CompressMaster

  2. #2
    Member Gotty's Avatar
    Join Date
    Oct 2017
    Location
    Hungary
    Posts
    343
    Thanks
    235
    Thanked 226 Times in 123 Posts
    Hello CompressMaster!
    Welcome to the Forum!

    For decompression: "19A6B" must become "AAAAAAAAAAAABBBBAAAAAAABB" again.
    So, how do you decompress "19A6B" back to "AAAAAAAAAAAABBBBAAAAAAABB"?
    The compressed expression 19A and 6B does tell about the individual frequencies, but it does not tell anything about the order they appear. It seems to be an irreversible process.
    Last edited by Gotty; 28th July 2018 at 13:45.

  3. #3
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    437
    Thanks
    137
    Thanked 152 Times in 100 Posts
    I suggest you take a look at http://mattmahoney.net/dc/text.html for a comparison of various text compression algorithms and http://mattmahoney.net/dc/dce.html if you want to learn more about them.

    The Transforms section in particular describes Run Length Encoding and LZ algorithms, which are important text processing steps.

  4. #4
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    437
    Thanks
    137
    Thanked 152 Times in 100 Posts
    Quote Originally Posted by CompressMaster View Post
    Time to compress or decompress is absolutely NOT important.
    We have a new cmix "customer".


  5. #5
    Member
    Join Date
    Jun 2018
    Location
    Slovakia
    Posts
    80
    Thanks
    22
    Thanked 3 Times in 3 Posts
    OK, so what about RLE Windows GUI implementation? Is there any? I was unable to find it.
    Further, can RLE algorithm take advantage of the same repetetive strings (e.g. EEEE) inside multiple files in TAR archive? So, inside TAR archive there will be 10 files consisting of the same characters ("EEEE"). Is RLE algorithm able to detect the text as a "EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE" and compress it to 44E? And what about limitations of pattern searching algorithm? Is there any WINDOWS GUI implementation with advanced settings?

    Thanks a lot for your willingness.

  6. #6
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    456
    Thanks
    46
    Thanked 164 Times in 118 Posts
    It is not possible to make a proper suggestion from your small sample.
    Better upload or provide a link to download a larger file (>1MB).

  7. #7
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    464
    Thanks
    202
    Thanked 81 Times in 61 Posts
    Hello CompressMaster! Welcome to the forum.

    What you seek does exist, but maybe not in the form you expect it to be. Let me elaborate a little:

    * "TEXT compression": Even when it's possible to produce the above string of characters in a text editor, I wouldn't call that a "text" string. Text in data compression refers to coherent content expressing ideas in a language meant to be read and understood by humans. This is more likely to be pseudo random or encrypted content (which has actual meaning but is designed to appear as nonsense). Maybe some checksums?
    Why state the difference? Because the algorithm of choice will not be the same for the two groups. More on this later.

    *
    I need something like this:

    1(26)2(15)E( etc.
    Congratulations! You managed to come up with an compression algorithm on your own! Is not necessarily new or efficient, still, more than most of the people ever got. There are countless implementations of RLE (what you described) right here in the forum and out in the wild, and every one have their own method to address the issue brought up by Gotty, which is a real one. And yes, even some have a GUI and are meant to use under Windows. But! Read on...

    * "The "best"" and "Time to compress or decompress is absolutely NOT important.": RLE is arguably the weakest method of compression out of them all. If you reaaally want to achieve the very best result, you'll have to resort to some neural network-powered compressors like cmix or paq8px. Beware, though, they are extremely slow (not "7z ultra" slow, I mean, really slow) and need a lot of RAM and CPU power to work. There is nothing better than those two and maybe a couple more programs.

    * "Is there any WINDOWS GUI implementation with advanced settings?": Short answer: EMMA. Really a piece of work. A bit overkill for your use case, though.

    Longer / more useful answer: When you ask for the BEST, you are inevitably asking for experimental software, written yesterday, barely tested, maybe even without a release binary, let alone a Windows one. There is no way a cutting edge developer also happens to have the time and skill set to write a nice GUI. Well, maybe Márcio with EMMA. That would be the one exception confirming the rule. Usually one prefer to stick to the more standard solutions like WinRAR or 7-zip to the day-to-day work, because they are thoroughly tested and backed by an extensive user base, maybe even a company.

    Hope it was useful. Happy packing!
    Last edited by Gonzalo; 14th June 2018 at 23:05. Reason: Advanced editor not working... Messed up the content

  8. The Following User Says Thank You to Gonzalo For This Useful Post:

    CompressMaster (20th June 2018)

  9. #8
    Member Gotty's Avatar
    Join Date
    Oct 2017
    Location
    Hungary
    Posts
    343
    Thanks
    235
    Thanked 226 Times in 123 Posts
    Quote Originally Posted by CompressMaster View Post
    there will be 10 files consisting of the same characters ("EEEE"). Is RLE algorithm able to detect the text as a "EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE" and compress it to 44E?
    I will try again. Let's see if it becomes more clear. What you have in mind does not work.

    A compressor program usually processes a file from the beginning to the end and encodes what it sees - character by character.
    When a decompressor decodes the compressed file back to the original it will decode the characters one by one, and reconstruct the original file from the beginning to the end.
    With a theoretical RLE compressor "EEEE" would become "4E", but then something else comes in the source, the compressor will need to process it, encode it, and than later there will be again "EEEE" and that would become "4E" again. Do you see, that it cannot become 44E - there were characters to process in between.

    If you intend to encode "EEEE" from 10 different places to "44E", you will lose the information about the locations of these "EEEE"s.
    When your decompressor tries to decode the compressed file back to the original file, it would see the encoded "44E" and it would have no clue about that it came from 10 different places and where those places were, and so it would not be able to reconstruct the original file.

    You are right: the algorithm you describe is most similar to RLE. But you cannot really circumvent the "resetting" that you would like to avoid.

    We understood from your first post, that you are looking for an algorithm. Please tell us: do you intend to implement it in your own software? Or do you wish to learn about (text) compression?
    Or... are you looking for just a tool?
    Last edited by Gotty; 28th July 2018 at 13:46.

  10. The Following User Says Thank You to Gotty For This Useful Post:

    CompressMaster (20th June 2018)

  11. #9
    Member
    Join Date
    Jun 2018
    Location
    Slovakia
    Posts
    80
    Thanks
    22
    Thanked 3 Times in 3 Posts
    Quote Originally Posted by dnd View Post
    It is not possible to make a proper suggestion from your small sample.
    Better upload or provide a link to download a larger file (>1MB).
    OK, so here (https://send.firefox.com/download/a8...7RwevYs1ay7X0g) is my sample text data that I want to compress it losslessly. Archive is uncompressed RAR (better is TAR format? I don´t think so. I´ve tried that, but resulting archive was larger than source folder with samples.)
    Archive contains 47 files. Each of the file consist of character presented in file name - e.g. file B_sampl27.txt contains only the "B" character, nothing more. Assume that it would be TAR archive, it is possible for RLE or similar algorithm take advantage of repeating pattern across multiple files? The files are only pure raw text data without any headers/footers, so I think that it will be possible for RLE or similar compression algorithms to process the files as a "AAAAAAAAAAAAAAAAABBBBBBBBB" and compress it as a 17A9B regardless of files position (thus TAR format). I know that .TAR format is uncompressed and it is used for folder compression. So, it is possible to achieve my specifications? Or any other suggestions for that? Of course I have tried the suggested solutions from other members, but for example "CMIX" algorithm shrinks the resulting file only little bit. Thanks a lot for your help.

    P.S. I have tried EMMA software utility, but no luck - resulting file was slightly larger.

    CompressMaster

  12. #10
    Member
    Join Date
    Aug 2016
    Location
    USA
    Posts
    41
    Thanks
    9
    Thanked 16 Times in 11 Posts
    If several algorithms seem to not shrink it all, a) it's likely (pseudo)random and therefore effectively incompressible or is already compressed b) RLE will definitely not help. However, the sample you have so far is all limited to just a few characters, so something is really fishy - any simple entropy coder should get you down by 50% of size, and you claim they don't.

  13. #11
    Programmer michael maniscalco's Avatar
    Join Date
    Apr 2007
    Location
    Boston, Massachusetts, USA
    Posts
    109
    Thanks
    7
    Thanked 80 Times in 25 Posts
    Quote Originally Posted by Stefan Atev View Post
    If several algorithms seem to not shrink it all, a) it's likely (pseudo)random and therefore effectively incompressible or is already compressed b) RLE will definitely not help. However, the sample you have so far is all limited to just a few characters, so something is really fishy - any simple entropy coder should get you down by 50% of size, and you claim they don't.
    Perhaps he's mistakenly trying to compress an already compressed version of the data rather than re-compressing the original data?

  14. #12
    Member Gotty's Avatar
    Join Date
    Oct 2017
    Location
    Hungary
    Posts
    343
    Thanks
    235
    Thanked 226 Times in 123 Posts
    Now I see.
    In your first post you mentioned one 250K text file, but what you actually have is a lot of 1-byte files that you tried to put in a container format such as an uncompressed rar file. Is that correct?

    You will need to concatenate the file contents in your own way (i.e. not with rar and not with tar).
    Their content will then become one continuous stream, byte after byte, so you'll be able to compress them tightly.
    The (uncompressed) rar (and the tar format, too) preserves filenames, file lengths, and other metadata interleaved with the contents of the input files. And that's a problem with these container formats - as you experienced it.
    Unfortunately no rar/tar/etc would keep the content as a continuous stream, so you'll need to invent your own container format: all filenames first, concatenated content second. But don't rush to do that, yet.

    From your description ("Each of the file consist of character presented in file name") and after seeing the sample you provided it is also clear that you don't need to compress or store the first character of the filename. It is given from the 1-byte file content.

    Please tell us about those filenames. What pattern do the filenames follow? If they have a specific pattern you may generate them with some simple c++ code or a perl script, etc.. So you won't need to compress the filenames at all.
    But then this algorithm would work with these files only. Is that OK?

    From your first post I also suspect that the files would contain only hexadecimal characters (0123456789ABCDEF). Is that correct?

  15. The Following User Says Thank You to Gotty For This Useful Post:

    CompressMaster (20th June 2018)

  16. #13
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    464
    Thanks
    202
    Thanked 81 Times in 61 Posts
    Of course I have tried the suggested solutions from other members, but for example "CMIX" algorithm shrinks the resulting file only little bit. Thanks a lot for your help.

    P.S. I have tried EMMA software utility, but no luck - resulting file was slightly larger.
    Yep, definitely hashing or other type of non-compressible data. I'm sorry, but it seems your data won't get smaller any time soon. The only way would be to reproduce the process in which it was generated. A pseudo-random generator will faithfully reproduce its output given the same input, for example. If it is a checksum and you have access to the original data, it can also be reduced to virtually zero size. I think that is what ECM does. Other than that, you are out of luck. It does not have anything to do with filenames also. They tend to be fairly compressible. The problem lies in the data itself. Don't look just at a few bytes, that can misled you, for example, into thinking RLE will be of any help. If you still want to try for yourself, just search for "rle" on github or here in the forum and download some of the tools. I hope you are comfortable with a console, though!

    Best of wishes!


    EDIT: I actually wanted to give it a try myself but the link expired...

  17. #14
    Member
    Join Date
    Jun 2018
    Location
    Slovakia
    Posts
    80
    Thanks
    22
    Thanked 3 Times in 3 Posts
    Hello Gotty!

    Thanks a lot for your useful explanation. As for your questions...

    1.
    "The compressed expression 19A and 6B does tell about the indivudual frequencies, but it does not tell anything about the order they appear."
    1.1 - Of course I know that, but I am looking for an algorithm that is capable to "detect" both position and occurences of each presented character in text document e.g. 19A(2,8,16,17,20,24,26,27); 23B(1,3,5,10,15,22,30,58,74,88,9) etc. So, inside test folder, there will be total of 16 files with range from 0-9, A-F in filenames plus total number of characters prior that.
    For example, (underscore_______ is here only for demonstration of file content and without "ETC" at the end, of course):
    19A.txt______1,3,45,10,15,22,30,58,74,88,98 etc.
    23B.txt______8,11,22,65,95,20,30,148 etc.
    18C.txt______87,54,55,12,18,658,748 etc.

    P.S. Position detection is not a problem, but I suppose that file content will have a very poor compresion ratio due mixed content.
    This approach is good. But I don´t think so that this form is easily compressible - I need the smallest possible output as I can get.
    Better approach will be... it will be discussed after a while...
    So, is there some software (Windows GUI prefferably) or at least algorithm that can achieve my specifications?

    2.
    "Please tell us: do you intend to implement it in your own software? Or do you wish to learn about (text) compression?
    Or... are you looking for just a tool? "
    2.1 I don´t know at this time.
    2.2 I do not have so much time to learning text compression...
    2.3 Yes, I am currently looking for EXE GUI software or at least ALGORITHM.

    3.
    "In your first post you mentioned one 250K text file, but what you actually have is a lot of 1-byte files that you tried to put in a container format such as an uncompressed rar file. Is that correct?"
    3.1 Both questions are correct. I started with random hexadecimal characters generated by myself written in one text file, because I supposed that there is an algorithm that is capable to express these repeated strings with much smaller data - for example DEFLATE. As far as I know, it works by finding REPETITIVE PATTERNS and stores it on a dictionary. I thinks that it works as follows:
    First of all, DEFLATE starts by processing text document from beginning to the end and encodes character by character BOTH with position and occurences of it.
    In decompression stage, DEFLATE algorithm write each character with its correct position into a newly created file.
    At the end, algorithm is able to reconstruct the original text file losslessly by inserting character with its corresponding position in the final TXT document.
    So, I supposed that DEFLATE is what I want to achieve but... it failed as I see.

    So, is there an algorithm that is capable to achieve that I want?

    4.
    "Their content will then become one continuous stream, byte after byte, so you'll be able to compress them tightly.
    The (uncompressed) rar (and the tar format, too) preserves filenames, file lengths, and other metadata interleaved with the contents of the input files. And that's a problem with these container formats - as you experienced it.
    Unfortunately no rar/tar/etc would keep the content as a continuous stream, so you'll need to invent your own container format: all filenames first, concatenated content second."
    4.1 You mentioned "TAR format preserves filenames, file lengths and other metadata INTERLEAVED with file content".
    I am not exactly sure if file lengths info could be discarded (although it will be 0-bytes only).
    Regarding interleaving, it will be the same if TAR archive would contain only filenames without additional content at all? What about filenames compression? It must follow specific pattern or not? In other words, if I will have only the filenames, is TAR capable to express names as one continuous stream? Or it´s a problem with TAR format itself?

    Is there a way that is capable to express content or filenames as a continuous stream? I am a software developer, but algorithm development is not my point of interest so far...

    5.
    "From your description ("Each of the file consist of character presented in file name") and after seeing the sample you provided it is also clear that you don't need to compress or store the first character of the filename. It is given from the 1-byte file content.

    Please tell us about those filenames. What pattern do the filenames follow? If they have a specific pattern you may generate them with some simple c++ code or a perl script, etc.. So you won't need to compress the filenames at all.
    But then this algorithm would work with these files only. Is that OK?


    From your first post I also suspect that the files would contain only hexadecimal characters (0123456789ABCDEF). Is that correct? "
    5.1 Filename storing is important, but content storing NOT. So, content can be discarded - only filenames are important.

    What approach could lead to better compression ratio?
    1.1-A.txt; 2-C.txt; 5-U.txt 9-G.txt etc.
    2.A-210.txt; A-145.txt; A-100.txt; B-35.txt; D-74.txt; D-79.txt etc.

    The filenames contains the exact position of each character presented in text document, so filenames MUST BE COMPLETELY PRESERVED... or at least must be decompressed to their original form.
    Of course I know that if I will have approx. 10 000 0-byte files, their data (or, precisely, filenames stored in MFT etc.) is stored somewhere else (MFT entries, allocation tables etc.), so it´s not exactly 0-bytes space occupation as reported by Windows.

    "But then this algorithm would work with these files only. Is that OK?" - I don´t exact determine what your sentence means... Could you explain it?

    Yes, files would contain only hexadecimal character generated by myself.

    I will be very glad if you could help me.
    Thanks a lot for your willingness.

    Best regards,
    CompressMaster
    Last edited by CompressMaster; 20th June 2018 at 15:10.

  18. #15
    Member Gotty's Avatar
    Join Date
    Oct 2017
    Location
    Hungary
    Posts
    343
    Thanks
    235
    Thanked 226 Times in 123 Posts
    Quote Originally Posted by CompressMaster View Post
    ...for example DEFLATE. As far as I know, it works by finding REPETITIVE PATTERNS and stores it on a dictionary. I thinks that it works as follows:
    First of all, DEFLATE starts by processing text document from beginning to the end and encodes character by character BOTH with position and occurences of it.
    In decompression stage, DEFLATE algorithm write each character with its correct position into a newly created file.
    At the end, algorithm is able to reconstruct the original text file losslessly by inserting character with its corresponding position in the final TXT document.
    It replaces a duplicate string with a pointer and length to a previous occurrence. DEFLATE is working with reoccurring strings, not characters. Do you have repeating strings?

    Quote Originally Posted by CompressMaster View Post
    I started with random hexadecimal characters generated by myself written in one text file, because I supposed that there is an algorithm that is capable to express these repeated strings with much smaller data...
    Did you really generate random content? Random content is incompressible, you probably know that. But if it is really random content did you really see repeated strings? I'm confused now.

    Quote Originally Posted by CompressMaster View Post
    The filenames contains the exact position of each character presented in text document...
    So you say, that you have a text document, you generated filenames based on its content, and you need to compress these filenames only. Why would you do that? Why not simply compressing the original file?

    Aha, I see. You would like to "cheat". You are converting a file to {filenames + almost zero content}. Is that you are trying to do? It does not work.
    When you "encode" information in filenames, you still need to compress that information (i.e. by compressing the filenames).
    Information is still information either it is a file content or content encoded in file names.
    Last edited by Gotty; 22nd June 2018 at 09:22.

  19. The Following User Says Thank You to Gotty For This Useful Post:

    SolidComp (21st June 2018)

  20. #16
    Member
    Join Date
    Jun 2018
    Location
    Slovakia
    Posts
    80
    Thanks
    22
    Thanked 3 Times in 3 Posts
    Hello Gotty,

    Please take a look at https://encode.ru/threads/2984-Compr...us-data-stream and give me detailed info about that.
    Thanks.

    Best regards,
    CompressMaster

  21. #17
    Member Gotty's Avatar
    Join Date
    Oct 2017
    Location
    Hungary
    Posts
    343
    Thanks
    235
    Thanked 226 Times in 123 Posts
    Hello Compressmaster,

    I'm following the thread you referred to. Gonzalo replied to your latest question - which is very similar to the above.
    I don't know how I can help you. What is your question?

  22. #18
    Member
    Join Date
    Jun 2018
    Location
    Slovakia
    Posts
    80
    Thanks
    22
    Thanked 3 Times in 3 Posts
    In one of your prior posts you´ve mentioned that "their content will then became one continuous stream, byte after byte, so you will be able to compress them tightly".
    So, I have altered my prior text file into the form full of Q and q (must I use another character set? A-F; 0-9 isn´t allowed, unfortunately) as stated in my another thread, but the file is still large after compression - I need to compress my file down to 99.98% of its original size losslessly, of course.

    Well first of all - the 1st attached file (A.rar - 65 MB decompressed) - it contains a lot of "A" sequences separated by blank space. I was able to compress it down to 6 KB by using simple RAR5 algorithm tuned for max. possible compression ratio.
    But when I have tried the same process with the 2nd attached file (A-mix.rar - 15 MB decompressed) which is much much smaller than first, result was terrible (66 KB). Where is the problem?

    Take a look at the following sequence - this is herein for demonstration purposes, actual file is attached below (Q-file.txt):
    QQQqQQQqQQQqQQQQQQQqQQQQqQQQQQQqQQQQQQqQQQQqQQQQqQ QqQQqQQQQqQQQqQQQqQQQqQQQQqQQQQqQQQQQQQqQQQQQQQQqQ QQQQQQqaQQqQQqQQqQQQQqQQQQqQQQq

    This can be further simplified:
    33374664422433344787222443

    where the number represents the occurence of character itself before another sequence starts - "q" acts there as a separator.

    One form of simplification could be - 3.3 7 4 6.2 4.2 2.2 4 3.3 4.2 7 8 7 2.3 4.2 3 etc, but I suppose that there will be another characters (spaces and also dots) which acts as a counter that I need to compress it... what do you think?

    Now, we have a one large integer... it is possible to simplify this large integer into much smaller form by some decomposition process - factorization or something similar math operations?

    Or, what about if I will count only the separator itself? My software will be able to handle that separation and it will be able to insert each character into corresponding fields - assuming that separator will be in its corresponding ENTER-separated (or space-separated; which approach is better?) line in software´s config file.

    Of course all simplification operations will be handled where possible in file-name form in order to further reduce the final filesize.

    Thanks.
    Attached Files Attached Files

  23. #19
    Member cfeck's Avatar
    Join Date
    Jan 2012
    Location
    Germany
    Posts
    50
    Thanks
    0
    Thanked 17 Times in 9 Posts
    Quote Originally Posted by CompressMaster View Post
    I need to compress my file down to 99.98% of its original size losslessly, of course.
    Please read about Shannon's information entropy limit before trying that.

  24. #20
    Member Gotty's Avatar
    Join Date
    Oct 2017
    Location
    Hungary
    Posts
    343
    Thanks
    235
    Thanked 226 Times in 123 Posts
    @CompressMaster, you have a lot of ideas. An that is good. I like that.

    But as cfeck suggested, you really should understand how data (information) compression works.
    It's good that you are trying different approaches. But you are still trying to encode the same information. When the information content (~entropy) in your file is 100K you can't get it down to 99K. You simply cannot.
    When you transform your file in different ways, some transformations will help compression, some will hurt compression. Sometimes there is not transformation at all that would help compression.

    I'm quite confused by your posts. In your earliest post (on the top of this thread) there were hexadecimal numbers. You then told us about random content. Now there are these Q's.
    Are they related? What exactly are you trying to do? I mean WHAT are you trying to compress? What are these Q's? What do they represent?
    ((And that is the information you are trying to compress.))

  25. #21
    Member
    Join Date
    Jun 2018
    Location
    Slovakia
    Posts
    80
    Thanks
    22
    Thanked 3 Times in 3 Posts
    Actually, I have only one text file full of O´s randomly separated by spaces.

    I am trying to compress only these O´s. It´s pure my random experiment.
    My prior posts were only the small experiments.

    Now, we have one text file (attached below) full of "O" and "o". Is there any algorithm for compressing this kind of data?
    Attached Files Attached Files

  26. #22
    Member Gotty's Avatar
    Join Date
    Oct 2017
    Location
    Hungary
    Posts
    343
    Thanks
    235
    Thanked 226 Times in 123 Posts
    What is the goal of your experiment?

  27. #23
    Member
    Join Date
    Jun 2018
    Location
    Slovakia
    Posts
    80
    Thanks
    22
    Thanked 3 Times in 3 Posts
    My goal is described here.

  28. #24
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    Text compression isn't about compressing symbols in the ASCII range. Text compression is about e.g. decomposition to words, stemming, modelling formatted text, punctuation, etc In your case you have only 'Q' and 'q' symbols. You can replace them with any other pair of symbols (either printable or non-printable) and get basically the same results. Files composed of only symbols 'Q' and 'q' is not something that text filters and models are prepared for so they won't help on such files.

  29. #25
    Member Gotty's Avatar
    Join Date
    Oct 2017
    Location
    Hungary
    Posts
    343
    Thanks
    235
    Thanked 226 Times in 123 Posts
    I mean: what is the goal of your experiment? What would you like to see or learn from it?

    What are those Q's exactly? If we understand your data better, we may help you better.

    Seconding what Piotr has written quite nicely: your file is more a "binary" file, it's not really a "text" file.

  30. #26
    Member
    Join Date
    Jun 2018
    Location
    Slovakia
    Posts
    80
    Thanks
    22
    Thanked 3 Times in 3 Posts
    I´d like to see if my data could be compressed down at least to 99.50 % of original size.

    These Q´s are pure random pattern that I have choose for my experiment (Q´s can be replaced with any other pair of symbols, of course).

    I agree that my file looks like a binary. OK, so what about if I will have only "random" words instead of symbol sequence? Let me elaborate a little...

    "QQQQQQq" will be replaced by "dog "
    "QQQQq" will be replaced by "car "
    "QQq" will be replaced by "cat "
    and so on...

    Will it be irrelevant if replaced words will be written, for example, in Slovakian instead of English? Or it´s language independent? Do you know some software to do this? Maybe EMMA with Text (slow) preset? Or PAQ?
    Is there are some specific rules for text modelling algorithms? I mean choosing of appropriate words.

    In conclusion, one more question for Gotty unrelated of compression - I know that you are from Hungary as stated in your profile. Thus I would like to request you if you could give me contact info to the owner of Multipék KFT, Kiskőrös, Petőfi Sándor út 72, 6200 Hungary. I know that company is defunct, but I really need to contact its owner. The telephone number mentioned in Google´s results is not currently assigned. The website nor email - www.multipek.hu; info@multipek.hu does not working at all - domain hasn´t been renewed. I will be very glad for that. Thanks.

  31. #27
    Member Gotty's Avatar
    Join Date
    Oct 2017
    Location
    Hungary
    Posts
    343
    Thanks
    235
    Thanked 226 Times in 123 Posts
    The Q/q's are not random. If it is truly random you cannot get better than 87.5% for a file that contains only 2 symbols (Q/q). See Shelwien's results with bit.exe. He encoded every Q and q to a 0 and a 1 bit. If the Q/q's are truly random then that is the maximum you can get. The encoded result however could be further compressed. Thus those Q/q's do follow some pattern. That's why it's compressible better than 87.5%. So how did you generate the file?

    PM-ed about your extra request.

  32. #28
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    I agree that my file looks like a binary. OK, so what about if I will have only "random" words instead of symbol sequence? Let me elaborate a little...

    "QQQQQQq" will be replaced by "dog "
    "QQQQq" will be replaced by "car "
    "QQq" will be replaced by "cat "
    and so on...

    Will it be irrelevant if replaced words will be written, for example, in Slovakian instead of English? Or it´s language independent? Do you know some software to do this? Maybe EMMA with Text (slow) preset? Or PAQ?
    Is there are some specific rules for text modelling algorithms? I mean choosing of appropriate words.
    Text models usually help text compression, but they are tuned on real texts so they rely on properties of real texts e.g. average word length, average punctuation amount, rules like space after dot, semicolon, closing parentheses, uppercase letter after dot, etc If you add dots, parentheses, uppercase letters, etc randomly then text models can even worsen compression. OTOH on textual data, models other than text models have less significance because they don't directly model real text properties. Artificially changing the type of your data is not going to magically help. Some transforms indeed help compression, like dictionary based preprocessing, executable filtering, etc but such transforms are tuned to particular compressors. xml-wrt for example has four output modes, each tuned to different types of compressors:
    Code:
    https://github.com/inikep/XWRT
    
    GENERAL OPTIONS (which also set default additional options):
    (...)
      -0 = preprocessed and uncompressed output optimized for further LZ77 compression
      -1 = preprocessed and uncompressed output optimized for further LZMA compression
      -2 = preprocessed and uncompressed output optimized for further PPM compression
      -3 = preprocessed and uncompressed output optimized for further PAQ compression
    Speaking very tersely: compression relies on finding patterns which can be unambigously described in less space than the original data. Deflate algorithm used in ZIP and gzip files for example detects repetitions and encodes them as (offset, length) pairs. But not always encoding a repetition brings gains. Sometimes encoding (offset, length) pair can take more amount of bits than just encoding the repeated data verbatim, byte by byte. I think writing simple LZ77 + Huffman compressor would allow you to get serious insight and intuition into lossless compression process in general.

    How to find whether a pattern can be unambigously described in less space than the original data? Well, by trial and error. Implement compression technique based on your idea, test on files you're interested on and check if that idea helps. If someone has a lot of experience in data compression then he can tell with high probability if it's worth to check some idea. Experience in data compression is mostly gained by trial and error. If some technique works well, you stick to it. If it's working poorly then you scrap it.

  33. #29
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by Gotty View Post
    The Q/q's are not random. If it is truly random you cannot get better than 87.5% for a file that contains only 2 symbols (Q/q). See Shelwien's results with bit.exe. He encoded every Q and q to a 0 and a 1 bit. If the Q/q's are truly random then that is the maximum you can get. The encoded result however could be further compressed. Thus those Q/q's do follow some pattern. That's why it's compressible better than 87.5%. So how did you generate the file?

    PM-ed about your extra request.
    You cannot get better than 87.5% if the data is random and the two symbols have equal probability. For the only-O.txt file, 88.7% of the symbols are O and 11.3% are o. If the data was random with these probabilities, the entropy would be about 225,473 bytes or 6.37% of the original file size.

    Your point is still accurate though, the symbols are not random. The file does not contain the pattern "oo".

  34. The Following User Says Thank You to Kennon Conrad For This Useful Post:

    Gotty (28th July 2018)

  35. #30
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    Your point is still accurate though, the symbols are not random. The file does not contain the pattern "oo".
    If that's true, we can losslessly compress it by removing one "O" before any "o" (except the first "o" because otherwise the transform would be ambiguous).

  36. The Following 2 Users Say Thank You to Piotr Tarsa For This Useful Post:

    Gotty (28th July 2018),Kennon Conrad (28th July 2018)

Page 1 of 2 12 LastLast

Similar Threads

  1. Numbers vs text compression
    By irect in forum Data Compression
    Replies: 3
    Last Post: 7th March 2016, 02:24
  2. Multi-language text compression corpus?
    By Paul W. in forum Data Compression
    Replies: 13
    Last Post: 19th November 2015, 19:06
  3. text compression?
    By codebox in forum The Off-Topic Lounge
    Replies: 2
    Last Post: 16th March 2015, 16:31
  4. lost interest in text data compression
    By RichSelian in forum The Off-Topic Lounge
    Replies: 12
    Last Post: 11th February 2014, 00:12
  5. Rationale for Text Compression
    By cfeck in forum Data Compression
    Replies: 34
    Last Post: 20th November 2013, 04:43

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •