Results 1 to 6 of 6

Thread: I am looking for the BEST filenames compression

  1. #1
    Member
    Join Date
    Jun 2018
    Location
    Slovakia
    Posts
    39
    Thanks
    6
    Thanked 0 Times in 0 Posts

    I am looking for the BEST filenames compression

    Hello users,

    I am currently looking for the BEST filenames compression algorithm available.

    My sample consist of the following descending-ordered files:
    1.7.TXT
    2.4.TXT
    3.5.TXT
    4.2.TXT
    5.18.TXT
    6.11.TXT
    and so on.

    Total number of files is approximately 9830 (1.7.TXT - 1st file; 9830.47.TXT - last file). This is, of course, oversimplified sample - actually, the number of files will vary from sample to sample thoroughly.
    All files are empty i.e. 0-bytes.

    You´ve probably asking for WHAT the numbers represents - it has been choosen by myself for small experiment with filenames compression...
    Of course it´s possible to alter the numbers with corresponding letters... but the letters must exactly follow the occurence of given number. For example:
    file "6.11.TXT" can be expressed as a "6.COMPRESSION.TXT" - furthermore, it can be replaced with much simplier form e.g. "6.IIIIIIIIIII.TXT"... or it´s irrelevant? .TXT extension can also be replaced if neccessary (e.g. with ".A") in order to get better ratio.

    So, it is possible to losslessly compress these filenames to at least 99.97% of its original size? The more, the better, of course. Any other ideas?

    Thanks a lot.

    Best regards,
    CompressMaster

  2. #2
    Member
    Join Date
    Feb 2015
    Location
    United Kingdom
    Posts
    151
    Thanks
    19
    Thanked 66 Times in 37 Posts
    If you're filenames are just increasing numbers then you can write program to produce those file names which will support file-0 to file-infinity, there will be no shorter representation than the source code which generated it.

    If you're looking for a compressor for short strings you can take a look at this https://github.com/antirez/smaz

    But since every data set you produce isn't realistic of real data all these compression approaches you're talking about aren't general purpose or useful. Do you want to learn how to make compression algorithms? If so I'd recommend reading this: http://mattmahoney.net/dc/dce .

  3. #3
    Member Gotty's Avatar
    Join Date
    Oct 2017
    Location
    Hungary
    Posts
    264
    Thanks
    204
    Thanked 163 Times in 91 Posts
    Quote Originally Posted by CompressMaster View Post
    My sample consist of the following descending-ordered files
    You say they are in descending order. What do you mean? They seem to be in ascending order.

    You of course can achieve 99.97% compression, and even more. In the filenames put a billion "I" letters or more, as you suggested. "6.IIIII...IIIIII.TXT". It will be highly compressible. Of course it has no sense. But your experiment has also not much sense. Or has it?

    Quote Originally Posted by CompressMaster View Post
    You´ve probably asking for WHAT the numbers represents - it has been choosen by myself for small experiment with filenames compression...
    Yes. That's my question. But it's not the kind of answer I am looking for. I'm out.
    Last edited by Gotty; 9th August 2018 at 00:15. Reason: Typos

  4. #4
    Member
    Join Date
    Jun 2018
    Location
    Slovakia
    Posts
    39
    Thanks
    6
    Thanked 0 Times in 0 Posts
    Lucas,

    Regarding filenames generation, this process could be further simplified - it will be enough if I will have stored only the last file name - i.e. 9830, because my software will be able to generate files from range "1" - "number given from configuration file" using incremental function. So this is resolved. But there´s another significant problem...

    I´ve looked at smaz. Firstly, there is not CMD nor GUI implementation. Is there any? I don´t have time to compiling it myself... Secondly, smaz is working with SHORT STRINGS. Let me elaborate a little...

    1.aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
    2.aaaaaaaaaaaaaaaaaa
    3.aaaa
    4.aaaaaaaaaaaaa
    5.aaaaaaaaaaaaaaaaaaaaaaa
    6.aaa
    7.aaaaaaaaaaaa
    8.aaaaaaaaaaaaa
    and so on.
    So, it is possible with SMAZ to compress my files down to at least 99.98% of its original size losslessly? But I am afraid that repetitions of those a´s will be most likely lost.

    2nd approach:
    If I will have folders with 0-byte files i.e.
    1 - 1.TXT; 2.TXT; 3.TXT
    2 - 1.TXT; 2.TXT; 3.TXT; 4.TXT; 5.TXT; 6.TXT
    3 - 1.TXT; 2.TXT
    4 - 1.TXT; 2.TXT; 3.TXT; 4.TXT; 5.TXT; 6.TXT 7.TXT; 8.TXT

    OR
    1 - 2.TXT; 3.TXT; 4.TXT
    5 - 6.TXT; 7.TXT; 8.TXT; 9.TXT; 10.TXT; 11.TXT
    12 - 13.TXT; 14.TXT
    15 - 16.TXT; 17.TXT; 18.TXT; 19.TXT; 20.TXT; 21.TXT 22.TXT; 23.TXT

    What do you think about that?
    Thanks.

  5. #5
    Member
    Join Date
    Jun 2018
    Location
    Slovakia
    Posts
    39
    Thanks
    6
    Thanked 0 Times in 0 Posts
    Gotty,
    Yeah, it´s ascending.
    In one of your prior posts you´ve mentioned

    Their content will then become one continuous stream, byte after byte, so you'll be able to compress them tightly.
    The (uncompressed) rar (and the tar format, too) preserves filenames, file lengths, and other metadata interleaved with the contents of the input files. And that's a problem with these container formats - as you experienced it.
    Unfortunately no rar/tar/etc would keep the content as a continuous stream, so you'll need to invent your own container format: all filenames first, concatenated content second.
    1.aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
    2.aaaaaaaaaaaaaaaaaa
    3.aaaa
    4.aaaaaaaaaaaaa
    5.aaaaaaaaaaaaaaaaaaaaaaa
    6.aaa
    7.aaaaaaaaaaaa
    8.aaaaaaaaaaaaa
    and so on.

    2nd approach:
    If I will have multiple folders with 0-byte files i.e.
    1 - 1.TXT; 2.TXT; 3.TXT
    2 - 1.TXT; 2.TXT; 3.TXT; 4.TXT; 5.TXT; 6.TXT
    3 - 1.TXT; 2.TXT
    4 - 1.TXT; 2.TXT; 3.TXT; 4.TXT; 5.TXT; 6.TXT 7.TXT; 8.TXT

    OR
    1 - 2.TXT; 3.TXT; 4.TXT
    5 - 6.TXT; 7.TXT; 8.TXT; 9.TXT; 10.TXT; 11.TXT
    12 - 13.TXT; 14.TXT
    15 - 16.TXT; 17.TXT; 18.TXT; 19.TXT; 20.TXT; 21.TXT 22.TXT; 23.TXT

    OR
    1.TXT
    2..TXT
    3.TXT
    4.TXT
    5.TXT
    6..TXT
    and so on.

    OR
    1.TXT - A
    9.TXT - A
    18.TXT - A
    23.TXT - A
    and so on.

    What do you think about that?

    In my prior thread you have mentioned
    It replaces a duplicate string with a pointer and length to a previous occurrence. DEFLATE is working with reoccurring strings, not characters. Do you have repeating strings?
    So, is DEFLATE able to replace these strings if I will have the "same" strings encoded across multiple files? Most likely not.
    Thanks.

  6. #6
    Member
    Join Date
    Feb 2015
    Location
    United Kingdom
    Posts
    151
    Thanks
    19
    Thanked 66 Times in 37 Posts
    As for smaz its use is to compress strings, URL's, filenames, SMS, or whatever you want, if you wanted to make a filenames compression tool you could use smaz to compress existing filenames and recover the originals later; that would be a potentially useful tool.

    All I can tell by your posts is you're suggesting base-changing algorithms or swapping numbers with letters which won't compress anything. If you want a way to identify a large number of files with a compressed but uniquely decodable code/name just use Golomb Coding https://en.wikipedia.org/wiki/Exponential-Golomb_coding, it doesn't get more efficient for number compression than that.

    Regardless of whether or not smaz has a GUI smaz is incredibly easy to build since it's a just two source files and a header, the header has compress and decompress functions so you can embed it into your own application, I built it in under 5 minutes with GCC. To me it seems you aren't a programmer and are just being stubborn when we don't do all the work for you. You learn by doing so I recommend learning the C language and a book about compression.

    Install GCC and start playing around with the provided source code if you want to learn the secret to compression

    Regards
    Attached Files Attached Files

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •