Results 1 to 15 of 15

Thread: Detecting previous matches, with very small code, but also good speed?

  1. #1
    Member
    Join Date
    Mar 2018
    Location
    London
    Posts
    7
    Thanks
    3
    Thanked 0 Times in 0 Posts

    Question Detecting previous matches, with very small code, but also good speed?

    Hi... I'm making a compressor. It's for a game I made, where I have block data like minecraft.

    My progress so far is good. I'm able to get about 10x compression on my save-files. This is good!

    Everything in my game so far is implemented with a very small amount of code (relatively). Entire game is 300K... but feature all the subsystems needed for a modern game. (Hit-detection, level creation/management, graphics, sound-threading, 3D code, dynamic shader reloading (So I can live-edit my shaders and see results in-game), saving/loading, compression, string libraries, data file parsing/rendering, user-interface, etc etc)

    So I don't want to use compression like gzip because actually just putting it in, will already double the size of my game Not fun.

    My compress/decompress takes about 4K of code... It's nice.

    BUT it compresses very slow. My "pattern detection" code, is using "dynamic programming". It will ALWAYS find the best match. But it's so slow.

    I'm using something like this: https://algorithms.tutorialhorizon.c...mon-substring/ but adapted for compression of files.

    Advantage: Small code, simple code, tight compression.
    Disadvantage: SLOW.

    Can anyone advise me how to find/make some code that is small AND fast AND acheives good matching? I don't mind if its code that I need to adapt to make work. I don't need a "ready made solution"... just to see that someone else made something "small and fast" should be enough for me to learn it's technique...

    it would really be helpful

    Thanks!

    ...

    I did look at some existing "Small compressors" such as http://create.stephan-brumme.com/smallz4/ but... his code had a bug in it, and actually ran VERY VERY VERY slowly... like 100x slower than mine, on a simple text file. And also didn't compress well. Bad compression, bugs on some files, and also 100x slower than mine. I tried a few other "small code size compressors", but similar problem. All very slow and buggy.

    ...

    For info about my compressor. It's not special It's like LZ4 (offset match) except I'm using 2 byte codes instead of 3. Also it has more flexible encoding, so it can escape 16 bytes with 1 byte overhead. And optional "Extension" so I can use 3 byte codes for longer offset-matches... if needed. Not special but if anyone is super-interested I can give code...

  2. #2
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    Here's a rather concise match finder source code: https://github.com/Cyan4973/mmc
    I don't find it particularly readable, though.
    Link to description of the algorithm of MMC: https://encode.ru/threads/1155-A-new...hing-structure

  3. The Following User Says Thank You to Piotr Tarsa For This Useful Post:

    BakingPowder (26th March 2018)

  4. #3
    Member
    Join Date
    Mar 2018
    Location
    London
    Posts
    7
    Thanks
    3
    Thanked 0 Times in 0 Posts
    I will give it a look thanks If anyone else has alternatives let me know also. It's good to have multiple choices. I will read them all!

    Thank you Piotr!

  5. #4
    Member
    Join Date
    Mar 2018
    Location
    London
    Posts
    7
    Thanks
    3
    Thanked 0 Times in 0 Posts
    I read the entire thread... O_O It's a lot of talk and I think I'm dizzzy now. (

    haha.

    However, his "mmc.c" file is very small! In fact, it's smaller than it seems due to his putting code like this:

    Code:
    {
       int c;
       for (c=0; c<NBCHARACTERS; c++)
       {
          MMC->segments[c].segments = (segmentInfo_t *) REALLOCATOR(MMC->segments[c].segments, NB_INITIAL_SEGMENTS * sizeof(segmentInfo_t));
          MMC->segments[c].max = NB_INITIAL_SEGMENTS;
          MMC->segments[c].start = 0;
          MMC->segments[c].segments[0].size = -1;
          MMC->segments[c].segments[0].position = (const BYTE*)beginBuffer - (MAX_DISTANCE+1);
       }
    }

    instead of this:
    Code:
    for (int c = 0; c < NBCHARACTERS; c++) {
        auto& Seg = MMC->segments[c];
        Seg.segments = (segmentInfo_t*) REALLOCATOR(Seg.segments, NB_INITIAL_SEGMENTS * sizeof(segmentInfo_t));
        Seg.max = NB_INITIAL_SEGMENTS;
        Seg.start = 0;
        Seg.segments[0].size = -1;
        Seg.segments[0].position = (const BYTE*)beginBuffer - (MAX_DISTANCE+1);
    }
    (8 lines instead of 11, also less repetition in code which makes it easier to read).

    But yes, it is small enough! Thanks a lot. Hopefully I can use it...

  6. #5
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    Perhaps Yann wanted to write plain C so C++ features are not present.

    I think if you distribute modified MMC then you're bound to release the source code of it, due to LGPL license :]

  7. #6
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    856
    Thanks
    447
    Thanked 254 Times in 103 Posts
    Code:
    for (int c = 0;
    is better, but requires a C99 compliant compiler.
    Unfortunately, when targeting extended portability, that's not an option

    Code:
    {
       int c;
       for (c=0;
    is strictly equivalent, but requires one more nest level, so it's much less neat.
    However, it's C89 compatible.

    If compiler portability is not in your requirement list, and your target compiler supports C99,
    I recommend the first option.

  8. The Following User Says Thank You to Cyan For This Useful Post:

    BakingPowder (26th March 2018)

  9. #7
    Member
    Join Date
    Feb 2015
    Location
    Australia
    Posts
    73
    Thanks
    13
    Thanked 65 Times in 23 Posts
    The reference algorithm in radix_engine.h at https://github.com/conor42/fast-lzma2 always finds the longest match up to a specified length limit. It's quite compact but is several times slower than the optimized version. It's a block algorithm instead of using a sliding window, but if your entire input fits into one block this is not an issue. The output is an array of match locations. Note that the output format is different if the block size is > 64Mb or the string length is >= 64 bytes. Below these limits, both the match position and length are packed into one 32-bit value. You can alter the size of the two bit fields.

  10. #8
    Member
    Join Date
    Mar 2018
    Location
    London
    Posts
    7
    Thanks
    3
    Thanked 0 Times in 0 Posts
    @Conor... I looked at fast-lzma2 ... Unfortunately I think I need to understand it first. Do you know any good links to describe how radix is used for compression? If I google "radix" I find lots of stuff about "Radix sort". But I think that's different than what you are doing?

    I think my brain is a little beaten from the time I read about MMC, but I should be able to read more in a day or two if I give my brain some rest.

  11. #9
    Member
    Join Date
    Mar 2018
    Location
    London
    Posts
    7
    Thanks
    3
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Cyan View Post
    If compiler portability is not in your requirement list, and your target compiler supports C99,
    I recommend the first option.
    Thanks Cyan that explains it. Its trying to be portable code...

  12. #10
    Member
    Join Date
    Feb 2015
    Location
    Australia
    Posts
    73
    Thanks
    13
    Thanked 65 Times in 23 Posts
    Quote Originally Posted by BakingPowder View Post
    @Conor... I looked at fast-lzma2 ... Unfortunately I think I need to understand it first. Do you know any good links to describe how radix is used for compression? If I google "radix" I find lots of stuff about "Radix sort". But I think that's different than what you are doing?
    There are no links that I know of that describe anything similar. The functions RadixInitReference and RecurseListsReference demonstrate a simple version of the algorithm. The optimized version has a lot more code but produces an almost identical result. If you want to see how to run the matchfinder and use the output, the function FL2_compressCurBlock in fl2_compress.c does the allocation and function calls, and FL2_radixGetMatch in lzma2_enc.c pulls the matches from the table.

  13. #11
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    Actually, I've developed a radix sort based match finder: https://github.com/tarsa/TarsaMatchFinder
    I haven't integrated it with any LZ coder, though.

  14. #12
    Member
    Join Date
    Mar 2018
    Location
    London
    Posts
    7
    Thanks
    3
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Cyan View Post
    Code:
    for (int c = 0;
    If compiler portability is not in your requirement list, and your target compiler supports C99,
    I recommend the first option.

    Hi Cyan,


    I'm trying to understand this code in mmc.c:


    size_t MMC_insertAndFindBestMatch (MMC_ctx* MMC, const void* inputPointer, size_t maxLength, const void** matchpos)


    I got the code from https://github.com/Cyan4973/mmc/blob/master/mmc.c as recommended in this thread by Piotr Tarsa.


    I had a lot of struggles with trying to understand it... I wrote the struggles here on this thread then deleted them as I realised this version of mmc seems out of date. Compared to the version at https://github.com/yandex/balancer/t...ntrib/libs/lz4

    So that explains it. But I just let you know that an old version of mmc.c with incorrect documentation exists.

    Hopefully I can have a better luck with the version on the "yandex"... I'll try again tomorrow.

  15. #13
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    856
    Thanks
    447
    Thanked 254 Times in 103 Posts
    Indeed, mmc project hasn't been updated in years.

    I've added a small example to the project, to showcase how to use mmc.

    Even then, mmc is no longer "cutting edge", if it ever was.
    I suspect that the comparatively simpler LZ4_HC match finder is probably more efficient these days.

  16. The Following User Says Thank You to Cyan For This Useful Post:

    BakingPowder (26th June 2018)

  17. #14
    Member
    Join Date
    Mar 2018
    Location
    London
    Posts
    7
    Thanks
    3
    Thanked 0 Times in 0 Posts
    Thanks a lot Cyan. Can I ask in your opinion what is a good/simple match-finder code for me to look at, to use or learn from? Yours, or anyone else's?

    Do you have a link to it? You said the LZ4_HC match finder, is that the one at https://github.com/yandex/balancer/t...libs/lz4/mmc.c or is it a different one? It seems mostly the same as the outdated one, just tweaked. Maybe you meant a different LZ4_HC match finder?

  18. #15
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    856
    Thanks
    447
    Thanked 254 Times in 103 Posts
    I meant the match finder as in :
    https://github.com/lz4/lz4/blob/dev/lib/lz4hc.c

    In contrast with mmc, it's not designed as a library : the match finder is expected to work with its own local parsers, in the same file.
    But I guess you could probably use that as a starting point.

    Also, looking again into it today, the implementation has become significantly complex, in the name of performance.
    Should simplicity be more important, you might want to look at an older variant, such as v1.8.0 :
    https://github.com/lz4/lz4/blob/v1.8.0/lib/lz4hc.c#L120

Similar Threads

  1. How do I dump the strings LZ77 matches?
    By SolidComp in forum Data Compression
    Replies: 4
    Last Post: 27th August 2016, 08:16
  2. sorting rotations of a string - detecting an never-ending comparision
    By just a worm in forum The Off-Topic Lounge
    Replies: 9
    Last Post: 5th June 2015, 17:53
  3. DataSmoke: detecting text, multimedia and incompressible data
    By Bulat Ziganshin in forum Data Compression
    Replies: 33
    Last Post: 8th March 2014, 00:33
  4. Detecting non-immediate data correlations
    By GerryB in forum Data Compression
    Replies: 14
    Last Post: 5th December 2010, 10:30
  5. compression speed VS decomp speed: which is more important?
    By Lone_Wolf236 in forum Data Compression
    Replies: 14
    Last Post: 12th July 2010, 19:57

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •