Page 2 of 5 FirstFirst 1234 ... LastLast
Results 31 to 60 of 130

Thread: LZPM's future

  1. #31
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,954
    Thanks
    359
    Thanked 332 Times in 131 Posts
    Quote Originally Posted by encode
    Another idea:
    0xxxxxxx - literal runs - 1..128 literals
    1xxxxxxx yyyyyyyy yyyyyyyy - match, match length, match offset
    By the way, this is a cool idea. Firstly we have a larger window and longer MAXMATCH. Secondly, the decoder is even simpler. In addition this scheme has a nice handling of already compressed data.

    The compression comparison between two LZSS variants:
    4KB - current scheme
    64KB - new scheme

    MPTRACK.EXE (1,159,172 bytes)
    4KB: 660,927 bytes
    64KB: 711,399 bytes

    explorer.exe (1,032,192 bytes)
    4KB: 505,733 bytes
    64KB: 503,481 bytes

    Doom3.exe (5,427,200 bytes)
    4KB: 2,672,672 bytes
    64KB: 2,678,714 bytes

    Reaktor.exe (14,446,592 bytes)
    4KB: 4,205,904 bytes
    64KB: 3,463,415 bytes [!]

    TraktorDJStudio3.exe (29,124,024 bytes)
    4KB: 10,245,406 bytes
    64KB: 9,004,243 bytes

    test.exe (7,919,616 bytes)
    4KB: 2,221,001 bytes
    64KB: 1,720,188 bytes


  2. #32
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,954
    Thanks
    359
    Thanked 332 Times in 131 Posts
    fp.log (20,617,071 bytes)
    4KB: 3,738,831 bytes
    64KB: 1,489,511 bytes [!]

    world95.txt (2,988,578 bytes)
    4KB: 1,551,596 bytes
    64KB: 1,015,861 bytes

    A10.jpg (842,468 bytes)
    4KB: 944,651 bytes
    64KB: 848,320 bytes

    BOOK1 (768,771 bytes)
    4KB: 414,466 bytes
    64KB: 398,326 bytes

    calgary.tar (3,152,896 bytes)
    4KB: 1,387,985 bytes
    64KB: 1,303,582 bytes

    Hm, nice LZSS modification...

  3. #33
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,954
    Thanks
    359
    Thanked 332 Times in 131 Posts
    The compression may be further improved, by adding more advanced parsing scheme. In addition, during parsing we may add some additional tricks according to the new coding scheme (looking how much literals are in temp buffer, and so on...)


  4. #34
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,954
    Thanks
    359
    Thanked 332 Times in 131 Posts
    And the ASM decoder may be like this:

    Code:
     
    procedure decompress; 
    asm 
      pushad 
      cld 
     
      lea esi,src 
      lea edi,dst 
      mov ebp,siz 
      add ebp,edi 
     
    @@1: 
      movzx ecx,byte ptr[esi] 
      add esi,1 
      test ecx,128 
      jz @@3 
     
    @@2: 
      sub ecx,128 // or sub ecx,124, ICL does add ecx,-125 
      movzx eax,word ptr[esi] 
      add esi,2 
      not eax 
      add eax,edi 
      xchg eax,esi 
      movsb 
      movsb 
      movsb 
      movsb 
      rep movsb 
      xchg eax,esi // or mov esi,eax, if xchg is slower than mov 
      jmp @@4 
     
    @@3: 
      movsb // or add ecx,1 
      rep movsb 
     
    @@4: 
      cmp edi,ebp 
      jb @@1 
     
      popad 
    end;




    Hm, why with
    MOVSD
    REP MOVSB
    decompression is incorrect?




  5. #35
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,954
    Thanks
    359
    Thanked 332 Times in 131 Posts
    For testing purposes, self-extracting test.exe:
    test.exe


  6. #36
    Member
    Join Date
    May 2008
    Location
    USA
    Posts
    42
    Thanks
    0
    Thanked 2 Times in 2 Posts
    Quote Originally Posted by encode
    I believe that many things in this ASM source done cleverly. I think this one is mostly better than others in this category.
    Im pretty sure mine beats yours in terms of speed I designed an LZSS+RLE compressor/decompressor specifically for extremely slow x86 hardware (think 4.77MHz 808. It uses 16-bit symbols to take advantage of the fastest instructions available for that platform, and all calculation is done in registers (ie. the only memory accesses are fetching and storing tokens/symbols). Heres the inner loop:

    <div class=""jscript""><pre>
    SHL BX,1 ;get leftmost control bit into carry bit
    JC @process_literal ;if carry set, next token is a literal
    POP CX ;otherwise, its a code: grab the length,
    SHR CX,1 ;SHR to both grab carry bit and adjust range
    JC @process_match ;carry bit set? Its a match
    POP AX ;not set? Its a run. Grab date to make run
    REP STOSW ;output the word AX, CX times
    JMP @token_finished ;Finished with this token
    @process_match:
    POP SI ;next token is the offset to copy FROM
    REP MOVSW ;Copy CX words
    JMP @token_finished ;Finished with this token
    @process_literal:
    POP AX ;grab the literal...
    STOSW ;...and store to output buffer
    @token_finished:
    [/code]


    While Im still tuning it (rearranging the jumps so that literals "fall through" the jump comparison will help a bit because theyre the most common case), this is the fastest LZSS+RLE decompressor implementation possible on 16-bit 808x. It doesnt compress smaller than LZO/UCL, but it outperforms it on decompression speed.

  7. #37
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,954
    Thanks
    359
    Thanked 332 Times in 131 Posts
    Maybe on 16-bit platforms - not on Pentium class ones. Also I bet my new LZSS has a higher compression. Anyway, the new optimized version for Pentium is in progress...

  8. #38
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,954
    Thanks
    359
    Thanked 332 Times in 131 Posts
    By the way, what is faster POP or MOV-based stuff? Will search in optimization guides. I think later should be faster.

  9. #39
    Member
    Join Date
    May 2008
    Location
    USA
    Posts
    42
    Thanks
    0
    Thanked 2 Times in 2 Posts
    Quote Originally Posted by encode
    Hm, why with
    MOVSD
    REP MOVSB
    decompression is incorrect?
    It shouldnt be; you should step through it in a debugger and watch the registers and output to see whats going on.

    BTW if youre using esi, eax, etc. then Ill assume youre on a 386 or later and you can make some more optimizations to your code (movsd). Even without 386+ opcodes there are some easy ways to speed things up without penalty. For example, for short moves (more than 8 bytes and less than 32), instead of this:

    rep movsb

    ...you could double the speed of the move with something like

    shr cx,1
    rep movsw
    adc cx,0
    rep movsb

    This works because if cx=0 then rep movsb does nothing. It is mostly without penalty (no JMPs).

  10. #40
    Member
    Join Date
    May 2008
    Location
    USA
    Posts
    42
    Thanks
    0
    Thanked 2 Times in 2 Posts
    Quote Originally Posted by encode
    Maybe on 16-bit platforms - not on Pentium class ones
    I never claimed Pentium, although a few simple changes and it would be

    Quote Originally Posted by encode
    By the way, what is faster POP or MOV-based stuff?
    It depends on the architecture. If youre targeting Pentium, I think its a wash (ie. equal). If its Pentium II or higher, I think MOV. But there are many Pentium optimization guides out there, check those.

  11. #41
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    pops and pushes should be faster than moving to and from memory locations - because pops and pushes doesn't require offsets, so they are faster to decode. of course i'm talking about situation where stack is aligned on double word boundary - never push 16 or 8 bit operand on stack or there will be a speed penalty on newer processors.

    xchg and fxch on registers only are very fast. fxch on pentium pro or higher takes about 0 clocks - as it's handled by decoder and register renaming mechanism. nothing is done by processor core with those instructions.

    http://www.agner.org/optimize/

  12. #42
    Member
    Join Date
    Dec 2006
    Posts
    611
    Thanks
    0
    Thanked 1 Time in 1 Post
    Ilia, which of this development is targeted on LZPM and which on something new?

  13. #43
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,954
    Thanks
    359
    Thanked 332 Times in 131 Posts
    Quote Originally Posted by Black_Fox
    Ilia, which of this development is targeted on LZPM and which on something new?
    Generally speaking, all this stuff is for LZPM. Im just playing with LZ77 searching for new possibilities. However, this new LZSS is for my EXE packer or for the new fast coder.
    Currently, LZ77-based LZPM better on some files, and worser on others, at the same time it is slower at compression, even with lazy matching. Until my LZ77 not completely beat my ROLZ, I will not release a new version of the LZPM.
    Meanwhile, I will add an optimal parsing to my LZSS.

    Quote Originally Posted by Trixter
    shr cx,1
    rep movsw
    adc cx,0
    rep movsb
    I will do as ICL does - MOVZX/MOV ADD only opcodes:
    <div class=""jscript""><pre>
    @@1
    MOVZX EAX, BYTE PTR [ESI]
    MOV BYTE PTR [EDI], AL
    ADD ESI,1
    ADD EDI,1
    ADD ECX,-1
    JNE @@1
    [/code]

  14. #44
    Member
    Join Date
    May 2008
    Location
    USA
    Posts
    42
    Thanks
    0
    Thanked 2 Times in 2 Posts
    Just check http://www.agner.org/optimize/ when optimizing. I didn't mention my example so that you could use it; I mentioned it so that you could think about different ways to optimize. In the specific bit you quoted, I illustrate how you can handle something without a JMP. JMPs (and all related, like JNC, JCXZ, etc.) are costly because on a jump they empty the prefetch queue and stall the pipelines. The removal of one JMP from an inner loop can sometimes speed it up 50% depending on the loop!

    Anyway, just giving out ideas, I know most people aren't targeting 16-bit today

  15. #45
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,954
    Thanks
    359
    Thanked 332 Times in 131 Posts
    Quote Originally Posted by Trixter
    I mentioned it so that you could think about different ways to optimize.
    Actually, I do.

    Thank you for advices anyway!

  16. #46
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > I will do as ICL does - MOVZX/MOV ADD only opcodes:

    Well, actually IntelC is only good when automatic vectorization
    and inlining/unrolling do improve timings.
    While its average code generation is far from ideal anyway.

    > @@1:
    > MOVZX EAX, BYTE PTR [ESI]
    > MOV BYTE PTR [EDI], AL
    > ADD ESI,1
    > ADD EDI,1
    > ADD ECX,-1
    > JNE @@1

    I'd try something like this instead:

    mov ebx,ecx
    add edi,ebx
    add esi,ebx
    neg ebx
    xor eax,eax
    @@1:
    mov al,byte ptr [esi+ebx]
    mov byte ptr [edi+ebx],al
    add ebx,1
    jnz @@1

    Don't think that either of these loops would be faster than
    rep movsb actually, but here's a common loop optimization
    technique which IntelC doesn't use somehow.

  17. #47
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,954
    Thanks
    359
    Thanked 332 Times in 131 Posts
    Like I said many guides say that LODSB/MOVSB are slow - MOVZX/MOV are preferred. Maybe stuff like [ESI+EBX]/[EDI+EBX] is slower than actual increment in loop. I think, guys from Intel knew what whey do.

  18. #48
    Member
    Join Date
    Jan 2007
    Location
    Moscow
    Posts
    239
    Thanks
    0
    Thanked 3 Times in 1 Post
    Quote Originally Posted by encode
    I think, guys from Intel knew what whey do.
    You can ask them in russian as they sit in NN

  19. #49
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > Like I said many guides say that LODSB/MOVSB are slow - MOVZX/MOV are
    > preferred.

    Well, my experience says that instructions with long codes are not
    healthy (eg. MOVZX) - in times they can be slower despite their
    supposedly faster microcode.

    > Maybe stuff like [ESI+EBX]/[EDI+EBX] is slower than actual
    > increment in loop. I think, guys from Intel knew what whey do.

    Well, I'd personally reported around 20 IntelC bugs (last one
    was like month ago), so not sure about that.
    Its a huge project... partly outsourced to Russia btw... so can't
    expect anything perfect.

    Meanwhile, I made another test:

    http://shelwien.googlepages.com/lzss_sh2.rar

    (1) (2)
    rep movs: codesize = 87, time = 7.954s 3.984s
    incr all: codesize = 101, time = 4.422s 4.468s
    incr ebx: codesize = 106, time = 4.750s 4.359s
    original: codesize = 90, time = 7.828s 4.312s
    (1) P4 3.1
    (2) CeleronM 1.7

    And here we see that on the old P4 rep movsb is somehow
    2x slower than unrolled loop, and [esi+ebx] loop has worse
    performance, probably due to long initialization and label
    aligment. While on the recent CeleronM timings are like they're
    supposed to be.

  20. #50
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,954
    Thanks
    359
    Thanked 332 Times in 131 Posts
    I suppose that the latest processors should have some REP MOVSB optimization.

    Anyway, I think we should not rush into things. Better spend time on a more clever design of a bytewise LZ77.

    For example I have a new idea:

    0XXXXXXX - Literal run (1..128 literals)

    10XXXXXX YYYYYYYY - Match with offset -1...-256

    11XXXXXX YYYYYYYY YYYYYYYY - Match with offset -1...-65536

    In this case we may compress matches starting at three bytes long.

    Also we may bind the match length and match offset length. For example, if len==3 then we should read a 8-bit offset, and 16-bit offset otherwise.

    Any ideas?

  21. #51
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Better make something like what _I_ think should be called LZP

    I mean, like create 256 circular buffers, each with 256 addresses
    of symbols occurences, and encode offset in order1 context.
    The decoder for that won't be much more complex than LZSS, but compression
    would be significantly better.

    Also merging the offset and length is a common and reasonable idea too,
    but it might have unexpectedly complex implementation.
    And anyway, you need to collect some statistics on most frequent lengths
    for given offset, _and_ on what happens to compression quality when you
    disable some infrequent offset-length pair.

  22. #52
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,954
    Thanks
    359
    Thanked 332 Times in 131 Posts
    Quote Originally Posted by Shelwien
    Better make something like what _I_ think should be called LZP
    LZP was the first thing I was tried. Surprisingly, standalone LZP provides poor compression even with large offset table, even with different output coding - I tried at almost all bytewise variants.

    You talking about bytewise ROLZ. Well, I didnt tried it yet, will do. But decoder is more complex in this case, in addition we need an extra data structure... Considering my latest experiments with LZPM, I think ROLZ-like scheme should give no improvement, especially here.

    LZ77 rules!

  23. #53
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > You talking about bytewise ROLZ. Well, I didn't tried it yet, will do. But
    > decoder is more complex in this case, in addition we need an extra data
    > structure... Considering my latest experiments with LZPM, I think ROLZ-like
    > scheme should give no improvement, especially here.

    Dunno how it can be worse, if you'd encode part of offset as a first
    symbol of match, for example.

    > LZ77 rules!

    Not really. For executables disassembler + simple entropy coder would be
    3x better than any LZ.

  24. #54
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,954
    Thanks
    359
    Thanked 332 Times in 131 Posts
    OK, stop talking, actual results is here. I implemented bytewise ROLZ and compared it to my current LZSS:

    MPTRACK.EXE
    ROLZ: 742,281 bytes
    LZ77: 710,712 bytes

    Reaktor.exe
    ROLZ: 3,433,648 bytes
    LZ77: 3,460,142 bytes

    Photoshop.exe
    ROLZ: 11,066,785 bytes
    LZ77: 10,856,019 bytes

    TraktorDJStudio3.exe
    ROLZ: 9,014,118 bytes
    LZ77: 8,998,152 bytes

    Not worth it...

    Quote Originally Posted by Shelwien
    Not really. For executables disassembler + simple entropy coder would be
    3x better than any LZ.
    Can you provide the ASM source for decoder?

    And what if EXE contains a large picture or other resource?

  25. #55
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > Not worth it...

    Ok, I'll try writing my own coder.
    But first give me your results for some file which I have.
    For example Bergmans' acrord32.exe and mso97.dll.

    > Can you provide the ASM source for decoder?

    Well, I don't mind translating it from C++ to asm, if you make it
    And obviously there's no sense in giving you my own model,
    as I can use it myself

    > And what if EXE contains a large picture or other resource?

    Of course, that'd take some more custom models.
    But its worth it anyway. Afair my assembly PPM was ~2.5k,
    so even if the decoder with all submodels would grow
    to like 30k, it'd be still several times better than currently
    available exe compressors.

  26. #56
    Member
    Join Date
    May 2008
    Location
    USA
    Posts
    42
    Thanks
    0
    Thanked 2 Times in 2 Posts
    Quote Originally Posted by Shelwien
    And here we see that on the old P4 rep movsb is somehow
    2x slower than unrolled loop
    Thats because youre not supposed to use movsb. Use movsd on doubleword-aligned data and you should see much better times.

  27. #57
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,954
    Thanks
    359
    Thanked 332 Times in 131 Posts
    The performance of current LZSS on EXE/DLL from SFC:

    acrord32.exe: 2,271,869 bytes
    mso97.dll: 2,771,528 bytes

    It's about the simplest scheme. I will add a little bit more complex LZ output encoding, thus LZS may compress a smaller matches.


  28. #58
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,954
    Thanks
    359
    Thanked 332 Times in 131 Posts
    The performance with extended scheme:
    acrord32.exe: 2,168,356 bytes
    mso97.dll: 2,661,815 bytes


  29. #59
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    why use fixed size offsets and lengths? gamma encoding is very efficient and easy to decode. look at aplib: http://ibsensoftware.com/products_aPLib.html . it has very fast decoding although it can be faster as gamma encoding is flawed in this program - guy interleaves flags with bits when gamma encoding. with normal gamma encoding: http://en.wikipedia.org/wiki/Elias_gamma_coding single bsr instruction can be used to determine the length of number.

  30. #60
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > The performance of current LZSS on EXE/DLL from SFC:
    >
    > acrord32.exe: 2,271,869 bytes
    > mso97.dll: 2,771,528 bytes

    Well, the results of my scheme with order1 offset encoding are:
    2272598 // acrord32.exe
    2739962 // mso97.dll
    Guess that's ok then... and here I thought that its bad and didn't post.
    So, this is the coder I made yesterday:
    http://shelwien.googlepages.com/lz_sh1.rar
    I think maybe I should try optimal parsing for it...
    Imho its effect would be much better than usual due to non-linearity of
    matches here. Also there might be some gain from switching to order2...

Page 2 of 5 FirstFirst 1234 ... LastLast

Similar Threads

  1. BCM's future
    By encode in forum Data Compression
    Replies: 17
    Last Post: 9th August 2009, 01:00
  2. Future Bandwidth.
    By Tribune in forum The Off-Topic Lounge
    Replies: 9
    Last Post: 10th October 2008, 22:56
  3. ENCODE's music
    By encode in forum Forum Archive
    Replies: 1
    Last Post: 16th October 2006, 23:19
  4. Dwing's UDA v3.00
    By spark in forum Forum Archive
    Replies: 6
    Last Post: 10th August 2006, 10:11
  5. TC - What's next
    By encode in forum Forum Archive
    Replies: 3
    Last Post: 20th June 2006, 17:06

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •