Results 1 to 8 of 8

Thread: Optimized paq7asm.asm code not compatible with paq8px?

  1. #1
    Member
    Join Date
    May 2009
    Location
    Europe
    Posts
    67
    Thanks
    0
    Thanked 1 Time in 1 Post

    Optimized paq7asm.asm code not compatible with some paq8 variants?

    I'm trying to make some optimized builds of some archivers. I had a small problem with paq8px, while trying to save some cycles
    I found an optimized (by Dark Shikari) sse2 version of dot_ product() and train() in paq8p
    I built paq8p and it is indeed (slightly) faster than the regular paq7asmsse.asm.
    But paq8px (tested v30 & v42) crashes. I didn't test all variants but paq8l, Matt's last paq8, does not crash. Then it seems this optimized asm code is not specifically designed for paq8p. Paq8l outputs identical files compiled with the regular or optimized code. Same for paq8p. So it looks Dark Shikari's code can perfectly replace the "not optimized" code.
    Then why it doesn't work with paq8px? Does paq8px contains some changes that makes it not compatible with this asm code? I don't think I made something wrong during the assembly/compilation stages.
    If I keep the optimized dot_product() and replace the optimized train() by the regular train(), it does not crash. Then it seems the problem is train()

    regular sse2 code:
    Code:
    ; Train n neural network weights w[n] on inputs t[n] and err.
    ; w[i] += t[i]*err*2+1 >> 17 bounded to +- 32K.
    ; n is rounded up to a multiple of 8.
    
    ; Train for SSE2
    ; Use this code to get some performance...
    
    global train ; (short* t, short* w, int n, int err)
    align 16
    train:
      mov eax, [esp+4]      ; t
      mov edx, [esp+8]      ; w
      mov ecx, [esp+12]     ; n
      add ecx, 7            ; n/8 rounding up
      and ecx, -8
      jz .done
      sub eax, 16
      sub edx, 16
      movd xmm0, [esp+16]
      pshuflw xmm0,xmm0,0
      punpcklqdq xmm0,xmm0
    .loop:                  ; each iteration adjusts 8 weights
      movdqa xmm3, [eax+ecx*2] 	; t[i]
      movdqa xmm2, [edx+ecx*2] 	; w[i]
      paddsw xmm3, xmm3     ; t[i]*2
      pmulhw xmm3, xmm0     ; t[i]*err*2 >> 16
      paddsw xmm3, [_mask]	; (t[i]*err*2 >> 16)+1
      psraw xmm3, 1         ; (t[i]*err*2 >> 16)+1 >> 1
      paddsw xmm2, xmm3     ; w[i] + xmm3
      movdqa [edx+ecx*2], xmm2
      sub ecx, 8
      ja .loop
    .done:
      ret
    
    align 16
    _mask	dd	10001h,10001h,10001h,10001h ; 8 copies of 1 in xmm1
    optimized sse2 code:
    Code:
    ; Train n neural network weights w[n] on inputs t[n] and err.
    ; w[i] += t[i]*err*2+1 >> 17 bounded to +- 32K.
    ; n is rounded up to a multiple of 16.
    
    ; Train for SSE2
    ; Use this code to get some performance...
    
    global train ; (short* t, short* w, int n, int err)
    align 16
    train:
      mov         eax, [esp+4]          ; t
      mov         edx, [esp+8]          ; w
      mov         ecx, [esp+12]         ; n
      add         ecx, 15               ; n/16 rounding up
      and         ecx, -16
      jz .done
      sub         eax, 32
      sub         edx, 32
      movd       xmm0, [esp+16]
      pcmpeqb    xmm6, xmm6
      pshuflw    xmm0, xmm0, 0
      psrlw      xmm6, 15               ; pw_1
      punpcklqdq xmm0, xmm0
    .loop:                              ; each iteration adjusts 16 weights
      movdqa     xmm3, [eax+ecx*2 +0]   ; t[i]
      movdqa     xmm5, [eax+ecx*2+16]
      paddsw     xmm3, xmm3             ; t[i]*2
      paddsw     xmm5, xmm5
      pmulhw     xmm3, xmm0             ; t[i]*err*2 >> 16
      pmulhw     xmm5, xmm0
      paddsw     xmm3, xmm6             ; (t[i]*err*2 >> 16)+1
      paddsw     xmm5, xmm6
      psraw      xmm3, 1                ; (t[i]*err*2 >> 16)+1 >> 1
      psraw      xmm5, 1
      paddsw     xmm3, [edx+ecx*2+ 0]   ; w[i] + xmm3
      paddsw     xmm5, [edx+ecx*2+16]
      movdqa [edx+ecx*2+ 0], xmm3
      movdqa [edx+ecx*2+16], xmm5
      sub     ecx, 16
      ja .loop
    .done:
      ret
    I don't understand x86 assembler but, according to the comments, Dark Shikari changed the rounding. Paq8l and paq8p don't mind, but maybe paq8px does not like this.

    EDIT: not all input files produce the crash. And not all level. At least it crashes on a file containing the first megabyte of ENWIK8 at levels 4,5,6,7,8. Levels 0,1,2,3 are ok
    Last edited by M4ST3R; 1st June 2009 at 00:46.

  2. #2
    Member
    Join Date
    Jun 2008
    Location
    USA
    Posts
    111
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Hi,

    Quote Originally Posted by M4ST3R View Post
    I'm trying to make some optimized builds of some archivers. I had a small problem with paq8px, while trying to save some cycles
    I found an optimized (by Dark Shikari) sse2 version of dot_ product() and train() in paq8p
    I built paq8p ...
    Okay, so you built paq8p yourself with Dark Shikari's tweaked asm code. What compiler+assembler (name, version, OS, options used, cpu targeted)?

    Quote Originally Posted by M4ST3R
    and it is indeed (slightly) faster than the regular paq7asmsse.asm.
    Uh, no. At least not in my tests. What cpu do you run (name, model, Ghz, features, etc)? In Matt (and my own) tests, any SSE2 version (including the original or DS's) is barely 5% faster than the MMX versions.

    Now, I've never really heard any tests from anyone else about this concerning newer machines with better SSE bandwidth (Phenom, Core 2), so I don't know for sure. But my AMD64x2 seems to not show any difference between either SSE2 version.

    In fact, Matt originally just built always assuming MMX with no CPUID check just because "it's ten years worth of cpus", hence he wasn't too worried about compatibility. I must be the only one, then! (Hence my really weak mostly-DOS port of paq8o8 called paq8o8z.)

    Newer Intel cpus (SSE 4.1) even support a "dot product" vector instruction, so who knows, maybe that would speed it up any more. But again, no one has bothered testing (and even Matt's new laptop lacks that).

    Quote Originally Posted by M4ST3R
    But paq8px (tested v30 & v42) crashes. I didn't test all variants but paq8l, Matt's last paq8, does not crash. Then it seems this optimized asm code is not specifically designed for paq8p. Paq8l outputs identical files compiled with the regular or optimized code. Same for paq8p. So it looks Dark Shikari's code can perfectly replace the "not optimized" code.
    You saw this, right?

    Quote Originally Posted by Matt's page
    "paq8p by Andreas Morphis, released Aug. 25, 2008, has greatly improved .wav compression and slightly improved .bmp compression. Discussion. Note: the SSE2 compiled version is not always archive compatible with the MMX version (noted by Rugxulo, Mar. 9, 2009).

    here. paq8p1 (Oct. 8, 200 by Kaidorav fixes some bugs in paq8p.

    paq8p update: Jan 8, 2009. Added paq8o8sse2.asm (optimized assembler by Dark Shikari) and corresponding Windows Pentium 4+ compiled executable, paq8p_sse2.exe. It is archive compatible with paq8p.exe and might or might not be faster. "
    Quote Originally Posted by M4ST3R
    Then why it doesn't work with paq8px? Does paq8px contains some changes that makes it not compatible with this asm code? I don't think I made something wrong during the assembly/compilation stages.
    If I keep the optimized dot_product() and replace the optimized train() by the regular train(), it does not crash. Then it seems the problem is train()

    (snip)

    I don't understand x86 assembler but, according to the comments, Dark Shikari changed the rounding. Paq8l and paq8p don't mind, but maybe paq8px does not like this.

    EDIT: not all input files produce the crash. And not all level. At least it crashes on a file containing the first megabyte of ENWIK8 at levels 4,5,6,7,8. Levels 0,1,2,3 are ok
    I tested paq8o8z (with both original and modified train) with -1 -3 -4 -5 on the first 1.6 MB of enwik8 and saw no crashes or inconsistencies. What exactly does the crash do? Stop the console? GPF? Report an error msgbox? What OS? Sure, it could be a bug in paq8px, but I don't know. (I do know some generic x86 assembly, but my SIMD knowledge is very very lacking, so I can't tell from looking at it without further intense study, which is highly unlikely.)

  3. #3
    Member
    Join Date
    May 2009
    Location
    Europe
    Posts
    67
    Thanks
    0
    Thanked 1 Time in 1 Post
    Quote Originally Posted by Rugxulo View Post
    Okay, so you built paq8p yourself with Dark Shikari's tweaked asm code. What compiler+assembler (name, version, OS, options used, cpu targeted)?
    Visual studio 2008 SP1, Intel 11.0.074. XP SP3 32 bits. tried several options, sse2, sse4.1. Assembled by nasm and yasm, optimizations on or off. Paq8px_v30 always crashes while compressing a file containing the 1st megabyte of ENWIK8 at default level.
    Quote Originally Posted by Rugxulo View Post
    Uh, no. At least not in my tests. What cpu do you run (name, model, Ghz, features, etc)? In Matt (and my own) tests, any SSE2 version (including the original or DS's) is barely 5% faster than the MMX versions.

    Now, I've never really heard any tests from anyone else about this concerning newer machines with better SSE bandwidth (Phenom, Core 2), so I don't know for sure. But my AMD64x2 seems to not show any difference between either SSE2 version.

    In fact, Matt originally just built always assuming MMX with no CPUID check just because "it's ten years worth of cpus", hence he wasn't too worried about compatibility. I must be the only one, then! (Hence my really weak mostly-DOS port of paq8o8 called paq8o8z.)

    Newer Intel cpus (SSE 4.1) even support a "dot product" vector instruction, so who knows, maybe that would speed it up any more. But again, no one has bothered testing (and even Matt's new laptop lacks that).
    Intel E8400 E0 (SSE4.1) 3.0ghz
    Compression of the 1st megabyte of ENWIK8 with paq8px_v30. No background task, priority above normal
    icl /O2 /QxSSE2 /DWINDOWS paq8px.cpp paq7asmmmx.obj ("official" mmx code) 82s08
    icl /O2 /QxSSE2 /DWINDOWS paq8px.cpp paq7asmsse2.obj ("official" sse2 code) 75s20
    icl /O2 /QxSSE2 /DWINDOWS paq8px.cpp paq7asmsse2opt.obj (Shikari) crashes at the start

    Just made some test with paq8p. Same input file:
    icl /O2 /QxSSE2 /DWINDOWS paq8p.cpp paq7asmmmx.obj 64s09
    icl /O2 /QxSSE2 /DWINDOWS paq8p.cpp paq7asmsse2.obj 58s69
    icl /O2 /QxSSE2 /DWINDOWS paq8p.cpp paq7asmsse2opt.obj 58s08

    So on my computer (at least on compression, on this file, at this level) Dark Shikari's code is slightly faster. And the official sse2 code is faster than mmx.

    Quote Originally Posted by Rugxulo View Post
    I tested paq8o8z (with both original and modified train) with -1 -3 -4 -5 on the first 1.6 MB of enwik8 and saw no crashes or inconsistencies. What exactly does the crash do? Stop the console? GPF? Report an error msgbox? What OS? Sure, it could be a bug in paq8px, but I don't know. (I do know some generic x86 assembly, but my SIMD knowledge is very very lacking, so I can't tell from looking at it without further intense study, which is highly unlikely.)
    The just in time visual studio debugger pops up, says win32 exception, use the selected debugger? The output file contains only the header.

    EDIT: Just narrowed the bug down:
    - file containing the 999 999 first bytes of enwik8: ok
    - file containing the 1 000 000 first bytes of enwik8: crash
    Attached Files Attached Files
    Last edited by M4ST3R; 1st June 2009 at 16:02.

  4. #4
    Member
    Join Date
    Jun 2008
    Location
    USA
    Posts
    111
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by M4ST3R View Post
    Intel E8400 E0 (SSE4.1) 3.0ghz
    Compression of the 1st megabyte of ENWIK8 with paq8px_v30. No background task, priority above normal
    icl /O2 /QxSSE2 /DWINDOWS paq8px.cpp paq7asmmmx.obj ("official" mmx code) 82s08
    icl /O2 /QxSSE2 /DWINDOWS paq8px.cpp paq7asmsse2.obj ("official" sse2 code) 75s20
    icl /O2 /QxSSE2 /DWINDOWS paq8px.cpp paq7asmsse2opt.obj (Shikari) crashes at the start
    I'm both amazed and unimpressed that there's any difference (10%), and that it's not more! (BTW, what option did you use for compressing the above, -5?)

    I weakly suspect the problem will disappear if you don't use /QxSSE2 at all. It's probably being overly zealous in its optimizations.

  5. #5
    Member
    Join Date
    May 2009
    Location
    Europe
    Posts
    67
    Thanks
    0
    Thanked 1 Time in 1 Post
    I used the default level, I think it's -5.

    Switching from /QxSSE2 to /arch:IA32 doesn't help, it still crashes. Same for /Od.

  6. #6
    Member
    Join Date
    Jun 2008
    Location
    USA
    Posts
    111
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by M4ST3R View Post
    I used the default level, I think it's -5.

    Switching from /QxSSE2 to /arch:IA32 doesn't help, it still crashes. Same for /Od.
    Tested below on my AMD64x2 TK-53 1.7(x2) Ghz laptop (obviously slower than yours, heh). And actually I had to hexedit out the two "GenuineIntel" parts for "AuthenticAMD" before it would even run. (Yes, I've also got a P4 in the other room, but I'm too lazy to boot it up just for this.)

    It finishes compressing successfully although at the end it does say "paq8px has encountered an error and has to close" (which Vista unfortunately says not too infrequently).

    [ Vista ] - Tue 06/02/2009 >paq8px doydoy ENWIK.txt
    Creating archive doydoy.paq8px with 1 file(s)...

    1/1 Filename: ENWIK.txt (1048576 bytes)
    Block segmentation:
    0 | default | 1048576 bytes [from 0 to 1048575]
    Compressed from 1048576 to 217628 bytes.

    Total 1048576 bytes compressed to 217660 bytes.
    Time 229.04 sec, used 243182921 bytes of memory

    7fa9201394d231a9466109f02cf69ffd *c:\Armslurp\temp\moo\doydoy.paq8px
    Try another compiler (that's what I'm gonna do now), e.g. GCC, and see what happens.

    EDIT: No crash with DJGPP, but it doesn't match the first one's output.
    EDIT#2: Wait, I think I'm using a newer paq8px than you, oops.
    EDIT#3: What version did you use? I tried again with v42 and same md5sum as my previous builds (but not yours).
    EDIT#4: Okay, v30 matches your md5sum, and still no crashes with DJGPP, so it must be your compiler (but I'll try again with MinGW and OpenWatcom just to be sure).
    EDIT#5: Odd, same problem at end with MinGW, must be something to do with Windows (e.g. MSVCRT, perhaps). Gotta try again with OpenWatcom, which doesn't use that.
    EDIT#6: Seems to crash immediately for OpenWatcom yet runs fine under debugger, weird.
    EDIT#7.2: What the hell happened to edit #7? Did attaching a file delete it? Argh! Anyways, seems I messed up the alignment, so OpenWatcom actually works fine (same md5sum) albeit a little slower.

    [ Vista/DJGPP ] - Tue 06/02/2009 >p doydoy2e ENWIK.txt
    Creating archive doydoy2e.paq8px with 1 file(s)...

    1/1 Filename: ENWIK.txt (1048576 bytes)
    Block segmentation:
    0 | default | 1048576 bytes [from 0 to 1048575]
    Compressed from 1048576 to 217628 bytes.

    Total 1048576 bytes compressed to 217660 bytes.
    Time 242.58 sec, used 226405703 bytes of memory

    [ Vista/DJGPP ] - Tue 06/02/2009 >md5sum doydoy2e.paq8px
    7fa9201394d231a9466109f02cf69ffd *doydoy2e.paq8px
    Morals of the story:

    1). It's not worth it, just use the "original" SSE2 version since we're talking less than 1% improvement (if at all).
    2). Intel's compiler is apparently still stupidly against AMD chips (and not free except non-commercial Linux version).
    3). Windows is weird. :-P
    Attached Files Attached Files
    Last edited by Rugxulo; 2nd June 2009 at 21:36.

  7. #7
    Member
    Join Date
    Jun 2008
    Location
    USA
    Posts
    111
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Okay, (whew!), I think I've edited that one post enough times, so I here by publicly announce that I have no more to offer on this subject (I think).

    EDIT: Fixed smiley, lol.

  8. #8
    Member
    Join Date
    May 2009
    Location
    Europe
    Posts
    67
    Thanks
    0
    Thanked 1 Time in 1 Post
    Thanks for the help

    It's interesting my build compresses successfully on your computer, even if you have an error at the end. I wonder whether the OS or the CPU makes the difference.

    Your openwatcom build works with no problem on my computer. Interesting. I tried visual studio and the intel compiler. Both crash. I'll try to compile with mingw when I get a chance.

    Of course, I'll stick with the original SSE2 code at the moment.

    About the GenuineIntel thing, I think it's because I compiled with /QxSSE2. /archSSE2 should generate an executable that run on AMD

    Thanks again for the time you spent!

Similar Threads

  1. paq8px
    By Jan Ondrus in forum Data Compression
    Replies: 1637
    Last Post: 15th July 2019, 17:58
  2. Code Optimisation
    By Cyan in forum Data Compression
    Replies: 18
    Last Post: 18th January 2010, 01:48
  3. Introducing zpipe, a streaming ZPAQ compatible compressor
    By Matt Mahoney in forum Data Compression
    Replies: 0
    Last Post: 1st October 2009, 06:32
  4. zlib-compatible alternatives
    By Cyan in forum Data Compression
    Replies: 0
    Last Post: 12th May 2009, 01:28
  5. Compiler related: Intel's code slower on AMD-CPUs?!
    By Vacon in forum Data Compression
    Replies: 5
    Last Post: 10th May 2008, 17:56

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •