Results 1 to 9 of 9

Thread: benchmark of mixed data types with SIMD instructions

  1. #1
    Member just a worm's Avatar
    Join Date
    Aug 2013
    Location
    planet "earth"
    Posts
    96
    Thanks
    29
    Thanked 6 Times in 5 Posts

    benchmark of mixed data types with SIMD instructions.

    Hello,
    I would like to know how big the penality is when someone mixes data types (integers and floating point values) when executing SIMD instructions on the x86 architecture. For this purpose I have prepared a test programm that executes a few SIMD instructions and counts the number of machine cycles taken.

    I have only 2 computers and I read that it has no penality in this situation but other processors do have a penality. For this reason I would appreciate it, if you folks could run the attached test software on your processor, report the measurement results and the name of your processor.

    Test results (output of the programm):
    Code:
    what this programm does
    -----------------------
    This programm is going to run several pieces of code with SIMD-instructions and measure the execution time in machine cycles.
    
    The code pieces are made to check wether processors exist which have a penality when it's about executing SIMD-instructions with mixed datatypes.
    
    The programm is going to run several tests. Before executing some test code the programm will show the code. While running a test the programm will execute the code 4,096 times in a loop. The execution time of the loop foot is included in the measurement result. The measurement result will show the execution time of the whole loop (several tenthousand machine cycles) + a few instructions extra to keep track of the measurement results (less than 100 machine cycles). The loop is going to be executed many times and only the lowest execution time is going to be showed because it contains the lowest amount of interference from other programms and the operating system.
    
    possible crash warning
    ----------------------
    The programm is going to execute
            - the instruction "read time-stamp counter" ("rdtsc"), which exists since 1993 with the introduction of the Pentium processor,
            - instructions from the instruction set "streaming single instruction multiple data extensions" ("SSE"), which exists since 26th February 1999 with the introduction of the Pentium 3 processor, and
            - instructions from the instruction set "streaming single instruction multiple data extensions 2" ("SSE2"), which exists since 20th November 2000 with the introduction of the Pentium 4 processor.
    Even though the requirements for running this programm are probably met by most processors, there is still a slight chance that your processor does not support the execution of one of the instructions, even if your processor is relatively new. This simple programm is not going tocheck wether your processor meets the execution requirements and therefore might crash instead of outputting the measurement results.
    
    test -1 (empty loop body)
    -------------------------
            data movement: none
            arithmetic:    none
    
            code:
                    none
    
            execution time in machine cycles: 8218
    
    test 0
    ------
            data movement: integer/general purpose
            arithmetic:    integer
    
            code:
                    /*|*/
    
                    # move aligned double quadword (integer/general purpose; SSE2)
                    # movdqa xmm0, RAM[eax]
                    66 0F 6F 00
    
                    # subtract packed integers (integer; SSE2)
                    # psub xmm0, xmm1
                    66 0F FA C1
    
                    # subtract packed integers (integer; SSE2)
                    # psub xmm0, xmm2
                    66 0F FA C2
    
                    # subtract packed integers (integer; SSE2)
                    # psub xmm0, xmm3
                    66 0F FA C3
    
                    /*|*/
    
                    # subtract packed integers (integer; SSE2)
                    # psub xmm0, xmm4
                    66 0F FA C4
    
                    # move aligned double quadword (integer/general purpose; SSE2)
                    # movdqa RAM[eax], xmm0
                    66 0F 7F 00
    
            execution time in machine cycles: 53284
    
    test 1
    ------
            data movement: floating point
            arithmetic:    integer
    
            code:
                    /*|*/
    
                    # move aligned packed single-precision floating-point values (floating point; SSE)
                    # movaps xmm0, RAM[eax]
                    0F 28 00
    
                    # subtract packed integers (integer; SSE2)
                    # psub xmm0, xmm1
                    66 0F FA C1
    
                    # subtract packed integers (integer; SSE2)
                    # psub xmm0, xmm2
                    66 0F FA C2
    
                    # subtract packed integers (integer; SSE2)
                    # psub xmm0, xmm3
                    66 0F FA C3
    
                    # subtract packed integers (integer; SSE2)
                    # psub xmm0, xmm4
                    66/*|*/ 0F FA C4
    
                    # move aligned packed single-precision floating-point values (floating point; SSE)
                    # movaps RAM[eax], xmm0
                    0F 29 00
    
            execution time in machine cycles: 53284
    
    test 2
    ------
            data movement: floating point
            arithmetic:    floating point
    
            code:
                    /*|*/
    
                    # move aligned packed single-precision floating-point values (floating point; SSE)
                    # movaps xmm0, RAM[eax]
                    0F 28 00
    
                    # subtract packed single-precision floating-point values (floating point; SSE)
                    # subps xmm0, xmm1
                    0F 5C C1
    
                    # subtract packed single-precision floating-point values (floating point; SSE)
                    # subps xmm0, xmm2
                    0F 5C C2
    
                    # subtract packed single-precision floating-point values (floating point; SSE)
                    # subps xmm0, xmm3
                    0F 5C C3
    
                    # subtract packed single-precision floating-point values (floating point; SSE)
                    # subps xmm0, xmm4
                    0F 5C C4
    
                    # move aligned packed single-precision floating-point values (floating point; SSE)
                    # movaps RAM[eax], xmm0
                    0F/*|*/ 29 00
    
            execution time in machine cycles: 5435423
    
    test 3
    ------
            data movement: floating point and integer
            arithmetic:    floating point and integer
    
            code:
                    /*|*/
    
                    # move aligned packed single-precision floating-point values (floating point; SSE)
                    # movaps xmm0, RAM[eax]
                    0F 28 00
    
                    # subtract packed single-precision floating-point values (floating point; SSE)
                    # subps xmm0, xmm1
                    0F 5C C1
    
                    # subtract packed integers (integer; SSE2)
                    # psub xmm0, xmm2
                    66 0F FA C2
    
                    # subtract packed single-precision floating-point values (floating point; SSE)
                    # subps xmm0, xmm3
                    0F 5C C3
    
                    # subtract packed integers (integer; SSE2)
                    # psub xmm0, xmm4
                    66 0F FA/*|*/ C4
    
                    # move aligned double quadword (integer/general purpose; SSE2)
                    # movdqa RAM[eax], xmm0
                    66 0F 7F 00
    
            execution time in machine cycles: 2968889
    I have no idea why the execution time of test 2 and 3 is so high.

    I would appreciate your help in testing. Thank you.

    Test results so far:
    Code:
    with wrap-around:
        ┌─────────────────────────────────────────────────┬───────────────────────────────────────────────────────────────┬─────────────┐
        │processor                                        │processing time for the test                                   │thanks to    │
        ├───────────────────────────────────┬─────────────┼───────────┬──────────┬──────────┬──────────┬──────────────────┤             │
        │manufacturer                       │model        │-1         │0         │1         │2         │3                 │             │
        │                                   │             │move: none │move: int │move: flo │move: flo │move: flo and int │             │
        │                                   │             │arith: none│arith: int│arith: int│arith: flo│arith: flo and int│             │
        ├───────────────────────────────────┼─────────────┼───────────┼──────────┼──────────┼──────────┼──────────────────┼─────────────┤
        │Intel Corporation                  │Atom D2550   │       8218│     53284│     53284│   5435423│           2968889│just a worm  │
        └───────────────────────────────────┴─────────────┴───────────┴──────────┴──────────┴──────────┴──────────────────┴─────────────┘
    
    without wrap-around:
        ┌─────────────────────────────────────────────────┬───────────────────────────────────────────────────────────────┬─────────────┐
        │processor                                        │processing time for the test                                   │thanks to    │
        ├───────────────────────────────────┬─────────────┼───────────┬──────────┬──────────┬──────────┬──────────────────┤             │
        │manufacturer                       │model        │-1         │0         │1         │2         │3                 │             │
        │                                   │             │move: none │move: int │move: flo │move: flo │move: flo and int │             │
        │                                   │             │arith: none│arith: int│arith: int│arith: flo│arith: flo and int│             │
        ├───────────────────────────────────┼─────────────┼───────────┼──────────┼──────────┼──────────┼──────────────────┼─────────────┤
        │Advanced Micro Devices Incorporated│A8-5500      │       7541│     71059│     71060│    119603│            112137│Kennon Conrad│
        │Intel Corporation                  │Atom D2550   │       8218│     53291│     53291│    118832│             86058│just a worm  │
        │Intel Corporation                  │Atom E3815   │       4136│     36905│     36905│     68310│             49192│just a worm  │
        │Intel Corporation                  │Core i7-4790K│       3747│     37270│     37270│     67061│             63338│Kennon Conrad│
        └───────────────────────────────────┴─────────────┴───────────┴──────────┴──────────┴──────────┴──────────────────┴─────────────┘
    Attached Files Attached Files
    Last edited by just a worm; 24th October 2015 at 00:38.

  2. #2
    Member just a worm's Avatar
    Join Date
    Aug 2013
    Location
    planet "earth"
    Posts
    96
    Thanks
    29
    Thanked 6 Times in 5 Posts
    .
    Last edited by just a worm; 5th June 2015 at 18:29.

  3. #3
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    providing only executbale isn't the best idea. if your code is really 100x slower with FP operations, the only possible reason i have found is:
    Underflow and denormalsOperations that have denormal numbers as input or output or generate underflow takeapproximately 160 clock cycles unless the flush-to-zero mode and denormals-are-zeromode are both used.
    Agner said this only about silvermount, but not old Atom, but it still may be true. try to fill memory with 0.0f prior to running a test. overall, read http://www.agner.org/optimize/microarchitecture.pdf
    Last edited by Bulat Ziganshin; 5th June 2015 at 23:30.

  4. #4
    Member just a worm's Avatar
    Join Date
    Aug 2013
    Location
    planet "earth"
    Posts
    96
    Thanks
    29
    Thanked 6 Times in 5 Posts
    Thank you for the hint. Well I was guessing already that the huge penality occures when a wrap-around happens. But I was thinking that there might be some exceptions that I don't know. But since I am executing SIMD-instructions the theory of exceptions didn't make much sense for me. Well, whatever these huge execution times caused, it was not the penality I wanted to test.

    I have changed the code so that it initializes the memory and the registers with zeros now and that removed the 100x+ slowdown.

    Code:
    test -1 (empty loop body)
    -------------------------
            data movement: none
            arithmetic:    none
    
            execution time in machine cycles: 8218
    
    test 0
    ------
            data movement: integer/general purpose
            arithmetic:    integer
    
            execution time in machine cycles: 53291
    
    test 1
    ------
            data movement: floating point
            arithmetic:    integer
    
            execution time in machine cycles: 53291
    
    test 2
    ------
            data movement: floating point
            arithmetic:    floating point
    
            execution time in machine cycles: 118832
    
    test 3
    ------
            data movement: floating point and integer
            arithmetic:    floating point and integer
    
            execution time in machine cycles: 86058
    Is there someone who is willing to help by providing some test results, please? I would like to figure out wether the shorter floating point instructions (3+ bytes per instruction) can be used to replace a few of the longer integer/general purpose data transfer/movement instructions (4+ bytes per instruction). I read in a different forum about some other test which showed that the floating point data transfer instructions actually speed up the execution a tiny bit because the code is shorter. Unfortunately this other test only tested floating point data transfer instructions vs. integer data transfer instructions but in both cases they were combined with floating point arithmetic instructions. Actually I thought that thouse results might interest some other forum members, too.

    Even if those arithmetic instructions aren't too interesting for me, it's still surprising that executing 4 floating point additions takes 16 additional machine cycles compared to 4 integer additions. At least on an Atom.

    Many thanks.
    Last edited by just a worm; 6th June 2015 at 22:39.

  5. #5
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    AMD A8-5500 @ 3.2 GHZ:

    test -1 (empty loop body)
    -------------------------
    execution time in machine cycles: 7541

    test 0
    ------
    execution time in machine cycles: 71059

    test 1
    ------
    execution time in machine cycles: 71060

    test 2
    ------
    execution time in machine cycles: 119603

    test 3
    ------
    execution time in machine cycles: 112137


    i7-4790K @ 4.36 GHz:

    test -1 (empty loop body)
    -------------------------
    execution time in machine cycles: 3747

    test 0
    ------
    execution time in machine cycles: 37270

    test 1
    ------
    execution time in machine cycles: 37270

    test 2
    ------
    execution time in machine cycles: 67061

    test 3
    ------
    execution time in machine cycles: 63338

  6. The Following User Says Thank You to Kennon Conrad For This Useful Post:

    just a worm (7th June 2015)

  7. #6
    Member just a worm's Avatar
    Join Date
    Aug 2013
    Location
    planet "earth"
    Posts
    96
    Thanks
    29
    Thanked 6 Times in 5 Posts
    Thank you. For easier comparision I added a table to my first post with the test results. I am assuming that you have used "SIMD-benchmark_adding_zeros.exe".

    So far it looks good that the shorter floating point data transfer instructions can be used without a penality.

  8. #7
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by just a worm View Post
    I am assuming that you have used "SIMD-benchmark_adding_zeros.exe".
    That is correct.

  9. #8
    Member just a worm's Avatar
    Join Date
    Aug 2013
    Location
    planet "earth"
    Posts
    96
    Thanks
    29
    Thanked 6 Times in 5 Posts
    thanks

  10. #9
    Member just a worm's Avatar
    Join Date
    Aug 2013
    Location
    planet "earth"
    Posts
    96
    Thanks
    29
    Thanked 6 Times in 5 Posts
    added Atom E3815 to the test results and a programm for Linux

Similar Threads

  1. loseless data compression method for all digital data type
    By rarkyan in forum Data Compression
    Replies: 157
    Last Post: 9th July 2019, 17:28
  2. Synthetic data benchmark
    By Matt Mahoney in forum Data Compression
    Replies: 13
    Last Post: 1st February 2019, 00:56
  3. Intel brings SIMD to JavaScript
    By nburns in forum The Off-Topic Lounge
    Replies: 60
    Last Post: 22nd July 2014, 00:26
  4. Good free SIMD library for x86 SSE & ARM NEON?
    By Paul W. in forum The Off-Topic Lounge
    Replies: 2
    Last Post: 17th May 2014, 04:31
  5. New ASPLOS paper on SIMD FSM's and Huffman decoding
    By Paul W. in forum Data Compression
    Replies: 0
    Last Post: 22nd April 2014, 04:26

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •