Create an account

Very important

  • To access the important data of the forums, you must be active in each forum and especially in the leaks and database leaks section, send data and after sending the data and activity, data and important content will be opened and visible for you.
  • You will only see chat messages from people who are at or below your level.
  • More than 500,000 database leaks and millions of account leaks are waiting for you, so access and view with more activity.
  • Many important data are inactive and inaccessible for you, so open them with activity. (This will be done automatically)


Thread Rating:
  • 307 Vote(s) - 3.51 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Optimized 2x2 matrix multiplication: Slow assembly versus fast SIMD

#1
**Problem**

I am studying high performance matrix multiplication algorithms such as OpenBLAS or GotoBLAS and I am trying to reproduce some of the results. This question deals with the inner kernel of a matrix multiplication algorithm. Specifically, I am looking at computing `C += AB`, where `A` and `B` are 2x2 matrices of type `double` at the peak speed of my CPU. There are two ways to do this. One way is to use SIMD instructions. The second way is to code directly in assembly using the SIMD registers.

**What I have looked at so far**

All the relevant papers, course webpages, many many SO Q&As dealing with the subject (too many to list), I have compiled OpenBLAS on my computer, looked through OpenBLAS, GotoBLAS, and BLIS source codes, Agner's manuals.

**Hardware**

My CPU is an Intel i5 - 540M. You can find the relevant CPUID information on cpu-world.com. The microarchitecture is Nehalem (westmere), so it can theoretically compute 4 double precision flops per core per cycle. I will be using just one core (no OpenMP), so with hyperthreading off and 4-step Intel Turbo Boost, I should be seeing a peak of `( 2.533 Ghz + 4*0.133 Ghz ) * ( 4 DP flops/core/cycle ) * ( 1 core ) = 12.27 DP Gflops`. For reference, with both cores running at peak, Intel Turbo Boost gives a 2-step speed up and I should get a theoretical peak of `22.4 DP Gflops`.

**Setup**

I declare my 2x2 matrices as `double` and initialize them with random entries as shown in the code snippet below.

srand(time(NULL));
const int n = 2;
double A[n*n];
double B[n*n];
double C[n*n];
double T[n*n];
for(int i = 0; i < n*n; i++){
A[i] = (double) rand()/RAND_MAX;
B[i] = (double) rand()/RAND_MAX;
C[i] = 0.0;
}

I compute a true answer using naive matrix-matrix multiplcation (shown below) which allows me to check my result either visually or by computing the L2 norm of all the elements

// "true" answer
for(int i = 0; i < n; i++)
for(int j = 0; j < n; j++)
for(int k = 0; k < n; k++)
T[i*n + j] += A[i*n + k]*B[k*n + j];

To run the code and get an estimate of the Gflops, I call each multiplication function once to warmup, and then execute it inside a `for` loop for `maxiter` times, making sure to zero the `C` matrix each time as I am computing `C += AB`. The `for` loop is placed inside two `clock()` statements and this is used to estimate the Gflops. The code snippet blow illustrates this part.

C[0] = 0.0; C[1] = 0.0; C[2] = 0.0; C[3] = 0.0;
mult2by2(A,B,C); //warmup
time1 = clock();
for(int i = 0; i < maxiter; i++){
mult2by2(A,B,C);
C[0] = 0.0; C[1] = 0.0; C[2] = 0.0; C[3] = 0.0;
}
time2 = clock() - time1;
time3 = (double)(time2)/CLOCKS_PER_SEC;
gflops = (double) (2.0*n*n*n)/time3/1.0e9*maxiter;
mult2by2(A,B,C); // to compute the norm against T
norm = L2norm(n,C,T);

**SIMD code**

My CPU supports 128-bit vectors, so I can fit 2 `double`s in each vector. This is the main reason why I am doing 2x2 matrix multiplication in the inner kernel. The SIMD code computes an entire row of `C` at one time.

inline void
__attribute__ ((gnu_inline))
__attribute__ ((aligned(16))) mult2by2B(
const double* restrict A,
const double* restrict B,
double* restrict C
)

{

register __m128d xmm0, xmm1, xmm2, xmm3, xmm4;
xmm0 = _mm_load_pd©;
xmm1 = _mm_load1_pd(A);
xmm2 = _mm_load_pd(B);
xmm3 = _mm_load1_pd(A + 1);
xmm4 = _mm_load_pd(B + 2);
xmm1 = _mm_mul_pd(xmm1,xmm2);
xmm2 = _mm_add_pd(xmm1,xmm0);
xmm1 = _mm_mul_pd(xmm3,xmm4);
xmm2 = _mm_add_pd(xmm1,xmm2);
_mm_store_pd(C,xmm2);

xmm0 = _mm_load_pd(C + 2);
xmm1 = _mm_load1_pd(A + 2);
xmm2 = _mm_load_pd(B);
xmm3 = _mm_load1_pd(A + 3);
//xmm4 = _mm_load_pd(B + 2);
xmm1 = _mm_mul_pd(xmm1,xmm2);
xmm2 = _mm_add_pd(xmm1,xmm0);
xmm1 = _mm_mul_pd(xmm3,xmm4);
xmm2 = _mm_add_pd(xmm1,xmm2);
_mm_store_pd(C + 2,xmm2);
}

**Assmebly (Intel Syntax)**

My first attempt was to create a separate assembly routine for this part and call it from the `main` routine. However, it was extremely slow because I cannot inline `extern` functions. I wrote the assembly as inline assembly as shown below. It is *identical* to that which is produced by `gcc -S -std=c99 -O3 -msse3 -ffast-math -march=nocona -mtune=nocona -funroll-all-loops -fomit-frame-pointer -masm=intel`. From what I understand of the Nehalem microarchitecture diagrams, this processor can perform `SSE ADD`, `SSE MUL`, and `SSE MOV` in parallel, which explains the interleaving of `MUL`, `ADD`, `MOV` instructions. You will notice the SIMD instructions above are in a different order because I had a different understanding from Agner Fog's manuals. Nevertheless, `gcc` is smart and the SIMD code above compiles to the assembly shown in the inline version.

inline void
__attribute__ ((gnu_inline))
__attribute__ ((aligned(16))) mult2by2A
(
const double* restrict A,
const double* restrict B,
double* restrict C
)
{
__asm__ __volatile__
(
"mov edx, %[A] \n\t"
"mov ecx, %[B] \n\t"
"mov eax, %[C] \n\t"
"movapd xmm3, XMMWORD PTR [ecx] \n\t"
"movapd xmm2, XMMWORD PTR [ecx+16] \n\t"
"movddup xmm1, QWORD PTR [edx] \n\t"
"mulpd xmm1, xmm3 \n\t"
"addpd xmm1, XMMWORD PTR [eax] \n\t"
"movddup xmm0, QWORD PTR [edx+8] \n\t"
"mulpd xmm0, xmm2 \n\t"
"addpd xmm0, xmm1 \n\t"
"movapd XMMWORD PTR [eax], xmm0 \n\t"
"movddup xmm4, QWORD PTR [edx+16] \n\t"
"mulpd xmm4, xmm3 \n\t"
"addpd xmm4, XMMWORD PTR [eax+16] \n\t"
"movddup xmm5, QWORD PTR [edx+24] \n\t"
"mulpd xmm5, xmm2 \n\t"
"addpd xmm5, xmm4 \n\t"
"movapd XMMWORD PTR [eax+16], xmm5 \n\t"
: // no outputs
: // inputs
[A] "m" (A),
[B] "m" (B),
[C] "m" ©
: //register clobber
"memory",
"edx","ecx","eax",
"xmm0","xmm1","xmm2","xmm3","xmm4","xmm5"
);
}


**Results**

I compile my code with the following flags:

gcc -std=c99 -O3 -msse3 -ffast-math -march=nocona -mtune=nocona -funroll-all-loops -fomit-frame-pointer -masm=intel

The results for `maxiter = 1000000000` are below:

********** Inline ASM
L2 norm: 0.000000e+000, Avg. CPU time: 9.563000, Avg. Gflops: 1.673115

********** SIMD Version
L2 norm: 0.000000e+000, Avg. CPU time: 0.359000, Avg. Gflops: 44.568245

If I force the SIMD version to not be inlined with `__attribute__ ((noinline))`, the results are:

********** Inline ASM
L2 norm: 0.000000e+000, Avg. CPU time: 11.155000, Avg. Gflops: 1.434334

********** SIMD Version
L2 norm: 0.000000e+000, Avg. CPU time: 11.264000, Avg. Gflops: 1.420455


**Questions**

1. If both the inline ASM and SIMD implementations produce identical assembly output, why is the assembly version so much slower? It's as if the inline assembly did not get inlined, which is made evident by the second set of results showing identical performance for "inline" ASM versus "noinline" SIMD. The only explanation I can find is in *Agner Fog Volume 2 page 6*:

> Compiled code may be faster than assembly code because compilers can make
inter-procedural optimization and whole-program optimization. The assembly
programmer usually has to make well-defined functions with a well-defined call
interface that obeys all calling conventions in order to make the code testable and
verifiable. This prevents many of the optimization methods that compilers use, such
as function inlining, register allocation, constant propagation, common subexpression
elimination across functions, scheduling across functions, etc. These
advantages can be obtained by using C++ code with intrinsic functions instead of
assembly code.

But the assembler output for both versions is exactly the same.

2. Why am I seeing 44 Gflops in the first set of results? This is way above the 12 Gflops peak I calculated, and is what I would expect if I was running both cores with single precision calculations.



**EDIT 1**
The comment says there might be dead code elimination I can confirm that this is happening for the SIMd instructions. The `-S` output shows that the `for` loop for SIMD only zeros `C` matrix. I can disable that by turning off compiler optimization with `-O0`. In that case, SIMD runs 3x as slow as ASM, but ASM still runs at exactly the same speed. The norm is also nonzero now, but it's still OK at 10^-16. I also see that the inline ASM version is being inlined with the `APP` and `NO_APP` tags, but it is also being unrolled 8 times in the `for` loop. I think unrolling that many times will impact performance heavily, as I usually unroll loops 4 times. Anything more, in my experience, seems to degrade performance.
Reply

#2
GCC is optimizing away your inline function using intrinsics, `mult2by2B`, due to the line

C[0] = 0.0; C[1] = 0.0; C[2] = 0.0; C[3] = 0.0;

Without that line it takes 2.9 seconds on the computer from Coliru

[To see links please register here]


And with the line it only takes 0.000001

[To see links please register here]


You can also see this in the assembly. If you drop the code below into

[To see links please register here]

you will see that with that line of code it skips the function entirely.

However, when you inline the assembly GCC is NOT optimizing the function, `mult2by2A`, away (even though it inlines it). You can see this in the assembly as well.

#include <stdio.h>
#include <emmintrin.h> // SSE2
#include <omp.h>

inline void
__attribute__ ((gnu_inline))
__attribute__ ((aligned(16))) mult2by2B(
const double* __restrict A,
const double* __restrict B,
double* __restrict C
)

{

register __m128d xmm0, xmm1, xmm2, xmm3, xmm4;
xmm0 = _mm_load_pd©;
xmm1 = _mm_load1_pd(A);
xmm2 = _mm_load_pd(B);
xmm3 = _mm_load1_pd(A + 1);
xmm4 = _mm_load_pd(B + 2);
xmm1 = _mm_mul_pd(xmm1,xmm2);
xmm2 = _mm_add_pd(xmm1,xmm0);
xmm1 = _mm_mul_pd(xmm3,xmm4);
xmm2 = _mm_add_pd(xmm1,xmm2);
_mm_store_pd(C,xmm2);

xmm0 = _mm_load_pd(C + 2);
xmm1 = _mm_load1_pd(A + 2);
xmm2 = _mm_load_pd(B);
xmm3 = _mm_load1_pd(A + 3);
//xmm4 = _mm_load_pd(B + 2);
xmm1 = _mm_mul_pd(xmm1,xmm2);
xmm2 = _mm_add_pd(xmm1,xmm0);
xmm1 = _mm_mul_pd(xmm3,xmm4);
xmm2 = _mm_add_pd(xmm1,xmm2);
_mm_store_pd(C + 2,xmm2);
}

int main() {
double A[4], B[4], C[4];
int maxiter = 10000000;
//int maxiter = 1000000000;
double dtime;
dtime = omp_get_wtime();
for(int i = 0; i < maxiter; i++){
mult2by2B(A,B,C);
C[0] = 0.0; C[1] = 0.0; C[2] = 0.0; C[3] = 0.0;
}
dtime = omp_get_wtime() - dtime;
printf("%f %f %f %f\n", C[0], C[1], C[2], C[3]);
//gflops = (double) (2.0*n*n*n)/time3/1.0e9*maxiter;
printf("time %f\n", dtime);
}
Reply



Forum Jump:


Users browsing this thread:
1 Guest(s)

©0Day  2016 - 2023 | All Rights Reserved.  Made with    for the community. Connected through