07-24-2023, 05:24 AM
I am doing some numerical optimization on a scientific application. One thing I noticed is that GCC will optimize the call `pow(a,2)` by compiling it into `a*a`, but the call `pow(a,6)` is not optimized and will actually call the library function `pow`, which greatly slows down the performance. (In contrast, [Intel C++ Compiler][1], executable `icc`, will eliminate the library call for `pow(a,6)`.)
What I am curious about is that when I replaced `pow(a,6)` with `a*a*a*a*a*a` using GCC 4.5.1 and options "`-O3 -lm -funroll-loops -msse4`", it uses 5 `mulsd` instructions:
movapd %xmm14, %xmm13
mulsd %xmm14, %xmm13
mulsd %xmm14, %xmm13
mulsd %xmm14, %xmm13
mulsd %xmm14, %xmm13
mulsd %xmm14, %xmm13
while if I write `(a*a*a)*(a*a*a)`, it will produce
movapd %xmm14, %xmm13
mulsd %xmm14, %xmm13
mulsd %xmm14, %xmm13
mulsd %xmm13, %xmm13
which reduces the number of multiply instructions to 3. `icc` has similar behavior.
Why do compilers not recognize this optimization trick?
[1]:
What I am curious about is that when I replaced `pow(a,6)` with `a*a*a*a*a*a` using GCC 4.5.1 and options "`-O3 -lm -funroll-loops -msse4`", it uses 5 `mulsd` instructions:
movapd %xmm14, %xmm13
mulsd %xmm14, %xmm13
mulsd %xmm14, %xmm13
mulsd %xmm14, %xmm13
mulsd %xmm14, %xmm13
mulsd %xmm14, %xmm13
while if I write `(a*a*a)*(a*a*a)`, it will produce
movapd %xmm14, %xmm13
mulsd %xmm14, %xmm13
mulsd %xmm14, %xmm13
mulsd %xmm13, %xmm13
which reduces the number of multiply instructions to 3. `icc` has similar behavior.
Why do compilers not recognize this optimization trick?
[1]:
[To see links please register here]