07-24-2023, 03:15 AM
I have a number of tight loops I'm trying to optimize with GCC and intrinsics. Consider for example the following function.
void triad(float *x, float *y, float *z, const int n) {
float k = 3.14159f;
int i;
__m256 k4 = _mm256_set1_ps(k);
for(i=0; i<n; i+=8) {
_mm256_store_ps(&z[i], _mm256_add_ps(_mm256_load_ps(&x[i]), _mm256_mul_ps(k4, _mm256_load_ps(&y[i]))));
}
}
This produces a main loop like this
20: vmulps ymm0,ymm1,[rsi+rax*1]
25: vaddps ymm0,ymm0,[rdi+rax*1]
2a: vmovaps [rdx+rax*1],ymm0
2f: add rax,0x20
33: cmp rax,rcx
36: jne 20
But the `cmp` instruction is unnecessary. Instead of having `rax` start at zero and finish at `sizeof(float)*n` we can set the base pointers (`rsi`, `rdi`, and `rdx`) to the end of the array and set `rax` to `-sizeof(float)*n` and then test for zero. I am able to do this with my own assembly code like this
.L2 vmulps ymm1, ymm2, [rdi+rax]
vaddps ymm0, ymm1, [rsi+rax]
vmovaps [rdx+rax], ymm0
add rax, 32
jne .L2
but I can't manage to get GCC to do this. I have several tests now where this makes a significant difference. Until recently GCC and intrinsics have severed me well so I'm wondering if there is a compiler switch or a way to reorder/change my code so the `cmp` instruction is not produced with GCC.
I tried the following but it still produces `cmp`. All variations I have tried still produce `cmp`.
void triad2(float *x, float *y, float *z, const int n) {
float k = 3.14159f;
float *x2 = x+n;
float *y2 = y+n;
float *z2 = z+n;
int i;
__m256 k4 = _mm256_set1_ps(k);
for(i=-n; i<0; i+=8) {
_mm256_store_ps(&z2[i], _mm256_add_ps(_mm256_load_ps(&x2[i]), _mm256_mul_ps(k4, _mm256_load_ps(&y2[i]))));
}
}
Edit:
I'm interested in maximizing instruction level parallelism (ILP) for these functions for arrays which fit in the L1 cache (actually for `n=2048`). Although unrolling can be used to improve the bandwidth it can decrease the ILP (assuming the full bandwidth can be attained without unrolling).
Edit:
Here is a table of results for a Core2 (pre Nehalem), a IvyBridge, and a Haswell system. Intrinsics is the results of using intrinsics, unroll1 is my assembly code not using `cmp`, and unroll16 is my assembly code unrolling 16 times. The percentages are the percentage of the peak performance (frequency*num_bytes_cycle where num_bytes_cycle is 24 for SSE, 48 for AVX and 96 for FMA).
SSE AVX FMA
intrinsic 71.3% 90.9% 53.6%
unroll1 97.0% 96.1% 63.5%
unroll16 98.6% 90.4% 93.6%
ScottD 96.5%
32B code align 95.5%
For SSE I get almost as good a result without unrolling as with unroll but only if I don't use `cmp`. On AVX I get the best result without unrolling and without using `cmp`. It's interesting that on IB unrolling actually is worse. On Haswell I get by far the best result by unrolling. Which is why I asked this [question](
Edit:
**Based on ScottD's answer I now get almost 97% with intrinsics for my Core2 system (pre Nehalem 64-bit mode).** I'm not sure why the `cmp` matters actually since it should take 2 clock cycles per iteration anyway. For Sandy Bridge it turns out the efficiency loss is due to code alignment not to the extra `cmp`. On Haswell only unrolling works anyway.
void triad(float *x, float *y, float *z, const int n) {
float k = 3.14159f;
int i;
__m256 k4 = _mm256_set1_ps(k);
for(i=0; i<n; i+=8) {
_mm256_store_ps(&z[i], _mm256_add_ps(_mm256_load_ps(&x[i]), _mm256_mul_ps(k4, _mm256_load_ps(&y[i]))));
}
}
This produces a main loop like this
20: vmulps ymm0,ymm1,[rsi+rax*1]
25: vaddps ymm0,ymm0,[rdi+rax*1]
2a: vmovaps [rdx+rax*1],ymm0
2f: add rax,0x20
33: cmp rax,rcx
36: jne 20
But the `cmp` instruction is unnecessary. Instead of having `rax` start at zero and finish at `sizeof(float)*n` we can set the base pointers (`rsi`, `rdi`, and `rdx`) to the end of the array and set `rax` to `-sizeof(float)*n` and then test for zero. I am able to do this with my own assembly code like this
.L2 vmulps ymm1, ymm2, [rdi+rax]
vaddps ymm0, ymm1, [rsi+rax]
vmovaps [rdx+rax], ymm0
add rax, 32
jne .L2
but I can't manage to get GCC to do this. I have several tests now where this makes a significant difference. Until recently GCC and intrinsics have severed me well so I'm wondering if there is a compiler switch or a way to reorder/change my code so the `cmp` instruction is not produced with GCC.
I tried the following but it still produces `cmp`. All variations I have tried still produce `cmp`.
void triad2(float *x, float *y, float *z, const int n) {
float k = 3.14159f;
float *x2 = x+n;
float *y2 = y+n;
float *z2 = z+n;
int i;
__m256 k4 = _mm256_set1_ps(k);
for(i=-n; i<0; i+=8) {
_mm256_store_ps(&z2[i], _mm256_add_ps(_mm256_load_ps(&x2[i]), _mm256_mul_ps(k4, _mm256_load_ps(&y2[i]))));
}
}
Edit:
I'm interested in maximizing instruction level parallelism (ILP) for these functions for arrays which fit in the L1 cache (actually for `n=2048`). Although unrolling can be used to improve the bandwidth it can decrease the ILP (assuming the full bandwidth can be attained without unrolling).
Edit:
Here is a table of results for a Core2 (pre Nehalem), a IvyBridge, and a Haswell system. Intrinsics is the results of using intrinsics, unroll1 is my assembly code not using `cmp`, and unroll16 is my assembly code unrolling 16 times. The percentages are the percentage of the peak performance (frequency*num_bytes_cycle where num_bytes_cycle is 24 for SSE, 48 for AVX and 96 for FMA).
SSE AVX FMA
intrinsic 71.3% 90.9% 53.6%
unroll1 97.0% 96.1% 63.5%
unroll16 98.6% 90.4% 93.6%
ScottD 96.5%
32B code align 95.5%
For SSE I get almost as good a result without unrolling as with unroll but only if I don't use `cmp`. On AVX I get the best result without unrolling and without using `cmp`. It's interesting that on IB unrolling actually is worse. On Haswell I get by far the best result by unrolling. Which is why I asked this [question](
[To see links please register here]
). The source code to test this can be found in that question.Edit:
**Based on ScottD's answer I now get almost 97% with intrinsics for my Core2 system (pre Nehalem 64-bit mode).** I'm not sure why the `cmp` matters actually since it should take 2 clock cycles per iteration anyway. For Sandy Bridge it turns out the efficiency loss is due to code alignment not to the extra `cmp`. On Haswell only unrolling works anyway.