Create an account

Very important

  • To access the important data of the forums, you must be active in each forum and especially in the leaks and database leaks section, send data and after sending the data and activity, data and important content will be opened and visible for you.
  • You will only see chat messages from people who are at or below your level.
  • More than 500,000 database leaks and millions of account leaks are waiting for you, so access and view with more activity.
  • Many important data are inactive and inaccessible for you, so open them with activity. (This will be done automatically)


Thread Rating:
  • 597 Vote(s) - 3.56 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Fully utilizing pipelines on kaby lake

#1
(Followup code-review [question here][1], with more details of the context of this loop.)

---

Environment:

- Windows 7 x64
- VS 2017 community
- Targeting x64 code on Intel i7700k (kaby lake)

I don't write a lot of assembler code, and when I do, it's either short enough or simple enough that I don't have to worry much about squeezing the maximum amount of perf out of it. My more complex code is usually written in C and I let the compiler's optimizers worry about latency, code alignment, etc.

However in my current project, MSVC's optimizer does a remarkably poor job on the code in my critical path. So...

I haven't yet found a good tool that does either a static or runtime analysis of x64 assembler code with a view to removing stalls, improving latency, etc. All I've got is the VS profiler which tells me (roughly) which instructions are taking the most time. And the clock on the wall that tells me if the latest change has made things better or worse.

As an alternative, I've been slogging my way thru Agner's docs with the hope of squeezing some more perf out of my code. The problem is that it's hard to understand any of his work until you understand all of it. But pieces of it make sense, and I'm trying to apply what I have learned.

What that in mind, here's the core of my innermost loop which (not surprisingly) is where the VS profiler says my time is being spent:

nottop:

vpminub ymm2, ymm2, ymm3 ; reset out of range values
vpsubb ymm2, ymm2, ymm0 ; take a step

top:
vptest ymm2, ymm1 ; check for out of range values
jnz nottop

; Outer loop that does some math, does a "vpsubb ymm2, ymm2, ymm0",
; and eventually jumps back to top

Yup, this is pretty much a textbook example of a dependency chain: Each instruction in this tight little loop depends upon the results of the previous operation. This means there can be no parallelism, which means I'm not taking full advantage of the processor.

Inspired by Agner's "optimizing assembler" doc, I've come up with an approach that (hopefully) allows me to do 2 operations at a time, so I could have one pipeline updating ymm2 and another updating (say) ymm8.

It's a non-trivial change though, so before I start ripping everything apart, I wonder if it's likely to help. Looking at Agner's "Instruction tables" for kaby lake (my target), I see that:

uops
each
port Latency
pminub p01 1
psubb p015 1
ptest p0 p5 3

Given this, it looks like while one pipeline is using p0+p5 to do the vptest against ymm2, another can be utilizing p1 to do both vpminub and vpsubb on ymm8. Yeah, things are still going to get stacked up behind vptest, but it should help.

Or would it?

I'm currently running this code from 8 threads (Yes, 8 threads really does give me better total throughput than 4,5,6 or 7). Given that my i7700k has 4 hyperthreaded cores, wouldn't the fact that there are 2 threads running on each core mean that I'm *already* maxing out the ports? Ports are "per core," not "per logical cpu," right?

So.

Based on my current understanding of Agner's work, it appears that there is no way to further optimize this code in its current form. If I want better perf, I'm going to need to come up with a different approach.

And yes, I'm sure if I posted my whole asm routine here, someone could suggest an alternate approach. But the purpose of this Question isn't to have someone write my code for me. I'm trying to see if I'm starting to understand how to think about optimizing asm code.

Is this (roughly) the right way to look at things? Am I missing a few pieces? Or is this flat-out wrong?


[1]:

[To see links please register here]

Reply

#2
A partial answer:

Intel provides a tool named [Intel Architecture Code Analyzer](

[To see links please register here]

) (described [here](

[To see links please register here]

)) that does static analysis of code, showing (kind of) what ports are in use in a section of asm code.

Unfortunately:

- v2.3 doesn't include the essential (and only) header file. You can find this file in v2.2.
- v2.2 includes the header, but omits a python script (pt.py) for analyzing the output. This file is also not included in v2.3 (haven't found it yet).
- One of the output formats from iaca is a .dot file, that is read by [graphviz](

[To see links please register here]

), but the Intel docs fail to describe which of the 38 executables in graphviz is used to display output.

But perhaps most importantly (for my needs):

- v2.3 (currently the most recent release) supports Skylake, but not Kaby Lake.

Given how the implementation details change between processors, that makes all the output suspect. Dates in the pdf file suggest that v2.3 was released July 2017, meaning that I'm probably going to have to wait a while for the next release.
Reply

#3
**TL:DR**: I think Hyperthreading should keep all your vector ALU ports busy with 2 threads per core.

---

`vptest` doesn't write either vector register, only flags. The next iteration doesn't have to wait for it, so its latency is mostly irrelevant.

Only `jnz` is dependent on `vptest`, and speculative execution + branch prediction hides the latency of control dependencies. `vptest` latency is relevant for how quickly a branch mispredict can be detected, but not for throughput in the correctly-predicted case.

---

Good point about hyperthreading. Interleaving two independent dep chains within a single thread can be helpful, but it's a lot harder to do correctly and efficiently.

Let's look at the instructions in your loop. predicted-taken `jnz` will always run on p6, so we can discount it. (Unrolling could actually hurt: predicted-not-taken `jnz` can also run on p0 or p6)

On a core by itself, your loop should run at 2 cycles per iteration, bottlenecked on latency. It's 5 fused-domain uops, so it takes 1.25 cycles to issue. (Unlike `test`, `jnz` can't macro-fuse with `vptest`). **With hyperthreading, the front-end is already a worse bottleneck than latency**. Each thread can issue 4 uops every other cycle, which is less than the 5 uops every other cycle of the dependency-chain bottleneck.

(This is common for recent Intel, especially SKL/KBL: many uops have enough ports to choose from that sustaining 4 uops per clock throughput is realistic, especially with SKL's improved throughput of the uop-cache and decoders to avoid issue bubbles due to front-end limitations rather than the back-end filling up.)

Every time one thread stalls (e.g. for a branch mispredict), the front-end can catch up on the other thread and get lots of future iterations into the out-of-order core for it to chew through at one iter per 2 cycles. (Or less because of execution-port throughput limits, see below).

----

Execution-port throughput (unfused-domain):

Only 1 of every 5 uops runs on p6 (the `jnz`). It can't be a bottleneck because the front-end issue rate limits us to less than one branch issuing per clock while running this loop.

The other 4 vector ALU uops per iteration have to run on the 3 ports with vector execution units. The p01 and p015 uops have enough scheduling flexibility that no single port will be a bottleneck, so we can just look at total ALU throughput. That's 4 uops / iter for 3 ports, for a max average throughput for a physical core of one iter per 1.333 cycles.

For a single thread (no HT), this is not the most serious bottleneck. But with two hyperthreads, that's one iter per 2.6666 cycles.

**Hyperthreading should saturate your execution units, with some front-end throughput to spare**. Each thread should average one per 2.666c, with the front-end able to issue at one per 2.5c. Since latency only limits you to one per 2c, it can catch up after any delays on the critical-path due to resource conflicts. (a `vptest` uop stealing a cycle from one of the other two uops).

If you can change the loop to check any less often, or with fewer vector uops, that could be a win. But everything I'm thinking of is *more* vector uops (e.g. `vpand` instead of `vptest` and then `vpor` a couple of those results together before checking... Or `vpxor` to produce an all-zero vector when `vptest` would). Maybe if there was a vector XNOR or something, but there isn't.

----

To check what's actually happening, you could use perf counters to profile your current code, and see what uop throughput you're getting for a whole core (not just each logical thread separately). Or profile one logical thread and see if it's saturating about half of p015.
Reply



Forum Jump:


Users browsing this thread:
1 Guest(s)

©0Day  2016 - 2023 | All Rights Reserved.  Made with    for the community. Connected through