Thread Director: Windows 11 Does It Best

Every operating system runs what is called a scheduler – a low-level program that dictates where workloads should be on the processor depending on factors like performance, thermals, and priority. A naïve scheduler that only has to deal with a single core or a homogenous design has it pretty easy, managing only power and thermals. Since those single core days though, schedulers have grown more complex.

One of the first issues that schedulers faced in monolithic silicon designs was multi-threading, whereby a core could run more than one thread simultaneously. We usually consider that running two threads on a core usually improves performance, but it is not a linear relationship. One thread on a core might be running at 100%, but two threads on a single core, while overall throughput might increase to 140%, it might mean that each thread is only running at 70%. As a result, schedulers had to distinguish between threads and hyperthreads, prioritizing new software to execute on a new core before filling up the hyperthreads. If there is software that doesn’t need all the performance and is happy to be background-related, then if the scheduler knows enough about the workload, it might put it on a hyperthread. This is, at a simple level, what Windows 10 does today.

This way of doing things maximizes performance, but could have a negative effect on efficiency, as ‘waking up’ a core to run a workload on it may incur extra static power costs. Going beyond that, this simple view assumes each core and thread has the same performance and efficiency profile. When we move to a hybrid system, that is no longer the case.

Alder Lake has two sets of cores (P-cores and E-cores), but it actually has three levels of performance and efficiency: P-cores, E-Cores, and hyperthreads on P-cores. In order to ensure that the cores are used to their maximum, Intel had to work with Microsoft to implement a new hybrid-aware scheduler, and this one interacts with an on-board microcontroller on the CPU for more information about what is actually going on.

The microcontroller on the CPU is what we call Intel Thread Director. It has a full scope view of the whole processor – what is running where, what instructions are running, and what appears to be the most important. It monitors the instructions at the nanosecond level, and communicates with the OS on the microsecond level. It takes into account thermals, power settings, and identifies which threads can be promoted to higher performance modes, or those that can be bumped if something higher priority comes along. It can also adjust recommendations based on frequency, power, thermals, and additional sensory data not immediately available to the scheduler at that resolution. All of that gets fed to the operating system.

The scheduler is Microsoft’s part of the arrangement, and as it lives in software, it’s the one that ultimately makes the decisions. The scheduler takes all of the information from Thread Director, constantly, as a guide. So if a user comes in with a more important workload, Thread Director tells the scheduler which cores are free, or which threads to demote. The scheduler can override the Thread Director, especially if the user has a specific request, such as making background tasks a higher priority.

What makes Windows 11 better than Windows 10 in this regard is that Windows 10 focuses more on the power of certain cores, whereas Windows 11 expands that to efficiency as well. While Windows 10 considers the E-cores as lower performance than P-cores, it doesn’t know how well each core does at a given frequency with a workload, whereas Windows 11 does. Combine that with an instruction prioritization model, and Intel states that under Windows 11, users should expect a lot better consistency in performance when it comes to hybrid CPU designs.

Under the hood, Thread Director is running a pre-trained algorithm based on millions of hours of data gathered during the development of the feature. It identifies the effective IPC of a given workflow, and applies that to the performance/efficiency metrics of each core variation. If there’s an obvious potential for better IPC or better efficiency, then it suggests the thread is moved. Workloads are broadly split into four classes:

  • Class 3: Bottleneck is not in the compute, e.g. IO or busy loops that don’t scale
  • Class 0: Most Applications
  • Class 1: Workloads using AVX/AVX2 instructions
  • Class 2: Workloads using AVX-VNNI instructions

Anything in Class 3 is recommended for E-cores. Anything in Class 1 or 2 is recommended for P cores, with Class 2 having higher priority. Everything else fits in Class 0, with frequency adjustments to optimize for IPC and efficiency if placed on the P-cores. The OS may force any class of workload onto any core, depending on the user.

There was some confusion in the press briefing as to whether Thread Director can ‘learn’ during operation, and how long it would take – to be clear, Thread Director doesn’t learn, it already knows from the pre-trained algorithm. It analyzes the instruction flow coming into a core, identifies the class as listed above, calculates where it is best placed (which takes microseconds), and communicates that to the OS. I think the confusion came with the difference in the words ‘learning’ and ‘analyzing’. In this case, it’s ‘learning’ what the instruction mix to apply to the algorithm, but the algorithm itself isn’t updated in the way that it is ‘learning’ and adjusting the classes. Ultimately even if you wanted to make the algorithm self-learn your workflow, the algorithm can’t actually see which thread relates to which program or utility – that’s something on the operating system level, and down to Microsoft. Ultimately, Thread Director could suggest a series of things, and the operating system can choose to ignore them all. That’s unlikely to happen in normal operation though.

One of the situations where this might rear its head is to do with in-focus operation. As showcased by Intel, the default behavior of Windows changes depending on whether on the power plan.

When a user is on the balanced power plan, Microsoft will move any software or window that is in focus (i.e. selected) onto the P-cores. Conversely, if you click away from one window to another, the thread for that first window will move to an E-core, and the new window now gets P-core priority. This makes perfect sense for the user that has a million windows and tabs open, and doesn’t want them taking immediate performance away.

However, this way of doing things might be a bit of a concern, or at least it is for me. The demonstration that Intel performed was where a user was exporting video content in one application, and then moved to another to do image processing. When the user moved to the image processing application, the video editing threads were moved to the E-cores, allowing the image editor to use the P-cores as needed.

Now usually when I’m dealing with video exports, it’s the video throughput that is my limiting factor. I need the video to complete, regardless of what I’m doing in the interim. By defocusing the video export window, it now moves to the slower E-cores. If I want to keep it on the P-cores in this mode, I have to keep the window in focus and not do anything else. The way that this is described also means that if you use any software that’s fronted by a GUI, but spawns a background process to do the actual work, unless the background process gets focus (which it can never do in normal operation), then it will stay on the E-cores.

In my mind, this is a bad oversight. I was told that this is explicitly Microsoft’s choice on how to do things.

The solution, in my mind, is for some sort of software to exist where a user can highlight programs to the OS that they want to keep on the high-performance track. Intel technically made something similar when it first introduced Turbo Max 3.0, however it was unclear if this was something that had to come from Intel or from Microsoft to work properly. I assume the latter, given the OS has ultimate control here.

I was however told that if the user changes the Windows Power Plan to high-performance, this behavior stops. In my mind this isn’t a proper fix, but it means that we might see some users/reviews of the hardware with lower performance if the workload doing the work is background, and the reviewer is using the default Balanced Power Plan as installed. If the same policy is going to apply to Laptops, that’s a bigger issue.

Cache and Hybrid Designs DDR5: Detailed Support, XMP, Memory Boost
Comments Locked

395 Comments

View All Comments

  • Gothmoth - Wednesday, October 27, 2021 - link

    241W TDP in intel speak means 280W under full load.

    you pay twice the money per year for energy..... and how do you cool this thing?
  • Wrs - Wednesday, October 27, 2021 - link

    Most of us don't sustain our processors at peak. Average consumption for fixed work and lightly threaded/idle power are substantial inputs to energy cost, which is probably not the biggest consideration for a desktop. If it were you'd get a laptop?

    On cooling, if you're getting a 241W turbo processor you are not aiming for a low-profile build. Any $70 tower cooler ought to handle 241W with ease, if the processor interface is strong. Intel's have historically been strong. AMD is usually behind there. For example the 5800X package can only dissipate around 160W on the best liquid/air coolers, as the power density is too high on the CCD and the solder is thick.
  • Spunjji - Thursday, October 28, 2021 - link

    More like $100 to handle 250W "with ease".

    "Intel's have historically been strong"
    Not between Ivy Bridge through to (some) Comet Lake.

    "For example the 5800X package can only dissipate around 160W on the best liquid/air coolers"
    You don't *need* to dissipate more than that, though? You barely get more performance than running it around 120W.

    Really loving this game of talking up Intel's strategy for their CPUs producing absurd amounts of heat. Like great, they deal with heat better, *because they have to*. Inefficient is as inefficient does.
  • Wrs - Thursday, October 28, 2021 - link

    No, $70 for 250W+. There's a somewhat hard limit for heat pipes. A $100 cooler is typically dual tower or equivalent with a limit in the 350-450W range (though admit I never set out to measure those specifically). That doesn't mean there aren't crappy designs, but I'm referring to any reputable maker. That also doesn't mean a cooler will cool to the limit on any processor. There are choke points with thermal interfaces and die size. Back in '08 I had this single tower, two-fan Thermalright 120 extreme that successfully sustained 383W on LGA1366 at under 100C... that's an above average $100 cooler. Might have gone higher with a bigger/thinner die but just illustrates the possibilities.

    The 5800x, on the other hand, cannot practically sustain over ~160W on that caliber of cooler. In fact a high-end heat sink (I've used a D15 and U12S) remains cool to the touch while the CCD is throttling at 90C, with some 135W over 88mm2, simply because the thermal interface down below is the choke. On the double CCD Zen 3's, the amperage at the socket seems to be the limit. Plenty of people know Zen3 doesn't have as much overclocking headroom, simply by the difference between stock 1C turbo and all-core OC frequency, forcing users to pick between snappy ST performance and sustained MT. Notice how both Rocket Lake and ADL don't have that issue, given sufficient cooling.

    Lastly, let's not confuse efficiency with rated power limits. The review sites will have to measure ADL efficiency empirically. From a theoretical view I don't spot a big difference between Intel 7 and TSMC 7nm, so to see the Intel 7 part rated for so much higher power than the N7 part (all the best Zen3's are 142W turbo) tells me that Intel's package/process accommodates much higher heat dissipation and by extension has more room to perform better whether stock vs. stock or OC vs. OC. And it's kind of expected based on the physical characterization of a 208mm2 monolithic die (per der8auer) with a reduced z-height and thin solder, as compared to Zen 3's typically thick package and thin IHS.
  • Oxford Guy - Friday, October 29, 2021 - link

    '$70 for 250W+'

    Noisy.

    Let's look at the cost per watt for a quiet installation.
  • Wrs - Friday, October 29, 2021 - link

    Maximum noise for a cooler is based on the fans and airflow path through the cooler, not the heat. The duty cycle - and thus noise - for typical PWM fans is regulated based on processor temperature, again not actually the heat. So if you want quiet, get a quiet fan, or get a processor with good enough efficiency and thermal dissipation or heat tolerance that it won't need the fans at 100%.

    Hope I didn't overcomplicate the explanation. When I put a $70 stock Noctua U12S on my 5800x and start a game, it gets somewhat noisy. That's because the ~100W being put out by the CPU isn't dissipating well to the heatsink, not because 100W is a challenge for a U12S.
  • Spunjji - Friday, October 29, 2021 - link

    That's more a function of how you have your fan curves configured. If the CPU isn't putting out enough heat to saturate the heatpipes, and the die temp is going to be high no matter how fast you run the fans because of thermal density, then you have room to reduce the fan curve.
  • Oxford Guy - Friday, October 29, 2021 - link

    Coolers that are undersized (and less expensive) make more noise.

    It’s similar to the problem with Radeon VII. The die was designed to be small for profit and the clock had to be too high to try to compensate.

    Quiet cooling costs more money in high-wattage parts. It’s not complicated, beyond the fact that some expensive AIOs are noisier than others.
  • mode_13h - Sunday, October 31, 2021 - link

    > Radeon VII. The die was designed to be small for profit and
    > the clock had to be too high to try to compensate.

    Radeon VII was not designed as a gaming GPU. It was small because it was the first GPU made at 7 nm, by a long shot. At that time, it was probably one of the biggest dies made on that node.
    The fact they could turn a profit by selling it at a mere $700 was bonus.

    And the MI50/MI60 that were its primary target don't even have a fan. They have a rigid power limit, as well. So, the idea that AMD just said "hey, let's make this small, clock it high, and just run the fans fast" is a bit off the mark.

    https://www.amd.com/en/products/professional-graph...
  • Spunjji - Friday, October 29, 2021 - link

    Maybe we live in different places. Where I am, a decent tower capable of cooling 250W comfortable - not maxed out - is the equivalent of $100 US.

    "The 5800x, on the other hand, cannot practically sustain over ~160W on that caliber of cooler"
    I don't know why anyone would bother, though. The difference between MT performance with PBO and overclocked MT performance is minimal. If you need more MT that badly and TDP isn't a problem, then Threadripper is a better option. If your use case doesn't cover the cost of Threadripper then it's unlikely you'll miss a few percent in performance and you'll probably benefit from not overspending on cooling just to get it. Rocket Lake doesn't compete well with Zen 3 in MT even when overclocked, so it's not a great argument for that chip. We'll have to see how it pans out with ADL, though it does look promising.

    "Lastly, let's not confuse efficiency with rated power limits"
    I'm not!

    "tells me that Intel's package/process accommodates much higher heat dissipation"
    Sure, but...

    "by extension has more room to perform better"
    ...this is absolutely not something you can deduce just from knowing that it can sustain higher power levels 😅

Log in

Don't have an account? Sign up now