The AMD FX (Bulldozer) Scheduling Hotfixes Testedby Anand Lal Shimpi on January 27, 2012 12:47 PM EST
The basic building block of Bulldozer is the dual-core module, pictured below. AMD wanted better performance than simple SMT (ala Hyper Threading) would allow but without resorting to full duplication of resources we get in a traditional dual core CPU. The result is a duplication of integer execution resources and L1 caches, but a sharing of the front end and FPU. AMD still refers to this module as being dual-core, although it's a departure from the more traditional definition of the word. In the early days of multi-core x86 processors, dual-core designs were simply two single core processors stuck on the same package. Today we still see simple duplication of identical cores in a single processor, but moving forward it's likely that we'll see more heterogenous multi-core systems. AMD's Bulldozer architecture may be unusual, but it challenges the conventional definition of a core in a way that we're probably going to face one way or another in the not too distant future.
A four-module, eight-core Bulldozer
The bigger issue with Bulldozer isn't one of core semantics, but rather how threads get scheduled on those cores. Ideally, threads with shared data sets would get scheduled on the same module, while threads that share no data would be scheduled on separate modules. The former allows more efficient use of a module's L2 cache, while the latter guarantees each thread has access to all of a module's resources when there's no tangible benefit to sharing.
This ideal scenario isn't how threads are scheduled on Bulldozer today. Instead of intelligent core/module scheduling based on the memory addresses touched by a thread, Windows 7 currently just schedules threads on Bulldozer in order. Starting from core 0 and going up to core 7 in an eight-core FX-8150, Windows 7 will schedule two threads on the first module, then move to the next module, etc... If the threads happen to be working on the same data, then Windows 7's scheduling approach makes sense. If the threads scheduled are working on different data sets however, Windows 7's current treatment of Bulldozer is suboptimal.
AMD and Microsoft have been working on a patch to Windows 7 that improves scheduling behavior on Bulldozer. The result are two hotfixes that should both be installed on Bulldozer systems. Both hotfixes require Windows 7 SP1, they will refuse to install on a pre-SP1 installation.
The first update simply tells Windows 7 to schedule all threads on empty modules first, then on shared cores. The second hotfix increases Windows 7's core parking latency if there are threads that need scheduling. There's a performance penalty you pay to sleep/wake a module, so if there are threads waiting to be scheduled they'll have a better chance to be scheduled on an unused module after this update.
Note that neither hotfix enables the most optimal scheduling on Bulldozer. Rather than being thread aware and scheduling dependent threads on the same module and independent threads across separate modules, the updates simply move to a better default cause of scheduling on modules first. This should improve performance in most cases but there's a chance that some workloads will see a performance reduction. AMD tells me that it's still working with OS vendors (read: Microsoft) to better optimize for Bulldozer. If I had to guess I'd say that we may see the next big step forward with Windows 8.
AMD was pretty honest when it described the performance gains FX owners can expect to see from this update. In its own blog post on the topic AMD tells users to expect a 1 - 2% gain on average across most applications. Without any big promises I wasn't expecting the Bulldozer vs. Sandy Bridge standings to change post-update, but I wanted to run some tests just to be sure.
|Motherboard:||ASUS P8Z68-V Pro (Intel Z68)
ASUS Crosshair V Formula (AMD 990FX)
|Hard Disk:||Intel X25-M SSD (80GB)
Crucial RealSSD C300
|Memory:||2 x 4GB G.Skill Ripjaws X DDR3-1600 9-9-9-20
|Video Card:||ATI Radeon HD 5870 (Windows 7)
|Video Drivers:||AMD Catalyst 11.10 Beta (Windows 7)
|Desktop Resolution:||1920 x 1200|
|OS:||Windows 7 x64 SP1 w/ BD Hotfixes|
Post Your CommentPlease log in or sign up to comment.
View All Comments
KonradK - Friday, January 27, 2012 - link"Ideally, threads with shared data sets would get scheduled on the same module, while threads that share no data would be scheduled on separate modules."
I think that it is impossible for sheduler to predict how thread will behave and it is not practical to track the behaviour of running thread (tracking which areas of memory are accessed by threads would be so computational intensive as computational intensive is emulation).
So ultimately there is choice between "threads should be scheduled on separate modules if possible" or "do not care which cores belongs to the same module" (pre-hotfix behaviour).
Second means that Bulldozer will behave as PIV EE (2 core, 4 threads) on Windows2000, at least for threads that uses FPU heavily. Windows 2000 does not ditinguish between logical and physical cores.
Araemo - Friday, January 27, 2012 - linkI've noticed that windows doesn't always schedule jobs well to take advantage of Intel Turbo Boost.. I realize that it probably doesn't have a noticeable level of impact, but I do notice that running only 1 thread of high-cpu-utilization still doesn't often kick turbo above the 3/4 cores active frequency.. I can use processor affinity on the various common background tasks to pull them all to 1 or 2 cores to activate full turbo, but if a process is only using a percent or so of cpu resources, why schedule it to an otherwise-inactive core if there is an already-active, but 98% un-utilized core available? I think the power gating efficiencies would actually be more useful than the pure mhz-related turbo efficiencies (Running 2 cores 100Mhz faster is probably going to waste less power than you gain by shutting down the other two cores completely/partially).
Is there anything to address that behavior?
taltamir - Friday, January 27, 2012 - linkWouldn't those hotfixes improve performance on intel HT processors as well?
tipoo - Friday, January 27, 2012 - linkNo, Windows already leaves virtual threads from hyperthreading alone until all the physical cores are used, so this won't improve things on the Intel side any. This is specifically for Bulldozer and future architectures like this.
Hale_ru - Monday, February 6, 2012 - linkBullshit!
Win7 had nothing to do with Intel HT until AMD hit them in the head!
I had so much asspain with Win7 shity CPU scheduler on FEM and FDTD simulations.
8HT-core setup just reduces overall performance up to 50%(a HALF!!!) comparing to NO-HT setup.
Simpel task manager checkup showed that Win7 just was putting low-threaded processes on the same core without an option. Just simplest increment scheduler they have for Intels.
Hale_ru - Monday, February 6, 2012 - linkSo, it is recommended to use AMD optimization patches (only the core-addresing one, not the C6 state patch) on any Win7 machine using simple multithreaded mathematics.
hescominsoon - Friday, January 27, 2012 - linkShared cpu modules that have to compete for resources? Reminds me of HT v1. IMO this is basically a quad core chip with the other 4 threads available in the primary core isn't being used all the way. I've looked at the design and it's just nonsensical. This is not a futuristic bet but a desperate attempt at differentiation...with most likely disastrous results. AMD has now painted themselves into a niche product instead of a high performance general purpose cpu.
dgingeri - Friday, January 27, 2012 - linkI like the design of the Bulldozer overall, but there is obviously a bottleneck that is causing problems with this chip. I'm thinking the bottleneck is likely the decoder. it can only handle 4 instructions per clock cycle, and feeds 2 full Int cores and the FP unit shared between the two cores. I bet increasing the decoder capacity would show a really big increase in speed. What do you think?
bji - Saturday, January 28, 2012 - linkI think that if it was something easy, AMD would have done it already or are in the process of doing it.
I also think that it's unlikely that it's something as simple as improving the decoder throughput, because one would think that AMD would have tried that pretty early on when evaluations of the chip showed that the decoder was limiting performance.
These chips are incredibly complicated and all of the parts are incredibly interrelated. The last 25% of IPC is incredibly hard to achieve.
bobbozzo - Friday, January 27, 2012 - linkThe hotfixes also support Windows Server 2008 R2 Service Pack 1 (SP1)