AMD's Steamroller Detailed: 3rd Generation Bulldozer Core
by Anand Lal Shimpi on August 28, 2012 4:39 PM EST- Posted in
- CPUs
- Bulldozer
- AMD
- Steamroller
Today at the annual Hot Chips conference, AMD’s new CTO Mark Papermaster unveiled the first details about the Steamroller x86 CPU core.
Steamroller is the third instantiation of AMD’s Bulldozer architecture, first conceived in the mid-2000s and finally brought to market in late 2011. Committed to this architecture for at least one more design after Steamroller, AMD has settled on roughly yearly updates to the architecture. For 2012 we have the introduction of Piledriver, the optimized Bulldozer derivative that formed the CPU foundation for AMD’s Trinity APU. By the end of the year we’ll also see a high-end desktop CPU without processor graphics based on Piledriver.
Piledriver saw a switch to hard edge flip flops, which allowed for a considerable decrease in power consumption at the expense of careful design and validation work. Performance didn’t change, but AMD saw a 10% - 20% reduction in active power. Piledriver also brought some scheduling efficiency improvements, but prefetching and branch prediction were the two other major design improvements in Piledriver.
Steamroller is designed to keep the ball rolling. It takes fundamentals from the Bulldozer/Piledriver architectures and offers a healthy set of evolutionary improvements on top of them. In Intel speak Steamroller wouldn’t be a tick as it isn’t accompanied by a significant process change (28nm bulk is pretty close to 32nm SOI), but it’s not a tock as the architecture is mostly enhanced but largely unchanged. Steamroller fits somewhere in between those two extremes when it comes to changes.
Front End Improvements
One of the biggest issues with the front end of Bulldozer and Piledriver is the shared fetch and decode hardware. This table from our original Bulldozer review helps illustrate the problem:
Front End Comparison | |||||
AMD Phenom II | AMD FX | Intel Core i7 | |||
Instruction Decode Width | 3-wide | 4-wide | 4-wide | ||
Single Core Peak Decode Rate | 3 instructions | 4 instructions | 4 instructions | ||
Dual Core Peak Decode Rate | 6 instructions | 4 instructions | 8 instructions | ||
Quad Core Peak Decode Rate | 12 instructions | 8 instructions | 16 instructions | ||
Six/Eight Core Peak Decode Rate | 18 instructions (6C) | 16 instructions | 24 instructions (6C) |
Steamroller addresses this by duplicating the decode hardware in each module. Now each core has its own 4-wide instruction decoder, and both decoders can operate in parallel rather than alternating every other cycle. Don’t expect a doubling of performance since it’s rare that a 4-issue front end sees anywhere near full utilization, but this is easily the single largest performance improvement from all of the changes in Steamroller.
The penalties are pretty obvious: area goes up as does power consumption. However the tradeoff is likely worth it, and both of these downsides can be offset in other areas of the design as you’ll soon see.
Steamroller inherits the perceptron branch predictor from Piledriver, but in an improved form for better performance (mostly in server workloads). The branch target buffer is also larger, which contributes to a reduction in mispredicted branches by up to 20%.
Execution Improvements
AMD streamlined the large, shared floating point unit in each Steamroller module. There’s no change in the execution capabilities of the FPU, but there’s a reduction in overall area. The MMX unit now shares some hardware with the 128-bit FMAC pipes. AMD wouldn’t offer too many specifics, just to say that the shared hardware only really applied for mutually exclusive MMX/FMA/FP operations and thus wouldn’t result in a performance penalty.
The reduction of pipeline resources is supposed to deliver the same throughput at lower power and area, basically a smarter implementation of the Bulldozer/Piledriver FPU.
There’s no change to the integer execution units themselves, but there are other improvements that improve integer performance.
The integer and floating point register files are bigger in Steamroller, although AMD isn’t being specific about how much they’ve grown. Load operations (two operands) are also compressed so that they only take a single entry in the physical register file, which helps increase the effective size of each RF.
The scheduling windows also increased in size, which should enable greater utilization of existing execution resources.
Store to load forwarding sees an improvement. AMD is better at detecting interlocks, cancelling the load and getting data from the store in Steamroller than before.
126 Comments
View All Comments
thehat2k5 - Wednesday, August 29, 2012 - link
at the very least we are suggesting Radeon 7870 or GTX570. How much are those where you come from? Up here, there is no way i can build you a computer for $469 that we will put our name on and certify it for BF3 at Ultra!Origin64 - Thursday, August 30, 2012 - link
You need at least 2GB of vram to run BF3 on ultra. And the flops to match it of course. Good luck getting that under 400 bucks.Hardin4188 - Wednesday, August 29, 2012 - link
Is it ok if I laugh at all seven of your employees?thehat2k5 - Wednesday, August 29, 2012 - link
it sure is ok, as long as you can link me the parts you are using to make this miracle machine;)Novulux - Wednesday, August 29, 2012 - link
I built a PC for my younger brother with $250 of parts from Microcenter, and gave him two HD 5770s bought on Ebay for ~$110 for both. Bought an HDD from Newegg for $70. He only plays at 1440x900 though.Spunjji - Thursday, August 30, 2012 - link
Yes, and presumably not at Ultra settings, unless he likes his textures being swapped in/out of RAM constantly..?CeriseCogburn - Wednesday, August 29, 2012 - link
Close your doors hat2k5 - you don't know what you're doing.LOL
Not surprised.
http://www.tomshardware.com/reviews/fx-4100-core-i...
hapkiman - Sunday, September 2, 2012 - link
Not trying to stir the pot, but I get around 60 FPS consistently on BF3 with ALL settings on Ultra. Everything, including having ambient occlusion turned on.I have a overclocked XFX Radeon HD 6950, which is not a $500 card.
I have a i7 3770, and 16GB of 1600MHz RAM, and the game and all maps are loading from an Intel 520 160GB SSD.
Believe it or not but its the truth. I just finished a game on Back to Karkand map, and I averaged 50-60 FPS, with spikes going well over 60.
AssBall - Wednesday, August 29, 2012 - link
Cool story, Bro.CeriseCogburn - Wednesday, August 29, 2012 - link
You just can't make up the crap that amd fan boys do, since they are clueless.http://www.tomshardware.com/reviews/fx-4100-core-i...
The i3 2100 STOMPS the fx4100 .