Electromigration: Why AMD Ryzen Current Boosting Won't Kill Your CPUby Dr. Ian Cutress on June 9, 2020 9:10 AM EST
- Posted in
Where there is a will to get extra performance out of a CPU, there is often a way. Either through end-user overclocking or motherboard vendors tweaking settings to improve their stock performance, at the end of the day everyone wants better performance, and for a multitude of reasons. This insatiable drive for peak performance, however, means that some of these tweaks and adjustments can start to skirt the lines of what is ‘in specification’. And as a result, we sometimes see methods of increasing processor performance that clearly deliver on their promises, but perhaps at the expense of thermals or longevity.
To this end, it has recently come to light that motherboard vendors have been taking advantage of a setting on AMD motherboards to misrepresent the current delivered to the CPU. By doing so, they are able to increase the processor's power headroom, and ultimately allowing for higher performance at the cost of higher thermals. To be sure, this kind of tweaking isn’t new, but recent events have lead to no shortage of confusion over what exactly is going on, and what the ramifications are for AMD Ryzen processors. So to try to clarify matters, here’s our take on the situation.
The Old Fashioned Way: Spread Spectrum, MultiCore Enhancement, PL2
One of the common themes I've noticed throughout my time at AnandTech as our motherboard editor and now our CPU editor is the lengths to which motherboard vendors will go to in order to get increased performance over the competition. We were the first outlet to break out features such as MultiCore Enhancement, way back in August 2012, which led to higher-than-specified all-core frequencies, or in some cases, outright overclocks. But the history of motherboard vendors adjusting and tweaking features for performance goes further back than that, such as variations with the base frequency from 100 MHz to 104.7 MHz with the Spread Spectrum, leading to increased performance on systems that can support it.
More recently, on Intel platforms, we’ve seen vendors increase their turbo power limits so that the motherboard can sustain the highest turbo for as long as the world remains in existence, just because the motherboard vendors are overengineering the power delivery in order to support it. In the past couple of weeks, we have also found examples of motherboards ignoring Intel’s new Thermal Velocity Boost requirements, which is something we'll be delving into more in a future article.
In short, motherboard vendors want to be the best, and that often means pushing the limits of what is considered the ‘base specification’ of the processor. As we’ve regularly discussed on topics like this with Intel’s turbo power limits, the differentiation between a ‘specification’ and a ‘recommended setting’ can get quite blurred – for Intel, the turbo power listed in the documents is a recommended setting, and any value the motherboard is set to is technically ‘in specification’. The point at which Intel considers it overclocking it seems is if the peak turbo frequency is increased.
Tweaking AM4 Above and Beyond
So now we move on to the news of the day, with motherboard manufacturers now attempting to tweak AMD based Ryzen motherboards in order to drive higher performance. As thoroughly explained over on the HWiNFO forums by The Stilt and summarized here, AM4 platforms typically have three defined limiters: Package Power Tracking (PPT), which indicates the power threshold that is allowed to be delivered to the socket; Thermal Design Current (TDC), which is the maximum current delivered by the motherboards voltage regulators under thermal limits; and Electrical Design Current (EDC), which is the max current at any time that can be delivered by the voltage regulators. Some of these values are compared to metrics derived internally in the CPU or externally in the power delivery, to see if these limits have been triggered.
In order to calculate the software-based power measurement for which PPT is compared to, the power management co-processor takes the value of current from the voltage regulator management controller. This isn’t an actual value of current, but a dimensionless value (0 to 255) designed to represent 0 = 0 amps, and 255 = peak amps that the VRMs can handle. The power management co-processor on the CPU then performs its power calculation (power in watts = voltage in volts multiplied by current in amps).
The dimensionless value range has to be calibrated on a per-motherboard layout, based on the componentry used (VRMs, Controllers) as well as the tracing, the board layers, and the quality of the design. In order to get an accurate scaler value for this dimensionless range, a motherboard vendor should accurately probe the correct values and then write the firmware to use that look-up table in the system power calculations.
This means that there is a potential way to fiddle with the way that the system interprets the peak power value of the processor. Motherboard vendors can reduce this dimensionless value of current in order to make the processor and the power management co-processor think that there is less power going to the CPU, and as a result, the package power tracking (PPT) limiter has not been yet achieved, and more power can be supplied. This allows the processor to turbo further than was originally intended by AMD.
This has knock-on effects. The processor will be consuming more power, mostly in the form of increased amps, leading to more heat being generated and increased thermals. Because the processor is turboing further (by being allowed to draw more power than what the software is reporting) the processor will also perform better in benchmarks.
As The Stilt points out, if you are running a CPU with a base TDP of 105 W and a PPT value of 142 W, under normal circumstances you should expect to see 142 W power being reported by the CPU at stock settings. However, if the dimensionless current value is only 75% of its real-world current, then the real world power consumption is actually ~190 W, which is the 142 W value divided by the 0.75 factor. Assuming that none of the other limits have been hit (TDC, EDC), the processor will only report 75% of the original PPT power, causing a lot of the confusion.
Is it Out of Specification?
If we are considering PPT, TDC, and EDC to be the be-all and end-all of AMD specifications for power draw and current draw, then yes, this is out of specification. However, PPT by its very nature is going beyond TDP, so we get into this mysterious world of how to define "turbo", similar to what we’ve covered in detail with Intel.
As we’ve previously discussed, in Intel land, the peak power consumed while in a turbo mode is only provided by Intel to motherboard vendors as a ‘recommended value’. As a result, Intel chips will actually accept any value for that peak power limit, including reasonable values like 200 W or 500 W, but even unreasonable values like 4000 W. More often than not (and depending on the processor) a chip might hit other limits first; but for the high-end models, it is certainly worth tracking. Meanwhile the turbo duration, Tau, which defines how big the bucket of energy that Turbo can draw from, can also be extended: instead of the default of between 8 and 56 seconds, Tau can be drawn-out to what's effectively an infinite amount of time. According to Intel, this is all within specification, if the motherboard manufacturers can build boards that can provide it.
What Intel considers out of specification is when the CPU goes beyond the frequencies listed in the turbo tables for Turbo Boost 2.0 (or TBM 3.0, or Thermal Velocity Boost). When the processor runs above the frequency as defined by the turbo tables, then Intel considers this overclocking, and has no obligation to adhere to the chip's warranty.
The problem is that while we can try and transplant the same rules to the AMD situation, AMD doesn’t really use Turbo Tables as such. AMD processors work by attempting to offer the highest possible frequency given the power and current limits at any given time. As more cores are ramped, the power per core decreases, and the overall frequency decreases. We get into the minutiae of frequency envelope tracking, which can get more complex given that AMD can work in 25 MHz steps rather than 100 MHz steps like Intel.
AMD also uses features that push a chip's frequency above the turbo frequency listed on the specifications page. If you wanted to strictly argue about those being overclocking, then judging by the number on the box, it could very well be. AMD purposefully blurs the lines here, but the upside is often more performance.
Is My CPU At Risk?
To answer the big question right off the bat then, no, your CPU is not at risk. For regular users with enough cooling running at stock frequency, there is no issue to any degree that will matter within the expected lifetime of the product.
Most modern x86 processors come with either a three-year warranty for retail boxed parts, or are sold as OEM parts with a one-year warranty. Past those support periods, while AMD or Intel won’t replace the processor in the event of failure, most processors are expected to live well into the 15+ year range. We are still very happily able to test old CPUs in old motherboards, even though they have gone out of service for a long time (and more often than not, it is the old motherboard capacitors that tend to blow up, not the CPU).
When a CPU wafer comes off the manufacturing line, the company get a reliability report about those processors, which helps get a sense of potential avenues for binning those CPUs. This will include elements such as voltage/frequency response, but also as it relates to electromigration.
Aside from physical damage, or thermal limits being disabled and the CPU cooking itself, the main way for a modern processor to become non-functional is through electromigration. This is the act of electrons making their way through the wires on a processor and ever so slightly bumping into the silicon (and other elements) in that wire to move them out of the crystal lattice. It is in itself a fairly rare event (how long have your wires been in your house, for example), however at the small scale it can affect change in how a processor works.
Adapted From "Electromigration" by Linear77, License: CC BY 3.0
By moving a metal atom from a wire out of place in a crystal lattice, the cross-section of the wire, at that point, is reduced. This increases the resistance, as resistance is inversely proportional to the cross-sectional area of the wire. If enough silicon atoms are moved out of place, the wire disconnects and the processor is no longer useable. This also happens in trasistors, and is commonly referred to as transitor aging, with the transistor needing a higher voltage over the lifetime of the product (voltage drift).
The amount of electromigration increases under certain conditions – temperature, use, and voltage. One of the main ways to get over the increased resistance is to increase the voltage, which in turn increases the temperature of the processor. It becomes a positive feedback loop, building itself for worse electrical performance, over the lifetime of the processor.
With higher voltage (energy per electron), and higher current density (electrons per unit area), this means that there are more chances for an electron migration event to occur. This can get worse at higher temperatures, and and all these elements act as different factors when it comes down to the amount of electrons that might have enough energy to enable an electromigration event. For anyone studying reaction kinetics, this is a similar principle to concentration but with a variable energy per incident.
So this is bad, right? Well, it used to be. As processor manufacturers and semiconductor fabs have iterated through the design of logic gates in CMOS and FinFET processors, there have been active countermeasures put in place to reduce the levels of electromigration (or reduce the effect of the levels of electromigration). As we shrink process nodes, and voltages decrease, it also becomes less of an issue – the fact that wires also decrease in area has the opposite effect. But as mentioned, the manufacturers now actively take steps to reduce the effect of electromigration inside a processor.
Electromigration has not been an issue for most consumer semiconductor products for a substantial time. The only time I personally have been affected by electromigration issues is when I owned a Sandy Bridge-based 2011 Core i7-2600K, that I used to use for overclocking competitions at 5.1 GHz under some extreme cooling scenarios. It eventually got to a point, after a couple of years, where it needed more voltage to run at stock.
But that was a processor I ran to the ragged edge. Modern day equipment is designed to run for a decade or longer. What we are seeing with these numbers, while there is an increase in thermals due to the increased power, isn’t actually a sizable shift. In The Stilt’s report, because the processor sees that it has extra power headroom, then it raises the voltage slightly in order to get the +75 MHz extra that the budget will allow, which increases the average voltage from 1.32 volts to 1.38 volts during a CineBench R20 run. The peak voltage, which matters a lot for electromigration, only moves from 1.41 volts to 1.42 volts. The overall power was increased 25 W, which makes for around 30A more. Not something on the order of a change in the order of magnitude.
So if I end up with a motherboard that adjusts this perceived current value, will it brick my processor? No. Not unless you have something else seriously wrong with your setup (such as thermals). Within the given lifetime of that product, and the next decade after, it is not likely to make a difference. And as stated previously, even if this did affect electromigration on a large scale, the processor manufacturers have built in mechanisms to deal with it. The only way to actively monitor it, as an end user, would be to observe your average and peak voltage values over the course of years, and see if the processor automatically adjusts itself to compensate.
It is perhaps worth mentioning that the dimensionless current value isn’t adjustable by the end user – it is something the motherboard controls through BIOS updates. If you are a user that overclocks, you are doing more towards electromigration than this adjustment ever will. For those concerned about thermals, then I suspect you are already monitoring and adjusting your BIOS limits as needed for your system.
How To Check if My Motherboard Is Doing It
First, you need to be running a stock system. Changing any of the regular PPT/TDC/EDC already means that the system is being adjusted, so we need to only focus on users dealing with stock systems.
Next, acquire the latest version of HWiNFO, and a test that will cause 100% load on the system, such as CineBench R20.
Inside HWiNFO, there is a metric called “CPU Power Reporting Deviation”. Observe that number while the system is at the full load. A normal motherboard should say ‘100%’, while a motherboard with an adjusted current/VRM reported value will say something below 100%.
Just to clarify, this metric is only valid:
- If your AMD Ryzen CPU is running at full stock settings in the BIOS. No OC, no adjustments to power or current limits.
- When your CPU is running at a full 100% load, such as Cinebench.
If your processor does not match these two requirements, then the value of the Power Reporting Deviation does not mean anything. If it says under 100%, then your motherboard is affected. Please let us know in the comments below.
What Are My Options?
If your motherboard is juicing the processor, but you are happy with the thermal performance of your cooler and the power draw at the wall, then enjoy the extra performance. Even if it is only 75 MHz.
AMD doesn’t necessarily need to comment on the matter, as this is an issue with the motherboard manufacturers. Users might want to probe their motherboard manufacturer, and ask for a BIOS update. Users who want to return their motherboards will have to check on their retailer, as it might depend on where it was purchased.
Given that while it appears to break PPT specifications, it doesn’t actually go beyond any frequency specifications (which are ill defined), it may be similar to how motherboard manufacturers play with power limits on Intel systems, which is to say that it's something that's "just there". Though it would probably be handy to get a BIOS option to enable/disable it.
- The Stilt at HWiNFO Forums
- MultiCore Enhancement (August 2012)
Post Your CommentPlease log in or sign up to comment.
View All Comments
CiccioB - Tuesday, June 9, 2020 - link
Of course he's not.
If you take the same current absorption of a 32nm CPU and a 7nm CPU you would see that the current density that passes through the connection lines is much higher. Current density is a parameter that increases electromigration.
If you take a 300mm^2 die and a today 80mm^2 die and calculate the energy density in them you will see that the latter dies suffer much more than old ones. High temperature (and hot spot) increases electromigration.
If you look at voltages used in 32nm PP and in actual ones you will see they are almost the same. But inner tracks are now much more thinner. Voltage is more than a linear factor that increases electromigration.
So comparing an old CPU to actual ones and say the problem is no more is actually the opposite of what physics is suggesting.
Intel encountered electromigration problem in their first 22nm chipset, just as an example.
Fataliity - Tuesday, June 9, 2020 - linkOld chips were planar. Todays chips are finfets. Old chips used aluminum todays use cobalt. Also today's 80mm2 chip has much much more wire inside of it, the power delivery is much more advanced and spread out. And the HP cells used for the high frequency parts of the chip are overbuilt.
Do some research and shut up.
CiccioB - Wednesday, June 10, 2020 - linkFinfet and planar transistor type have nothing to do with electromigration or its mitigation.
80mm^2 must have more wiring inside because the power absorption is the same as the old 300mm^3 chips and if they haven't so many lines and metal inside they would melt in few seconds.
But those traces are much narrower than old chips. And dimension for electromigration counts. Really. Do some research.
WaltC - Wednesday, June 10, 2020 - linkFINFET means the transistors are able to shut themselves down individually--the technology exists to keep energy from coming off of the CPU in waves of excess thermal energy caused by current leakage--its purpose is to stop current leakage, etc. It works very well. As Ian mentions, smaller nm processes today require less energy than older designs--coupled with FINfet and *other things* it's whole different ballgame in that regard. And no, a stock motherboard isn't going to create electromigration in normal use. That would be nuts...;)
PeterCollier - Thursday, June 11, 2020 - linkThe biggest factor in electromigration is the use of resonant clock meshes. The resonance phenomenon accelerates failure similar to the Tacoma narrows bridge.
Fataliity - Tuesday, June 9, 2020 - linkAnd that's not even mentioning bamboo structures, the Blech effect, EDA tools designed around electromigration, doping, etc. There are so many things done to prevent this stuff. Chips are literally designed and verified to avoid this. Sure, it can slip through the cracks in rare cases. But most chips are engineered to not suffer this fate.
The science is wayyyyyyyyyy more advanced than 22nm was.
jim bone - Tuesday, June 9, 2020 - linkThe chips are designed for what they're designed for; often a 10 year Gaussian mean life. If you increase the current density you will reduce the statistical life of the chip below what it was designed for, and not necessarily in a predictable way; you may fall off a aging cliff. Engineers will *always* waive some EM violations - no tapeout is clean in this regard.
I guess I agree on average the headline claim is true most of the time; until it isn't.
FWIW I'm an IC designer working at a major semiconductor company working in the most advanced nodes available right now. In the past I've worked for AMD designing parts of their chips. I look at EM and aging results several times a year for as part of standard tapeout sign-off.
CiccioB - Wednesday, June 10, 2020 - linkThe science is what is put into a PP.
Tell me what makes 7nm PP less prone to electromigration than 22nm.
Moreover I have just a simple idea: if the chip could sustain that increased stress for the desired life time it was assigned, why the producer, in this case AMD, has not chosen to sell it with those increased specs?
More performance at zero costs... but that was not the case.
My great scientist, tell me why AMD chose to loose money.
Lord of the Bored - Wednesday, June 10, 2020 - linkCiccioB, I am glad you asked. I have a reasonable answer: marketing.
AMD has had a reputation for making hot, inefficient chips of late. They'd very much like to be rid of that reputation, so they're setting a lower wattage than is strictly necessary.
And then they're treating that as a hard limit, so they can score points against Intel with their "this is our power limit, except we'll probably draw twice this and not tell you" policy.
CiccioB - Thursday, June 11, 2020 - linkAnd so it is for this percent point of efficiency they created a "marketing issue" that had resonance in all sites that treat technology by releasing a BIOS that could not make the CPU run at the advertised speeds to try to limit the voltage to the minimum possible?
My feeling is that this sudden bothering on voltages value trying to keep them as low as possible, even lower than needed and so preventing correct boosting, is quite suspect.
They historically have never been short on voltage in any of their chips, be them CPUs or GPUs.