Vega’s NCU: Packed Math, Higher IPC, & Higher Clocks

As always, I want to start at the heart of the matter: the shader core. In some respects AMD has not significantly altered their shader core since the launch of the very first GCN parts 5 years ago. Various iterations of GCN have added new instructions and features, but the shader core has remained largely constant, and IPC within the shader core itself hasn’t changed too much. Even Polaris (GCN 4) followed this trend, sharing its whole ISA with GCN 1.2.

With Vega, this is changing. By just how much remains to be seen, but it is clear that even with what we see today, AMD is already undertaking the biggest change to their shader core since the launch of GCN.

Meet the NCU, Vega’s next-generation compute unit. As we already learned from the PlayStation 4 Pro launch and last month’s Radeon Instinct announcement, AMD has been working on adding support for packed math formats for future architectures, and this is coming to fruition in Vega.

With their latest architecture, AMD is now able to handle a pair of FP16 operations inside a single FP32 ALU. This is similar to what NVIDIA has done with their high-end Pascal GP100 GPU (and Tegra X1 SoC), which allows for potentially massive improvements in FP16 throughput. If a pair of instructions are compatible – and by compatible, vendors usually mean instruction-type identical – then those instructions can be packed together on a single FP32 ALU, increasing the number of lower-precision operations that can be performed in a single clock cycle. This is an extension of AMD’s FP16 support in GCN 1.2 & GCN 4, where the company supported FP16 data types for the memory/register space savings, but FP16 operations themselves were processed no faster than FP32 operations.

And while previous announcements may have spoiled that AMD offers support for packed FP16 formats, what we haven’t known for today is that they will also support a high-speed path (analogous to packed FP16) for 8-bit integer operations. INT8 is a data format that has proven especially useful for neural network inference – the actual execution of trained neural networks – and is a major part of what has made NVIDIA’s most recent generation of Pascal GPUs so potent at inferencing. By running dot products and certain other INT8 operations along this path, INT8 performance can be greatly improved.

Though before we get too far – and this is a longer discussion to have closer to Vega’s launch – it’s always important to note when and where these faster operations can be used in consumer workloads, as the odds are most of you reading this are thinking gaming. While FP16 operations can be used for games (and in fact are in the mobile space), in the PC space they are virtually never used. When PC GPUs made the jump to unified shaders in 2006/2007, the decision was made to do everything at FP32 since that’s what vertex shaders typically required to begin with, and it’s only recently that anyone has bothered to look back. So while there is some long-term potential here for Vega’s fast FP16 math to become relevant for gaming, at the moment it wouldn’t do anything. Vega will almost certainly live and die in the gaming space based on its FP32 performance.

Moving on, the second thing we can infer from AMD’s slide is that a CU on Vega is still composed of 64 ALUs, as 128 FP32 ops/clock is the same rate as a classic GCN CU. Nothing here is said about how the Vega NCU is organized – if it’s still four 16-wide vector SIMDs – but we can at least reason out that the total size hasn’t changed.

Finally, along with outlining their new packed math formats, AMD is also confirming, at a high level, that the Vega NCU is optimized for both higher clockspeeds and a higher IPC. It goes without saying that both of these are very important to overall GPU performance, and it’s an area where, very broadly speaking, AMD hasn’t compared to NVIDIA too favorably. The devil is in the details, of course, but a higher clockspeed alone would go a long way towards improving AMD’s performance. And as AMD’s IPC has been relatively stagnant for some time here, improving it would help AMD put their relatively sizable lead in total ALUs to good use. AMD has always had a good deal more ALUs than a comparable NVIDIA chip, but getting those ALUs to all do useful work outside of corner cases has always been difficult.

That said, I do think it’s important not to read too much into this on the last point, especially as AMD has drawn this slide. It’s fairly muddled whether “higher IPC” means a general increase in IPC, or if AMD is counting their packed math formats as the aforementioned IPC gain.

Geometry & Load Balancing: Faster Performance, Better Options

As some of our more astute readers may recall, when AMD launched the GCN 1.1 they mentioned that at the time, GCN could only scale out to 4 of what AMD called their Shader Engines; the logical workflow partitions within the GPU that bundled together a geometry engine, a rasterizer, CUs, and a set of ROPs. And when the GCN 1.2 Fiji GPU was launched, while AMD didn’t bring up this point again, they still held to a 4 shader engine design, presumably due to the fact that GCN 1.2 did not remove this limitation.

Fiji 4x Shader Engine Layout

But with Vega however, it looks like that limitation has finally gone away. AMD is teasing that Vega offers an improved load balancing mechanism, which pretty much directly hints that AMD can now efficiently distribute work over more than 4 engines. If so, this would represent a significant change in how the GCN architecture works under the hood, as work distribution is very much all about the “plumbing” of a GPU. Of the few details we do have here, AMD has told us that they are now capable of looking across draw calls and instances, to better split up work between the engines.

This in turn is a piece of the bigger picture when looking at the next improvement in Vega, which is AMD’s geometry pipeline. Overall AMD is promising a better than 2x improvement in peak geometry throughput per clock. Broadly speaking, AMD’s geometry performance in recent generations hasn’t been poor (it’s one of the areas where Polaris even further improved), but it has also hurt them at times. So this is potentially important for removing a bottleneck to squeezing more out of GCN.

And while AMD's presentation and comments itself don't go into detail on how they achieved this increase in throughput, buried in the footnote for AMD's slide deck is this nugget: "Vega is designed to handle up to 11 polygons per clock with 4 geometry engines." So this clearly reinforces the idea that the overall geometry performance improvement in Vega comes from improving the throughput of the individual geometry engines, as opposed to simply adding more as the scalability improvements presumably allow. This is one area where Vega’s teaser paints a tantalizing view of future performance, but in the process raises further questions on just how AMD is doing it.

In any case, however AMD is doing it, the updated geometry engines will also feature one more advancement, which AMD is calling the primitive shader. A new shader stage that runs in place of the usual vertex and geometry shader path, the primitive shader allows for the high speed discarding of hidden/unnecessary primitives. Along with improving the total primitive rate, discarding primitives is the next best way to improve overall geometry performance, especially as game geometry gets increasingly fine, and very small, overdrawn triangles risk choking the GPU.

AMD isn’t offering any real detail here in how the primitive shader operates, and as a result I’m curious here whether this is something that AMD’s shader compiler can automatically add, or if it requires developers to specifically call it (like they would vertex and geometry shaders).

The AMD Vega GPU Architecture Teaser HBM2 & “The World’s Most Scalable GPU Memory Architecture”
Comments Locked


View All Comments

  • jjj - Thursday, January 5, 2017 - link

    I get to some 500mm2 too.
    As for cost, Fiji was on 28nm, much cheaper on an area basis. They will get better yield for packaging but the overall costs will be much higher than with Fiji. They could have a SKU with 25% of the CUs disabled at 499$ and the full Vega 10 at 699 to .. the upper limit depends on where perf lands vs Titan X.
  • Yojimbo - Friday, January 6, 2017 - link

    One must also consider the difference in cost of the process that each chip is made with as well as the price of the larger capacity of HBM2 compared to smaller amount of HBM1. But a smaller interposer will definitely help.

    It'd better smoke Fiji performance-wise. I think it's a foregone conclusion that it will, though. Judging by their released benchmarks I'm guessing it'll be modestly faster than the 1080, on average, but significantly slower than a forthcoming 1080Ti. So it'll probably be priced around the same $650 as the Fury X.
  • jjj - Friday, January 6, 2017 - link

    You are rushing into judging perf in a big way.
    AMD is demoing Vega, just showing it up and running ,NOT showing perf. Not quite sure why so many don't get that.

    You can bet that it is using early software and that it's clocked at 1-1.2GHz only for now.They are not gonna show their hand ,months before retail availability. They are just showing 4k gaming at 60FPS or better.

    You can look at min perf Vega 10 should offer in many ways.
    1. It is assumed to have same number of "cores" as the Fury X but almost 50% higher clocks. Then,some 15% architectural and software gains would put it on par with Titan X. So the question is, if they can do better and by how much and at what power.
    A note here, given the number of cores , the scaling from 28nm to 14/16nm is very poor. They are clearly sacrificing area for huge gains elsewhere, you wouldn't do it otherwise. 16 and 8 bit is on thing but there must be a lot more.
    2. Even the Polaris architecture scaled to 12.5TFLOPS and this memory bandwidth, would match the Titan X. So the question is, how much better can they do.

    One could argue that AMD is nuts and Vega is worse than Polaris but that's less than reasonable.
  • eachus - Saturday, January 14, 2017 - link

    The 12.5 TFLOPS is Vega 10 doing 16-bit floating point, not Polaris. And this, in part, explains why Vega 10 doesn't scale well compared to Fury X. There is a new compute engine with lots of new features. I'm hoping that one of them is a fused multiply-add, but we do know that it can do 16-bit floating point twice as fast as single precision (32 bit).

    So remember that, for now, any comparisons you see with Vega will almost certainly use none of the new features. AMD can and should spend most of their driver effort right now on changes that benefit Polaris. Some support for Vega? Sure, but don't expect support for the new features in drivers until just before (or after) Vega ships.
  • SunnyNW - Thursday, January 5, 2017 - link

    With AMD's countdown on the site and their marketing almost everyone thought their would be a live stream incoming, but instead it was a countdown to an NDA lift, like really AMD? I believe AMD made a huge mistake here because there are A LOT of angry people out there whom were waiting with much anticipation. I kinda thought something was odd with the time being 6 a.m. PST and had told my friends as much but still they were convinced too that it was for a live stream.

    Why has AMD marketing been SO bad for so long, I do think they have gotten a little better in recent history but they are way behind Nvidia for example. It's like AMD has kept the same marketing team for years and years and just does not want to let them go for who knows what reason. In business when you fail for so long who are usually replaced, has AMD just been replacing under-performer after another or just keeping the same people I just don't get it.

    And this is coming from someone who is an AMD supporter and would like to see them succeed.
  • Michael Bay - Thursday, January 5, 2017 - link

    Marketing is where careers go to die. If you can even call those things careers.
    What else do you expect?
  • alphasquadron - Thursday, January 5, 2017 - link

    Marketing for large companies is usually one of the higher paid positions. The ability to persuade a mass of people gets you paid well.
  • eldakka - Friday, January 6, 2017 - link

    like a cult leader?
  • The_Assimilator - Friday, January 6, 2017 - link

    This is exactly my biggest problem with AMDL: their marketing is, to be frank, absolute bullshit. It's why I don't trust them to deliver on their promises until I see independent reviewers validate them, while if Intel says their next chip is going to have an extra 10% IPC or NVIDIA says their next GPU is going to be 20% faster, I'm inclined to believe those companies.

    AMD marketing needs to understand the difference between hype and lies, and that consistent lies hurt them far more than they help.
  • Meteor2 - Friday, January 6, 2017 - link

    This. I'm tiring of AMD 'teasing' products.

    When will it be available? What can it do? How much will it cost? This is what I want to know.

Log in

Don't have an account? Sign up now