Intel Xeon Sapphire Rapids: How To Go Monolithic with Tilesby Dr. Ian Cutress on August 31, 2021 10:00 AM EST
The March of More Silicon: Connectivity Matters
To date, all of Intel’s leading-edge Xeon Scalable processors have been monolithic, i.e. one piece of silicon. Having a single piece of silicon has its advantages, namely a fast in-silicon interconnect between cores but also a singular power interface to manage. However, as we move to smaller and smaller process nodes, having one large piece of silicon has downsides: they are hard to manufacture in volume without defects, which increases cost if you want the high-core count models, but also it ends up being limiting.
The alternative to a large monolithic design is to cut it up into smaller bits of silicon and connect them together. The main advantages here is better silicon yield, but also configurability by having different silicon for different functions as needed. With a multi-die design, you can ultimately end up with more silicon than a monolithic design can provide – the reticle (manufacturing) limit for a single silicon die is ~700-800 mm2, and with a multi-die processor several smaller silicon dies can be put together, easily pushing over 1000mm2. Intel has stated that each of its silicon tiles are ~400 mm2, creating a total around ~1600mm2. But the major downside to multi-die designs is connectivity and power.
The simplest way to package two chips in a substrate together is through intra-substrate connections, or what essentially amounts to PCB traces. This is a high-yielding process, however it has the two drawbacks listed above: connectivity and power. It costs more energy to send a bit over a PCB connection than it does through silicon, but also the bandwidth is much lower because the signals cannot be as densely packed. As a result, without careful planning, a multi-die connected product will have to be aware of how far data is at any one time, an issue few monolithic products have.
The way around this is with a faster interconnect. Rather than putting that connectivity through the substrate, through the package, what if it was through silicon anyway? By placing these connected dies on a piece of silicon, such as an interposer, the connectivity traces have better signal integrity, and better power. Using an interposer, this is commonly referred to as 2.5D packaging. It costs a bit more than standard packaging technology (there’s also scope for active interposers with logic), but we also have another limitation in that the interposer has to be bigger than all the silicon put together. But overall, this is a better option, especially if you want your multi-die product to act as if it were monolithic.
Intel decided that the best way to beat the downsides of interposers but still get the benefits of an effective monolithic silicon design was instead was to create super small interposers that lived inside the substrate. By pre-embedding them in the right location, with the right packaging tools two chips could be placed on this small Embedded Multi-Die Interconnect Bridge (EMIB), and voilà, a system that works as close to a monolithic design as is physically possible.
Intel has worked on the EMIB technology for over a decade. The development has had three major milestones from our perspective: (1) being able to embed the bridge into a package with a high yield, (2) being able to place big silicon die on the bridge at high yield, and (3) being able to put two high-powered die next to each other on a bridge. It is that third part that I think Intel has struggled with the most – by having two high-powered die next to each other, especially if the die have different coefficients of thermal expansion and different thermal properties, there is the potential of weakening the substrate around the bridge or the connections to the bridge itself. Almost all of Intel’s products that used EMIB so far have been around connecting a CPU/GPU to high-bandwidth memory, which is an order of magnitude lower power than what it’s being connected to. Because of that, I wasn’t convinced putting two high-powered tiles together possible, at least until Intel announced a multi-die FPGA connected by EMIB using two high-powered FPGA tiles in late 2019. From that point on, it was only a moment in time before Intel enabled the technology on its CPU product stack. We’re finally getting that with Sapphire Rapids.
10x EMIB on Sapphire Rapids
Sapphire Rapids is going to be using four tiles connected with 10 EMIB connections using a 55-micron connection pitch. Normally you might think that a 2x2 array of tiles would need equal EMIBs per tile-to-tile connection, so in this case with 2 EMIBs per connection, that would be eight – why is Intel quoting 10 here? That comes down to the way Sapphire Rapids is designed.
Because Intel wants SPR to look monolithic to every operating system, Intel has essentially cut its inter-core mesh horizontally and vertically. That way each connection through the EMIB is seen purely as the next step on the mesh. But Intel’s monolithic designs are not symmetric in either of those dimensions – usually features like the PCIe or QPI are on the edges, and not in the same place in every corner. Intel has told us that in Sapphire Rapids, this is similarly the case, and one dimension is using 3 EMIBs per connection while the other dimension is using 2 EMIBs per connection.
By avoiding strict rotational symmetry in its design, and without a central IO hub, Intel is leaning heavily to acting as a monolithic die – leaning so heavily it’s almost falling over to do so. As long as the EMIB connections are consistent between tiles, software shouldn’t have to worry, although until we get further details here, it’s hard to speculate exactly why without going through the motions of trying to figure out Intel’s mesh designs and how the extra parts all connect together. SPR sounds like a monolithic design cut up, rather than a ground-up multi-die design, if that makes sense.
Intel announced earlier this year that it will make an HBM version of Sapphire Rapids, using four HBM tiles. These will also be connected by EMIB, one per tile.
Tiles Tiles Tiles
Intel did give an insight into what exactly each of the separate tiles will have inside it, however this was extremely high level:
Each tile has:
- cores, cache, and mesh
- A memory controller with 2x64-bit DDR5 channels
- UPI links
- Accelerator links
- PCIe links
In this situation, and throughout the presentation, it looks like all four tiles are equal, with the rotational symmetry I mentioned above. To make silicon that does this, in the way presented, isn’t as easy as mirroring the design and printing that onto a silicon wafer. The crystal plane of the wafer limits how designs can be built, and so any mirroring has to be redesigned completely. As a result, Intel confirmed that it has to use two different sets of masks to build Sapphire Rapids, one each for the two dies it has to make. It can then rotate each of these two dies to build the 2x2 tile grid as shown.
It’s worth comparing this to AMD’s first-generation EPYC, which also used a 2x2 chiplet method, albeit with connectivity through the package. AMD escaped the need for having multiple silicon designs by having it rotationally symmetric – AMD built four die-to-die interfaces on the silicon, but only used three for each rotation. It’s a cheaper solution (and one that was right for AMD’s financial situation at the time) at the cost of die area, but also enables a level of simplicity. AMD’s central IO die method in newer EPYCs moves away from this issue entirely. From my perspective, it’s something Intel is going to have to move towards if they want to scale beyond SPR but also for a different reason.
As it stands, each of the tiles holds 128-bits of DDR5 memory interfaces, for a total of 512-bits across all four tiles. Physically, this means we will see eight 64-bit memory controllers* for either eight or sixteen memory modules per socket in a system. That’s perfectly fine for versions of Sapphire Rapids with all four compute tiles.
However, we know that the Sapphire Rapids processor offering is going to have to scale down to fewer cores. In the past, Intel would create three different silicon monolithic variations to cater for these markets and optimize silicon output, but all the processors would have the same memory controller count.
This means that if SPR is going to offer versions with fewer cores, it is going to either create dummy tiles without any cores on them, but still keep the PCIe/DDR5 as required, or quite simply those lower core counts are going to have fewer memory controllers. That’s going to be a pain for system manufactures who want to build catch-all systems, because they’re going to have to build for both extremes.
The other alternative is that Intel has monolithic versions of SPR with all 8 memory channels for lower core count designs. But at this time, Intel has not disclosed how it is going to cater to those markets.
*technically DDR5 puts two 32-bit channels on a single module, but as yet the industry doesn’t have a term to differentiate between a module with one 64-bit memory channel on it vs. a module with two 32-bit memory channels on it. The word ‘channel’ has often been interchangeable with ‘memory slot’ to date, but this will have to change.
Post Your CommentPlease log in or sign up to comment.
View All Comments
Noctrn - Tuesday, August 31, 2021 - linkIt seems 99% likely that we're looking at 8 cores per tile for a max of 32 cores in the package. Intel has so far proven incapable of making a single piece of functional silicon with more than 8 large cores on it using anything smaller than 14nm.
Kamen Rider Blade - Tuesday, August 31, 2021 - linkAll the rumors for Sapphire Rapids is pointing to 14 Cores per Tile in a 4x4 Grid.
2x of those nodes are for Vertical/Horizontal Interconnect management.
repoman27 - Tuesday, August 31, 2021 - linkNot exactly. 4x4 grid with 15 CPU core tiles and one IMC tile for the two DDR5 controllers. See my post further up for additional details and link to actual die shots.
thestryker - Tuesday, August 31, 2021 - linkExcept for the entire Ice Lake Xeon line which scales up to 40 cores...
dullard - Tuesday, August 31, 2021 - linkPsst, don't go putting facts into discussions! It throws the rest of us off.
Kamen Rider Blade - Tuesday, August 31, 2021 - link*technically DDR5 puts two 32-bit channels on a single module, but as yet the industry doesn’t have a term to differentiate between a module with one 64-bit memory channel on it vs. a module with two 32-bit memory channels on it. The word ‘channel’ has often been interchangeable with ‘memory slot’ to date, but this will have to change.
What about calling them 2x 32-bit Sub-Channels?
TeXWiller - Tuesday, August 31, 2021 - linkJEDEC calls them independent channels, that is 2 channels per module. Hence, an eight DIMM server board has a 16 channel memory system. I don't know why these terms are in flux everywhere.
Ian Cutress - Wednesday, September 1, 2021 - linkBecause people have been interchanging module and channel for years, and one module = one channel. The fact that DDR5 moves down to 32-bit channels from 64-bit channels means I'm going to be sprinkling the word controller around two be absolutely specific.
kpb321 - Tuesday, August 31, 2021 - linkThe lower core count versions will certainly be interesting.
If the comments are correct and it is 14 cores per tile then you'd have 56 cores max. Certainly you could see them doing 52 and 48 core versions from die harvesting with 1 or 2 cores disabled per tile but the further below that you go the less it makes sense. On the other hand looking at the high level chip diagrams you pretty much have IO going around the entire outside of the cluster of tiles. I'm not sure how much smaller you can make the tiles and still have enough room for all the IO. What's the min core count going to look like? Are there going to be a 16 or 20 core version? Are they still going to use tiles for those or design a different monolithic die?
Spunjji - Friday, September 3, 2021 - linkIt would be logical to infer that they're going to need at least one more monolithic design, to allow for designs with fewer tiles with the same number of memory channels.
Unless they just leave the lower-core designs with less memory bandwidth, which would be a product segmentation strategy of sorts, I guess?