The March of More Silicon: Connectivity Matters

To date, all of Intel’s leading-edge Xeon Scalable processors have been monolithic, i.e. one piece of silicon. Having a single piece of silicon has its advantages, namely a fast in-silicon interconnect between cores but also a singular power interface to manage. However, as we move to smaller and smaller process nodes, having one large piece of silicon has downsides: they are hard to manufacture in volume without defects, which increases cost if you want the high-core count models, but also it ends up being limiting.

The alternative to a large monolithic design is to cut it up into smaller bits of silicon and connect them together. The main advantages here is better silicon yield, but also configurability by having different silicon for different functions as needed. With a multi-die design, you can ultimately end up with more silicon than a monolithic design can provide – the reticle (manufacturing) limit for a single silicon die is ~700-800 mm2, and with a multi-die processor several smaller silicon dies can be put together, easily pushing over 1000mm2. Intel has stated that each of its silicon tiles are ~400 mm2, creating a total around ~1600mm2. But the major downside to multi-die designs is connectivity and power.

The simplest way to package two chips in a substrate together is through intra-substrate connections, or what essentially amounts to PCB traces. This is a high-yielding process, however it has the two drawbacks listed above: connectivity and power. It costs more energy to send a bit over a PCB connection than it does through silicon, but also the bandwidth is much lower because the signals cannot be as densely packed. As a result, without careful planning, a multi-die connected product will have to be aware of how far data is at any one time, an issue few monolithic products have.

The way around this is with a faster interconnect. Rather than putting that connectivity through the substrate, through the package, what if it was through silicon anyway? By placing these connected dies on a piece of silicon, such as an interposer, the connectivity traces have better signal integrity, and better power. Using an interposer, this is commonly referred to as 2.5D packaging. It costs a bit more than standard packaging technology (there’s also scope for active interposers with logic), but we also have another limitation in that the interposer has to be bigger than all the silicon put together. But overall, this is a better option, especially if you want your multi-die product to act as if it were monolithic.

Intel decided that the best way to beat the downsides of interposers but still get the benefits of an effective monolithic silicon design was instead was to create super small interposers that lived inside the substrate. By pre-embedding them in the right location, with the right packaging tools two chips could be placed on this small Embedded Multi-Die Interconnect Bridge (EMIB), and voilà, a system that works as close to a monolithic design as is physically possible.

Intel has worked on the EMIB technology for over a decade. The development has had three major milestones from our perspective: (1) being able to embed the bridge into a package with a high yield, (2) being able to place big silicon die on the bridge at high yield, and (3) being able to put two high-powered die next to each other on a bridge.  It is that third part that I think Intel has struggled with the most – by having two high-powered die next to each other, especially if the die have different coefficients of thermal expansion and different thermal properties, there is the potential of weakening the substrate around the bridge or the connections to the bridge itself. Almost all of Intel’s products that used EMIB so far have been around connecting a CPU/GPU to high-bandwidth memory, which is an order of magnitude lower power than what it’s being connected to. Because of that, I wasn’t convinced putting two high-powered tiles together possible, at least until Intel announced a multi-die FPGA connected by EMIB using two high-powered FPGA tiles in late 2019. From that point on, it was only a moment in time before Intel enabled the technology on its CPU product stack. We’re finally getting that with Sapphire Rapids.

10x EMIB on Sapphire Rapids

Sapphire Rapids is going to be using four tiles connected with 10 EMIB connections using a 55-micron connection pitch. Normally you might think that a 2x2 array of tiles would need equal EMIBs per tile-to-tile connection, so in this case with 2 EMIBs per connection, that would be eight – why is Intel quoting 10 here? That comes down to the way Sapphire Rapids is designed.

Because Intel wants SPR to look monolithic to every operating system, Intel has essentially cut its inter-core mesh horizontally and vertically. That way each connection through the EMIB is seen purely as the next step on the mesh. But Intel’s monolithic designs are not symmetric in either of those dimensions – usually features like the PCIe or QPI are on the edges, and not in the same place in every corner. Intel has told us that in Sapphire Rapids, this is similarly the case, and one dimension is using 3 EMIBs per connection while the other dimension is using 2 EMIBs per connection.

By avoiding strict rotational symmetry in its design, and without a central IO hub, Intel is leaning heavily to acting as a monolithic die – leaning so heavily it’s almost falling over to do so. As long as the EMIB connections are consistent between tiles, software shouldn’t have to worry, although until we get further details here, it’s hard to speculate exactly why without going through the motions of trying to figure out Intel’s mesh designs and how the extra parts all connect together. SPR sounds like a monolithic design cut up, rather than a ground-up multi-die design, if that makes sense.

Intel announced earlier this year that it will make an HBM version of Sapphire Rapids, using four HBM tiles. These will also be connected by EMIB, one per tile.

Tiles Tiles Tiles

Intel did give an insight into what exactly each of the separate tiles will have inside it, however this was extremely high level:

Each tile has:

  • cores, cache, and mesh
  • A memory controller with 2x64-bit DDR5 channels
  • UPI links
  • Accelerator links
  • PCIe links

In this situation, and throughout the presentation, it looks like all four tiles are equal, with the rotational symmetry I mentioned above. To make silicon that does this, in the way presented, isn’t as easy as mirroring the design and printing that onto a silicon wafer. The crystal plane of the wafer limits how designs can be built, and so any mirroring has to be redesigned completely. As a result, Intel confirmed that it has to use two different sets of masks to build Sapphire Rapids, one each for the two dies it has to make. It can then rotate each of these two dies to build the 2x2 tile grid as shown.

It’s worth comparing this to AMD’s first-generation EPYC, which also used a 2x2 chiplet method, albeit with connectivity through the package. AMD escaped the need for having multiple silicon designs by having it rotationally symmetric – AMD built four die-to-die interfaces on the silicon, but only used three for each rotation. It’s a cheaper solution (and one that was right for AMD’s financial situation at the time) at the cost of die area, but also enables a level of simplicity. AMD’s central IO die method in newer EPYCs moves away from this issue entirely. From my perspective, it’s something Intel is going to have to move towards if they want to scale beyond SPR but also for a different reason.

As it stands, each of the tiles holds 128-bits of DDR5 memory interfaces, for a total of 512-bits across all four tiles. Physically, this means we will see eight 64-bit memory controllers* for either eight or sixteen memory modules per socket in a system. That’s perfectly fine for versions of Sapphire Rapids with all four compute tiles.

However, we know that the Sapphire Rapids processor offering is going to have to scale down to fewer cores. In the past, Intel would create three different silicon monolithic variations to cater for these markets and optimize silicon output, but all the processors would have the same memory controller count.

This means that if SPR is going to offer versions with fewer cores, it is going to either create dummy tiles without any cores on them, but still keep the PCIe/DDR5 as required, or quite simply those lower core counts are going to have fewer memory controllers. That’s going to be a pain for system manufactures who want to build catch-all systems, because they’re going to have to build for both extremes.

The other alternative is that Intel has monolithic versions of SPR with all 8 memory channels for lower core count designs. But at this time, Intel has not disclosed how it is going to cater to those markets.

*technically DDR5 puts two 32-bit channels on a single module, but as yet the industry doesn’t have a term to differentiate between a module with one 64-bit memory channel on it vs. a module with two 32-bit memory channels on it. The word ‘channel’ has often been interchangeable with ‘memory slot’ to date, but this will have to change.

What is Sapphire Rapids? HBM, UPI, Conclusions
Comments Locked

94 Comments

View All Comments

  • Silver5urfer - Tuesday, August 31, 2021 - link

    Finally something innovative from Intel after years of abandonment of HEDT and Xeon leadership. I would give credit to Intel here because no big little BS scam. Mirroring the die mask design and using separate on top of the design which is using an EMIB on such a large silicon damn it looks super complicated vs AMD chiplet strategy.

    Now for the worst part, the I/O controller and Memory controller. That is going to be an issue for this design, Intel's mesh had power problems and inferior x86 performance on SKL Xeon and the Ice Lake solved that issue but core problem is AMD solved the EPYC 7000 series Zen based chiplets into a central I/O die and memory controller design eliminating the NUMA node performance hit. With smaller path trace due to EMIB this looks great but still it will have that hopping issues of Zen design based processors.

    So a SPR based HEDT LGA4xxx socket is coming but when ? 2022 ? Zen 3 Threadripper Chagall / Genesis Peak is coming this year. And Zen 3 based V-Cache Milan EPYC will be coming next year once the factories start producing them and they will be dripped to AM4 socket processors. SPR needs to prove a lot, Zen 4 is dropping soon with 96C and beastly IPC on top of 12 Channel memory design on TSMC 5N.

    IPC is a whole big another equation BUT most important is how the Intel 7 / 10nm design is vs TSMC 7N based EPYC in terms of clock potential and efficiency to performance ratios. Esp the fact that Intel had to cut off the x86 cores into those small SKL inferior crap cores onto the LGA1700 socket to keep up with the rising power consumption of their x86 processor designs. This one maxes out at 56C apparently with each tile at 14C, a big shame all these 14C couldn't make it to the LGA1700 they would have been perfect for the desktop compute, for those stupid thin and light BGA junk sales they axed it and shoved those efficiency designs into the Desktop LGA platform.
  • Kamen Rider Blade - Tuesday, August 31, 2021 - link

    The NUMA Domain is going to be interesting with the 4x Memory Controllers split on each die having to cross domain boundaries.

    And there appears to be 2x Cross-Tile interconnects on each tile that hold CHA (Caching and Home Agent) and a LLC (Last Level Cache) to handle resolution of coherency across multiple processors.

    Each core has a tiny CHA to check against the larger dedicated CHA & LLC to improve scalability according to the number of cores, memory controllers, or the sub-NUMA clustering mode. The address space is interleaved across different CHAs, which act like a single logical agent.

    So that's a interesting solution to their data locality issues when multi-threading or having cross core communication.
  • repoman27 - Tuesday, August 31, 2021 - link

    Why do you presume there will be any NUMA domain boundaries on-package? The whole point of going with EMIB and tiles vs. conventional chiplets on an organic substrate is that the EMIBs essentially act as "long wires" bridging the metal layers of the individual dies and extending the mesh without having to pipe the data over a separate link with a SerDes / PHY at each end.
  • JayNor - Tuesday, August 31, 2021 - link

    The leaked Emerald Rapids slides show CXL.mem. With the info that Sapphire Rapids doesn't implement CXL.mem, that finally makes sense.

    CXL.mem isn't mandatory according to servethehome article, "Compute Express Link or CXL What it is and Examples", from May 21,2021
  • JayNor - Tuesday, August 31, 2021 - link

    The CXL.mem feature in Emerald Rapids can be seen in the slides in the adoredtv transcription article, "intel-rapidly-getting-worse", from June 29, 2021.
  • Wereweeb - Tuesday, August 31, 2021 - link

    Not what I wanted to see, but EMIB is a pretty cool piece of tech
  • Blastdoor - Wednesday, September 1, 2021 - link

    Itanium, Optane, Knight's Landing, and Intel's original 10nm plans were all cool pieces of tech, too. Yet they lost to less cool tech like AMD64, flash, GPUs, and TSMC 7nm.

    After reading all this I'm left wondering if Intel designed this thing to showcase how cool EMIB is rather than EMIB enabling the optimal design (aka, a misalignment between cart and horse).

    I look forward to seeing how this super cool glue compares to the combo of AMD"s Elmer's plus a big slab of SRAM plus high yielding multi-use chiplets.

    Let's also see how yields and watts look.
  • wira6444 - Tuesday, August 31, 2021 - link

    I though Intel hate "GLUE" ?
  • Kamen Rider Blade - Tuesday, August 31, 2021 - link

    They changed their mind.
  • Oxford Guy - Thursday, September 2, 2021 - link

    'Truth as convenience' is one of the defining qualities of corporate reasoning.

Log in

Don't have an account? Sign up now