In the news cycle today, Intel is announcing an update to its planned deployment of its next generation Xeon Scalable platform known as Sapphire Rapids. Sapphire Rapids is the main platform behind the upcoming Aurora supercomputer, and set to feature support for leading edge technologies such as DDR5, PCIe 5.0, CXL, and Advanced Matrix Extensions. The announcement today is Intel reaffirming its commitment to bringing Sapphire Rapids to market for wide availability in the first half of 2022, meanwhile early customers are currently operating with early silicon for testing and optimization.

In a blog post by Lisa Spelman, CVP and GM of Intel’s Xeon and Memory Group, Intel is getting ahead of the news wave by announcing that additional validation time is being incorporated into the product development cycle to assist with top tier partners and customers to streamline optimizations and ultimately deployment. To that end, Intel is working with those top tier partners today with early silicon, typically ES0 or ES1 using Intel’s internal designations, with those partners helping validate the hardware for issues against their wide ranging workloads. As stated by former Intel CTO Mike Mayberry in the 2020 VLSI conference, Intel’s hyperscale partners end up testing 10-100x more use cases and edge cases than Intel can validate, so working with them becomes a critical part of the launch cycle.

As the validation continues, Intel works with its top tier partners for their specific monetizable goals and features that they’ve requested, so that when the time comes for production (Q1 2022) and ramp (Q2 2022), and a full public launch (1H 2022), those key partners are already benefiting from working close with Intel. Intel has stated that as more information about Sapphire Rapids becomes public, such as at upcoming events like Hot Chips in August or Intel’s own event in October, there will be a distinct focus on benchmarks and metrics that customers rely upon for monetizable work flows, which is in part what this cycle of deployment assists with.

Top tier partners getting early silicon 12 months in advance, and then deploying final silicon before launch, is nothing new. It happens for all server processors regardless of source, so when we finally get a proper public launch of a product, those hyperscalers and HPC customers have already had it for six months. In that time, those relationships allow the CPU vendors to optimize the final bits to which the general public/enterprise customers are often more sensitive.

It should be noted that a 2022 H1 launch of Sapphire Rapids hasn’t always been the date in presentations. In 2019, Ice Lake Xeon was a 2020 product and Sapphire Rapids was a 2021 product. Ice Lake slipped to 2021, but Intel was still promoting that it would be delivering Sapphire Rapids to the Aurora supercomputer by the end of 2021. In an Interview with Lisa Spelman in April this year, we asked about the close proximity of the delayed Ice Lake to Sapphire Rapids. Lisa stated that they expected a fast follow on with the two platforms - AnandTech is under the impression that because Aurora has been delayed repeatedly, and that the ‘end of 2021’ was a hard part of Intel’s latest contract with Argonne on the machine for key deliverables. At Computex 2021, Spelman announced in Intel’s keynote that Sapphire Rapids would be launching in 2022, and today’s announcement reiterates that. We expect general availability to be more within the end Q2/Q3 timeframe.

It’s still coming later than expected, however it does space the Ice Lake/Sapphire Rapids transition out a bit more. Whether this constitutes an additional delay depends on your perspective; Intel contends that it is nothing more than a validation extension, whereas we are aware that others may ascribe the commentary to something more fundamental, such as manufacturing. It's no secret that the level of manufacturing capacity Intel has for its 10nm process, or particularly 10nm ESF which is what Sapphire Rapids is built on, is not well known beyond ‘three ramping fabs’ announced earlier this year. Intel appears to be of the opinion that it makes sense for them  to work closer with their key hyperscaler and HPC customers, who account for 50-60%+ of all Xeons sold in the previous generation, as a priority before a wider market launch to focus on their monetizable workflows. (Yes, I realize I’ve said monetizable a few times now; ultimately it’s all a function revenue generation.)

As part of today’s announcement, Intel also lifted the lid on two new Sapphire Rapids features.

First is Advanced Matrix Extensions (AMX), which has technically been announced before, and there is plenty of programming documentation about it already, however today Intel is confirming that AMX and Sapphire Rapids are the initial pairing for this technology. The focus of AMX is matrix mutliply, enabling more machine learning compute performance for training and inference in Intel’s key ‘megatrend markets’, such as AI, 5G, cloud, and others. Also part of the AMX disclosures today is some level of performance – Intel is stating that early Sapphire Rapids silicon with AMX, at a pure hardware level, is enabling at least a 2x performance increase over Ice Lake Xeon silicon with AVX512. Intel was keen to point out that this is early silicon without any additional software enhancements on Sapphire Rapids. AMX will form part of Intel’s next-gen DL Boost portfolio at launch.

The second feature is that Intel is integrating a Data Streaming Accelerator (DSA). Intel has also had documentation about DSA on the web since 2019, stating that it is a high-performance data copy and transformation accelerator for streaming data from storage and memory or to other parts of the system through a DMA remapping hardware unit/IOMMU. DSA has been a request from specific hyperscaler customers, who are looking to deploy it within their own internal cloud infrastructure, and Intel is keen to point out that some customers will use DSA, some will use Intel’s new Infrastructure Processing Unit, while some will use both, depending on what level of integration or abstraction they are interested in.

Yesterday we learned that Intel will be offering versions of Sapphire Rapids with HBM integrated for every customer, with the first deployment of those going to Aurora. As mentioned, Intel is confirming that they will be disclosing more details at Hot Chips in August, and at Intel’s own Innovation event in October. There may also apparently be some details about the architecture before that date as well, according to today’s press release.

Realted Reading

 

Comments Locked

34 Comments

View All Comments

  • AshlayW - Friday, July 9, 2021 - link

    That is exactly what is coming. Zen3+ EPYC with over 1GB L3 Stacked Cache per socket is coming, Sapphire Rapids is more or less DOA except for a few niche AI cases that are better off with NVIDIA GPUs anyway.
  • mode_13h - Saturday, July 10, 2021 - link

    Yeah, though AMX will certainly be an interesting feature. Intel has some other goodies in there, like DSA (Data Streaming Accelerators). Depending on how they use the HBM, Sapphire Rapids will still have some niches and benchmarks where it excels.
  • Unashamed_unoriginal_username_x86 - Tuesday, June 29, 2021 - link

    Incredible! Intel really is clawing it's way back to the top with this aggressive cadence, launching in H3 2021!
  • Arsenica - Tuesday, June 29, 2021 - link

    Don't worry about dates, worry about the new exploits soon to be enabled by DSA!
  • JayNor - Tuesday, June 29, 2021 - link

    "enabling at least a 2x performance increase over Ice Lake Xeon silicon with AVX512enabling at least a 2x performance increase over Ice Lake Xeon silicon with AVX512"

    Ice Lake doesn't have bfloat16 support. How does AMX performance compare to Cooper Lake?
  • SystemsBuilder - Tuesday, June 29, 2021 - link

    Agree. The "enabling at least a 2x performance increase over Ice Lake Xeon silicon with AVX512" statement makes no sense. AMX is BF16 floats (and byte + dword accumulate) and Icelake does not support BF16 (closest is FP32). So yeah if you compare BF16 vs FP32 you probably get 2x throughput but at half the precision (though exponent range is the same) just by changing format and adjusting the FPU units in AMX -> that would mean no additional improvement beyond format. Cooper Lake does support BF16 though and that is the interesting like for like comparison.
  • mode_13h - Thursday, July 1, 2021 - link

    > How does AMX performance compare to Cooper Lake?

    AMX is probably going to be one of those 10x improvements we see, every now and then. The question will be how relevant it is, by the time SPR actually launches.

    If the hyperscalers and cloud operators already mid-transition to HW AI accelerators, by then, it could be worth little more than a few big benchmarks to help shift the GeoMean.
  • SystemsBuilder - Thursday, July 1, 2021 - link

    AMX could potentially be like a ~10x magnitude BF16 float improvement over cooper lake in VERY special cases. In commercial AI the matrices and data sets are big (>>larger than cache) -> the bottleneck is not the FPU capabilities of the CPU (AVX-512 or AMX) but the memory bandwidth/latency. AVX-512 is already quite powerful so it can fully saturates the DRAM bandwidth doing large scale FP32 matrices. The BF16 format alone allows a 2x improvement over FP32 (16 bits vs 32 bits so 2 elements instead of 1 with the same data bandwidth). Unless Intel does something dramatic witch the cache and this DSR “thing” (like near L1 cache level bandwidth for GBs of data), the AMX will sit mainly idle waiting for data to feed it. The area where AMX would potentially be able to pull an out ~10x FP throughput improvement could be for small matrices that can be kept entirely in the AMX 8 T registers (8KB of configurable matrix registers) and L1 cache. If the matrices are significantly bigger than that, then you end up with memory bandwidth being the bottle neck again... there is only so much cache and register files can only hide.
  • mode_13h - Friday, July 2, 2021 - link

    > In commercial AI the matrices and data sets are
    > big (>>larger than cache) -> the bottleneck is not the FPU capabilities
    > of the CPU (AVX-512 or AMX) but the memory bandwidth/latency.

    A close reading of AMX seems to suggest it's as much a data-movement optimization as a compute optimization. It seems ideally-suited to accelerate small, non-separable 2D convolutions. The data movement involved makes that somewhat painful to optimize with instructions like AVX-512.

    From what I've seen, about the only things it currently does are load tiles and compute dot-products (at 8-bit and BFloat16 precision).

    > The BF16 format alone allows a 2x improvement over FP32 (16 bits vs 32 bits so
    > 2 elements instead of 1 with the same data bandwidth).

    Since Ivy Bridge, they've had instructions to do fp16 <-> fp32 conversion. So, although I know fp16 isn't as good as BF16, you could already do a memory bandwidth optimization by storing your data in fp16 and only converting it to fp32 after loading it.

    > If the matrices are significantly bigger than that, then you end up with memory
    > bandwidth being the bottle neck again... there is only so much cache and
    > register files can only hide.

    Yes, but... a number of the AI accelerators I'm seeing coming to market don't have a whole lot more on-die SRAM than modern server CPUs. And some don't even have more memory bandwidth, either. That tells me they're probably doing some clever data movement optimizations around batching, where they leave parts of the model on chip and run the input data through, before swapping in the next parts of the model.
  • SystemsBuilder - Saturday, July 3, 2021 - link

    Some points on AMX/AVX-512 and Neural Nets – and sorry for the long post...
    1. BF16 is not FP16 (as you said) and at least in my experience FP16 just isn't good enough in training scenarios. BF16 is a good compromise between precision and bandwidth and will do just fine - BF16 was created for AI because FP16 is not good enough (in training)... so streaming FP16 and then upconverting to fp32 is not good enough path... I think we are saying the same on this.
    2. AMX will make BF16 matrix multiplications on Sapphirelake much easier compared to Cooperlake since we would only need one instruction to multiply 2 matrix tiles (plus of course initial tile configuration, tile load and final store). So yes it will be easier to program and faster when executed compared to AVX-512 for BF16 matrix multiplications just for those reasons alone (assuming AMX have competitive clock cycles improvements over AVX-152). Having said that, AVX-512 tile x tile multiplications are not super hard either (although in FP32 so half speed). It basically involves repetitive use of the read ahead (prefetcht0) and the vfmadd231ps instructions (apologize for the low level technical detail but it’s key). AVX-512 dgemm implementations use the same approach and can get close to fully saturating the memory bandwidth and get close to theoretical max FP32 FLOPS (for CPUs with 2 AVX-512 units per core). So how much can AMX improve on this? (beyond BF16 compression advantage).
    3. Matrix multiplications requires at least O(i*j) +O(j*k) reads, O(i*k) writes and O(i*j*k) FMAs (multiply and adds) (O = big O notation – comp sci theory…):
    So if i, j, k (dimensions of the matrices) are large enough (>> than cache dimensions), it mathematically does not matter how clever you are with caching or how fast you can compute etc. Mathematically, you still need to read every single matrix element from RAM at least once and that (the RAM bandwidth) is the theoretical max throughput you can get. Best case: all data is in cache and FPU units are never Idle (= all matrices are in cache when needed) and that’s not happening anytime soon. My point is that RAM bandwidth is the fundamental bottle neck and AMX matrix FLOPS speed advantage over Cooperlake needs a corresponding bump in streaming bandwidth - maybe the DSA “thing” will do just that… if not, intel would just have upgraded the matrix “engine” but can’t feed it fast enough…
    4. back to AI: Without getting too technical (and I’m sure I have over stepped many times already): neural nets (whatever flavor you are using), requires tons of matrix multiplications of sometimes very large matrices (>> cache) during training, which additionally, are algorithmically serially dependent on each other (e.g. in the feed forward and back ward prop case you need to finish the calculation of one layer before you can move on to the next forward and backward saving and reading them in between). Therefore it is fundamentally a data flow and bandwidth constrained problem not a FPU constrained problem. Batching/ mini batching of the training data just brings down the size for each iteration but you still need to work through the entire training set and then do it all over again and again for hundreds of times until convergence. Net net: the nature of the AI algorithms makes it inherently difficult to break through the RAM speed limit.

Log in

Don't have an account? Sign up now