Intel recently released a new version of its document for software developers revealing some additional details about its upcoming Xeon Scalable ‘Cooper Lake-SP’ processors. As it appears, the new CPUs will support AVX512_BF16 instructions and therefore the bfloat16 format. Meanwhile, the main intrigue here is the fact that at this point AVX512_BF16 seems to be only supported by the Cooper Lake-SP microarchitecture, but not its direct successor, the Ice Lake-SP microarchitecture.

The bfloat16 is a truncated 16-bit version of the 32-bit IEEE 754 single-precision floating-point format that preserves 8 exponent bits, but reduces precision of the significand from 24-bits to 8 bits to save up memory, bandwidth, and processing resources, while still retaining the same range. The bfloat16 format was designed primarily for machine learning and near-sensor computing applications, where precision is needed near to 0 but not so much at the maximum range. The number representation is supported by Intel’s upcoming FPGAs as well as Nervana neural network processors, and Google’s TPUs. Given the fact that Intel supports the bfloat16 format across two of its product lines, it makes sense to support it elsewhere as well, which is what the company is going to do by adding its AVX512_BF16 instructions support to its upcoming Xeon Scalable ‘Cooper Lake-SP’ platform.

AVX-512 Support Propogation by Various Intel CPUs
Newer uArch supports older uArch
  Xeon General Xeon Phi  
Skylake-SP AVX512BW
Knights Landing
Cannon Lake AVX512VBMI
Knights Mill
Cascade Lake-SP AVX512_VNNI
Cooper Lake AVX512_BF16
Ice Lake AVX512_VNNI
(not BF16)
Source: Intel Architecture Instruction Set Extensions and Future Features Programming Reference (pages 16)

The list of Intel’s AVX512_BF16 Vector Neural Network Instructions includes VCVTNE2PS2BF16, VCVTNEPS2BF16, and VDPBF16PS. All of them can be executed in 128-bit, 256-bit, or 512-bit mode, so software developers can pick up one of a total of nine versions based on their requirements.

Intel AVX512_BF16 Instructions
Intel C/C++ Compiler Intrinsic Equivalent
Instruction Description
VCVTNE2PS2BF16 Convert Two Packed Single Data to One Packed BF16 Data

Intel C/C++ Compiler Intrinsic Equivalent:
VCVTNE2PS2BF16 __m128bh _mm_cvtne2ps_pbh (__m128, __m128);
VCVTNE2PS2BF16 __m128bh _mm_mask_cvtne2ps_pbh (__m128bh, __mmask8, __m128, __m128);
VCVTNE2PS2BF16 __m128bh _mm_maskz_cvtne2ps_pbh (__mmask8, __m128, __m128);
VCVTNE2PS2BF16 __m256bh _mm256_cvtne2ps_pbh (__m256, __m256);
VCVTNE2PS2BF16 __m256bh _mm256_mask_cvtne2ps_pbh (__m256bh, __mmask16, __m256, __m256);
VCVTNE2PS2BF16 __m256bh _mm256_maskz_cvtne2ps_ pbh (__mmask16, __m256, __m256);
VCVTNE2PS2BF16 __m512bh _mm512_cvtne2ps_pbh (__m512, __m512);
VCVTNE2PS2BF16 __m512bh _mm512_mask_cvtne2ps_pbh (__m512bh, __mmask32, __m512, __m512);
VCVTNE2PS2BF16 __m512bh _mm512_maskz_cvtne2ps_pbh (__mmask32, __m512, __m512);
VCVTNEPS2BF16 Convert Packed Single Data to Packed BF16 Data

Intel C/C++ Compiler Intrinsic Equivalent:
VCVTNEPS2BF16 __m128bh _mm_cvtneps_pbh (__m128);
VCVTNEPS2BF16 __m128bh _mm_mask_cvtneps_pbh (__m128bh, __mmask8, __m128);
VCVTNEPS2BF16 __m128bh _mm_maskz_cvtneps_pbh (__mmask8, __m128);
VCVTNEPS2BF16 __m128bh _mm256_cvtneps_pbh (__m256);
VCVTNEPS2BF16 __m128bh _mm256_mask_cvtneps_pbh (__m128bh, __mmask8, __m256);
VCVTNEPS2BF16 __m128bh _mm256_maskz_cvtneps_pbh (__mmask8, __m256);
VCVTNEPS2BF16 __m256bh _mm512_cvtneps_pbh (__m512);
VCVTNEPS2BF16 __m256bh _mm512_mask_cvtneps_pbh (__m256bh, __mmask16, __m512);
VCVTNEPS2BF16 __m256bh _mm512_maskz_cvtneps_pbh (__mmask16, __m512);
VDPBF16PS Dot Product of BF16 Pairs Accumulated into Packed Single Precision

Intel C/C++ Compiler Intrinsic Equivalent:
VDPBF16PS __m128 _mm_dpbf16_ps(__m128, __m128bh, __m128bh);
VDPBF16PS __m128 _mm_mask_dpbf16_ps( __m128, __mmask8, __m128bh, __m128bh);
VDPBF16PS __m128 _mm_maskz_dpbf16_ps(__mmask8, __m128, __m128bh, __m128bh);
VDPBF16PS __m256 _mm256_dpbf16_ps(__m256, __m256bh, __m256bh);
VDPBF16PS __m256 _mm256_mask_dpbf16_ps(__m256, __mmask8, __m256bh, __m256bh);
VDPBF16PS __m256 _mm256_maskz_dpbf16_ps(__mmask8, __m256, __m256bh, __m256bh);
VDPBF16PS __m512 _mm512_dpbf16_ps(__m512, __m512bh, __m512bh);
VDPBF16PS __m512 _mm512_mask_dpbf16_ps(__m512, __mmask16, __m512bh, __m512bh);
VDPBF16PS __m512 _mm512_maskz_dpbf16_ps(__mmask16, __m512, __m512bh, __m512bh);

Only for Cooper Lake?

When Intel mentions an instruction in its Intel Architecture Instruction Set Extensions and Future Features Programming Reference, the company usually states the first microarchitecture to support it and indicates that its successors also support it (or are set to support it) by calling them ‘later’ omitting the word microarchitecture. For example, Intel’s original AVX is supported by Intel’s ‘Sandy Bridge and later’.

This is not the case with AVX512_BF16. This one is said to be supported by ‘Future Cooper Lake’. Meanwhile, after the Cooper Lake-SP platform comes the long-awaited 10nm Ice Lake-SP server platform and it will be a bit odd for it not to support something its predecessor does. However, this is not an entirely impossible scenario. Intel is keen on offering differentiated solutions these days, so tailoring Cooper Lake-SP for certain workloads while focusing Ice Lake-SP on others may be the case here.

We have reached out to Intel for additional information and will update the story if we get some extra details on the matter.

Update: Intel has sent us the following:

At this time, Cooper Lake will add support for Bfloat16 to DL Boost.  We’re not giving any more guidance beyond that in our roadmap.

Related Reading

Source: Intel Architecture Instruction Set Extensions and Future Features Programming Reference (via InstLatX64/Twitter)

Comments Locked


View All Comments

  • mode_13h - Friday, April 5, 2019 - link

    Good riddance, IMO. There are good reasons IEEE 754 used a different bit-allocation for their half-precision format.

    Given that bloat16 isn't useful for much beyond deep learning, I think it'd be a good call to drop it.
    It's a mistake to burn CPU cycles on deep learning, when GPUs do it far more efficiently. Keep in mind that around the time of Ice Lake SP's introduction, Intel should have its own GPUs. Not to mention Nervana's chips and Altera's FPGAs.
  • Yojimbo - Saturday, April 6, 2019 - link

    But Intel is pursuing a strategy of some sort of unified code across their product lines. So their CPUs should be able to run that code, even if it does so more slowly than an accelerator, or a fracture would be created and the strategy would seem suspect. There is likely to be a market for machine learning on CPU-only servers beyond 2020 and 1) if bfloat16 can help its performance then it would be good for it to be there and 2) it would be a problem if this market's codebase was separated from that of the rest of Intel's constellation because of the lack of compatibility with the format.

    Of course, I wonder how useful such a unified system will be when dealing with very different underlying architectures. Maybe in a relatively narrow field like deep learning they can make it work, but if things start to require more vertical integration I would think that to get good performance with any sort of portability everything would need to be heavily re-optimized for each architecture and even mix of architectures. That would seem to defeat the purpose and presumably the heavy machine-learning only applications should be running almost entirely on Nervana hardware, anyway, assuming they believe in their Nervana accelerator.
  • mode_13h - Saturday, April 6, 2019 - link

    Have they even said much about oneAPI? I doubt it can mean something like bfloat16 everywhere, because that would instantly obsolete all of their existing products. Perhaps they're talking about higher-level interfaces that don't necessarily expose implementation-specific datatypes.

    I guess time will tell.
  • saratoga4 - Saturday, April 6, 2019 - link

    >But Intel is pursuing a strategy of some sort of unified code across their product lines. So their CPUs should be able to run that code, even if it does so more slowly than an accelerator,

    Intel's compiler has been able to do this for 20 years, at least since SSE, maybe even MMX . You don't need to support all instructions on all devices so long as your development tools do proper runtime checking.
  • Yojimbo - Sunday, April 7, 2019 - link

    Intel seems to be marketing their platform as a method of portability, they aren't marketing the capability of their compiler to target all their architectures. Something like bfloat16 most likely completely changes the parameters of what makes an effective neural network. So if Intel will market their CPUs for deep learning training, will market bfloat16 as the preferred data type for deep learning training, will market their platform for portability between their architectures, then how could they leave bfloat16 out of their CPUs? Either the CPUs will have bfloat16 or the situation Intel is presenting to their customers isn't as smooth as their marketing team has recently suggested it would be.
  • Yojimbo - Sunday, April 7, 2019 - link

    A third possibility is of course that I am misinterpreting what Intel is saying...
  • KurtL - Tuesday, April 9, 2019 - link

    It is not a mistake to use a CPU for deep learning even though GPUs may do a better job. For one thing, for occasional deep learning it doesn't make sense to invest in the overpriced NVIDIA GPUs. And a CPU has a distinct advantage: its memory may be a lot slower than on a GPU, but it has one big memory pool that no GPU solution can beat. And offering alternatives do keep the pressure on those who do make AI accelerators to keep the pricing reasonable.

    blfloat16 is as useful as the IEEE half precision format or maybe even more useful. I know IEEE half precision is supported in some graphics or acceleration frameworks, and I am fully aware that it is a prime citizen in the iPhone hardware. But so far it look like in typical PC applications, areas where you would expect to see it actually use 32-bit floating point and in a number of cases even 64-bit floating point. And for scientific computing IEEE half precision is useless.

    The true reason the instructions are not in Ice Lake is because Ice Lake was scheduled to be launched much earlier and its design was probably more or less finished quite a while ago. Cooper Lake is a quick stopgap adding some additional support for deep learning on top of what Cascade Lake offers and was probably easier to implement. Especially if the rumours are true that we actually won't see high core count Ice Lake CPUs for a while due to too low yields on 10nm. But I'd expect to see these new instructions on an Ice Lake successor.
  • mode_13h - Wednesday, April 10, 2019 - link

    For the most part, memory speed is vastly more important for deep learning than size.

    I think the main reason half-precision didn't see much use was a chicken-and-egg problem. Game developers didn't go out of their way to use it, because most GPUs implemented only token support. And GPUs didn't bother implementing proper support since no games used it. But since Broadwell, Vega, and Turing that has completely changed - the big 3 makers of PC GPUs now offer twice the fp16 throughput as their fp32. We've already seen some games start taking advantage of that.

    As for fp16 not being useful for scientific computing, you can find papers on how the Nvidia V100's tensor cores (which support only fp16 w/ fp32 accumulate) have been harnessed in certain scientific applications, yielding greater performance than if the same computations were run in the conventional way, on the same chip.

    Finally, I had the same suspicion about Ice Lake as you say. We knew that Intel had some 10 nm designs that were just waiting on their 10 nm node to enter full production.

    To be honest, I don't really mind if bfloat16 sticks around - I just don't want to see IEEE 754 half-precision disappear, as I feel that format is more usable for many applications.
  • The Hardcard - Friday, April 5, 2019 - link

    I know it is still a new tech direction, but I am hoping Anandtech or someone will soon get more in depth on training and inference. Hopefully there will be tools developed that will allow an idea of the relative capabilities of the approaches of these companies, from smartphone SOCs, through general CPUs and GPUs, to specialized accelerators.

    This seems to be a significant proportion of nearly every company’s R & D budget currently.IBM is working with a float8 that they claim loses virtually no accuracy. I hope there can be more articles about this, from the advantage of floating point over integer, to the remarkable lack of a need for precision that using float8 entails. Can it really stand up to FP32?
  • mode_13h - Friday, April 5, 2019 - link

    It seems to me that training benchmarks are harder than inference, especially when novel approaches are involved. The reason being that techniques like reducing precision usually involve some trade-off between speed and accuracy. So, you'd have to define some stopping condition, like a convergence threshold, and then report how long it took to reach that point and what level of accuracy the resulting model delivered.

    Inferencing with reduced precision is a little similar, but at least you don't have the challenge of defining what it is you actually want to test. Forward propagation is straight forward enough. But you still have to measure both speed *and* accuracy.

    I do see value in dissecting different companies' approaches, not unlike Anandtech's excellent coverage of Nvidia's Tensor cores. That said, there's not really much depth to some of the approaches, to the extent they simply boil down to a slight variation on existing FP formats.

    int8 and int4 coverage could be more interesting, since you could get into the area of the layer types for which they're most applicable and the tradeoffs of using them (i.e. not simply accuracy, but maybe you've also got to use larger layers) and the typical speeds that result. But, that has more to do with how various software frameworks utilize the underlying hardware capabilities than with the hardware, itself.

    As for IBM's FP8, this describes how that had to reformulate or refactor aspects of the training problem, in order to utilize significantly less than 16-bits for training:

    On a related note, Nervana (now part of Intel) introduced what they call "flexpoint":

Log in

Don't have an account? Sign up now