Intel's Xeon Phi in 10 Petaflops supercomputer

by Johan De Gelas on September 11, 2012 7:41 PM EST

15 Comments | Add A Comment

15 Comments

Intel announced the Xeon Phi ("Knights Corner") a few months ago and bought the Qlogic infiniband team and Cray fabric team to bolster its HPC efforts. A clear signal that Intel will not stand idly while GPU vendors try to conquer the HPC market.

Dell, Intel and the Texas Advanced Computing Center (TACC) build the first Supercomputer based upon the Xeon Phi, called Stampede. Stampede can spit out 10 Petaflops. If it was released right now, it would occupy the third place in the top 500 list of supercomputers. Stampede will go live on January the 7th, 2013.

The Xeon Phi consists of 64 x86 cores (256 threads), each with a 512-bit vector unit. The vector unit can dispatch 8 double precision SIMD operations. The Xeon Phi runs at 2 GHz (more or less, probably more soon) and thus delivers (2 GHz x 64 cores x 8 FLOPs) 1 TFlops. For comparison, a quadcore Haswell at 4 GHz will deliver about one fourth of that in 2013. NVIDIA and AMD GPUs can deliver similar FLOPs, programming the Xeon Phi should be a lot easier to use than CUDA- or OpenCL. The same development tools as for the regular Xeons are available: OpenMP, Intel's Threading Building Blocks, MPI, the Math Kernel Library (MKL)...

Gallery: Supermicro's Xeon Phi Server

Anyway, the Xeon Phi is definitly not limited to ultra expensive supercomputers. Supermicro showed us the Superserver 2027GR-TRF which contain 4 Xeon Phi cards thanks to two redundant 1800W (Platinum) PSUs. The rest of the server consists of two Xeon E5 and 16 DIMM slots in total, supporting up to 256 GB. So it seems that one Xeon Phi card consume about 300W.

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

15 Comments

View All Comments

djgandy - Wednesday, September 12, 2012 - link
Intel's SIMD units are designed in such a way that performance scales inversely with precision, so FP64 is half the performance of FP32 which is half of FP16.

Theoretical performance is what is talked about when calculating for any card, the performance is there, you just have to use it :)

Your numbers are also not quite correct, you missed out 2ops/clock from your calculation making the performance 2TFLOPS.
1008anan - Wednesday, September 12, 2012 - link
This is what I thought digandy.

Can someone confirm that there are 16 double precision flops per core per cycle? Or 4 double precision flops per thread per cycle?

This would mean that at 2 Gigahertz:

2 Gigahertz * (64 cores)*(4 threads/core)*(4 double precision flops per thread/clock) = 2 double precision teraflops theoretical maximum speed.

Actual speed would be quite a bit less than 2 Terahertz double precision. If we assume 70% efficiency [completely pulled out of nothing], we would get 1.4 Terahertz double precision.

Is there confirmation that this is true (aside from the efficiency estimate since I doubt Intel has released that information yet)? Is there also confirmation that each Xeon Phi SoC only has a TDP of 75 watts? If so that is astounding.

This would mean that the whole system generates:
1.4 double precision gigaflops/75 watts = 19 double precision gigaflops/watt.

Can this be right?
ArCamiNo - Thursday, September 13, 2012 - link
The threads don't add any more peak flops performance. They're here only to approach this performance peak.
4 threads per core means there is 4 complete sets of registers in each core.
For example, if a thread, currently executed, doesn't use all the unit of the core, another thread can use it.
So two thread can't use the same resource that an other thread in the same time but if the resources (number of ALU, FPU, decode and dispatch units etc) per core is still the same, its use it more efficient.

So for me it's a 1GHz (maybe a little more) chip with 64 cores. Each could run a Fused Multiply and Add instruction (like on the future AVX2 instruction set of Haswell). It means 2 instructions/cycle on 512bits (so 8 double precision floats) = 1TFlops in double precision peak performance (2*64*8).
So maybe the frequency is a little more than 1Ghz to achieve the 1TFlops in double precision on LINPACK like they said. But with this kind of architecture and the 4 threads/core, the real performance won't be that far from the theoretical performance unlike the GPU where it's about 60%.
ArCamiNo - Thursday, September 13, 2012 - link
I just checked on my Ivy Bridge processor, and I can reach the theoretical performance peak with the Intel Linpack Benchmark (http://software.intel.com/en-us/articles/intel-mat...
I have 82 GFlops in double precision. The theoretical perf are 8 double precision floats / cycles. At 2.6GHz (3720QM) on 4 cores it's 83.2.
So I'm now pretty sure that it will be the same with Xeon Phi. And the frequency will be 1GHz.

I don't think that the power consumption will be only 75W per card. If you remove the power for the RAM, it will means around 1 Watt/ core. It's the power consumption of an ARM core. I think it's more 3-4 Watts/ core.
JohanAnandtech - Monday, September 17, 2012 - link
Seems like we are not far from the mark.

1 core has several threads, but that is just to keep the flow going. For FLOPs, you should focus on the vector unit, no the pipeline or threads. So each vector unit can do 8 DP, not 16. Core is around 2 GHz

We reported 300W per card, and Charlie is reporting about 200W on idle. So 300W maxing out seems very reasonable to me.
http://semiaccurate.com/2012/09/14/hard-numbers-fo...

Right?

Intel's Xeon Phi in 10 Petaflops supercomputer

Post Your Comment

15 Comments

View All Comments

djgandy - Wednesday, September 12, 2012 - link

1008anan - Wednesday, September 12, 2012 - link

ArCamiNo - Thursday, September 13, 2012 - link

ArCamiNo - Thursday, September 13, 2012 - link

JohanAnandtech - Monday, September 17, 2012 - link

Log in

Don't have an account? Sign up now