Yesterday HP announced retail availability of two ARM based servers, the ProLiant m400 and m800. Each are offered in a server cartridge as part of the Moonshot System. A single 4.3U Moonshot chassis can hold 45 server cartridges. Usually higher numbers mean better, but in this case the m400 and m800 are so significantly different I wouldn’t consider them competitors. The m800 is focused on parallel compute and DSP, while the m400 is focused on compute, memory bandwidth, IO bandwidth and features the first 64-bit ARM processor to reach retail server availability.

HP ProLiant ARM Servers
  m400 m800
Processors 1 4
Processor AppliedMicro X-Gene
Custom 64-bit ARMv8
TI KeyStone II 66AK2H
Cortex-A15 ARMv7A + DSP
Compute cores per processor

8 CPU

4 CPU
8 DSP
Clock Speed 2.4 GHz 1.0 GHz
Cache Memory Each core: 32KB L1 D$ and I$
Each pair: 256KB L2
All cores: 8MB L3
Each DSP core: 1MB L2
Memory Quad Channel
8 SODIMM Slots
DDR3-1600 Low Voltage
Max: 64GB (8x8GB)
Single Channel
4 SODIMM Slots
DDR3-1600 Low Voltage
Max: 32GB (4x8GB)
Network Controller Dual 10GbE Dual 1GbE
Storage M.2 2280 M.2 2242
PCIe 3.0 2.0

Starting with the m400, HP designed in a single AppliedMicro X-Gene SoC at 2.4 GHz. AppliedMicro has been discussing the X-Gene processor for several years now, and with this announcement becomes the first vendor to achieve retail availability of a 64-bit ARMv8 SoC other than Apple. Considering Apple doesn’t sell their processors stand-alone, this is a significant milestone. AppliedMicro has significantly beaten AMD’s A1100 processor to market, as AMD has not yet entered production. Marquee features of the X-Gene SoC include 8 custom 64-bit ARM cores, which at quad-issue should be higher performance than A57, quad channel DDR3 memory, and integrated PCIe 3.0 and dual 10GbE interfaces. Look out for a deep dive on the X-Gene SoC in a future article.

The m800 is a 32-bit ARM server containing four Texas Instruments KeyStone II 66AK2H SoCs at 1.0 GHz. Each KeyStone II SoC contains four A15 CPU cores alongside eight TI C66x DSP cores and single channel DDR3 memory, for a total of 16 CPU and 32 DSP cores. IO steps back to dual GbE and PCIe 2.0 interfaces. It is clear from the differences in these servers that m400 and m800 target different markets. There isn’t yet a best-of-both-worlds server combining the core count and memory + IO interfaces of the m400 and m800 together.

Each server is available with Ubuntu and IBM Informix database preinstalled, and will be demonstrated at ARM TechCon October 1-3 in Santa Clara, California.

Source: HP

Comments Locked

33 Comments

View All Comments

  • Wilco1 - Saturday, October 4, 2014 - link

    Check the Intel optimization manual, it clearly states the FP issue queues are in-order. Memory operations are similar, except that there is a small retry buffer that handles cachemisses.
  • patrickjchase - Saturday, October 4, 2014 - link

    It was discussed ad nauseum in the (400+ post) discussion thread from the very same RWT article you cite. Looks for the posts titled "No out-of-order in FP cluster of silvermont" (and yes, I'm the same 'Patrick Chase' and 'Wilco1' above is the same "Wilco'. It's rather depressing how it's always the same group of people with nothing better to do with our lives... :-).

    The gcc commit comments cited in the initial post in that thread are somewhat misleading in describing the memory pipe as in-order. That's true for the initial L1 D$ probe, but ops that miss L1 are then forwarded to an OoO reissue queue. The result is an in-order memory pipeline for L1 but OoO for L2/DDR, which is basically what you want given that arguably the main benefit of having OoO in the first place is that it allows you to execute past cache misses (compilers are good at scheduling around known latencies such as L1 D$ access - it's the unknown/variable ones that require runtime "help").

    I recall David Kanter confirming the in-order nature of Silvermont FP somewhere, but can't find that post.
  • Wilco1 - Saturday, October 4, 2014 - link

    Silvermont has a small reorder buffer and a mostly in-order FP pipeline. Memory operations are also mostly in-order, with no speculation whatsoever. All loads/stores always issue in program order (ie. there is no out-of-order issue at all). Stores with unknown addresses stall the whole pipeline until resolved. Loads with unknown address and cache misses go into a small retry queue after using an issue cycle, and are retried later (so there is limited out of order retry). This is almost identical to a hit-under-miss load/store pipeline on an in-order CPU.

    So Silvermont is limited and partially OoO. If you call this "aggressive" or "top notch" then what do you call eg. Cortex-A15 with its much larger reorder buffer, full OoO FP, int, branch and memory pipelines with full speculation? You'll run out of superlatives...

    AnandTech would do well to ditch their hopeless JS rubbish - the results are not even from the same browser (let alone same version of the same browser!), and different browsers show >2x difference in performance on eg. SunSpider. Using JS as an indication of browser performance is bad enough (as it isn't at all), but using JS benchmarks to claim better CPU performance is insane.

    Did you read what I wrote about the Geekbench integer test? I said I removed the AES subtest as that used hardware acceleration on BT but not on A15 which skews the score. Without that the A15 wins the integer test too despite running at a significantly lower frequency (remember that Avoton runs at 2.6GHz turbo).

    Yes the large number on-board interfaces on X-Gene is definitely a differentiator, and this improves the power efficiency equation. Personally I think the 4 DDR3L channels are more interesting, providing a whopping 60GB/s or 7.5GB/s per core!
  • patrickjchase - Saturday, October 4, 2014 - link

    I call A15 a big, power-hungry design that looks great on paper but that has consistently underperformed in the real world, probably due to memory subsystem and branch prediction limitations. Yes, it's 3-wide and has a massive ROB, but it performs no better than the much less aggressive A17 (and not much better clock-for-clock than Krait, for that matter - There's a reason why Snapdragons dominated the market in that generation).

    ARM have effectively admitted as much by backtracking to a much less aggressive design in A17.
  • Wilco1 - Sunday, October 5, 2014 - link

    The main reason A17 does so well is the streamlined memory system that was designed in the 2 years since the A15. Better prefetchers and the ability to do 2 loads or 2 stores per cycle easily give 20-30% gain. You can see a 2.2GHz comparison of A15 with A17 here: http://browser.primatelabs.com/geekbench3/compare/... - almost identical integer and FP scores, but A17 scores 15% better on memory.

    If any ARM CPU could be compared with Pentium 4, Krait would fit the description perfectly - it is a design for high frequency with low IPC. So it has to be clocked extremely high to get decent performance (there is a reason Samsung uses either 1.9GHz A15 or 2.5-2.7GHz Kraits in their phones - they have nearly identical performance).

    Krait's only redeeming factor is that its power consumption is very good. However Snapdragon's dominance in the last 2 years is unlikely to continue now that other SoCs finally support built-in modems and QC got the timing of the 64-bit generation badly wrong.
  • patrickjchase - Sunday, October 5, 2014 - link

    Now that I think about it some more, A15 can be described by a single "superlative": It's the Pentium-4 of the ARM world. Really.
  • shodanshok - Sunday, October 5, 2014 - link

    Wilco, patrick,
    Thank you for giving some reference for silvermont FP cluster.

    @wilco
    "Aggressive" and "top notch" were referred to current common ARM cores. Silvermont performance are quite good even compared to A15 which, as patrick noted, is impressive on paper but underperforming in real world test and consume a noticeably amount of power. And until A12/A17, Cortex memory performance was quite bad.

    I think that the real plus of Intel cores is their very advanced prefetcher...

    Web benchmarks, while flawed, represents a real world scenario which is quite significative for end users.

    Anyway, the X-Gene discussed above is quite a different beat ;)

    Regards.
  • TadzioPazur - Tuesday, October 14, 2014 - link

    The benchmark that you referenced (how relevant is it to server products?) points that ARM-based tabled is on par with server Avoton, and needs to catch up when the field levels.
    Intel Atom C2750 @ 2.40 GHz trades blows with the nVidia tn8 @ 2.22 GHz (in single threaded benchmarks).
    Now, lets see what are the ramifications of making it a server-grade chip:
    1. Buffered memory with ECC (latencies go up) - the Avoton already uses them
    2. Write-through D cache (that is ARM's solution for cache coherency)
    3. Optional addition of L3 cache (which would mediate the 2. somewhat)
    4. Further latency increase if we go to multisocket machines (optional, would hurt both lines - not needed for SAN/NAS appliances)

    So no, ARM chips are not clearly better than low-power, low-performance intel CPUs, at least performance-wise.
  • michael2k - Wednesday, October 1, 2014 - link

    Um. what world do you live in where ARM isn't competitive in tablets? In the real world it is Intel that isn't competitive. Apple managed to ship a 6 issue dual core OOE processor at 1.3GHz last year, 1.4GHz this year, broadly comparable to a 2GHz 4 issue dual core Core Duo that Intel shipped in 2006.
  • AFigueira - Thursday, October 2, 2014 - link

    Can you provide some sources for those claims, please?

Log in

Don't have an account? Sign up now