The Ampere Altra Review: 2x 80 Cores Arm Server Performance Monster

Name: The Ampere Altra Review: 2x 80 Cores Arm Server Performance Monster
Item: The Ampere Altra Review: 2x 80 Cores Arm Server Performance Monster
Author: Andrei Frumusanu

by Andrei Frumusanu on December 18, 2020 6:00 AM EST

148 Comments | Add A Comment

148 Comments

Test Bed and Setup - Compiler Options

For the rest of our performance testing, we’re disclosing the details of the various test setups:

Ampere "Mount Jade" - Dual Altra Q80-33

Obviously, for the Ampere Altra system we’re using the provided Mount Jade server as configured by Ampere.

The system features 2 Altra Q80-33 processors within the Mount Jade DVT motherboard from Ampere.

In terms of memory, we’re using the bundled 16 DIMMs of 32GB of Samsung DDR4-3200 for a total of 512GB, 256GB per socket.

CPU	2x Ampere Altra Q80-33 (3.3 GHz, 80c, 32 MB L3, 250W)
RAM	512 GB (16x32 GB) Samsung DDR4-3200
Internal Disks	Samsung MZ-QLB960NE 960GB Samsung MZ-1LB960NE 960GB
Motherboard	Mount Jade DVT Reference Motherboard
PSU	2000W (94%)

The system came preinstalled with CentOS 8 and we continued usage of that OS. It’s to be noted that the server is naturally Arm SBSA compatible and thus you can run any kind of Linux distribution on it.

Ampere makes special note of Oracle’s active support of their variant of Oracle Linux for Altra, which makes sense given that Oracle a few months ago announced adoption of Altra systems for their own cloud-based offerings.

The only other note to make of the system is that the OS is running with 64KB pages rather than the usual 4KB pages – this either can be seen as a testing discrepancy or an advantage on the part of the Arm system given that the next page size step for x86 systems is 2MB – which isn’t feasible for general use-case testing and something deployments would have to decide to explicitly enable.

The system has all relevant security mitigations activated, including SSBS (Speculative Store Bypass Safe) against Spectre variants.

AMD - Dual EPYC 7742

For our AMD system, unfortunately we had hit some issues with our Daytona reference server motherboard, and moved over to a test-bench setup on a SuperMicro H11DSI0.

We’re also equipping the system with 256GB per socket of 8-channel/DIMM DDR4-3200 memory, matching the Altra system.

CPU	2x AMD EPYC 7742 (2.25-3.4 GHz, 64c, 256 MB L3, 225W)
RAM	512 GB (16x32 GB) Micron DDR4-3200
Internal Disks	OCZ Vector 512GB
Motherboard	SuperMicro H11DSI0
PSU	EVGA 1600 T2 (1600W)

As an operating system we’re using Ubuntu 20.10 with no further optimisations. In terms of BIOS settings we’re using complete defaults, including retaining the default 225W TDP of the EPYC 7742’s, as well as leaving further CPU configurables to auto, except of NPS settings where it’s we explicitly state the configuration in the results.

The system has all relevant security mitigations activated against speculative store bypass and Spectre variants.

Intel - Dual Xeon Platinum 8280

For the Intel system we’re also using a test-bench setup with the same SSD and OS image actually – we didn’t have enough RAM to run both systems concurrently.

Because the Xeons only have 6-channel memory, their maximum capacity is limited to 384GB of the same Micron memory, running at a default 2933MHz to remain in-spec with the processor’s capabilities.

CPU	2x Intel Xeon Platinum 8280 (2.7-4.0 GHz, 28c, 38.5MB L3, 205W)
RAM	384 GB (12x32 GB) Micron DDR4-3200 (Running at 2933MHz)
Internal Disks	OCZ Vector 512GB
Motherboard	ASRock EP2C621D12 WS
PSU	EVGA 1600 T2 (1600W)

The Xeon system was similarly run on BIOS defaults on an ASRock EP2C621D12 WS with the latest firmware available.

The system has all relevant security mitigations activated against the various vulnerabilities.

Compiler Setup

For compiled tests, we’re using the release version of GCC 10.2. The toolchain was compiled from scratch on both the x86 systems as well as the Altra system. We’re using shared binaries with the system’s libc libraries.

Topology, Memory Subsystem & Latency SPEC - Single-Threaded Performance

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

148 Comments

View All Comments

mode_13h - Thursday, December 31, 2020 - link
Isn't Blender included in SPECfp2017 as 526.blender_r? Or is that something different?
Teckk - Friday, December 18, 2020 - link
Whoever decided on naming these products — fantastic job. Simple, clear and effective.
Maybe you can offer some free advice to Intel and Sony.
Calin - Friday, December 18, 2020 - link
The answer to the question of "how powerful it is" is clear - more than good enough.
The real question in fact is:
"How much can they produce?"
AMD has the crown in x86 processor performance, but this doesn't really matter very much as long as they can build enough processors only for a part of the market.
jwittich - Friday, December 18, 2020 - link
How many do you need? :)
Bigos - Friday, December 18, 2020 - link
64kB pages might significantly enhance performance on workload with large memory sets, as the TLB will be up to 16x less used. On the other hand, memory usage of the Linux file system cache will also increase a lot.

Would you be able to test the effect of 64kB vs 4kB page size on at least some workloads?
Andrei Frumusanu - Friday, December 18, 2020 - link
It's something that I wanted to test but it requires a OS reinstall / kernel recompile - I didn't want to get into that rabbit hole of a time sink as already spent a lot of time verifying a lot of data across the three platforms over a few weeks already.
arnd - Friday, December 18, 2020 - link
I'd love to see that as well. For workloads that use transparent huge pages, there should not be much difference since both would use 2MB huge pages (512*4KB or 32*64KB), plus one or more even larger page sizes, but it needs to be measured to be

The downsides of 64KB requiring larger disk I/O and more RAM are often harder to quantify, as most benchmarks try to avoid the interesting cases.

I've tried benchmarking kernel compiles on Graviton2 both ways and found 64kB pages to be a few percent faster when there is enough RAM, but forcing the system to swap by limiting the total RAM made the 64kB kernel easily 10x to 1000x slower than the 4kB one, depending on the how the available memory lined up with the working set.
abufrejoval - Friday, December 18, 2020 - link
Thank you for the incredible amount of information and the work you put into this: Anandtech's best!

Yet I wonder who would deploy this and where. The purchasing price of the CPU would seem to become a rather miniscule part of the total system cost, especially once you go into big RAM territory. And I wonder if it's not similar with the energy budget: I see my larger systems requiring more $ and Watts on RAM than on the CPUs. Are they doing, can they do anything there to reduce DRAM energy consumption vs. Intel/AMD?

The cost of the ecosystem change to ARM may be less relevant once you have the scale to pay for it, but where exactly would those scale benefits come from? And what scales are we talking about? Would you need 100k or 1m servers to break even?

And what sort of system load would you have to reach/maintain to have significant energy advantages vs. x86 iron?

Do they support special tricks like powering down quadrants and RAM banks for load management, do they enable quick standby/actvation modes so that servers can be take off and on for load management?

And how long would the benefits last? AMD has demonstrated rather well, that the ability to execute over at least three generations of hardware are required to shift attention even from the big guys and they have still all the scaling benefits the x86 installed base provides.

These guys are on a 2nd generation product, promise 3rd but essentially this would seem to have the same level of confidence as the 1st EPIC.
askar - Friday, December 18, 2020 - link
Would you mind testing ML performance, i.e. python's SKLearn library classes that can be multithreaded (random forest for example)?
mode_13h - Sunday, December 20, 2020 - link
MLPerf?

The Ampere Altra Review: 2x 80 Cores Arm Server Performance Monster

Test Bed and Setup - Compiler Options

Ampere "Mount Jade" - Dual Altra Q80-33

AMD - Dual EPYC 7742

Intel - Dual Xeon Platinum 8280

Compiler Setup

Post Your Comment

148 Comments

View All Comments

mode_13h - Thursday, December 31, 2020 - link

Teckk - Friday, December 18, 2020 - link

Calin - Friday, December 18, 2020 - link

jwittich - Friday, December 18, 2020 - link

Bigos - Friday, December 18, 2020 - link

Andrei Frumusanu - Friday, December 18, 2020 - link

arnd - Friday, December 18, 2020 - link

abufrejoval - Friday, December 18, 2020 - link

askar - Friday, December 18, 2020 - link

mode_13h - Sunday, December 20, 2020 - link

Log in

Don't have an account? Sign up now