Memory Performance

ARM made several improvements to the A73’s memory system. Both L1 caches see an increase in size, with the I-cache growing from 48KB (A72) to 64KB and the D-cache doubling in size to 64KB. The A73 includes several other changes, such as enhanced prefetching, that should improve cache performance too.

The A73 still has 2 AGUs like the A72, but they are now capable of both load and store operations instead of having each AGU dedicated to a single operation like in the A72, which should improve the issue rate into main system memory.

The Kirin 960’s larger 64KB L1 cache maintains a steady latency of 1.27ns versus 1.74ns for the Kirin 950, a 27% improvement that far exceeds the 2.6% difference in CPU frequency, highlighting the A73’s L1 cache improvements. L2 cache latency is essentially the same, but Kirin 960 again shows a 27% latency improvement over Kirin 950 when accessing main memory, which should be beneficial for the latency sensitive CPUs.

Memory bandwidth results are less definitive, however. The Kirin 960 shows up to a 30% improvement in L1 read bandwidth over the Kirin 950 depending on the access pattern used, although L1 write bandwidth is lower by nearly the same amount. The 960’s L2 cache bandwidth is also lower for both read and write by up to about 30%.

The two graphs above, which show reading and writing NEON instructions using two threads, help to illustrate Kirin 960’s memory bandwidth. When reading instructions, Kirin 960’s L1 cache outperforms the 950’s, but bandwidth drops once it hits the L2 cache. The Kirin 950 outpaces the 960 when writing to both L1 and L2, only falling below the 960’s bandwidth when writing to system memory. This reduction in cache bandwidth could help explain the Kirin 960’s performance regression in several of Geekbench 4’s floating-point tests.

Geekbench 4 - Memory Performance
Single Threaded
  Kirin 960 Kirin 950
(% Advantage)
Exynos 7420
(% Advantage)
Snapdragon 821
(% Advantage)
Memory Copy 4.55 GB/s 3.67 GB/s
(23.87%)
3.61 GB/s
(26.04%)
7.82 GB/s
(-41.84%)
Memory Latency 12.1 Mops/s 9.6 Mops/s
(25.39%)
5.6 Mops/s
(115.67%)
6.6 Mops/s
(81.82%)
Memory Bandwidth 15.5 GB/s 9.2 GB/s
(69.28%)
7.5 GB/s
(105.84%)
13.5 GB/s
(14.53%)

While the Kirin 960’s L1/L2 cache performance is mixed, it holds a clear advantage over the Kirin 950 when using system memory. Memory latency improves by 25%, about the same amount our internal testing shows, and memory bandwidth improves by 69%. The A73’s two load/store AGUs are likely responsible for a large chunk of the additional memory bandwidth, with the Mate 9’s higher memory bus frequency helping some too.

System Performance

Now it’s time to see how Kirin 960’s lower-level CPU and memory results translate into real-world performance, keeping in mind that OEMs can influence the balance between performance and battery life in a number of ways, including adjusting thermal limits and parameters that govern CPU scheduler and DVFS behavior, which is one reason why two devices with the same SoC can perform differently.

PCMark - Work 2.0 Performance Overall

PCMark - Web Browsing 2.0

PCMark - Writing 2.0

PCMark - Data Manipulation 2.0

PCMark includes several realistic workloads that stress the CPU, GPU, RAM, and NAND storage using Android API calls many common apps use. The Mate 9 and its Kirin 960 SoC land at the top of each chart, outpacing the Mate 8 and its Kirin 950 by 15% overall and the top-performing Snapdragon 821 phones by up to 20%.

The Mate 9’s advantage over the Mate 8 is only 4% in the Web Browsing test, but it’s still the fastest phone we’ve tested so far. Integer performance is not the Kryo CPU’s strength, and in this integer-heavy test all of the Snapdragon 820/821 phones fall behind SoCs using ARM’s A72 and A73 CPUs, with LeEco’s Le Pro3, the highest performing Snapdragon 821 phone, finishing 18% slower than the Mate 9.

The Writing test performs a variety of operations, including PDF processing and file encryption (both integer workloads), along with some memory operations and even reading and writing some files to internal NAND, and it tends to generate frequent, short bursts of activity on the big CPU cores. This seems to suit the Mate 9 just fine, because it extends its performance advantage over the Mate 8 to 23%. There’s a pretty big spread between the Snapdragon 820/821 phones; the LeEco Le Pro3, the best performer in the family, is 40% faster than the Galaxy S7 edge, a prime example of how other hardware components and OEM software tinkering can affect the overall user experience.

The Data Manipulation test is another primarily integer workload that measures how long it takes to parse chunks of data from several different file types and then records the frame rate while interacting with dynamic charts. In this test, the Mate 9 is 30% faster than the Mate 8 and 37% faster than the Pixel XL.

Kraken 1.1 (Chrome/Safari/IE)

WebXPRT 2015 (Chrome/Safari/IE)

JetStream 1.1 (Chrome/Safari)

All of the Snapdragon 820/821 phones perform well in the Kraken JavaScript test, pulling ahead of the Mate 9 by a small margin. The P9 uses Kirin 955’s 7% CPU frequency advantage to help it keep up with the Mate 9 in Kraken and JetStream. The Mate 9 still pulls ahead by 11% in WebXPRT 2015, though, and outperforms the Mate 8 by 10% to 19% in all three tests. The Moto Z Play Droid, the only phone in the charts to use an octa-core A53 CPU configuration, cannot even manage half the performance of the Mate 9, which is similar to what our integer IPC tests show.

The Kirin 960 showed mixed results in our lower-level CPU and memory testing, pulling ahead of the Kirin 950 in some areas while falling behind in others. But when looking at system level tests using real-world workloads, the Mate 9 and its Kirin 960 are the clear winners. There are many hardware and software layers between you and the SoC, which is why it’s important not to use an SoC benchmark to test system performance and a system benchmark, such as PCMark, to test CPU performance.

CPU Performance CPU Power Consumption and Thermal Stability
POST A COMMENT

86 Comments

View All Comments

  • Eden-K121D - Tuesday, March 14, 2017 - link

    Samsung only Reply
  • Meteor2 - Wednesday, March 15, 2017 - link

    I think the 820 acquitted itself well here. The 835 could be even better. Reply
  • name99 - Tuesday, March 14, 2017 - link

    "Despite the substantial microarchitectural differences between the A73 and A72, the A73’s integer IPC is only 11% higher than the A72’s."

    Well, sure, if you're judging by Intel standards...
    Apple has been able to sustainabout a 15% increase in IPC from A7 through A8, A9, and A10, while also ramping up frequency aggressively, maintaining power, and reducing throttling. But sure, not a BAD showing by ARM, the real issue is will they keep delivering this sort of improvement at least annually?

    Of more technical interest:
    - the largest jump is in mcf. This is a strongly memory-bound benchmark, which suggests a substantially improved prefetcher. In particular simplistic prefetchers struggle with it, suggesting a move beyond just next-line and stride prefetchers (or at least the smarts to track where these are doing more harm than good and switch them off.) People agree?

    - twolf appears to have the hardest branches to predict of the set, with vpr coming up second. So it's POSSIBLE (?) that their relative shortcomings reflect changes in the branch/fetch engine that benefit
    most apps but hurt specifically weird branching patterns?

    One thing that ARM has not made clear is where instruction fusion occurs, and so how it impacts the two-decode limit. If, for example, fusion is handled (to some extent anyway) as a pre-decode operation when lines are pulled into L1I, and if fusion possibilities are being aggressively pursued [basically all the ideas that people have floated --- compare+branch, large immediate calculation, op+storage (?), short (+8) branch+op => predication like POWER8 (?)] there could be a SUBSTANTIAL fraction of fused instruction going through the system so that the 2-wide decode is basically as good as the 3-wide of A72?
    Reply
  • fanofanand - Wednesday, March 15, 2017 - link

    Once WinArm (or whatever they want to call it) is released, we will FINALLY be able to compare apples to apples when it comes to these designs. Right now there are mountains of speculation, but few people actually know where things are at. We will see just how performant Apple's cores are once they can be accurately compared to Ryzen/Core designs. I have the feeling a lot of Apple worshippers are going to be sorely disappointed. Time will tell. Reply
  • name99 - Wednesday, March 15, 2017 - link

    We can compare Apple's ARM cores to the Intel cores in Apple laptops today, with both GeekBench and Safari. The best matchup I can find is this:
    https://browser.primatelabs.com/v4/cpu/compare/177...

    (I'd prefer to compare against the MacBook 12" 2016 edition with Skylake, but for some reason there seem to be no GB4 results for that.)

    This compares an iPhone (so ~5W max power?) against a Broadwell that turbo's up to 3.1 GHz (GB tends to run everything at the max turbo speed bcs it allows the core to cool between the [short] tests), and with TDP of 15W.

    Even so, the performance is comparable. When you normalize for frequency, you get that A10 is about 20% better IPC, so drops down to maybe 15% better IPC for Skylake.
    Of course that A10 runs at a lower (peak) frequency --- but also at much lower power.

    There's every reason to believe that the A10X will beat absolutely the equivalent Skylake chip in this class (not just m-class but also U-class), running at a frequency of ?between 3 and 3.5GHz? while retaining that 15-20% IPC advantage over Skylake and at a power of ?<10W?
    Hopefully we'll see in a few weeks --- the new iPads should be released either end March or beginning April.

    Point is --- I don't see why we need to wait for WinARM server --- specially since MS has made no commitment to selling WinARM to the public, all they've committed to is using ARM for Azure.
    Comparing GB4 or Safari on Apple devices gives us comparable compilers, comparable browsers, comparable OSs, comparable hardware design skill. I don't see what a Windows equivalent brings to the table that adds more value.
    Reply
  • joms_us - Wednesday, March 15, 2017 - link

    Bwahaha keep dreamin iTard, GB is your most trusted benchmark. =D

    Why don't you run both machine with A10 and Celeron released in 2010. You will see how pathetic your A10 is in realworld apps.
    Reply
  • name99 - Wednesday, March 15, 2017 - link

    When I was 10 years old, I was in the car and my father and his friend were discussing some technical chemistry. I was bored with this professional talk of pH and fractionation and synthesis, so after my father described some particular reagent he'd mixed up, I chimed in with "and then you drank it?", to which my father said "Oh be quiet. Listen to the adults and you might learn something." While some might have treated this as a horrible insult, the cause of all their later failures in life, I personally took it as serious advice and tried (somewhat successfully) to abide by it, to my great benefit.
    Thanks Dad!

    Relevance to this thread is an exercise left to the reader.
    Reply
  • joms_us - Wednesday, March 15, 2017 - link

    Even the latest Ryzen is just barely equal or faster than Skylake clock per clock so what makes you think a worthless low-powered mobile chip will surpass them? A10 is not even better than SD821 on real-world apps comparison. Again real-world apps not Antutu, not Geekbench. Reply
  • zodiacfml - Wednesday, March 15, 2017 - link

    Intel's chips are smaller than Apple's. Apple also has the luxury to spend much on the SoC. Reply
  • Andrei Frumusanu - Tuesday, March 14, 2017 - link

    Stamp of approval. Reply

Log in

Don't have an account? Sign up now