Arm Cortex X925: Leading The Way in Single-Threaded IPC

The Arm Cortex-X925, codenamed "Black Hawk," as Arm boldly claims, stands at the forefront of single-threaded instruction per clock (IPC) performance, setting things up for improved performance and efficiency in a big way, at least from Arm's claims. This core is a pivotal part of Arm's move to the 3 nm process node and integrates seamlessly into the second-generation Armv9.2 architecture. If Arms claims were taken as gospel, the Cortex X925 would be positioned as a leader in high-performance mobile computing and is an example of where Arm and its focus on a highly efficient PPA is the driving force with Arm's 2024 CPU Core Cluster.

The Cortex-X925 is built on architectural improvements designed to maximize IPC. One of the standout features is its 10-wide decode and dispatch width, significantly increasing the number of instructions processed per cycle. This enhancement allows the core to execute more instructions simultaneously, leading to better utilization of execution units and higher overall throughput.

Arm has doubled the instruction window size to support this wide instruction path, allowing more instructions to be held in flight at any given time. This reduces stalls and improves the efficiency of the execution pipeline. Additionally, the core boasts a 2X increase in L1 instruction cache (I$) bandwidth and a similar increase in L1 instruction translation lookaside buffer (TLB) size. These enhancements ensure that the core can quickly fetch and decode instructions, minimizing delays and maximizing performance.

The Cortex-X925 also features a highly advanced branch prediction unit, which reduces the number of mispredicted branches. By incorporating techniques such as folded-out unconditional direct branches, Arm has removed several architectural roadblocks, enabling a more streamlined and efficient execution path. This leads to fewer pipeline flushes and higher sustained IPC.

The front end of the Arm Cortex-X925 showcases plenty of improvements within the design, including boosting instruction throughput and reducing latency. Central to these improvements is the 10-wide decode and dispatch width, which allows the core to handle more instructions per cycle compared to previous architectures. This wide instruction path increases the parallelism in instruction processing, enabling the core to execute more tasks simultaneously.

Additionally, the Cortex-X925 features a doubled instruction window size, accommodating more instructions in flight and minimizing pipeline stalls. The L1 instruction cache (I$) bandwidth has also been increased by 2x, along with a similar expansion in the L1 instruction translation lookaside buffer (iTLB) size. These enhancements ensure that the core can quickly fetch and decode instructions, significantly reducing fetch bottlenecks and improving overall performance.

The backend of the Cortex-X925 has seen significant growth in out-of-order (OoO) execution capabilities, with a 25-40% increase. This growth allows the core to execute instructions more flexibly and efficiently, reducing idle times and improving overall performance. Furthermore, the core's register file structure has been enhanced, increasing the reorder buffer size and instruction issue queues, contributing to ultimately smoother and, thus, faster instruction execution.

Despite its high performance, the Cortex-X925 is designed to be power efficient. The 3 nm process technology is crucial, enabling better power efficiency than previous generations. The core's design includes features such as dynamic voltage and frequency scaling (DVFS), which allows it to adjust power and performance levels based on the workload. This ensures energy is used efficiently, extending battery life and reducing thermal output.

The Cortex-X925 also incorporates advanced power management features, such as per-core DVFS and improved voltage regulation. These features help manage power consumption more effectively, ensuring the core delivers high performance without compromising energy efficiency. This balance is particularly beneficial for mobile devices requiring sustained performance and long battery life.

The Cortex-X925 is also designed for and optimized for AI-based workloads, with dedicated AI accelerators and software optimizations that enhance AI processing efficiency. With up to 80 TOPS (trillion operations per second), the core can handle complex AI tasks, from natural language processing to computer vision. These capabilities are further supported by Arm's Kleidi AI and Kleidi CV libraries, which provide developers with the tools needed to build advanced AI applications.

Interestingly, Arm hasn't moved into the realm of NPU or AI accelerators itself. Instead, it allows its partners, such as MediaTek, to incorporate their own, ensuring that the Core Cluster can provide the necessary support and integration capabilities. With its reference software stack and optimized libraries, the CSS platform provides a robust foundation for developers. The inclusive Arm Performance Studio offers advanced tooling environments that help developers optimize their applications for the new architecture.

The CSS platform's integration with operating systems such as Android, Linux variants, and Windows through its reinvigorated Windows on Arm OS ensures broad compatibility and ease of development. This cross-operating system support enables developers to quickly and efficiently build applications that leverage the capabilities of the Cortex-X925, along with the entirety of the updated Armv9.2 Core Cluster, which not only accelerates time-to-market but ensures compatibility across multiple devices.

Arm Unveils 2024 CPU Core Designs, Cortex X925, A725 and A520: Arm v9.2 Redefined For 3nm Arm Cortex A725: Improvements to Middle Core Efficiency
POST A COMMENT

55 Comments

View All Comments

  • StormyParis - Wednesday, May 29, 2024 - link

    Do these processors have anything to prevent exploits such as RowHammer etc ... ? Those & variants have been a big story, then disappeared, but we were never told about an actual solution ? Reply
  • GeoffreyA - Wednesday, May 29, 2024 - link

    Are these companies so lame with their AI desperation? Reply
  • abufrejoval - Wednesday, May 29, 2024 - link

    Memory tagging extensions have been around since ARM 8.5. When you say ARM 9.2 MTE, does that mean they have been significantly upgraded e.g. in the direction of what CHERI does for RISC-V?

    I've been trying to find out if ARM has an "AVX-512" issue with their different big/middle/small designs, too. That is if these distinct cores might actually differ in the range of instruction set extensions they support. And I can't get a clear picture, either.

    So if say the big cores support some clever new vector formats for AI and the middle or small cores won't, how will apps and OS deal with the issue?
    Reply
  • GeoffreyA - Wednesday, May 29, 2024 - link

    I don't know enough about ARM to comment, but should think that there are compatibility issues, with instructions, spanning different models and generations. Perhaps there's a feature-level type of method? Reply
  • Findecanor - Wednesday, May 29, 2024 - link

    ARM MTE is much cruder than CHERI. It can be described as "memory colouring": Every allocation in memory is tagged with one of 16 colours. Two adjacent allocations can't have the same colour. When you use a pointer the colour bits in otherwise unused top bits of the pointer have to match the colour of the allocation it points into.

    With SVE both E and P cores need to have the same vector length, yes. The vector length is usually no larger than a cache line which have to be the same size anyway.

    I don't know specifically about SME but many extensions have to first be enabled by the OS on a core to be available to user-mode programs. If not all cores have an extension, the OS may choose to not enable it on any.
    Reply
  • mode_13h - Thursday, May 30, 2024 - link

    > The vector length is usually no larger than a cache line which have to be the same size anyway.

    Cache lines are usually 64 bytes, which is 512 bits. Presumably, the A520 has only SVE2 @ 128 bits. So, don't let ARM off the hook *that* easily!
    Reply
  • eastcoast_pete - Wednesday, May 29, 2024 - link

    That is very much something I am wondering, too. Reports/rumors have it that, for example, Qualcomm chose not to enable SVE in the big cores of their SD 8 Gen3. Qualcomm isn't exactly forthcoming with information about that, not that I would expect them to comment. Reply
  • name99 - Wednesday, May 29, 2024 - link

    That's a silly response. It's like being present at the birth of Mac or Windows and saying "why are these stupid hardware companies trying so hard to make their chips run graphics fast?"

    The hope of LLMs is that they will provide a substantial augmentation to existing UI. So instead of having to understand a complicated set of Photoshop commands, you'll be able to say something like "Highlight the subject of the photo. Now move it about an inch left. Now remove that power line in the background".
    This is not a trivial task; it requires substantial replumbing on existing apps, along with a fair degree of rethinking app architecture. Well, no-one said it would be easy to convert Visicalc to Excel...

    But that is where things are headed. And because ARM (and Apple, and QC, and MS) are not controlled by idiots who think only in terms of tweets and snark, each of these companies is moving heaven and earth to ensure that they will not be irrelevant during this shift.

    (Oh, you thought the entire world consisted of LLMs answering questions did you? Strange how the QUESTION-ANSWERING COMPANY, ie Google, has created that impression...
    Try thinking independently for once. Everything in the world happens along a dozen dimensions at once. All it takes to be a genius is to be able to hold *more than one* dimension in your head simultaneously.)
    Reply
  • FunBunny2 - Wednesday, May 29, 2024 - link

    Well, no-one said it would be easy to convert Visicalc to Excel..

    well... Mitch did it, in assembler at first, and called it Lotus 1-2-3
    Reply
  • GeoffreyA - Thursday, May 30, 2024 - link

    You can go on with your ad hominen and air of superiority; it won't change that these companies are tripping over themselves, in insecurity and desperation, to grab dollars from the AI pot or not be left behind.

    You make assumptions about my ideas based on one sentence. In fact, AI is quite interesting to me. Not how it's going to help someone in Photoshop or Visual Studio, but where LLMs eventually lead; whether they end up being the language faculty in strong AI, or more; what's missing from today's LLMs (state, being trained in real-time, connecting to the sense modalities, using little power in a small space, etc.); and whether consciousness, the last great riddle, will ever be solved, how, and the moral implications. That's of interest to me.

    But when one see Microsoft and Intel making an "AI PC," or AMD calling their CPU "Ryzen AI," and so on, it is little about true AI and more about money, checklists, and the bandwagon. Independence of thought is about seeing past fashion and the in-thing. And no thank you: I have got no desire, nor the insecurity, to want to be a genius.
    Reply

Log in

Don't have an account? Sign up now