Hi Sarah, can you post any links (including rumors) about that? Given ARM's focus on bigger, high performance-oriented designs, the LITTLE cores haven't gotten a lot of love in recent years. The persistence of the in-order designs for ARM LITTLE cores is one of the reasons why I find the dominance of ARM troubling; that clearly stood still because there is nowhere else to turn to for many, i.e. they didn't have to change it. In x86, at least we have two larger players having their own, yet compatible designs.
I've seen it reported in a few places, including on RWT which is a pain to search - but since task migration generally requires compatible instruction sets between big and little cores, it's pretty clear that Matterhorn will bring a small, low-power friend when it arrives.
I wonder if they could simply repurpose either a refresh of the A73 or A75 as the little core. Surely with the new fabrication processes available, die area relative to a big Matterhorn core should be comparable to A55 vs A78/X1, but the question becomes performance / energy. Integer performance of A75/73 vs. Ice Storm is comparable with the former winning by a bit in FP, but efficiency is light years apart:
That could make sense; there's fairly little information on the micro-architecture of the A65 or A65AE at present except that it does do OoOE, and it's unclear what clocks and efficiency it can achieve as well:
It does sport a bigger maximum L2 configuration than the A55. They do need to up their game here as the A55 makes a pretty poor showing for efficiency compared to Apple's small core (which got even worse in the A14 generation):
Got it, thanks for that! The A65 is interesting, without SMT they are quoting a pretty modest bump in integer performance < 20% at a bit more than half the power of A55 at 7nm:
They could probably tune this to be better without SMT, but are you against having SMT for security reasons?
It's still not close to Apple's small cores in performance, but efficiency might be in the same ballpark now. ARM designs are quite good in terms of PPA but even their performance oriented X1 is likely only 70% the die area as a Firestorm core, and their cache hierarchies are more complex as core designs pull double duty for servers parts too.
It probably made sense to have fewer transistors per CPU core as quite a few Android SoC vendors integrated modems on die, but this may change once Qualcomm digests its Nuvia purchase and move to a smaller node. All parties may hit a wall for per core improvements as slowing SRAM density improvements at new nodes bottleneck what gains are gotten from logic density improvements.
TL;DR - ARM needs to focus on a new product stack. It needs to have a diverse ARMv9 lineup of small, medium, large chipset options. With the small chipset being very scalable down to Tiny IoT Sensor level. Whereas the large chipset being scalable up to large supercomputers and servers. Whilst the medium chipset focusing on phones and tablets. As this covers full SoC, it includes both CPUs and GPUs.
Long version: I know making these architectures is a huge challenge, but ARM has been a little lazy in some scenarios. I know they're basically following the money in the industry, and that means chasing the "phablet" market for CPUs and GPUs. But they've been leaving themselves vulnerable to gaps, in either smaller power or larger power systems, that can be exploited by competitors, such as RISC-V. If not, even x86 might poke some wins here and there.
Ages ago, like 2013, they had the A7 (tiny), A15 (small), and A57 (medium) core designs. Basically covering most bases. Along with the Mali-400 iGPU, and 1GB-2GB Shared-RAM, to do some compute tasks. To say ARM was innovative would be a disservice to the technology they brought forward. That's in contrast to x86 Intel's Atom (small) and Intel's Core-i7Y/M (large), as well as Intel Iris Pro iGPU with 8GB Shared-RAM in systems of the time. Then ARM made the leap into 64bit processing around 2016. The lineup evolved into the A35 (tiny), A53 (small), A73 (medium) core designs, running with 1GB-2GB-4GB sRAM, and used modest G31 (tiny) to G51 (small) to G71 (medium) iGPU options. Again, this lineup was very innovative and impressive. Contrast that to the new x86 competition in AMD's 16nm Vega Large-iGPU, and Zen1 Large-CPU.
However... There hasn't been any upgrades for the "tiny" portfolio, being stuck to the offerings of Cortex A35 CPU and G31 GPU ever since. There has been only a slight refresh to the "small" portfolio, upgrading to the Cortex A55 CPU, and the G52 and G57 iGPUs. To the point that they're a joke, and easily surpassable by the competitors. ARM really needs a revolutionary new design here, it needs to be super-efficient. Perhaps something that can scale between both tiny and small categories: with performance ranging from the A55 (or more) at the "tiny" power-level, to the A73 (or more) at the "small" power-level. Basically catching up to Apple, if unable to surpass them.
Whereas the "medium" portfolio has seen very frequent upgrades, in the CPU-side to the Cortex A75, A76, A77, and A78. In the GPU-side we've seen G72, G76, G77, G78 which have been mostly competitive, surpassing some custom implementations (Samsung/MediaTek) and losing to others (Apple/Qualcomm). Not much needs to change here to be honest. We've also seen the emergence of a new "large" category of ARM processors. Firstly popularised by custom implementations from Apple (A10 and onwards), then Samsung (Mongoose M3, and onwards). Now it's supported officially by ARM in the form of the Cortex A77+ and the Cortex A78X / X1. This has been mostly underwhelming and uncompetitive, with Apple being the only one implementing good designs. There hasn't been any new "large" category for iGPUs from ARM or competitors, with the only Large-iGPU exception actually being inside the Apple Silicon M1. ARM (without counting Apple) needs to do better here, and it looks like ARM might already be focussing here in the future with ARMv9. Again contrast this to the x86 markets offering 7nm Large-CPUs of Zen2 and Zen3, with RDNA-1 and RDNA-2 Large-GPUs.
> 2013, they had the A7 (tiny), A15 (small), and A57 > Then ARM made the leap into 64bit processing around 2016.
A57 is a 64-bit core.
> Contrast that to the new x86 competition in AMD
No. Why would we do that? They were competing in totally different markets, at the time. The only partial overlap was embedded Ryzen.
> There hasn't been any upgrades for the "tiny" portfolio, being stuck to ... Cortex A35 CPU > There has been only a slight refresh to the "small" portfolio, upgrading to the Cortex A55 CPU
The A35 and A55 both launched in 2017.
> they're a joke, and easily surpassable by the competitors.
In terms of what? PPA? Perf/W? Perf/$? Might want to be sure you're comparing apples to apples and not comparing competing "small" core with ARM "tiny".
> There hasn't been any new "large" category for iGPUs from ARM or competitors
Samsung is using RDNA and MediaTek is licensing a Nvidia GPU for its upcoming SoCs.
Might want to do a little more research, before writing another longpost. I agree that A55 could use a refresh, but ARMv9 will force that, anyway. I don't even know where A35 is used, but same story, there.
It's worth noting that ARM has also been active in the microcontroller market, with both 32-bit and 64-bit offerings.
Firstly, apologies. I know the A57 is 64bit, but there have been many (most?) implementations of it running in 32bit mode. The A57 was really a "rough draft" for ARM, in moving towards both "medium" sized cores and into 64bit computing. Hence, it feels more at home next to it's A7 and A15 brethren.
The contrast is there, and necessary to show the landscape of the time. The tech industry is a fast-paced one. And if your code/calculations is agnostic, that it can run on any platform, you would consider all options (not that I recommend people go creating agnostic code, compared to specialized or hardware-accelarated code).
The Cortex A35 launched in 2015. It's long due for an upgrade, or replacement. Where this core likes to be in is in small, low-power, and cheap devices. In particular the microcontroller market as you mentioned. ARM hasn't been as active in this field as you think they have, with many of the products being custom designs from the ODMs.
I already mentioned the A55 was a slight refresh for the A53, and that itself is also surpassed. Have a look at Apple's "small" cores. They are Out-of-Order processors, they are slightly faster than an A73, they use slightly less power than an A53. It's mind boggling. Others disagree, and say they're actually faster than A75, and more efficient than A55... but at this scale we're splitting hairs. With that much room for difference, it's not inconceivable (heck it's likely) that an outside competitor like RISC-V will surpass the A55 in terms of Perf/W, Perf/PPA, Perf/$, or a combination of the lot. And remember, the Cortex-A53 is the most popular core out there, where it's getting stamped out on so many different Chinese products.
Samsung isn't using Radeon iGPUs YET, and neither is MediaTek. Besides, we have yet to see them in the wild and find out details if their architecture. These might be licensed from AMD or Nvidia, but they might be "small" iGPUs instead of "large" iGPU designs. I did forget to mention that the Tegra X1, and some Nvidia SBC did actually use their "large" iGPU architecture (ie Maxwell etc).
The gist of my rant is that ARM was a revolutionist early on, basically creating the market. Then they were extremely innovative and competitive, basically dominating the market. Now they are competitive but not as revolutionary nor as competitive/innovative as they used to. With ARMv9 they have a chance to start fresh, and return to status quo, by having a trifecta of products for the computing industry. I was pointing out the gaps in their history and portfolio. They shouldn't just focus on mobile phones, that's boring.
Okay, the date I saw was wrong. It seems to have been announced in November 2015. The A55 seems to have been announced in May 2017.
> this core likes to be in is in small, low-power, and cheap devices. > In particular the microcontroller market as you mentioned.
They have actual microcontrollers, though. The A35 is still too power-hungry (and expensive?) for most IoT devices.
> Have a look at Apple's "small" cores.
You focus on performance and efficiency, but what about area? Apple has a narrower focus and different process, cost, & area targets than ARM.
The point we can definitely agree on is that ARM's bottom & middle tier cores should've been refreshed more frequently. But, everyone seems to think that ARM is directly competing with Apple, but it's not. Their objectives meaningfully differ, resulting in ARM probably being driven more towards making smaller cores than Apple.
It's only at the top end of their mobile stacks that you can really say ARM and Apple are in direct competition. However, even on something like the A78, ARM is still put in a position of having to make compromises that Apple isn't.
> ARM was a revolutionist early on, basically creating the market. > Now they are competitive but not as revolutionary nor as competitive/innovative as they used to.
That's how these things work. A small upstart has a lot of freedom. The bigger a company gets, the more constrained it becomes by its customers, its market, the cost of changing, and the downside risk. I'm still just not totally convinced that entirely explains what we're seeing.
If they can manage to cleave their server cores entirely from their mobile cores, and then really make big cores that are performance-first (instead of scaled up versions of mostly-performance cores, like the X1 and A78 situation), then we might see them start to compete at Apple's level. Basically, to compete they'd have to start by designing the X1 first, and then make the A78 by putting it on a diet.
> They shouldn't just focus on mobile phones, that's boring.
LOL, it's also where most of their revenue still lies. If you were CEO, you wouldn't last a day.
> LOL, it's also where most of their revenue still lies. If you were CEO, you wouldn't last a day.
Focusing on the same-ol' same-ol' business is exactly how once-profitable companies fade into irrelevance as technology moves on. Plenty of mediocre CEOs do that.
A great CEO can find the future revenue opportunities and prove it to the company's owners.
Yeah, but you can't afford to walk away from your bread and butter. Any new growth areas you pursue can't come at the expense of revenues in your core business. If you even threatened to starve your core business, you'd be out of a job before your new ambitions could ever get off the ground.
Just look at what happened with Qualcomm, they tried to invest in new areas, but their investors absolutely wouldn't tolerate it. Granted, they're more exposed than ARM would be, either under Soft Bank or Nvidia.
What you said is EXACTLY what Blockbuster said before they went bankrupt. In case you didn't know, the board members passed the opportunity to buy Netflix for $50 Million. The CEO then tried to right that wrong by acquiring another competitor, and shifting their revenue stream. The board fired their CEO, saying that their late-fee revenue was the bread and butter of their business model. Blockbuster was too narrow focused and stuck in the past, that not only did they miss the opportunity of becoming a whole new behemoth, but they sunk their own ship at the same time.
> What you said is EXACTLY what Blockbuster said before they went bankrupt.
If grant3 is saying that Blockbuster should close half its stores while they're still profitable, to divert money into R&D on getting into the (then) almost non-existent streaming market, no company in the world would do that.
Now, it's not like ARM is ignoring other markets, of course. They just can't turn their back on the mobile market, in order to do so.
> Blockbuster was too narrow focused and stuck in the past
The genius of capitalism is that the failure of Blockbuster to transition into a streaming platform didn't keep streaming from happening. Its investors could even get in on the game by shifting their investments into players in the streaming market. If the CEO was such a believer, he could've quit and gone to work for a streaming company or founded his own.
Also, let's not forget that there have already been losers in streaming, and it wasn't clear Netflix would've successfully made the transition from movies-by-mail. Who remembers Google Video? Yahoo even bought some company in the space. And just last year, there was quibbi. I'm sure there are others I'm forgetting.
I think we all want to see ARM succeed outside of mobile. They're been investing a lot, in order to do so. Some in this very thread have been complaining at their lack of focus on their smaller, lower-power cores (currently A35 & A55), which you could see as evidence they've already been making sacrifices to try and compete outside their niche. I don't know if that's accurate, but it's plausible.
If Nvidia's acquisition goes through (as I expect it will), I hope and expect it will provide ARM with the funds to do even more ambitious things.
Why would you need rumours when we know for a FACT that there will be an A55 successor unless b.L design is abandoned for no good reason. I'll give you a hint, b.L can't have mixed architectures that's why big cores stayed at ARMv8.2a for so long.
Maybe the shift to ARMv9 will force ARM's hand with giving the LITTLE cores out-of-order designs; however, current bigLITTLE designs already mix big, out-of-order designs with LITTLE in-order cores like the A55. So, bL can and has worked with mixed architectures for quite a while. However, I hope you are correct in that the shift to ARMv9 will force the issue, and we'll finally get out-of-order LITTLE cores also on non-Apple devices
Maybe dotjaz meant you couldn't mix 8.5 and 8.2 architectures?
In any case, DynamIQ, not big.LITTLE, is more relevant now. Also, if people really want to push for an out of order big.LITTLE, why not use the A78 for the big core and the older A76 as the little core? Both A76 and A78 can be fabricated at 5nm, and the A76 would use less power by dint of being able to do less work per clock, which is fine for the kind of work a little core would do anyway.
Thanks for asking. Can't watch it a for years small A55 didn't get any update or successor.
For me it would be even more improtant to update those as lots of tasks run on those rather than high perfromance cores. But I guess it is just better for marketing talk about big gains in theoretical pefromance.
At least I expect an update now. Just hope it won't be the only one...
The lack of deep uarch details on the N2 is disappointing, but I guess we'll probably see what Matterhorn looks like in a few weeks so not a huge deal.
I am waiting for the first in-silicone V1 design that Andrei and others can put through its paces. N2 is quite a while away, but yes, maybe we'll see a Matterhorn design in a mobile chip in the next 12 months. As for V1, I am curious to learn what, if anything, Microsoft has cooked up. They've been quite busy trying to keep up with AWS and it's Gravitons.
‘Fast-forward to 2021, the Neoverse N1 design today employed in designs such as the Ampere Altra is still competitive, or beating the newest generation AMD or Intel designs – a situation that which a few years ago seemed anything but farfetched.’
Hmm... That last bit is odd. Either it’s just ‘farfetched’ or it’s ‘expected’.
Yes, those slides look very promising; now eagerly awaiting an eventual test of one or two of these in a actual silicone. I guess then we'll see how they measure up.
Not to be confused with the chemical element silicon.
A silicone or polysiloxane is a polymer made up of siloxane (−R2Si−O−SiR2−, where R = organic group). They are typically colorless, oils or rubber-like substances. Silicones are used in sealants, adhesives, lubricants, medicine, cooking utensils, and thermal and electrical insulation.
I'll have to take this up with auto-correct. It keeps changing silicon to silicone. Now that I forced it again to leave silicon alone (for the umpteenth time), maybe it will stop (:
Good, finally confirmed N2 is in fact ARMv9 as suspected. Now we'll just have to wait and see how the new mobile counterparts are. Hopefully we'll see some real improvements.
It'll be interesting to see how small the new low power v9 core is given that it has to have a 128b SVE2 pipeline instead of 2x64b NEON.
> finally confirmed N2 is in fact ARMv9 as suspected. > Now we'll just have to wait and see how the new mobile counterparts are. > Hopefully we'll see some real improvements.
The data presented on N2 doesn't give me much hope that v9 changed much, besides the feature baseline. I was hoping for something slightly revolutionary, but it's certainly not that.
We've known for a couple of years ARMv9 is just ARMv8.x rebased. Your hopes weren't realistic to begin with. Besides, what "revolutionary" features would you expect ISAs to include? Can oyu name one? ARMv8.5a+SVE2 already has everything you need to design an excellent and efficient uarch. Why re-invent the wheel just for the sake of it?
> We've known for a couple of years ARMv9 is just ARMv8.x rebased.
You knew this according to where? It's one thing to assume that, and clearly it wasn't an unreasonable assumption, but it's another thing to *know* it. So, how did you *know* it?
> Besides, what "revolutionary" features would you expect ISAs to include? Can oyu name one?
It's a fair question. Generally speaking, anything that would help improve efficiency. Maybe things like scheduling hints or maybe some kind of tags to indicate memory writes that are thread-private and terminal reads. Just some examples, off the top of my head.
> ARMv8.5a+SVE2 already has everything you need to design an excellent and efficient uarch.
The issue I see is that IPC and efficiency gains are going to become ever more hard-won, so there needs to be some more creativity in redefining the SW/HW interface to unlock further gains. ARMv9 is going to be with us for probably another decade and it could end up having to compete with yet-to-be-identified alternatives like maybe RISC VI or something completely out of left-field. So, I see it as a wasted opportunity. A pragmatic decision, for sure, but a little disappointing.
SoftBank is already publicly traded on the Tokyo Stock Exchange. Why rely on NVIDIA buyout which for all likelihood won't happen any time soon if at all.
> SoftBank is already publicly traded on the Tokyo Stock Exchange.
They also invested heavily in WeWork, when it was highly over-valued. I have no idea what other nutty positions they might've taken, but I think it's not a great proxy for ARM just due to its sheer size.
Never. Since there has to be an OEM for these chips to put in DIY and Consumer machines, so far except the HPE's A64FX ARM there's no way any consumer can buy these ARM processors and that is also highly expensive over 5 digit figure. And then the drivers / sw ecosystem comes into play, there's passion projects like Pi as we all know but they are nowhere near the Desktop class performance.
ARM Graviton 2 was made because AWS wants to save money on their Infrastructure, that's why their Annapurna design team is working there. Simply because of that reason Amazon put more effort onto it AND the fact that ARM is custom helps them to tailor it to their workloads and spread their cost.
Altra is niche, Marvell is nowhere near as their plans was to make custom chips on order. And from the coverage above we see India, Korea, EU use custom design licensing for their HPC Supercomputer designs.
Then there's a rumor that MS is also making their own chips, again custom tailored for their Azure, Google also rumored esp their Whitechapel mobile processor (it won't beat any processor on the market that's my guess) and maybe their GCP oriented own design.
These numbers projection do look good vs x86 SMT machines finally to me after all these years, BUT have to see how they will compete once they are out vs 2021 HW is the big question, since if these CPUs outperform the EPYC Milan technically AWS should replace all of them right ? since you have Perf / Power improvements by a massive scale. Idk, gotta see. Then the upcoming AMD Genoa and Sapphire Rapids competition will also show how the landscape will be.
If they don't replace all the x86 systems in AWS with ARM, that *must* mean Neoverse is somehow secretly inferior, right??
Or, you know, it could mean that x86 compatibility matters for a fair chunk of the EC2 installed base, especially on the Windows Server side (which is not small) but on Linux too (Oracle DB, for instance, which does not yet run on ARM.)
"The first Qualcomm® Snapdragon™ platforms to feature Qualcomm Technologies' new internally designed CPUs are expected to sample in the second half of 2022 and will be designed for high performance ultraportable laptops."
Uh, that means new machines won't be using them until at least the end of next year. And if we want more cores than an ultraportable, it's still no good.
I wouldn't put it past them to do a desktop or server sized SoC eventually if they have a great in house core design that isn't a commoditized IP block that anyone can license from ARM. It would give them an advantage at the higher tiers of performance that they will want piece of for sure.
They also seem to be devoted to providing an open ARM computing platform in working with Linux developers and Windows when compared with Apple. That they added a hypervisor to the 888 should give you some indication to their future compute ambitions...
> I wouldn't put it past them to do a desktop or server sized SoC
The already tried this, but their investors killed it. Lookup "Centriq". Building out a whole server infrastructure & ecosystem takes a lot of investment, and now they'd have established competitors with a multi-year lead.
I wasn't talking about servers (at least not right away), more consumer oriented and workstation scale compute. Amon did say that the designs they had in mind with Nuvia were "scalable" and that they were going to be addressing multiple markets.
You need three things to create a higher performance core than Apple - designers (check) - an implementation team (hmm. maybe? this means *enough* good people and superb simulation/design tools) - management willing to pay the costs [design costs, and willing to accept a substantially larger core] (hmmmmmmmm? will they chicken out and assume no-one is willing to pay for such a core, they way they always have for watch, phone, then centriq?)
> if these CPUs outperform the EPYC Milan technically AWS should replace all of them right ?
No, because a lot of people are still stuck on x86. Also, Amazon could be fab-limited, like just about everyone else. The sun might be setting on x86, but it's still a long time until dark.
An Avantek Ampere workstation might be available in a stand-alone system. Andrei expects Ampere to include N2 in their next gen systems instead of V1. Apple might also launch something in that segment in the coming years.
A UK-based company called Avantek makes Ampere-based workstations. Their eMAG-based version was reviewed on this site, a couple years ago, and they now have one with Altra. So, I'd say better than average chances we might see one with a V1-based CPU by maybe the end of the year or so.
Looking at Cortex-X-next. It seems like Arm can put out a new Cortex-X for every new Cortex-A78 successor, since the Cortex-X is very similar but bigger.
> The Cortex-X1 was designed within the frame of a new program at Arm, > which the company calls the “Cortex-X Custom Program”. > The program is an evolution of what the company had previously > already done with the “Built on Arm Cortex Technology” program > released a few years ago. As a reminder, that license allowed > customers to collaborate early in the design phase of a new > microarchitecture, and request customizations to the configurations, > such as a larger re-order buffer (ROB), differently tuned prefetchers, > or interface customizations for better integrations into the SoC designs. > Qualcomm was the predominant benefactor of this license,
Do any of the current x86 cores pair up SSE operations for >= 4x throughput per cycle?
AVX2 has been around for long enough that a lot of the code which could benefit from it has already been written to do so, yet *most* people are still compiling to baseline x86-64 (or just above that), since Intel is still making low-power cores without any AVX. So, I'm sure there's still *some* code that could benefit from >= 4x SSEn execution.
Quick addition. The term SLC is more popular lately, as it emphasize that the cache is not only shared among the cores but also with the system (GPU, DMAs etc).
Thanks. I guess I should've just waited until I'd finished reading it, because the interconnect slide made it abundantly clear.
Now, I'm wondering about this "snoop filter" and why so much RAM is needed for it, when Graviton 2 & Altra have so little SLC. So, I gather it's not like tag RAM, then? Does it index the L2 of the adjacent cores, or something like that?
That slide was provided by ARM and I think they're trying to have at least the *appearance* of maintaining anonymity, even if the identities are abundantly clear.
Also, you realize that their Vendor A is your Vendor I, right?
How does the narrower front end and shallower pipeline of the N2 compare to Apple's M1? I'm thinking about how this could translate to the A78 successor, if that uses an evolution of the X1 core with improvements from N2 brought in.
What do you mean by "compare?" Apple is 8-wide Decode, Map, Rename. But that doesn't include the fact that Apple does a ton of clever work in those three stages ( - simple branches handled at Decode, - a variety of zero-cycle moves and immediate handling in Rename - two-level scheduler, with the higher level able to accept an 8-wide feed from Rename, even though the lower-level scheduler is narrower [6 for int, 4 for FP or LS] )
Apple is *astonishingly* wide at the completion end. 16-wide register freeing and history file release, up to 56(!!!)-wide release of ROB entries.
The Apple pattern so far (insofar as pattern-detection is worth anything) has been a 1st generation of four cores (A7/8/9/10) with similar design and 6-wide, constantly iterating on details within that framework; then a 2nd generation (A11/12/13/14) that makes explicit the big.LITTLE structure (in A10 that was mostly invisible) and based on an 8-wide (with 6 integer units) structure. If one has to bet, the reasonable thing to bet might be something like starting with A15 we transition to 10-wide (initially with6 later with 8 integer units), and 2xSVE256. Once again lay the framework, then scale out the pieces over the subsequent three cores.
One thing that is very clear (and presumably part of Apple's success) is that they have been very willing to keep modifying how they do things; they don't just settle on a design and leave it unchanged except perhaps for some scaling up. For example the way they handle the MOV xn, xm instruction has gone through at least three very different schemes. This may seem trivial (who cares about how a singe instruction is implemented?) except that these schemes indicate a substantial reworking of the entire register file and how registers are allocated and then freed. This is in comparison to x86 which seems to live in (probably justified) terror that any change they make, no matter how low level, will probably break something because the whole system is so complex and so interconnected that no one person holds the entire thing in their head.
They also seem to have a good system in place for hiding new functionality behind chicken bits, so that they can effectively debug new features within shipping hardware. For example there are reasons to believe that A14 might have in place most of the pieces required for physical register file amplification (avoid allocation for back-to-back register usage and grab the intermediary off the bypass bus; early release of logically overwritten registers) but these are not visible -- probably behind chicken bits so that they can be tested under all circumstances in shipping HW, and made visible for A15. And anyone who has not looked at the details is unaware of just how impressive the underlying Apple µArch platform is. There is substantial room there on-going growth! As I continue to explore it, not only do I see how well it works today, I also see multiple directions in which it could "easily" (ie feasibly, on schedule and within budget) be improved for years to come. The only other artifact I know of that comes close in terms of quality of implementation and ability for continuing growth is the Mathematica code base -- other artifacts like other CPUs, or various OS implementations, are in a totally different (and far inferior) league.
To expand on my point, it's great that ARM are including so many good idea, but it's also astonishing the extent to which pretty much every good idea already has an Apple precedent.
For example consider the MPAM discussion: "The mechanism to which this can be achieved can also include microarchitectural features such as dispatch throttling where the core slows down the dispatched instructions, smoothing out high power requirements in workloads having high execution periods, particularly important now with the new wider 2x256b SVE pipelines for example." This sounds like (and IS) a good idea -- certainly a lot better than reducing frequency the way Intel does for AVX512.
But look at this Apple patent from 2011(!) https://patents.google.com/patent/US9009451B2 "A system and method for reducing power consumption through issue throttling of selected problematic instructions. A power throttle unit within a processor maintains instruction issue counts for associated instruction types. The instruction types may be a subset of supported instruction types executed by an execution core within the processor. The instruction types may be chosen based on high power consumption estimates for processing instructions of these types. The power throttle unit may determine a given instruction issue count exceeds a given threshold. In response, the power throttle unit may select given instruction types to limit a respective issue rate. The power throttle unit may choose an issue rate for each one of the selected given instruction types and limit an associated issue rate to a chosen issue rate. The selection of given instruction types and associated issue rate limits is programmable."
I just keep bumping into this stuff! Arm release new cores with what seem like good ideas (and of course ARM tell us a lot more about what's new than Apple does). I do some exploring -- and find Apple patented that idea five or more years earlier!
Zen 3 needn't blush when standing next to Apple. 4-wide decode might be small but that does pick up to 6, coming out of the micro-op dispatch. Then, going down, you've got 10-wide issue on the INT side, and 6-wide on FP. Admittedly, narrower register files and 8-wide retire from the (smaller) ROB, along with smaller caches. As for move elimination, even Skylake has that. Yes, everything tends to be narrower. But I think it goes to show there's nothing particularly out of this world on the Apple side.
I did not say that move elimination was the interesting part. I said that what was interesting is that over Apple's short CPU career they have already implemented it in three significantly different ways.
That strikes me as interesting and important -- there is no resting on the laurels, no acceptance that "we have the feature, OK to slow down". You honestly believe that Intel operates according to that same mentality?
I'll admit, Apple isn't resting; and they aren't scared to break orthodoxy in advancing their designs. If the others do not wake up, they'll be left in the dust. As for Intel, complacency has put them in the well-deserved pickle they're in today. AMD deserves credit, though, for doing much these past few years; and, like Apple, aren't resting on their laurels either (arguably like they did in the K8 era).
How big are Apple's cores, though? Area is tricky, because they tend to be on a newer process node.
But, my point is that maybe AMD and Intel aren't making their cores even larger and more complex, because they're targeting the server market and found that a more area-efficient way to scale performance is by adding more cores, rather than making their existing cores even more complex.
Eyeballing it, each large core is about 2.5% of that area, so 2.2mm^2 Throw in the L2 at about 3.5%, so 3mm^2 (shared between two cores). Throw in the SLC (not exactly an L3, but pretend it is if you insist) at 8.8% and about 8mm^2.
I guess if you were targeting a server type design, we could probably treat it as something like 2.5+(3/4)+ (8.8/8) [making rough guesses about what sort of L2 and L3 would be optimal for a server type design] so ~4.4mm^2. Could fit ~100 in a 440mm^2 (though you'd also want some memory controllers and IO!) Definitely a lot larger than something like an N1 or N2 -- but of course, Apple isn't designing for the data center -- if they were, they'd probably adopt something halfway between Fire Storm and Ice Storm.
The problem is not that Intel and AMD are chasing the server market, it is that the way they are chasing it is incoherent. IF your primary goal is the server market, then WTF are you designing for super-high frequencies? The data center cores never run at those frequencies -- but being able to boast about them means your transistor density is half to a third that of Apple (or more relevant ARM/Altra/Graviton) on the same process... Pick a goal and optimize for it! But Intel's goal seems to be to optimize for marketing that they can hit 5.x GHz (for half a second...) Not clear that designing the entire company around that goal (of zero interest to the data center, and little interest to most users) is such a great long-term strategy.
> IF your primary goal is the server market, then WTF are you designing for super-high frequencies?
AMD actually has lagged Intel in frequency, and I think that's one of the reasons. Remember, AMD is the only one using the same exact silicon on both the mainstream desktop and in all their server products.
Intel, on the other hand, has completely separate silicon for their server dies, and we don't know all of the subtle ways they could differ from their desktop or laptop cores. We just know they tend to reuse the same basic core micro-architecture up and down their product lines (except for the really cheap/low-power stuff).
> The data center cores never run at those frequencies
A few Cascade Lake Xeons could turbo up to 4.5 GHz, which benefits certain workloads. The fastest turbo clock of an Ice Lake Xeon is 4.4 GHz
The fastest EPYC can boost up to 4.1 GHz.
In Intel & AMD's defense, a simpler core can clock higher, but runs more efficiently at lower clocks and enables higher densities. So, it seems like a pretty good strategy to me.
> Pick a goal and optimize for it!
Like ARM, Intel and AMD trying to balance power (in a server/laptop application, at least) and area. Apple is the only one who really has the luxury not to care much about area and just optimize for a single target. When Apple reuses its laptop core micro-architecture in both desktops AND servers, then we can compare them to the other guys. Until then, I think it's a case of Apples and pears.
"Until then, I think it's a case of Apples and pears."
I think that's it. Well, soon process improvements will be a thing of the past, owing to quantum effects, and then we'll see who does what. The free ride is almost over.
(a) Sandy Bridge was the last such. (b) Look at the relative spacing (in time) for the two cases.
Look, I'm not interested in "x86 vs ARM. FIGHT!!!" I'm simply pointing out various patterns I've noted that strike me as interesting and significant. If other people have similar such patterns to point out -- interesting and non-obvious aspects of new x86 micro-architectures, or patterns in how those micro-architectures have evolved over the past few years, they should add a comment. But to this outsider the micro-architectures look stagnant -- utterly so in the case of Intel, mostly so in the case of AMD. In particular slight scaling up of an existing micro-architectures because a new process is more dense is not interesting! What is interesting is a new way of conceptualizing the problem that allows for a step change in the micro-architecture; and that is what I am not seeing on the x86 side. I do see it in IBM (though for purposes that are, to me, uninteresting, both for POWER and for z/) I do see it in ARM Ltd.
> What is interesting is a new way of conceptualizing the problem that allows for a step change in the micro-architecture
Yes, but I think that largely depends on the ISA. And there, ARM has indeed been rather stagnant. Besides SVE and their new security features, most of their ISA changes have been tweaking around the margins. Not a fundamental rethink, or anything close to it.
What we need is more willingness to rethink the SW/HW divide and look at what more software can do to make hardware more efficient. Whenever I say this, people immediately seem to think I mean doing a VLIW-like approach, but that's too extreme for most workloads. You just have to look at an energy breakdown of a modern CPU and think creatively about where compilers could make the hardware's job a little bit easier or simpler, for the same or better result.
You can also flip it around, and ask where the primitives CPUs provide don't quite match up with what software is trying to do. I think TSX/HLE stands as an interesting example of that, and probably one where Intel doesn't get enough credit (granted, partly due to their own missteps).
Architecture and micro-architecture are two different things. You want to fantasize about different architectures, be my guest. But I'm interested in MICRO-ARCHITECTURE and that was the content of my comments.
> Architecture and micro-architecture are two different things.
The principle manifestation of the HW/SW divide is the ISA. That's why I talk about it rather than "architecture", which is a word that can mean different things to different people and in different contexts.
> You want to fantasize about different architectures, be my guest.
It's about as on-topic here as ever, given that we've gotten our most detailed look at ARMv9, yet. And performance + efficiency numbers!
> But I'm interested in MICRO-ARCHITECTURE and that was the content of my comments.
There's only so much you can do, within the constraints of an ISA. ARM had a chance to think really big, but they chose to play it safe and be very incremental. That could turn out to be a very costly mistake, for them and some of their licensees.
I just want what I think we all want, which is another decade of progress in performance and efficiency like the last one. So far, I'm not very hopeful. I guess we need to really hit the wall, before people are ready to get serious about embracing options to push it back, a bit further.
We’ve updated our terms. By continuing to use the site and/or by logging into your account, you agree to the Site’s updated Terms of Use and Privacy Policy.
95 Comments
Back to Article
yeeeeman - Tuesday, April 27, 2021 - link
what about a cortex a55 successor?SarahKerrigan - Tuesday, April 27, 2021 - link
I'd expect to see one next month launching alongside Matterhorn.eastcoast_pete - Tuesday, April 27, 2021 - link
Hi Sarah, can you post any links (including rumors) about that? Given ARM's focus on bigger, high performance-oriented designs, the LITTLE cores haven't gotten a lot of love in recent years. The persistence of the in-order designs for ARM LITTLE cores is one of the reasons why I find the dominance of ARM troubling; that clearly stood still because there is nowhere else to turn to for many, i.e. they didn't have to change it. In x86, at least we have two larger players having their own, yet compatible designs.SarahKerrigan - Tuesday, April 27, 2021 - link
I've seen it reported in a few places, including on RWT which is a pain to search - but since task migration generally requires compatible instruction sets between big and little cores, it's pretty clear that Matterhorn will bring a small, low-power friend when it arrives.Raqia - Tuesday, April 27, 2021 - link
I wonder if they could simply repurpose either a refresh of the A73 or A75 as the little core. Surely with the new fabrication processes available, die area relative to a big Matterhorn core should be comparable to A55 vs A78/X1, but the question becomes performance / energy. Integer performance of A75/73 vs. Ice Storm is comparable with the former winning by a bit in FP, but efficiency is light years apart:https://images.anandtech.com/doci/13614/SPEC-perf-...
https://images.anandtech.com/doci/16192/spec2006_A...
SarahKerrigan - Tuesday, April 27, 2021 - link
I think use of a refreshed A65 without multithreading and with the new ops seems more plausible to me.Raqia - Tuesday, April 27, 2021 - link
That could make sense; there's fairly little information on the micro-architecture of the A65 or A65AE at present except that it does do OoOE, and it's unclear what clocks and efficiency it can achieve as well:https://developer.arm.com/documentation/100439/010...
It does sport a bigger maximum L2 configuration than the A55. They do need to up their game here as the A55 makes a pretty poor showing for efficiency compared to Apple's small core (which got even worse in the A14 generation):
https://images.anandtech.com/doci/14072/SPEC2006ef...
At least wattage and hence current draw is low.
SarahKerrigan - Tuesday, April 27, 2021 - link
A65 is E1, which has had a uarch dive on this site.Raqia - Wednesday, April 28, 2021 - link
Got it, thanks for that! The A65 is interesting, without SMT they are quoting a pretty modest bump in integer performance < 20% at a bit more than half the power of A55 at 7nm:https://images.anandtech.com/doci/13959/07_Infra%2...
https://images.anandtech.com/doci/13959/07_Infra%2...
They could probably tune this to be better without SMT, but are you against having SMT for security reasons?
It's still not close to Apple's small cores in performance, but efficiency might be in the same ballpark now. ARM designs are quite good in terms of PPA but even their performance oriented X1 is likely only 70% the die area as a Firestorm core, and their cache hierarchies are more complex as core designs pull double duty for servers parts too.
It probably made sense to have fewer transistors per CPU core as quite a few Android SoC vendors integrated modems on die, but this may change once Qualcomm digests its Nuvia purchase and move to a smaller node. All parties may hit a wall for per core improvements as slowing SRAM density improvements at new nodes bottleneck what gains are gotten from logic density improvements.
Kangal - Thursday, April 29, 2021 - link
TL;DR - ARM needs to focus on a new product stack. It needs to have a diverse ARMv9 lineup of small, medium, large chipset options. With the small chipset being very scalable down to Tiny IoT Sensor level. Whereas the large chipset being scalable up to large supercomputers and servers. Whilst the medium chipset focusing on phones and tablets. As this covers full SoC, it includes both CPUs and GPUs.Long version:
I know making these architectures is a huge challenge, but ARM has been a little lazy in some scenarios. I know they're basically following the money in the industry, and that means chasing the "phablet" market for CPUs and GPUs. But they've been leaving themselves vulnerable to gaps, in either smaller power or larger power systems, that can be exploited by competitors, such as RISC-V. If not, even x86 might poke some wins here and there.
Ages ago, like 2013, they had the A7 (tiny), A15 (small), and A57 (medium) core designs. Basically covering most bases. Along with the Mali-400 iGPU, and 1GB-2GB Shared-RAM, to do some compute tasks. To say ARM was innovative would be a disservice to the technology they brought forward. That's in contrast to x86 Intel's Atom (small) and Intel's Core-i7Y/M (large), as well as Intel Iris Pro iGPU with 8GB Shared-RAM in systems of the time. Then ARM made the leap into 64bit processing around 2016. The lineup evolved into the A35 (tiny), A53 (small), A73 (medium) core designs, running with 1GB-2GB-4GB sRAM, and used modest G31 (tiny) to G51 (small) to G71 (medium) iGPU options. Again, this lineup was very innovative and impressive. Contrast that to the new x86 competition in AMD's 16nm Vega Large-iGPU, and Zen1 Large-CPU.
However... There hasn't been any upgrades for the "tiny" portfolio, being stuck to the offerings of Cortex A35 CPU and G31 GPU ever since. There has been only a slight refresh to the "small" portfolio, upgrading to the Cortex A55 CPU, and the G52 and G57 iGPUs. To the point that they're a joke, and easily surpassable by the competitors. ARM really needs a revolutionary new design here, it needs to be super-efficient. Perhaps something that can scale between both tiny and small categories: with performance ranging from the A55 (or more) at the "tiny" power-level, to the A73 (or more) at the "small" power-level. Basically catching up to Apple, if unable to surpass them.
Whereas the "medium" portfolio has seen very frequent upgrades, in the CPU-side to the Cortex A75, A76, A77, and A78. In the GPU-side we've seen G72, G76, G77, G78 which have been mostly competitive, surpassing some custom implementations (Samsung/MediaTek) and losing to others (Apple/Qualcomm). Not much needs to change here to be honest. We've also seen the emergence of a new "large" category of ARM processors. Firstly popularised by custom implementations from Apple (A10 and onwards), then Samsung (Mongoose M3, and onwards). Now it's supported officially by ARM in the form of the Cortex A77+ and the Cortex A78X / X1. This has been mostly underwhelming and uncompetitive, with Apple being the only one implementing good designs. There hasn't been any new "large" category for iGPUs from ARM or competitors, with the only Large-iGPU exception actually being inside the Apple Silicon M1. ARM (without counting Apple) needs to do better here, and it looks like ARM might already be focussing here in the future with ARMv9. Again contrast this to the x86 markets offering 7nm Large-CPUs of Zen2 and Zen3, with RDNA-1 and RDNA-2 Large-GPUs.
mode_13h - Thursday, April 29, 2021 - link
Uh...> 2013, they had the A7 (tiny), A15 (small), and A57
> Then ARM made the leap into 64bit processing around 2016.
A57 is a 64-bit core.
> Contrast that to the new x86 competition in AMD
No. Why would we do that? They were competing in totally different markets, at the time. The only partial overlap was embedded Ryzen.
> There hasn't been any upgrades for the "tiny" portfolio, being stuck to ... Cortex A35 CPU
> There has been only a slight refresh to the "small" portfolio, upgrading to the Cortex A55 CPU
The A35 and A55 both launched in 2017.
> they're a joke, and easily surpassable by the competitors.
In terms of what? PPA? Perf/W? Perf/$? Might want to be sure you're comparing apples to apples and not comparing competing "small" core with ARM "tiny".
> There hasn't been any new "large" category for iGPUs from ARM or competitors
Samsung is using RDNA and MediaTek is licensing a Nvidia GPU for its upcoming SoCs.
Might want to do a little more research, before writing another longpost. I agree that A55 could use a refresh, but ARMv9 will force that, anyway. I don't even know where A35 is used, but same story, there.
It's worth noting that ARM has also been active in the microcontroller market, with both 32-bit and 64-bit offerings.
Kangal - Friday, April 30, 2021 - link
Firstly, apologies.I know the A57 is 64bit, but there have been many (most?) implementations of it running in 32bit mode. The A57 was really a "rough draft" for ARM, in moving towards both "medium" sized cores and into 64bit computing. Hence, it feels more at home next to it's A7 and A15 brethren.
The contrast is there, and necessary to show the landscape of the time. The tech industry is a fast-paced one. And if your code/calculations is agnostic, that it can run on any platform, you would consider all options (not that I recommend people go creating agnostic code, compared to specialized or hardware-accelarated code).
The Cortex A35 launched in 2015. It's long due for an upgrade, or replacement. Where this core likes to be in is in small, low-power, and cheap devices. In particular the microcontroller market as you mentioned. ARM hasn't been as active in this field as you think they have, with many of the products being custom designs from the ODMs.
I already mentioned the A55 was a slight refresh for the A53, and that itself is also surpassed. Have a look at Apple's "small" cores. They are Out-of-Order processors, they are slightly faster than an A73, they use slightly less power than an A53. It's mind boggling. Others disagree, and say they're actually faster than A75, and more efficient than A55... but at this scale we're splitting hairs. With that much room for difference, it's not inconceivable (heck it's likely) that an outside competitor like RISC-V will surpass the A55 in terms of Perf/W, Perf/PPA, Perf/$, or a combination of the lot. And remember, the Cortex-A53 is the most popular core out there, where it's getting stamped out on so many different Chinese products.
Samsung isn't using Radeon iGPUs YET, and neither is MediaTek. Besides, we have yet to see them in the wild and find out details if their architecture. These might be licensed from AMD or Nvidia, but they might be "small" iGPUs instead of "large" iGPU designs. I did forget to mention that the Tegra X1, and some Nvidia SBC did actually use their "large" iGPU architecture (ie Maxwell etc).
The gist of my rant is that ARM was a revolutionist early on, basically creating the market. Then they were extremely innovative and competitive, basically dominating the market. Now they are competitive but not as revolutionary nor as competitive/innovative as they used to. With ARMv9 they have a chance to start fresh, and return to status quo, by having a trifecta of products for the computing industry. I was pointing out the gaps in their history and portfolio. They shouldn't just focus on mobile phones, that's boring.
mode_13h - Friday, April 30, 2021 - link
> The Cortex A35 launched in 2015.Okay, the date I saw was wrong. It seems to have been announced in November 2015. The A55 seems to have been announced in May 2017.
> this core likes to be in is in small, low-power, and cheap devices.
> In particular the microcontroller market as you mentioned.
They have actual microcontrollers, though. The A35 is still too power-hungry (and expensive?) for most IoT devices.
> Have a look at Apple's "small" cores.
You focus on performance and efficiency, but what about area? Apple has a narrower focus and different process, cost, & area targets than ARM.
The point we can definitely agree on is that ARM's bottom & middle tier cores should've been refreshed more frequently. But, everyone seems to think that ARM is directly competing with Apple, but it's not. Their objectives meaningfully differ, resulting in ARM probably being driven more towards making smaller cores than Apple.
It's only at the top end of their mobile stacks that you can really say ARM and Apple are in direct competition. However, even on something like the A78, ARM is still put in a position of having to make compromises that Apple isn't.
> ARM was a revolutionist early on, basically creating the market.
> Now they are competitive but not as revolutionary nor as competitive/innovative as they used to.
That's how these things work. A small upstart has a lot of freedom. The bigger a company gets, the more constrained it becomes by its customers, its market, the cost of changing, and the downside risk. I'm still just not totally convinced that entirely explains what we're seeing.
If they can manage to cleave their server cores entirely from their mobile cores, and then really make big cores that are performance-first (instead of scaled up versions of mostly-performance cores, like the X1 and A78 situation), then we might see them start to compete at Apple's level. Basically, to compete they'd have to start by designing the X1 first, and then make the A78 by putting it on a diet.
> They shouldn't just focus on mobile phones, that's boring.
LOL, it's also where most of their revenue still lies. If you were CEO, you wouldn't last a day.
grant3 - Saturday, May 1, 2021 - link
> LOL, it's also where most of their revenue still lies. If you were CEO, you wouldn't last a day.Focusing on the same-ol' same-ol' business is exactly how once-profitable companies fade into irrelevance as technology moves on. Plenty of mediocre CEOs do that.
A great CEO can find the future revenue opportunities and prove it to the company's owners.
mode_13h - Sunday, May 2, 2021 - link
Yeah, but you can't afford to walk away from your bread and butter. Any new growth areas you pursue can't come at the expense of revenues in your core business. If you even threatened to starve your core business, you'd be out of a job before your new ambitions could ever get off the ground.Just look at what happened with Qualcomm, they tried to invest in new areas, but their investors absolutely wouldn't tolerate it. Granted, they're more exposed than ARM would be, either under Soft Bank or Nvidia.
Kangal - Sunday, May 2, 2021 - link
No, grant3 is exactly right.What you said is EXACTLY what Blockbuster said before they went bankrupt. In case you didn't know, the board members passed the opportunity to buy Netflix for $50 Million. The CEO then tried to right that wrong by acquiring another competitor, and shifting their revenue stream. The board fired their CEO, saying that their late-fee revenue was the bread and butter of their business model. Blockbuster was too narrow focused and stuck in the past, that not only did they miss the opportunity of becoming a whole new behemoth, but they sunk their own ship at the same time.
mode_13h - Sunday, May 2, 2021 - link
> What you said is EXACTLY what Blockbuster said before they went bankrupt.If grant3 is saying that Blockbuster should close half its stores while they're still profitable, to divert money into R&D on getting into the (then) almost non-existent streaming market, no company in the world would do that.
Now, it's not like ARM is ignoring other markets, of course. They just can't turn their back on the mobile market, in order to do so.
> Blockbuster was too narrow focused and stuck in the past
The genius of capitalism is that the failure of Blockbuster to transition into a streaming platform didn't keep streaming from happening. Its investors could even get in on the game by shifting their investments into players in the streaming market. If the CEO was such a believer, he could've quit and gone to work for a streaming company or founded his own.
Also, let's not forget that there have already been losers in streaming, and it wasn't clear Netflix would've successfully made the transition from movies-by-mail. Who remembers Google Video? Yahoo even bought some company in the space. And just last year, there was quibbi. I'm sure there are others I'm forgetting.
I think we all want to see ARM succeed outside of mobile. They're been investing a lot, in order to do so. Some in this very thread have been complaining at their lack of focus on their smaller, lower-power cores (currently A35 & A55), which you could see as evidence they've already been making sacrifices to try and compete outside their niche. I don't know if that's accurate, but it's plausible.
If Nvidia's acquisition goes through (as I expect it will), I hope and expect it will provide ARM with the funds to do even more ambitious things.
Spunjji - Friday, April 30, 2021 - link
That's a sound argument for that expectation - it's definitely long since past time for an update.dotjaz - Tuesday, April 27, 2021 - link
Why would you need rumours when we know for a FACT that there will be an A55 successor unless b.L design is abandoned for no good reason. I'll give you a hint, b.L can't have mixed architectures that's why big cores stayed at ARMv8.2a for so long.eastcoast_pete - Tuesday, April 27, 2021 - link
Maybe the shift to ARMv9 will force ARM's hand with giving the LITTLE cores out-of-order designs; however, current bigLITTLE designs already mix big, out-of-order designs with LITTLE in-order cores like the A55. So, bL can and has worked with mixed architectures for quite a while. However, I hope you are correct in that the shift to ARMv9 will force the issue, and we'll finally get out-of-order LITTLE cores also on non-Apple devicesmichael2k - Tuesday, April 27, 2021 - link
Maybe dotjaz meant you couldn't mix 8.5 and 8.2 architectures?In any case, DynamIQ, not big.LITTLE, is more relevant now. Also, if people really want to push for an out of order big.LITTLE, why not use the A78 for the big core and the older A76 as the little core? Both A76 and A78 can be fabricated at 5nm, and the A76 would use less power by dint of being able to do less work per clock, which is fine for the kind of work a little core would do anyway.
Does DynamIQ allow for a mix of A76 and A78?
smalM - Thursday, April 29, 2021 - link
Yes.But the maximum is 4 A7x Cores. Only A78C can scale to 8 Cores in one DynamIQ cluster.
dotjaz - Thursday, April 29, 2021 - link
No, big.LITTLE is the correct term. DynamIQ is an umbrella term. The part related to mixing uarch is still b.L, nothing has changed.https://community.arm.com/developer/ip-products/pr...
dotjaz - Thursday, April 29, 2021 - link
And yes, I mean what I wrote, architectures or ISA, not uarch.dotjaz - Thursday, April 29, 2021 - link
Name one example where ARCHITECTURES were mixed. Microarchitectures are of course mixed, otherwise it won't be b.LZingam - Wednesday, April 28, 2021 - link
Do you remember the forum experts taunting that Intel is so much better and arm so weak, it will never be competitive?Matthias B V - Tuesday, April 27, 2021 - link
Thanks for asking. Can't watch it a for years small A55 didn't get any update or successor.For me it would be even more improtant to update those as lots of tasks run on those rather than high perfromance cores. But I guess it is just better for marketing talk about big gains in theoretical pefromance.
At least I expect an update now. Just hope it won't be the only one...
SarahKerrigan - Tuesday, April 27, 2021 - link
The lack of deep uarch details on the N2 is disappointing, but I guess we'll probably see what Matterhorn looks like in a few weeks so not a huge deal.eastcoast_pete - Tuesday, April 27, 2021 - link
I am waiting for the first in-silicone V1 design that Andrei and others can put through its paces. N2 is quite a while away, but yes, maybe we'll see a Matterhorn design in a mobile chip in the next 12 months. As for V1, I am curious to learn what, if anything, Microsoft has cooked up. They've been quite busy trying to keep up with AWS and it's Gravitons.mode_13h - Tuesday, April 27, 2021 - link
> in-siliconeJust picturing a jiggly, squidgy CPU core... had to LOL at that!
Oxford Guy - Tuesday, April 27, 2021 - link
‘Fast-forward to 2021, the Neoverse N1 design today employed in designs such as the Ampere Altra is still competitive, or beating the newest generation AMD or Intel designs – a situation that which a few years ago seemed anything but farfetched.’Hmm... That last bit is odd. Either it’s just ‘farfetched’ or it’s ‘expected’.
eastcoast_pete - Tuesday, April 27, 2021 - link
Yes, those slides look very promising; now eagerly awaiting an eventual test of one or two of these in a actual silicone. I guess then we'll see how they measure up.mode_13h - Tuesday, April 27, 2021 - link
Silicone - From Wikipedia, the free encyclopediaNot to be confused with the chemical element silicon.
A silicone or polysiloxane is a polymer made up of siloxane (−R2Si−O−SiR2−, where R = organic group). They are typically colorless, oils or rubber-like substances. Silicones are used in sealants, adhesives, lubricants, medicine, cooking utensils, and thermal and electrical insulation.
eastcoast_pete - Thursday, April 29, 2021 - link
I'll have to take this up with auto-correct. It keeps changing silicon to silicone. Now that I forced it again to leave silicon alone (for the umpteenth time), maybe it will stop (:Mondozai - Tuesday, April 27, 2021 - link
Fantastic overview by Andrew. AT's most underrated reporter. Hopefully he gets more responsibility to cover more things in the future.Linustechtips12#6900xt - Tuesday, April 27, 2021 - link
AGREEDdotjaz - Tuesday, April 27, 2021 - link
Good, finally confirmed N2 is in fact ARMv9 as suspected. Now we'll just have to wait and see how the new mobile counterparts are. Hopefully we'll see some real improvements.It'll be interesting to see how small the new low power v9 core is given that it has to have a 128b SVE2 pipeline instead of 2x64b NEON.
mode_13h - Wednesday, April 28, 2021 - link
> finally confirmed N2 is in fact ARMv9 as suspected.> Now we'll just have to wait and see how the new mobile counterparts are.
> Hopefully we'll see some real improvements.
The data presented on N2 doesn't give me much hope that v9 changed much, besides the feature baseline. I was hoping for something slightly revolutionary, but it's certainly not that.
dotjaz - Thursday, April 29, 2021 - link
> hoping for something slightly revolutionaryWe've known for a couple of years ARMv9 is just ARMv8.x rebased. Your hopes weren't realistic to begin with. Besides, what "revolutionary" features would you expect ISAs to include? Can oyu name one? ARMv8.5a+SVE2 already has everything you need to design an excellent and efficient uarch. Why re-invent the wheel just for the sake of it?
mode_13h - Thursday, April 29, 2021 - link
> We've known for a couple of years ARMv9 is just ARMv8.x rebased.You knew this according to where? It's one thing to assume that, and clearly it wasn't an unreasonable assumption, but it's another thing to *know* it. So, how did you *know* it?
> Besides, what "revolutionary" features would you expect ISAs to include? Can oyu name one?
It's a fair question. Generally speaking, anything that would help improve efficiency. Maybe things like scheduling hints or maybe some kind of tags to indicate memory writes that are thread-private and terminal reads. Just some examples, off the top of my head.
> ARMv8.5a+SVE2 already has everything you need to design an excellent and efficient uarch.
The issue I see is that IPC and efficiency gains are going to become ever more hard-won, so there needs to be some more creativity in redefining the SW/HW interface to unlock further gains. ARMv9 is going to be with us for probably another decade and it could end up having to compete with yet-to-be-identified alternatives like maybe RISC VI or something completely out of left-field. So, I see it as a wasted opportunity. A pragmatic decision, for sure, but a little disappointing.
Dug - Tuesday, April 27, 2021 - link
Now is when I wish ARM was publicly traded.mode_13h - Tuesday, April 27, 2021 - link
Well, you could buy NVDA, under the assumption the acquisition will go through.dotjaz - Thursday, April 29, 2021 - link
SoftBank is already publicly traded on the Tokyo Stock Exchange. Why rely on NVIDIA buyout which for all likelihood won't happen any time soon if at all.mode_13h - Thursday, April 29, 2021 - link
> SoftBank is already publicly traded on the Tokyo Stock Exchange.They also invested heavily in WeWork, when it was highly over-valued. I have no idea what other nutty positions they might've taken, but I think it's not a great proxy for ARM just due to its sheer size.
cjcoats - Tuesday, April 27, 2021 - link
As an environmental modeling (HPCC) developer: what is the chance of putting a V1 machine on my desk in the foreseeable future?Silver5urfer - Tuesday, April 27, 2021 - link
Never. Since there has to be an OEM for these chips to put in DIY and Consumer machines, so far except the HPE's A64FX ARM there's no way any consumer can buy these ARM processors and that is also highly expensive over 5 digit figure. And then the drivers / sw ecosystem comes into play, there's passion projects like Pi as we all know but they are nowhere near the Desktop class performance.ARM Graviton 2 was made because AWS wants to save money on their Infrastructure, that's why their Annapurna design team is working there. Simply because of that reason Amazon put more effort onto it AND the fact that ARM is custom helps them to tailor it to their workloads and spread their cost.
Altra is niche, Marvell is nowhere near as their plans was to make custom chips on order. And from the coverage above we see India, Korea, EU use custom design licensing for their HPC Supercomputer designs.
Then there's a rumor that MS is also making their own chips, again custom tailored for their Azure, Google also rumored esp their Whitechapel mobile processor (it won't beat any processor on the market that's my guess) and maybe their GCP oriented own design.
These numbers projection do look good vs x86 SMT machines finally to me after all these years, BUT have to see how they will compete once they are out vs 2021 HW is the big question, since if these CPUs outperform the EPYC Milan technically AWS should replace all of them right ? since you have Perf / Power improvements by a massive scale. Idk, gotta see. Then the upcoming AMD Genoa and Sapphire Rapids competition will also show how the landscape will be.
SarahKerrigan - Tuesday, April 27, 2021 - link
If they don't replace all the x86 systems in AWS with ARM, that *must* mean Neoverse is somehow secretly inferior, right??Or, you know, it could mean that x86 compatibility matters for a fair chunk of the EC2 installed base, especially on the Windows Server side (which is not small) but on Linux too (Oracle DB, for instance, which does not yet run on ARM.)
Silver5urfer - Tuesday, April 27, 2021 - link
That was a joke.Spunjji - Friday, April 30, 2021 - link
Was it, though? Schrodinger's Joke strikes again.Raqia - Tuesday, April 27, 2021 - link
Maybe not an V1 but you could probably get a more open high performance ARM core than the Apple MX series pretty soon:https://investor.qualcomm.com/news-events/press-re...
"The first Qualcomm® Snapdragon™ platforms to feature Qualcomm Technologies' new internally designed CPUs are expected to sample in the second half of 2022 and will be designed for high performance ultraportable laptops."
mode_13h - Tuesday, April 27, 2021 - link
> sample in the second half of 2022Uh, that means new machines won't be using them until at least the end of next year. And if we want more cores than an ultraportable, it's still no good.
Raqia - Wednesday, April 28, 2021 - link
I wouldn't put it past them to do a desktop or server sized SoC eventually if they have a great in house core design that isn't a commoditized IP block that anyone can license from ARM. It would give them an advantage at the higher tiers of performance that they will want piece of for sure.They also seem to be devoted to providing an open ARM computing platform in working with Linux developers and Windows when compared with Apple. That they added a hypervisor to the 888 should give you some indication to their future compute ambitions...
mode_13h - Wednesday, April 28, 2021 - link
> I wouldn't put it past them to do a desktop or server sized SoCThe already tried this, but their investors killed it. Lookup "Centriq". Building out a whole server infrastructure & ecosystem takes a lot of investment, and now they'd have established competitors with a multi-year lead.
Raqia - Wednesday, April 28, 2021 - link
I wasn't talking about servers (at least not right away), more consumer oriented and workstation scale compute. Amon did say that the designs they had in mind with Nuvia were "scalable" and that they were going to be addressing multiple markets.mode_13h - Wednesday, April 28, 2021 - link
I hope you're right. If anyone can compete with Apple right now, it's probably Nuvia/Qualcomm.name99 - Thursday, April 29, 2021 - link
You need three things to create a higher performance core than Apple- designers (check)
- an implementation team (hmm. maybe? this means *enough* good people and superb simulation/design tools)
- management willing to pay the costs [design costs, and willing to accept a substantially larger core] (hmmmmmmmm? will they chicken out and assume no-one is willing to pay for such a core, they way they always have for watch, phone, then centriq?)
And Apple won't stand still...
mode_13h - Tuesday, April 27, 2021 - link
> so far except the HPE's A64FXGigabyte makes Altra motherboards and servers that I'm sure you can buy for less than a HPE A64FX-based machine.
And, if you're counting A64FX as a "consumer machine", you ought to include Avantek's Altra-based workstations that I mentioned below.
mode_13h - Tuesday, April 27, 2021 - link
> if these CPUs outperform the EPYC Milan technically AWS should replace all of them right ?No, because a lot of people are still stuck on x86. Also, Amazon could be fab-limited, like just about everyone else. The sun might be setting on x86, but it's still a long time until dark.
Rudde - Tuesday, April 27, 2021 - link
An Avantek Ampere workstation might be available in a stand-alone system. Andrei expects Ampere to include N2 in their next gen systems instead of V1. Apple might also launch something in that segment in the coming years.mode_13h - Tuesday, April 27, 2021 - link
A UK-based company called Avantek makes Ampere-based workstations. Their eMAG-based version was reviewed on this site, a couple years ago, and they now have one with Altra. So, I'd say better than average chances we might see one with a V1-based CPU by maybe the end of the year or so.nandnandnand - Tuesday, April 27, 2021 - link
Looking at Cortex-X-next. It seems like Arm can put out a new Cortex-X for every new Cortex-A78 successor, since the Cortex-X is very similar but bigger.mode_13h - Tuesday, April 27, 2021 - link
Form an earlier article:> The Cortex-X1 was designed within the frame of a new program at Arm,
> which the company calls the “Cortex-X Custom Program”.
> The program is an evolution of what the company had previously
> already done with the “Built on Arm Cortex Technology” program
> released a few years ago. As a reminder, that license allowed
> customers to collaborate early in the design phase of a new
> microarchitecture, and request customizations to the configurations,
> such as a larger re-order buffer (ROB), differently tuned prefetchers,
> or interface customizations for better integrations into the SoC designs.
> Qualcomm was the predominant benefactor of this license,
Alistair - Tuesday, April 27, 2021 - link
I just want to be able to use ARM in standard DIY with an Asus motherboard and a socket, just like AMD and Intel.mode_13h - Tuesday, April 27, 2021 - link
I wonder if Nvidia will put out a Jetson-style board in something like a mini-ITX form factor.Alistair - Wednesday, April 28, 2021 - link
i sure hope so, and something not massively overpriced like right nowmode_13h - Thursday, April 29, 2021 - link
Yeah, because Nvidia is known for their bargain pricing!; )
Although, if they wanted to create a whole new product segment, it's conceivable they might keep prices rather affordable for a couple generations.
nandnandnand - Wednesday, April 28, 2021 - link
I want it. You want it. Some people seem to want it. Maybe demand is forming? Get on it, China.16-core Cortex-X2 please.
mode_13h - Wednesday, April 28, 2021 - link
They already did, sort of. See: https://e.huawei.com/us/products/servers/kunpeng/k...Whoops! Had to get this out of Google cache, because the page 404'd:
Board Model D920S10
Processors 1 Kunpeng 920 processor, 4/8 cores, 2.6 GHz
Internal Storage 6 SATA 3.0 hard drive interfaces, 2 M.2 SSD slots
Memory 4 DDR4-2666 UDIMM slots, up to 64 GB
PCIe Expansion 1 PCIe 3.0 x16, 1 PCIe 3.0 x4, and 1 PCIe 3.0 x1 slots
LOM Network Ports 2 LOM NIC, supporting GE network ports or optical ports
USB 4 USB 3.0 and 4 USB 2.0
mode_13h - Tuesday, April 27, 2021 - link
Do any of the current x86 cores pair up SSE operations for >= 4x throughput per cycle?AVX2 has been around for long enough that a lot of the code which could benefit from it has already been written to do so, yet *most* people are still compiling to baseline x86-64 (or just above that), since Intel is still making low-power cores without any AVX. So, I'm sure there's still *some* code that could benefit from >= 4x SSEn execution.
AntonErtl - Wednesday, April 28, 2021 - link
Zen has 4 128-bit FP units (2 FMA and 2 FADD). Not sure if that's what you are interested in.mode_13h - Wednesday, April 28, 2021 - link
Ah, yes! wikichip says of Zen 1:> Accordingly the peak throughput is four SSE/AVX-128 instructions
> or two AVX-256 instructions per cycle.
And Zen 2:
> This improvement doubles the peak throughput of AVX-256 instructions to four per cycle
Wow!
mode_13h - Tuesday, April 27, 2021 - link
What's SLC? I figured it was Second-Level Cache, until I saw the slide referencing "SLC -> L2 traffic"."System Level Cache", maybe? Could it be the term they use instead of L3 or LLC?
Thala - Tuesday, April 27, 2021 - link
I think you are totally right - SLC == LLC.Thala - Tuesday, April 27, 2021 - link
Quick addition. The term SLC is more popular lately, as it emphasize that the cache is not only shared among the cores but also with the system (GPU, DMAs etc).mode_13h - Wednesday, April 28, 2021 - link
Thanks. I guess I should've just waited until I'd finished reading it, because the interconnect slide made it abundantly clear.Now, I'm wondering about this "snoop filter" and why so much RAM is needed for it, when Graviton 2 & Altra have so little SLC. So, I gather it's not like tag RAM, then? Does it index the L2 of the adjacent cores, or something like that?
mode_13h - Tuesday, April 27, 2021 - link
Question and corrections on Page 6: PPA & ISO Performance ProjectionsWhat do the colors on the chip plots mean?
> Only losing out 10% IPC versus the N1
I'm sure that's meant to say "V1".
> In terms of absolute IPC improvements
Huh? These are definitely "relative IPC improvements" or just "IPC improvements".
Calin - Wednesday, April 28, 2021 - link
AWS share by vendor type: It should have been "Vendor A" and "Vendor I"mode_13h - Wednesday, April 28, 2021 - link
That slide was provided by ARM and I think they're trying to have at least the *appearance* of maintaining anonymity, even if the identities are abundantly clear.Also, you realize that their Vendor A is your Vendor I, right?
serendip - Wednesday, April 28, 2021 - link
How does the narrower front end and shallower pipeline of the N2 compare to Apple's M1? I'm thinking about how this could translate to the A78 successor, if that uses an evolution of the X1 core with improvements from N2 brought in.mode_13h - Thursday, April 29, 2021 - link
Good point. It suggests the A78+1 will perform < N2.Although, a derivative X-core would likely be > N2.
name99 - Thursday, April 29, 2021 - link
What do you mean by "compare?"Apple is 8-wide Decode, Map, Rename. But that doesn't include the fact that Apple does a ton of clever work in those three stages (
- simple branches handled at Decode,
- a variety of zero-cycle moves and immediate handling in Rename
- two-level scheduler, with the higher level able to accept an 8-wide feed from Rename, even though the lower-level scheduler is narrower [6 for int, 4 for FP or LS] )
Apple is *astonishingly* wide at the completion end. 16-wide register freeing and history file release, up to 56(!!!)-wide release of ROB entries.
The Apple pattern so far (insofar as pattern-detection is worth anything) has been a 1st generation of four cores (A7/8/9/10) with similar design and 6-wide, constantly iterating on details within that framework;
then a 2nd generation (A11/12/13/14) that makes explicit the big.LITTLE structure (in A10 that was mostly invisible) and based on an 8-wide (with 6 integer units) structure.
If one has to bet, the reasonable thing to bet might be something like starting with A15 we transition to 10-wide (initially with6 later with 8 integer units), and 2xSVE256. Once again lay the framework, then scale out the pieces over the subsequent three cores.
One thing that is very clear (and presumably part of Apple's success) is that they have been very willing to keep modifying how they do things; they don't just settle on a design and leave it unchanged except perhaps for some scaling up. For example the way they handle the MOV xn, xm instruction has gone through at least three very different schemes. This may seem trivial (who cares about how a singe instruction is implemented?) except that these schemes indicate a substantial reworking of the entire register file and how registers are allocated and then freed.
This is in comparison to x86 which seems to live in (probably justified) terror that any change they make, no matter how low level, will probably break something because the whole system is so complex and so interconnected that no one person holds the entire thing in their head.
They also seem to have a good system in place for hiding new functionality behind chicken bits, so that they can effectively debug new features within shipping hardware. For example there are reasons to believe that A14 might have in place most of the pieces required for physical register file amplification (avoid allocation for back-to-back register usage and grab the intermediary off the bypass bus; early release of logically overwritten registers) but these are not visible -- probably behind chicken bits so that they can be tested under all circumstances in shipping HW, and made visible for A15.
And anyone who has not looked at the details is unaware of just how impressive the underlying Apple µArch platform is. There is substantial room there on-going growth! As I continue to explore it, not only do I see how well it works today, I also see multiple directions in which it could "easily" (ie feasibly, on schedule and within budget) be improved for years to come. The only other artifact I know of that comes close in terms of quality of implementation and ability for continuing growth is the Mathematica code base -- other artifacts like other CPUs, or various OS implementations, are in a totally different (and far inferior) league.
name99 - Thursday, April 29, 2021 - link
To expand on my point, it's great that ARM are including so many good idea, but it's also astonishing the extent to which pretty much every good idea already has an Apple precedent.For example consider the MPAM discussion: "The mechanism to which this can be achieved can also include microarchitectural features such as dispatch throttling where the core slows down the dispatched instructions, smoothing out high power requirements in workloads having high execution periods, particularly important now with the new wider 2x256b SVE pipelines for example."
This sounds like (and IS) a good idea -- certainly a lot better than reducing frequency the way Intel does for AVX512.
But look at this Apple patent from 2011(!)
https://patents.google.com/patent/US9009451B2
"A system and method for reducing power consumption through issue throttling of selected problematic instructions. A power throttle unit within a processor maintains instruction issue counts for associated instruction types. The instruction types may be a subset of supported instruction types executed by an execution core within the processor. The instruction types may be chosen based on high power consumption estimates for processing instructions of these types. The power throttle unit may determine a given instruction issue count exceeds a given threshold. In response, the power throttle unit may select given instruction types to limit a respective issue rate. The power throttle unit may choose an issue rate for each one of the selected given instruction types and limit an associated issue rate to a chosen issue rate. The selection of given instruction types and associated issue rate limits is programmable."
I just keep bumping into this stuff! Arm release new cores with what seem like good ideas (and of course ARM tell us a lot more about what's new than Apple does). I do some exploring -- and find Apple patented that idea five or more years earlier!
mode_13h - Thursday, April 29, 2021 - link
> consider the MPAM discussionThe slide calls that MPMM. The article confuses it with Memory Partitioning And Monitoring.
Anyway, the N2's PDP sounds a lot more advanced.
GeoffreyA - Friday, April 30, 2021 - link
Zen 3 needn't blush when standing next to Apple. 4-wide decode might be small but that does pick up to 6, coming out of the micro-op dispatch. Then, going down, you've got 10-wide issue on the INT side, and 6-wide on FP. Admittedly, narrower register files and 8-wide retire from the (smaller) ROB, along with smaller caches. As for move elimination, even Skylake has that. Yes, everything tends to be narrower. But I think it goes to show there's nothing particularly out of this world on the Apple side.name99 - Friday, April 30, 2021 - link
I did not say that move elimination was the interesting part.I said that what was interesting is that over Apple's short CPU career they have already implemented it in three significantly different ways.
That strikes me as interesting and important -- there is no resting on the laurels, no acceptance that "we have the feature, OK to slow down". You honestly believe that Intel operates according to that same mentality?
GeoffreyA - Friday, April 30, 2021 - link
I'll admit, Apple isn't resting; and they aren't scared to break orthodoxy in advancing their designs. If the others do not wake up, they'll be left in the dust. As for Intel, complacency has put them in the well-deserved pickle they're in today. AMD deserves credit, though, for doing much these past few years; and, like Apple, aren't resting on their laurels either (arguably like they did in the K8 era).mode_13h - Friday, April 30, 2021 - link
How big are Apple's cores, though? Area is tricky, because they tend to be on a newer process node.But, my point is that maybe AMD and Intel aren't making their cores even larger and more complex, because they're targeting the server market and found that a more area-efficient way to scale performance is by adding more cores, rather than making their existing cores even more complex.
name99 - Friday, April 30, 2021 - link
A14 is 88mm^2.Eyeballing it, each large core is about 2.5% of that area, so 2.2mm^2
Throw in the L2 at about 3.5%, so 3mm^2 (shared between two cores).
Throw in the SLC (not exactly an L3, but pretend it is if you insist) at 8.8% and about 8mm^2.
I guess if you were targeting a server type design, we could probably treat it as something like
2.5+(3/4)+ (8.8/8) [making rough guesses about what sort of L2 and L3 would be optimal for a server type design] so ~4.4mm^2.
Could fit ~100 in a 440mm^2 (though you'd also want some memory controllers and IO!)
Definitely a lot larger than something like an N1 or N2 -- but of course, Apple isn't designing for the data center -- if they were, they'd probably adopt something halfway between Fire Storm and Ice Storm.
The problem is not that Intel and AMD are chasing the server market, it is that the way they are chasing it is incoherent. IF your primary goal is the server market, then WTF are you designing for super-high frequencies? The data center cores never run at those frequencies -- but being able to boast about them means your transistor density is half to a third that of Apple (or more relevant ARM/Altra/Graviton) on the same process...
Pick a goal and optimize for it! But Intel's goal seems to be to optimize for marketing that they can hit 5.x GHz (for half a second...) Not clear that designing the entire company around that goal (of zero interest to the data center, and little interest to most users) is such a great long-term strategy.
mode_13h - Saturday, May 1, 2021 - link
> IF your primary goal is the server market, then WTF are you designing for super-high frequencies?AMD actually has lagged Intel in frequency, and I think that's one of the reasons. Remember, AMD is the only one using the same exact silicon on both the mainstream desktop and in all their server products.
Intel, on the other hand, has completely separate silicon for their server dies, and we don't know all of the subtle ways they could differ from their desktop or laptop cores. We just know they tend to reuse the same basic core micro-architecture up and down their product lines (except for the really cheap/low-power stuff).
> The data center cores never run at those frequencies
A few Cascade Lake Xeons could turbo up to 4.5 GHz, which benefits certain workloads. The fastest turbo clock of an Ice Lake Xeon is 4.4 GHz
The fastest EPYC can boost up to 4.1 GHz.
In Intel & AMD's defense, a simpler core can clock higher, but runs more efficiently at lower clocks and enables higher densities. So, it seems like a pretty good strategy to me.
> Pick a goal and optimize for it!
Like ARM, Intel and AMD trying to balance power (in a server/laptop application, at least) and area. Apple is the only one who really has the luxury not to care much about area and just optimize for a single target. When Apple reuses its laptop core micro-architecture in both desktops AND servers, then we can compare them to the other guys. Until then, I think it's a case of Apples and pears.
GeoffreyA - Saturday, May 1, 2021 - link
"Until then, I think it's a case of Apples and pears."I think that's it. Well, soon process improvements will be a thing of the past, owing to quantum effects, and then we'll see who does what. The free ride is almost over.
GeoffreyA - Friday, April 30, 2021 - link
"This is in comparison to x86 which seems to live in (probably justified) terror that any change they make, no matter how low level"P6, Netburst, Sandy Bridge, and Bulldozer seem like pretty big changes.
name99 - Friday, April 30, 2021 - link
(a) Sandy Bridge was the last such.(b) Look at the relative spacing (in time) for the two cases.
Look, I'm not interested in "x86 vs ARM. FIGHT!!!"
I'm simply pointing out various patterns I've noted that strike me as interesting and significant. If other people have similar such patterns to point out -- interesting and non-obvious aspects of new x86 micro-architectures, or patterns in how those micro-architectures have evolved over the past few years, they should add a comment.
But to this outsider the micro-architectures look stagnant -- utterly so in the case of Intel, mostly so in the case of AMD. In particular slight scaling up of an existing micro-architectures because a new process is more dense is not interesting! What is interesting is a new way of conceptualizing the problem that allows for a step change in the micro-architecture; and that is what I am not seeing on the x86 side.
I do see it in IBM (though for purposes that are, to me, uninteresting, both for POWER and for z/)
I do see it in ARM Ltd.
mode_13h - Friday, April 30, 2021 - link
> What is interesting is a new way of conceptualizing the problem that allows for a step change in the micro-architectureYes, but I think that largely depends on the ISA. And there, ARM has indeed been rather stagnant. Besides SVE and their new security features, most of their ISA changes have been tweaking around the margins. Not a fundamental rethink, or anything close to it.
What we need is more willingness to rethink the SW/HW divide and look at what more software can do to make hardware more efficient. Whenever I say this, people immediately seem to think I mean doing a VLIW-like approach, but that's too extreme for most workloads. You just have to look at an energy breakdown of a modern CPU and think creatively about where compilers could make the hardware's job a little bit easier or simpler, for the same or better result.
You can also flip it around, and ask where the primitives CPUs provide don't quite match up with what software is trying to do. I think TSX/HLE stands as an interesting example of that, and probably one where Intel doesn't get enough credit (granted, partly due to their own missteps).
name99 - Friday, April 30, 2021 - link
Architecture and micro-architecture are two different things.You want to fantasize about different architectures, be my guest. But I'm interested in MICRO-ARCHITECTURE and that was the content of my comments.
mode_13h - Saturday, May 1, 2021 - link
> Architecture and micro-architecture are two different things.The principle manifestation of the HW/SW divide is the ISA. That's why I talk about it rather than "architecture", which is a word that can mean different things to different people and in different contexts.
> You want to fantasize about different architectures, be my guest.
It's about as on-topic here as ever, given that we've gotten our most detailed look at ARMv9, yet. And performance + efficiency numbers!
> But I'm interested in MICRO-ARCHITECTURE and that was the content of my comments.
There's only so much you can do, within the constraints of an ISA. ARM had a chance to think really big, but they chose to play it safe and be very incremental. That could turn out to be a very costly mistake, for them and some of their licensees.
I just want what I think we all want, which is another decade of progress in performance and efficiency like the last one. So far, I'm not very hopeful. I guess we need to really hit the wall, before people are ready to get serious about embracing options to push it back, a bit further.