• Some users have recently had their accounts hijacked. It seems that the now defunct EVGA forums might have compromised your password there and seems many are using the same PW here. We would suggest you UPDATE YOUR PASSWORD and TURN ON 2FA for your account here to further secure it. None of the compromised accounts had 2FA turned on.
    Once you have enabled 2FA, your account will be updated soon to show a badge, letting other members know that you use 2FA to protect your account. This should be beneficial for everyone that uses FSFT.

NVIDIA "Vera" CPU Benchmarked: Beating Intel Xeon and AMD EPYC

erek

Fully [H]
2FA
Joined
Dec 19, 2005
Messages
17,421
"For comparison, Phoronix tested single and dual Intel Xeon "Granite Rapids" 6980P CPUs, as well as AMD EPYC "Turin" and "Turin Dense" models like the AMD EPYC 9755, 9575F, and 9475F. They also included NVIDIA's first-generation "Grace" design based on Arm Neoverse V2 cores. NVIDIA allowed only a specific subset of tests on this pre-release chip, including standard workloads like code compilation, stream memory performance, video encoding, Python/Java, and database performance. In the geometric mean of all test results, NVIDIA's "Vera" topped the chart, performing nearly 11% better than AMD's most advanced designs and about 55.3% better than the best single-socket Intel Xeon. It also outperformed dual-socket configurations, suggesting that some workloads have scaling issues across multiple sockets. These limited results place "Vera" above any Arm-based design, with a 450 W TDP for the CPU and 50 W for the 768 GB memory pool.

NVIDIA is projected to sell about $20 billion worth of "Vera" and "Grace" CPUs, tapping into a $200 billion Total Addressable Market (TAM) with its standalone offerings. As NVIDIA partners with every major hyperscaler to supply "Vera" CPU racks, we are starting to see many deployments across infrastructure providers for their own use cases and offerings to third-party customers. This approach allows NVIDIA to tap into a massive market, potentially propelling it to become one of the largest CPU makers this year and likely for years to come."

1779810909451.png

Source: https://www.techpowerup.com/349361/...g-intel-xeon-and-amd-epyc-in-select-workloads
 
Seem like Nvidia cpu effort is well run:

Going into this, I didn't really know what to expect of NVIDIA's Vera with the new Olympus cores. But in the end I was left realizing this is the most formidable competition to Intel and AMD x86_64 processors ever realized. NVIDIA Vera is much more performant than what we have seen out of the likes of Ampere Computing or the custom in-house ARM solutions at public cloud providers like Google Compute Engine and Microsoft Azure.
....
Beyond the aggressive performance out of NVIDIA Vera with its competitiveness to x86_64 and excellent gen-on-gen performance compared to Grace, the other enlightening experience from my time at NVIDIA was seeing the great upstream open-source support for Vera. Given they aren't relying on ugly Device Tree files and complying nicely with the various Arm standards, the upstream kernel support for NVIDIA Vera is in good shape and in turn for modern, prominent AArch64 Linux distributions. .. with the Olympus cores it's great to see everything able to run nicely on a mainline Linux kernel
....
Additionally, NVIDIA having upstreamed their Olympus core support to the GCC and LLVM Clang compilers well in advance of launch.


Has they point out, Nvidia could have a yearly refresh pace (their current 2 year gen with yearly refresh or even more aggressive like Apple and a more full ~1 year generational pace) to those that would be big expense/effort for AMD-intel to keep up with.
 
Last edited:
Seem like Nvidia cpu effort is well run:

Going into this, I didn't really know what to expect of NVIDIA's Vera with the new Olympus cores. But in the end I was left realizing this is the most formidable competition to Intel and AMD x86_64 processors ever realized. NVIDIA Vera is much more performant than what we have seen out of the likes of Ampere Computing or the custom in-house ARM solutions at public cloud providers like Google Compute Engine and Microsoft Azure.
....
Beyond the aggressive performance out of NVIDIA Vera with its competitiveness to x86_64 and excellent gen-on-gen performance compared to Grace, the other enlightening experience from my time at NVIDIA was seeing the great upstream open-source support for Vera. Given they aren't relying on ugly Device Tree files and complying nicely with the various Arm standards, the upstream kernel support for NVIDIA Vera is in good shape and in turn for modern, prominent AArch64 Linux distributions. .. with the Olympus cores it's great to see everything able to run nicely on a mainline Linux kernel
....
Additionally, NVIDIA having upstreamed their Olympus core support to the GCC and LLVM Clang compilers well in advance of launch.


Has they point out, Nvidia could have a yearly refresh pace (their current 2 year gen with yearly refresh) to those that would be big expense/effort for AMD-intel to keep up with.
x86 has too much legacy ops imo.

There's a lot to test and validate every release.
 
Nvidia keepinp all those cores, cache and fabric on a big die with just some small IO out could be quite something latency wise

1779812969281.png


For current Nvidia business model, the foundry cost of giant 600mm low yield chips is not the issues/were the cost are in that liquid-cooled total NVL72 rack. Which could make the aggressive trying to help their margin/profit chiplet design of intel-AMD having issues to compete.

And GPU maker are used to sell silicon at much cheaper price than CPU one.
 
Last edited:
Apparently Nvidia is expecting nearly 20 billions in the CPU business in 2026.

That could already be more than Intel datacenter cpu revenues of 2026, should be close if not and could be significantly more than AMD data center cpus revenues (latest quarter cpu+gpu+everything else in that space was at a 23 billions yearly pace)
 
Last edited:

Indeed.

Granted, I believe a lot of this breaks down to the creation of specialist chips VS GPUs. Think ASICs for mining Bitcoin vs GPUs, same idea. That doesn't mean Nvidia can't also create similar chips, but there's a definite trend that their customers are also quickly becoming their competitors (Cerebras, Google with the TPU, Amazon's Annapurna Labs, etc). Nvidia's still the go-to for a lot of things in the space, but things are definitely heating up!
 
Jensen and his staff seem to have shifted the focus of Nvidia's future from just AI, to robotics where the GPU serves a dual role of training AI as well as running the physics simulation used to train the model. And it seems to be working in terms of models training in simulations are able to run on robots in the real world, but there is some sim2real gap there. I suspect Jensen is well aware of the the competition and is already working on how to place GPUs in roles that ASICs can't compete with a more general purpose SIMD processor.
 
Last edited:
x86 has too much legacy ops imo.

There's a lot to test and validate every release.

That legacy is what makes it so valuable though all the same - a ball and chain that if you cut loose is just cutting your nose to spite your face.

If they want to compete here (aside from just performance increases to x86_64 chips) it's better to develop separate chip lines/families versus dropping legacy (which both AMD and Intel are doing currently, with their own ARM chips). Itanium was just too early in a way.
 
specialist chips VS GPUs ... That doesn't mean Nvidia can't also create similar chips,
I suspect Jensen is well aware of the the competition and is already working on how to place GPUs in roles that ASICs can't compete with a more general purpose SIMD processor.
Yes Jensen is well aware and he paid 20 billions to buy Groq for a reason (they are low latency LPU... an ASIC only for LLM type inference), that do only this at an extreme level of simplicity and efficacy, build from the ground up to do this by the guy that made the google TPU a decade before, he seen quite early the challenge and the best solution and Nvidia has it now.
 
That legacy is what makes it so valuable though all the same - a ball and chain that if you cut loose is just cutting your nose to spite your face.

If they want to compete here (aside from just performance increases to x86_64 chips) it's better to develop separate chip lines/families versus dropping legacy (which both AMD and Intel are doing currently, with their own ARM chips). Itanium was just too early in a way.
Do legacy applications from pre 2000 really need the performance uplift from a 2026 CPU or can you run those in a vm and still have order of magnitude the performance requirements from pre 2000?
 
Highlighting the most important part of this slide...

1780014336190.png


In other words: Geometric mean more is better.

Also: Made up numbers on a slide are made up numbers on a slide.
 
Do legacy applications from pre 2000 really need the performance uplift from a 2026 CPU or can you run those in a vm and still have order of magnitude the performance requirements from pre 2000?

I'd say yes when those VMs can rely on CPU/HW passthrough and not having to emulate a different architecture altogether increasing/creating more even more overhead/slowdowns.
 
Do legacy applications from pre 2000 really need the performance uplift from a 2026 CPU or can you run those in a vm and still have order of magnitude the performance requirements from pre 2000?
are the cpu themselve not doing this a bit ? with the way decoder cut big operation in uOps anyway, not so sure how much gain there would be here to be had.

Vera big advantage here is support for much faster memory, NVLInk, being monolithic instead of cutting cost with chiplet and ISA wise having fixed lenght instruction that make it easy to go really wide on a really big chip (at the cost of a little bit bigger code/memory).

An x86 that would drop some old instruction, would still have a decoder for variable lenght for all the used one and a massive amount of instruction for a complex decoder still, very ancient arcane have a table of uOps already for them ready all of which is a very small transistor budget on a modern chip.

Intel did do some x86-s affair,
https://www.tomshardware.com/pc-com...-de-bloat-x86-instruction-set-comes-to-an-end

That continued with AMD, droping the 16-bit real mode and other simplification, can seem them save money on design a bit, but how much will it pay in performance ?

Also: Made up numbers on a slide are made up numbers on a slide.
Not sure what your issue with how Phoronix average its benchmark result can be, it seem the only possible way to do it that make sense.

You can look at individual benchmark, https://www.phoronix.com/review/nvidia-vera-benchmarks/3

88 core arm cpu as fast as 2x64cores epyc CPU (or 2x128) at compiling x86 code, it is nvidia approved only workload which is an important caveat but still quite general type and not pure AI workload being done (we can imagine with agentic AI, those agents want to write, compile and execute code they do all the time, so you need a good workstation cpu to be available to them) and having 4 time the memory bandwith per core is just raw power that will show up all around.
 
Last edited:
You guys are missing some things. Firstly, Phoronix didn't have full control over testing. "NVIDIA also requested only specific workloads relevant to the intended workloads/domains that Vera is catering to in the data center be tested." They did the testing at Nvidia's office. To give you an idea, Stream was one of the biggest wins and this was said for the test. "For Stream memory benchmarks, NVIDIA okay'ed the plain upstream Stream code where as some vendors prefer their own customized versions or built with their own compiler toolchains. With upstream Stream and all built using GCC, NVIDIA Vera shows its impressive memory bandwidth capabilities."

I'm more concerned that Phoronix allowed their tests to be limited.
 
You guys are missing some things. Firstly, Phoronix didn't have full control over testing. "NVIDIA also requested only specific workloads relevant to the intended workloads/domains that Vera is catering to in the data center be tested." They did the testing at Nvidia's office. To give you an idea, Stream was one of the biggest wins and this was said for the test. "For Stream memory benchmarks, NVIDIA okay'ed the plain upstream Stream code where as some vendors prefer their own customized versions or built with their own compiler toolchains. With upstream Stream and all built using GCC, NVIDIA Vera shows its impressive memory bandwidth capabilities."

I'm more concerned that Phoronix allowed their tests to be limited.
I mean, from the guy who said the RTX 5070 was "4090 performance at $549", what idiot would trust anything NVIDIA says in regards to their own performance claims?
 
You guys are missing some things. Firstly, Phoronix didn't have full control over testing. "NVIDIA also requested only specific workloads relevant to the intended workloads/domains that Vera is catering to in the data center be tested." They did the testing at Nvidia's office. To give you an idea, Stream was one of the biggest wins and this was said for the test. "For Stream memory benchmarks, NVIDIA okay'ed the plain upstream Stream code where as some vendors prefer their own customized versions or built with their own compiler toolchains. With upstream Stream and all built using GCC, NVIDIA Vera shows its impressive memory bandwidth capabilities."

I'm more concerned that Phoronix allowed their tests to be limited.
phoronix was super clear about this, in fact you took this from there first page, no one missed this. And datacenter CPU being tested for datacenter type of workload is not that special, and here they were able to go a little bit outside of that compiling godot game engine is not necessarily the typical use case.

Has for not using the special tricks in the comparison like they did here, for real world usage that can be the best way to do it, it is not certain that the code and all its depends will be redo for intel oneAPi and what not on super aggressive optimisation flag and long work made.
 
Last edited:
phoronix was super clear about this, in fact you took this from there first page, no one missed this. And datacenter CPU being tested for datacenter type of workload is not that special, and here they were able to go a little bit outside of that compiling godot game engine is not necessarily the typical use case.
It's odd to not only not give a sample to Phoronix, but also limit their testing and only doing so in their presence. In fact, if you read the comments section we can see that Michael mentions that this is only been done once before 16 years ago for Calxeda ARM server. You're not comfortable with your CPU's performance if you have to do all this.

To give you an idea how these tests skewed the results, I don't see too many wins for Nvidia's Vera, but where it does win it usually wins big, like Stream, LuaJIT, and Renaissance. Even Grace does well in some of these tests. Can't trust these tests.
vera LuaJIT.png
 
It's odd to not only not give a sample to Phoronix, but also limit their testing and only doing so in their presence. In fact, if you read the comments section we can see that Michael mentions that this is only been done once before 16 years ago for Calxeda ARM server. You're not comfortable with your CPU's performance if you have to do all this.

To give you an idea how these tests skewed the results, I don't see too many wins for Nvidia's Vera, but where it does win it usually wins big, like Stream, LuaJIT, and Renaissance. Even Grace does well in some of these tests. Can't trust these tests.
View attachment 806354
Well, I never assumed these cpus would try to be as general purpose as most.
 
Well, I never assumed these cpus would try to be as general purpose as most.
As a mathematician I am sure that you know that using the mean (aka average for non math folks) is one of the worst statistical tools available. A few very high or very low numbers can skew the average a lot especially with a small sample set. It is used because it is easy to calculate and most common people understand it.
 
As a mathematician I am sure that you know that using the mean (aka average for non math folks) is one of the worst statistical tools available. A few very high or very low numbers can skew the average a lot especially with a small sample set. It is used because it is easy to calculate and most common people understand it.
1780167011830.jpeg


Geometric mean is used to show the typical proportional changes. Using the linear mean here would not make sense.

They could use the geometric median which is robust to outliners similar to how the linear median is robust to outliers.

Not going to dispute that, but if there was a big difference in those two I would probably show both to highlight that some differences dominate the data.

R = geo mean / geo median

If R ~ 1 the data is clean.
R > 1 there's skew depending on how much.

You can play with some toy data to see this pretty easily.
 
Last edited:
It makes sense for Vera to absolutely crush in some metrics and get absolutely wrecked in others.
Nvidia knows what their existing CPU's are primarily used for, they know how Intel and AMD CPU's are primarily used when paired with Nvidia's Datacenter GPU's.
I have to assume that Nvidia has spent a lot of time optimizing for those use case scenarios and tailored the fit accordingly.

Nvidia isn't likely selling this as a run of the mill all rounder CPU like you would expect from an x86 chip, it is specialized, and I would expect it to significantly outperform other more generic products in the tasks it's specialized for.
 
View attachment 806372

Geometric mean is used to show the typical proportional changes. Using the linear mean here would not make sense.

They could use the geometric median which is robust to outliners similar to how the linear median is robust to outliers.

Not going to dispute that, but if there was a big difference in those two I would probably show both to highlight that some differences dominate the data.

R = geo mean / geo median

If R ~ 1 the data is clean.
R > 1 there's skew depending on how much.

You can play with some toy data to see this pretty easily.
I agree that the geometric mean is less sensitive to extreme values, but why use that here? My math skills are very rusty since I got two other degrees after the math degree, so I used a refresher on when to use which.

Use the arithmetic mean when your data adds together (e.g., total quantities, uniform distributions), and the geometric mean when your data multiplies together (e.g., growth rates, ratios, compounded returns).

Arithmetic Mean
Best for: Data where quantities are additive and independent (e.g., total sales, average test scores, uniform distributions)

Geometric Mean
Best for: Data where quantities are multiplicative and compound over time (e.g., growth rates, ratios, compounded returns)
 
I agree that the geometric mean is less sensitive to extreme values, but why use that here? My math skills are very rusty since I got two other degrees after the math degree, so I used a refresher on when to use which.

Use the arithmetic mean when your data adds together (e.g., total quantities, uniform distributions), and the geometric mean when your data multiplies together (e.g., growth rates, ratios, compounded returns).

Arithmetic Mean
Best for: Data where quantities are additive and independent (e.g., total sales, average test scores, uniform distributions)

Geometric Mean
Best for: Data where quantities are multiplicative and compound over time (e.g., growth rates, ratios, compounded returns)
Well, yes. But you often hear about the median, and for linear data the median is the middle value. It's less sensitive to outliers.

The geometric median is a little different, but same idea with sensitivity to outliers.

But for performance test, the geometric make sense and linear continuous measurements you need to normalize the data before doing the mean and that doesn't make sense for these kinds of tests.
 
Well, yes. But you often hear about the median, and for linear data the median is the middle value. It's less sensitive to outliers.

The geometric median is a little different, but same idea with sensitivity to outliers.

But for performance test, the geometric make sense and linear continuous measurements you need to normalize the data before doing the mean and that doesn't make sense for these kinds of tests.
I have never seen a video card or CPU review use the geometric mean before.
 
It's odd to not only not give a sample to Phoronix, but also limit their testing and only doing so in their presence. In fact, if you read the comments section we can see that Michael mentions that this is only been done once before 16 years ago for Calxeda ARM server. You're not comfortable with your CPU's performance if you have to do all this.
they are not with the current power usage as they are not power tuned up yet (that the reason giving to phoronix about why they would not be allowed to run power metric on them)

To give you an idea how these tests skewed the results, I don't see too many wins for Nvidia's Vera, but where it does win it usually wins big, like Stream, LuaJIT, and Renaissance. Even Grace does well in some of these tests. Can't trust these tests.
You cannot trust that fast fourier transform (a good for gpu workload), ran by phoronix that compiled the open source code themselve run quite fast on those higher memory bandwith CPU ? fourier transform tend to be memory bottleneck, those cpu have way better bandwith and latency, Grace was also an impressive cpu for bandwith.


Well, I never assumed these cpus would try to be as general purpose as most.
That the big surprise here, faster at compiling a linux kernel than epyc cpu with more cores is a bit strange...

I imagine that:
- AI agents, use tools, compile code to run it, do sql databases, parse texts/json files, run python that can do a lot of stuff
- They use them in 3 different product at least, one being an actual cpu rack and other being the bluefield storage system
 
Code compilation is highly branchy. I suspect that Nvidia doesn't care much about SIMD or general math performance and instead focused on strings and branch prediction. The heavy math would be done by the gpu. The fast interconnection makes the gpu basically a coprocessor.
 
Highlighting the most important part of this slide...

View attachment 806091

In other words: Geometric mean more is better.

Also: Made up numbers on a slide are made up numbers on a slide.

https://www.phoronix.com/review/nvidia-vera-benchmarks

You could read the full article. The numbers are not pulled out of backsides. Its a average of all the individual tests they ran.
Like comparing Intel and AMD there are situations where one is faster then the other. There are situations in which they are so close it makes no difference.

What is of note here. Is previous ARM server parts. It was always a lets look at the power usage. The entire point was look per watt ARM is doing more, and or per $ spent ARM is doing more. As in for 50k you could get X number of x86 cores that suck this much power and do this much work. Or for 50k you get twice as many ARM cores that draw the same power and do 1.2x the work as you have twice as many cores.

With Vera (and to be fair this is also true of Amazon AND microsofts latest arm chips) the comparisons are starting to be made on a 1:1 basis. Comparing One x86 and One ARM core. Its starting to be 100% that for many workloads ARM is not just using less power, or offering more value.... they are starting to just outright win outright in terms of computational power.
 
https://www.phoronix.com/review/nvidia-vera-benchmarks

You could read the full article. The numbers are not pulled out of backsides. Its a average of all the individual tests they ran.
Like comparing Intel and AMD there are situations where one is faster then the other. There are situations in which they are so close it makes no difference.

What is of note here. Is previous ARM server parts. It was always a lets look at the power usage. The entire point was look per watt ARM is doing more, and or per $ spent ARM is doing more. As in for 50k you could get X number of x86 cores that suck this much power and do this much work. Or for 50k you get twice as many ARM cores that draw the same power and do 1.2x the work as you have twice as many cores.

With Vera (and to be fair this is also true of Amazon AND microsofts latest arm chips) the comparisons are starting to be made on a 1:1 basis. Comparing One x86 and One ARM core. Its starting to be 100% that for many workloads ARM is not just using less power, or offering more value.... they are starting to just outright win outright in terms of computational power.
ARM has an intrinsic advantage when you are dealing with a system that has to handle a crapload of small, simple tasks. In a highly transactional environment where the CPU is mostly performing fetch tasks, having a crapload of weak cores can perform better overall than a small number of strong cores, even if the strong cores benchmark much better, because in a large environment, the CPU isn't (or shouldn't be) your bottleneck; you have a dozen other factors at play that impact overall system performance before you even get to the CPU. So, really, the individual cores need to be powerful enough that they aren't your bottleneck, and ARM does a really good job of falling into that category.
 
  • Like
Reactions: ChadD
like this
You cannot trust that fast fourier transform (a good for gpu workload), ran by phoronix that compiled the open source code themselve run quite fast on those higher memory bandwith CPU ? fourier transform tend to be memory bottleneck, those cpu have way better bandwith and latency, Grace was also an impressive cpu for bandwith.
My analysis reposted.

LuaJIT is weird. I think of it like a real-time compiler. It takes byte code, reinterprets it, caches it, then optimizes its execution.

What this isn't is a server bench. 1.4k Mflops for an FFT is basically nothing. Say for instance you're computing a 1024-point complex FFT. FFTW will execute that in about 5 usec or 200k iterations a sec. That works out to roughly 10.2 Gflops for a single core.

This is a micro anomaly skewing multi-core metrics. It may belong in a comparison of Java vs LuaJIT vs native compilers for different platforms, but not in a multi-core server review.

I guess some people do run LuaJIT on servers for things like Nginx routing, so it can't be totally discounted, but it is not a realistic metric for high-performance floating-point hardware.

2nd post

Continuing with my obsession with FFTs. Vera on paper would be very good at them.

double precision FFT
L1 96k cache means up to 2048 without spilling. * see below.
L2 2mb 128k.
L3 162mb 8m.
monolithic - no chiplet fabric penalty
32 registers - faster operations before it becomes memory bound.

The shortcomings are not in the actual benchmark, nor the chips being tested, but the lack of context and variance that makes this unpalatable.

With what I know about FFTs a few extra data points could provide a lot of context on what is going on here. If we had a full picture of what FFTs were being calculated, I wouldn't have to guess so much. From experience, as stated above. this looks like a 1024-2048 point FFT that fits in Vera's huge L1 cache . If a 256 or 512 point problem was passed through the benchmark, and the gap was reduced, you could ascertain to what degree cache was playing in this result. Metrics further down, 2048, 4096 would provide better mapping of how the problem resides in the hierarchy.

So that's the FFT side. On the JIT side, we know the approximate performance of FFTW. If it gets 10 GFlops at 1024 and this gets 1.4 GFlops that creates a ~7.5x gap. If that gap is applied to the x86 side it balloons to 10 / 0.5 = 20x. This provides some idea of what LuaJIT is costing the system. Further, in my research it seems LuaJIT is very register dependent. Vera's 32 register complement could be showing its advantage.

* LightningDust pointed out that the cache is 96k. Typically in a FFT you need the size of the problem + twiddle factors + buffers + etc. So a 1.5x cache size can usually handle the next cache up. So

32k = 512
48K and 64k = 1024
96K and 128k = 2048
 
having a crapload of weak cores can perform better overall than a small number of strong cores,
not sure looking at apple or nvidia core count if that what going on here.

88 cores was not an high amount in phoronix benchmark (that went from 64 to 256 x86 cores in its competition), same for Apple M5 cpu, they often have similar or smaller core count then their intel-amd competition.

ARM has an intrinsic advantage when you are dealing with a system that has to handle a crapload of small, simple tasks.
why ? predefined fixed length instruction size or the rules of no operation in memory ?

I can see that, but versus the giant memory bandwidth feeding a big monolithic tile on a chiplet cpu, that where I think apple-Nvidia-google-amazon advantage will tend to be.

chiplet is nice to keep intel-amd cost down, but that come at a big performance cost particularly for that type of scenario, same for how they manage ram.

A grace was a giant near the ~850mm of die limit chip (774 apparently), vera is apparently 550-600mm for its main tile, with a system IO, 8 memory controllers and a nvlink chiplets around it.

It would be hard to separate ISA difference/cpu linked to x86 vs ARM choice to how much cost cutting intel-AMD are doing in comparison to amazon-Nvidia for the cpu.
 
Last edited:
  • Like
Reactions: ChadD
like this
ARM has an intrinsic advantage when you are dealing with a system that has to handle a crapload of small, simple tasks. In a highly transactional environment where the CPU is mostly performing fetch tasks, having a crapload of weak cores can perform better overall than a small number of strong cores, even if the strong cores benchmark much better, because in a large environment, the CPU isn't (or shouldn't be) your bottleneck; you have a dozen other factors at play that impact overall system performance before you even get to the CPU. So, really, the individual cores need to be powerful enough that they aren't your bottleneck, and ARM does a really good job of falling into that category.
Traditionally.
That paradigm is about to get smashed to pieces. That is my take away from benchmarks from the very latest ARM server parts.

There is nothing keeping the ARM ISA from being just as performant PER core. It used to be you get 1.5x as many cores per $, and per watt. So for low power server stuff ARM shines.
The very latest ISAs, and with software starting to properly leverage things like the latest versions of SVE... ARM is starting to hold its own when you compare one core to one core.

Of course most ARM things will still lean into the low power benefits of the ISA.

I think the next year or two though is when we are going to start seeing ARM server solutions that are ZERO compromise vs X86. The progress is happening fast. I mean in just one generation Nvidia has essentially doubled their ARM performance in most tasks.
 
not sure looking at apple or nvidia core count if that what going on here.

88 cores was not an high amount in phoronix benchmark (that went from 64 to 256 x86 cores in its competition), same for M5 cpu, they often have similar or small core count then their competition.


why ? predefined fixed length instruction size or the rules of no operation in memory ?

I can see that, but versus the giant memory bandwidth feeding a big monolithic tile on a chiplet cpu, that where I think apple-Nvidia-google-amazon advantage will tend to be.

chiplet is nice to keep intel-amd cost down, but that come at a big performance cost particularly for that type of scenario, same for how they manage ram.

A grace was a giant near the ~850mm of die limit chip (774 apparently), vera is apparently 550-600mm for its main tile, with a system IO, 8 memory controllers and a nvlink chiplets around it.

It would be hard to separate ISA difference/cpu linked to x86 vs ARM choice to how much cost cutting intel-AMD are doing in comparison to amazon-Nvidia for the cpu.
Apple optimizes its programs to make extensive use of every accelerator and instruction set available to it.
Built into XCode and the LLVM compiler, Apple has the Whole Module Optimization tool (equivalent of Intel's Binary Optimization Tool), and they just do a fantastic job of making sure that programs on their platforms are taking advantage of every accelerator or instruction set available to them. It's something Microsoft and most Linux developers just can't do, there is simply too many variations available.

In terms of bottlenecks, though I am not really talking about bandwidth to the CPU, that is a factor, of course, but storage, networking, GPU accelerators bandwidth, system ram bandwidth, communication times from the GPU to system storage, system storage to system ram. In big environments, storage isn't always on the same device where the CPU lives, so that is just a massive storage array that is being accessed by dozens of other machines across the network.
Then there is also the interrupt pipeline every time something needs something from elsewhere in the system.
A GPU for example can't talk directly to system storage for data, the GPU needs to ask the CPU for permission, the CPU responds with a yes or no, if yes the GPU requests the data, the CPU then begins fetching the data, placing it into system RAM, the CPU then lets the GPU know when it has finished copying the data, the CPU then asks if the GPU is ready for the data, when the GPU responds with a yes, then the CPU starts transfering the data from system RAM to VRAM. The Ack Nak process generated by system components, like the NIC or GPU, generates a lot of interruptions for other things in the stack as they quite literally interrupt them and take processing priority (which is why they are called system interrupts). The bigger the system and the more individual users within a system, the more interrupts that are generated.
Even if we ignore interruptions caused by requests, systems have lots of tiny background processes, be it data logging, verifications, securing traffic, encrypting or decrypting data as it is sent or received, system schedulers aren't that good at actually packing multiple requests together, they are always left with a lot of dead space in the tasks sent to any one CPU core, the smaller the cores the less dead space, the more small cores the more tasks processed overall.
Granted, it's a trade-off, something always is, but the benefits often outweigh the negatives, especially for data centers, which is why ARM is rapidly expanding in highly transactional environments from the likes of Amazon, Google, and Meta.

It's also why it makes sense for Nvidia in their systems, the CPU is mostly running fetch quests for the GPU and Networking hardware, it's not doing any of the heavy processing work, it's primarily handling data movement, encryption, and verification.
 
Traditionally.
That paradigm is about to get smashed to pieces. That is my take away from benchmarks from the very latest ARM server parts.

There is nothing keeping the ARM ISA from being just as performant PER core. It used to be you get 1.5x as many cores per $, and per watt. So for low power server stuff ARM shines.
The very latest ISAs, and with software starting to properly leverage things like the latest versions of SVE... ARM is starting to hold its own when you compare one core to one core.

Of course most ARM things will still lean into the low power benefits of the ISA.

I think the next year or two though is when we are going to start seeing ARM server solutions that are ZERO compromise vs X86. The progress is happening fast. I mean in just one generation Nvidia has essentially doubled their ARM performance in most tasks.
Well, as ARM datacenter and enterprise presence grows, more software platforms get optimization for ARM, it's not an ISA issue, it's an optimization issue.
There are always going to be some problems you can't optimize your way out of, but that's why the other platforms aren't really going anywhere any time soon.
In fact, Amazon, Google, Meta, and the rest are putting significant work into optimizing systems for having different architectures coexisting within a single platform. It won't be long before we see systems primarily run by ARM, but with x86 accelerator cards, sitting right alongside the GPUs, with tasks being processed by the most efficient option, which, for the record, isn't always the fastest option, because electrical bills are dwarfing the other bills in most of these environments, and optimizing for those is the biggest priority.
 
A GPU for example can't talk directly to system storage for data, the GPU needs to ask the CPU for permission, the CPU responds with a yes or no, if yes the GPU requests the data, the CPU then begins fetching the data, placing it into system RAM, the CPU then lets the GPU know when it has finished copying the data, the CPU then asks if the GPU is ready for the data, when the GPU responds with a yes, then the CPU starts transfering the data from system RAM to VRAM. The Ack Nak process generated by system components, like the NIC or GPU, generates a lot of interruptions for other things in the stack as they quite literally interrupt them and take processing priority (which is why they are called system interrupts). The bigger the system and the more individual users within a system, the more interrupts that are generated.
That seem quite ISA independent for the most part, that could well be true but it seem there often some missing because of ARM fixed instruction side or rules that you must put data in register before manipulating it which mean that in this case X,y,z over x86 performance will always be there.

Would amazon-nvidia-apple had the legal right to build giant monolithic CPU that used the x86 ISA for their specialized use case, I am not sure we would necessarily see much gap.
 
That seem quite ISA independent for the most part, that could well be true but it seem there often some missing because of ARM fixed instruction side or rules that you must put data in register before manipulating it which mean that in this case X,y,z over x86 performance will always be there.

Would amazon-nvidia-apple had the legal right to build giant monolithic CPU that used the x86 ISA for their specialized use case, I am not sure we would necessarily see much gap.
Nah, no need for them to build a monolithic CPU for x86 specialized tasks; Intel and AMD are far more likely to resume selling x86 chips on a PCIe card.
Trying to add multiple architectures into a single CPU package gets ugly, and you get a lot of duplication and complication for something pretty niche, so you just balloon your silicon footprint for an edge case scenario.
I am sure if a business case exists for such a CPU, then somebody will make it one day, so never say never, but I can't think of a scenario where it wouldn't be more efficient to route that work to another system or a dedicated accelerator card.
 
Well, as ARM datacenter and enterprise presence grows, more software platforms get optimization for ARM, it's not an ISA issue, it's an optimization issue.
There are always going to be some problems you can't optimize your way out of, but that's why the other platforms aren't really going anywhere any time soon.
In fact, Amazon, Google, Meta, and the rest are putting significant work into optimizing systems for having different architectures coexisting within a single platform. It won't be long before we see systems primarily run by ARM, but with x86 accelerator cards, sitting right alongside the GPUs, with tasks being processed by the most efficient option, which, for the record, isn't always the fastest option, because electrical bills are dwarfing the other bills in most of these environments, and optimizing for those is the biggest priority.

Its not like X86 is that general of a compute ISA either. There is a reason they have to keep extending it.
You may well be right in that x86 might be treated as an accelerator for legacy things. I'm sure that prospect scares the hell out of Intel and AMD.
I think that is the case though I don't see x86 really being all that important in a decade. All the highest performance x86 stuff server side is being powered by extensions like AVX. ARM has essentially the same hardware... SVE is in many ways superior software wise vs AVX. AMDs AVX implementation has been superior... that is changing though. One of the big changes Vera has made is how it handles SVE.
NV is also introducing Spatial Multithreading, essentially SMT. Giving Olympus 176 threads.
Per their release; "a new type of multithreading that runs two hardware threads per core by physically partitioning resources instead of time-slicing, enabling a run-time tradeoff between performance and efficiency. This approach increases throughput and virtual CPU density while maintaining predictable performance and strong isolation"
Based on the description, it sounds like their implementation of SMT is a little different then SMT as we have seen it on x86. It sounds more dynamic... it will be interesting to see more indepth testing when these end up in the wild.

IMO that is the crazy thing about the early testing Phronix was able to do. There is still a lot of optimization coming for this platform. SVE optimizations are industry wide, but it seems to me a lot will be possible on this specific NV hardware. If they are able to match AMD and Intel for the most part currently, its going to fall more their way 6 months and a year from now as the software gets tightened up. I think of META using things like the LAVD CPU scheduler... and I am pretty sure with user space RUST CPU schedulers like that you should be able to make the scheduler aware of their spatial mutlti threading stuff. META found huge improvements switching to LAVD for their x86 servers as it was aware of specific latencies in cache, clusters, and ccds. META wrote a paper on how they were able to tune the LAVD scheduler to hold jobs, or split jobs based on very specific parameters on used cache space in clusters. I suspect it won't be long before the same type of work will be scheduling jobs around the NV chips ability to split jobs by passing more dumb time slice type job splits. I could be wrong but it sounds like this is going to allow a software CPU scheduler some very granular control on how jobs are split OR not split, or even exactly when to split conditionally. LAVD is already aware of a bunch of different ARM topologies, I will be paying attention to see if Nvidias new version of SMT get specifically patched in at some point.

https://developer.nvidia.com/blog/i...-platform-six-new-chips-one-ai-supercomputer/
Olympus is using a 6x 128b SVE2 FP8 implementation. In software ARM SVE can use 128-2048 bit depths as needed. The flexible software vector length is really a much better solution then what they have x86 side at this point.

Notice Nvidia even mentioned the Phoronix testing in their press release.
https://investor.nvidia.com/news/pr...-Unveils-Vera-the-CPU-for-Agents/default.aspx
 
You may well be right in that x86 might be treated as an accelerator for legacy things. I'm sure that prospect scares the hell out of Intel and AMD.
This is why getting the fabs up and running for 3'rd parties is so critical for Intel; there is a very plausible reality where Intel's fabs are worth substantially more than the CPU and design side of things.
Intel and AMD can happily trade x86 clients back and forth every couple of generations, but every client lost to ARM is a client lost forever.

If a client is willing to make the transition over and rebuild their platform accordingly, they had a hell of a good reason to do so, and winning them back will equally need one hell of a good reason, and neither AMD nor Intel has managed to provide one yet.
 
  • Like
Reactions: ChadD
like this
Back
Top