Next.js App Router + React Server Components Demo

new
past
show
ask
show
jobs
submit

▲Parsing JSON faster with Intel AVX-512 (lemire.me)

167 points by ashvardanian 1132 days ago | 71 comments

PragmaticPulp 1131 days ago [-]

> Could we do better? Assuredly. There are many AVX-512 that we are not using yet. We do not use ternary Boolean operations (vpternlog). We are not using the new powerful shuffle functions (e.g., vpermt2b). We have an example of coevolution: better hardware requires new software which, in turn, makes the hardware shine.

> Of course, to get these new benefits, you need recent Intel processors with adequate AVX-512 support

AVX-512 support can be confusing because it’s often referred to as a single instruction set.

AVX-512 is actually a large family of instructions that have different availability depending on the CPU. It’s not enough to say that a CPU has AVX-512 because it’s not a binary question. You have to know which AVX-512 instructions are supported on a particular CPU.

Wikipedia has a partial chart of AVX-512 support by CPU: https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512

Note that some instructions that are available in one generation of CPU are can actually be unavailable (superseded, usually) in the next generation of the CPU. If you go deep enough into AVX-512 optimization, you essentially end up targeting a specific CPU for the code. This is not a big deal if you’re deploying software to 10,000 carefully controlled cloud servers with known specifications, but it makes general use and especially consumer use much harder.

mikepurvis 1131 days ago [-]

Are there good libraries for doing runtime feature detection? Eg, include three versions of hot function X in the binary, and have it seamlessly insert the correct function pointer at startup? Or have the function contain multiple bodies and just JMP to the correct block of code?

I know you can do this yourself, but last time I looked it was a heavily manual process— you had to basically define a plugin interface and dynamically load your selected implementation from a separate shared object. What are the barriers to having compilers able to be hinted into transparently generating multiple versions of key functions?

loeg 1131 days ago [-]

GCC has offered Function Multiversioning for about a decade now (GCC ~4.8 or 4.9). GCC 6's resolver apparently uses CPUID to resolve the ifunc once at program start: https://lwn.net/Articles/691932/ .

Clang added it in 7.0.0: https://releases.llvm.org/7.0.0/tools/clang/docs/AttributeRe...

A nice presentation on it: https://llvm.org/devmtg/2014-10/Slides/Christopher-Function%...

indygreg2 1131 days ago [-]

While this IFUNC feature does exist and it is useful, when I performed binary analysis on every package in Ubuntu in January, I found that only ~11 distinct packages have IFUNCs. It certainly looks like this ELF feature is not really used much [in open source] outside of GNU toolchain-level software!

https://gregoryszorc.com/blog/2022/01/09/bulk-analyze-linux-...

burntsushi 1131 days ago [-]

Do note that IFUNC is a convenience. You don't actually need to use IFUNCs to write target specific code that dispatches dynamically at runtime.

For example, ripgrep's dependencies dispatch dynamically at runtime by querying CPUID, but nothing uses GCC's "IFUNC" thingy. So it's likely that much more software is utilizing target specific code than not. Still, it's probably less than one would like.

(I think this is less of a response to you and more of a response to this entire thread. It seems like some folks are conflating "IFUNC" with "all forms of dynamic dispatching based on CPUID.")

TkTech 1131 days ago [-]

I've wanted to use them many times in the past, but the limited support on other compilers (looking at you MSVC) always made it a non-starter. If I have to support some other method of feature detection anyways, there's no point.

arthur2e5 1131 days ago [-]

The way ifunc (well, actually language-level FMV) works in GCC and clang is that the input source code, not the command-line switches, specifies what ISA extensions to build for on a candidate function. This naturally means that packagers and other vendors are not using it as much as you hope: they would need to have a separate patch to add these attributes for each architecture and that’s just not maintainable.

Even Intel’s Clear Linux does not bother to patch individual libraries. They just use the glibc library multi-versioning feature to load from different directories depending on the cpuid.

In my opinion, GCC and Clang could make the whole thing more ergonomic. Ideally you declare a function as "interesting for multi-versioning" in the source using an attribute, then in the command line define what -march's to actually clone for. Kinda like ICC’s /Qax. (On second thought preprocessor defs are sufficient, duh.)

jeffbee 1131 days ago [-]

The use of ifunc in libc, openssl, and zlib covers just about everything people want from it. You want that optimized memcpy, you only need it in one place.

matheusmoreira 1131 days ago [-]

This is interesting. It looks like the multiversioned functions are resolved by the dynamic linker or loader. What happens if the binary uses static linking? Is there a way to resolve the functions generated by GCC ourselves? Can we specify their symbols?

bremac 1131 days ago [-]

I'm unsure about library support, but gcc and clang support function multi-versioning (FMV), which resolves the function based on CPUID the first time the function is called.

This LWN article has some additional information: https://lwn.net/Articles/691932/

mikepurvis 1131 days ago [-]

TIL! I guess it makes sense that popular numeric libraries like BLAS, Eigen, and so-on would take advantage of this, but I wonder how widely used it is overall.

bee_rider 1131 days ago [-]

For BLAS, it has enough legacy that the vendors just make their own libraries, so it would probably be best to use Intel MKL (which has it's own internal dispatching, but I don't think it uses FMV) if you are targeting Intel, and maybe AMD's branch of BLIS if you are targeting AMD. I believe Eigen comes with it's own version of BLAS built in, but it can be made to link to MKL at least, probably BLIS if you are really stubborn.

Scene_Cast2 1131 days ago [-]

For vectorizable workloads, there's Halide. It can be set up in multiple ways, one of which is a JIT that auto-detects CPU capabilities and cache sizes (to better optimize scheduling).

colejohnson66 1131 days ago [-]

Check out Agner Fog's vectorclass library: https://github.com/vectorclass/version2

robocat 1131 days ago [-]

To add, they are using[2] the relatively recent VBMI2 instructions of AVX512. This article[1] talks about the advantages of VBMI on IceLake released 2021.

[1] https://www.singlestore.com/blog/a-programmers-perspective/ comments https://news.ycombinator.com/item?id=28179111

[2] https://news.ycombinator.com/item?id=31522464

paulmd 1130 days ago [-]

> AVX-512 support can be confusing because it’s often referred to as a single instruction set.

This has been broadly overplayed. There are two main families: consumer AVX-512 and server AVX-512. Server gets some additional VNNI instructions and BFLOAT16 support, consumer gets VPOPCNTDQ and IFMA/VBMI. That's basically all you need to know

The chart looks confusing in Wikipedia because it's ordered by date, not by "series" and generation, and some intrepid wikipedia editors have munged up the features even then. This is what the feature set looks like to me:

https://i.imgur.com/idAjB1X.png

Xeon Phi is sorta its own thing for sure, they did a weird 4-wide architecture and it got special 4-wide VNNI instructions to match. And since it's been abandoned it never was updated. But there's zero reason for average consumer/server software to target Xeon Phi anyway these days - it's abandoned and it was almost entirely a supercomputing/HPC product even when it wasn't.

The other instructions are (almost) a straight superset within their series, so you just need to know if you're targeting server or consumer. Laptop gets things a bit earlier because Intel started 10nm first on laptops, and 14nm was stuck with older architectures and then a backported architecture, so it's not quite a strict series in terms of dates, but generation-on-generation they haven't ever regressed a feature within a given "series".

The only exception is Alder Lake doesn't get anything because lol Intel - they disabled AVX-512 entirely, but if you got a model that supports it then it's also a superset of all previous consumer generations.

Again, all you really need to remember is, consumer gets better FMA and server gets VNNI/BFLOAT, that's the major difference. Big deal, who cares.

But "it's just strict supersets of features within server/client" wouldn't get the clicks/exasperated outcry of "gosh isn't this just an intractible mess!". It's not, and they've never regressed a feature within a series.

Also for the record, AVX2/AVX1/SSE/MMX look really shitty if you split them up in this same fashion. There were like 4 different "sets" of SSE4 feature sets alone and some of them were abandoned and never used again, and both AMD and Intel adopted them all at different dates and sometimes dropped them back out, so if you made a "feature set vs architecture" table it would be a giant fucking mess for SSE as well, let alone if you did it by year instead of by architectural series.

Yet I never hear anyone bring up how awful and how confusing AVX and SSE are, just AVX-512. Feature sets are messy!

https://en.wikipedia.org/wiki/SSE4

Eventually, AVX-512 support will converge around the sets of features that are useful, and some of the less-useful features may eventually be deprecated, just like the SSE4a instruction set did. It's gonna be fine.

skavi 1131 days ago [-]

Hopefully we'll see AVX-512 in Intel's little cores soon. Centaur's last CPU architecture proves that it is possible to implement the extension without a huge amount of area [0]. Once that happens, I expect we'll finally consistently see AVX-512 on new Intel processors. The masks really are a huge improvement to the design.

AMD should be implementing AVX-512 on their own cores soon as well. Once Armv9 (with SVE2) becomes dominant, we'll pretty much be in a golden age of SIMD.

[0]: https://chipsandcheese.com/2022/04/30/examining-centaur-chas...

dragontamer 1131 days ago [-]

> we'll pretty much be in a golden age of SIMD.

We already are in the golden age of SIMD. NVidia and AMD GPUs are easier and easier to program through standard interfaces.

Intel / AMD are pushing SIMD on a CPU, which is useful for sure, but always is going to be smaller in scope than a dedicated SIMD-processor like A100, 3060, AMD Vega, AMD 6800 xt and the like.

SIMD-on-a-CPU is useful because you can perform SIMD over the L1 cache as communication (rather than traversing L1 -> L2 -> L3 -> DDR4 / PCIe -> GPU VRAM -> GPU Registers -> SIMD, and back). But if you have a large-scale operation that can work SIMD, the GPU-traversal absolutely works and is commonly done.

skavi 1131 days ago [-]

Good point. Should have clarified I was referring to CPU SIMD.

dragontamer 1131 days ago [-]

AVX2 is not as good as AVX512. But AVX2 still has vgather instructions, pshufb, and a few other useful tricks.

AVX512 and ARM SVE2 bring the CPU up to parity with maybe 2010s-era GPUs or so (full gather/scatter, more permutation instructions, etc. etc.). But GPUs continued to evolve. Butterfly-shuffles are the generic any-to-any network building block, and are exposed in PTX (NVidia assembly) shfl.bfly, and AMD DPP (data-parallel primitives).

Having a richer set of lane-to-lane shuffling (especially ready-to-use butterfly networks) would be best. It really is surprising how many problems require those rich-sets of data-movement instructions, or otherwise benefit from them.

NEON and SVE had hard-coded data-movement for specific applications. The general-purpose instruction (pshufb) is kinda like permute/shfl from AMD/NVidia. A backwards-permute IIRC doesn't exist yet on CPU-side.

And butterfly networks are the general-purpose solution, capable of implementing any arbitrary data-movement in just log(width) steps. (pshufb / permute instructions would be the full-sized butterfly network, but some cases might be "easier" and faster to execute with only a limited number of butterfly swaps, such as what inevitably comes up in sorting)

--------

Still, all of these operations can be implemented in AVX2 (albeit slower / less efficiently). So its not like the "language" of AVX2 / AVX is incomplete... its just missing a few general-purpose instructions that could lead to better performance.

torginus 1131 days ago [-]

I'm kinda torn on AVX-512 (and SIMD in general). On one hand, AVX-512 finally introduced a sane programming model with mask registers for branching code, which makes the lives of compilers much easier.

On the other hand, the tooling for turning high-level languages into SIMD code is not there yet, ISPC refuses to support ARM, and is still kind of a novelty tool.

Additionally, 512-bit wide vectors are just too big - the resulting vector units take up too much die space even on big cores, and the power consumption causes issues causing said dies to downclock. Probably it won't be viable on small cores.

dr_zoidberg 1131 days ago [-]

> Additionally, 512-bit wide vectors are just too big - the resulting vector units take up too much die space even on big cores, and the power consumption causes issues causing said dies to downclock.

This is no longer true, citing [0]:

> At least, it means we need to adjust our mental model of the frequency related cost of AVX-512 instructions. Rather than the prior-generation verdict of “AVX-512 generally causes significant downclocking”, on these Ice Lake and Rocket Lake client chips we can say that AVX-512 causes insignificant (usually, none at all) license-based downclocking and I expect this to be true on other ICL and RKL client chips as well.

And we still have to see AMDs implementation of AVX512 on Zen4 to know what behavior and limits it may have (if any).

[0] https://travisdowns.github.io/blog/2020/08/19/icl-avx512-fre...

jeffbee 1131 days ago [-]

Considering that the execution units, register file, etc that support AVX-512 are themselves nearly as large as the entire Gracemont core ... don't hold your breath.

brigade 1131 days ago [-]

You don’t need larger than the 128-bit ALUs or the 207x128-bit register file Gracemont already has to implement AVX-512. It doesn’t make sense on its own with that backend, but for ISA compatibility with a big core it does.

jeffbee 1131 days ago [-]

I'm not sure that users would accept that. You could have a situation where an ifunc is resolved on a fast core with a slightly superior AVX-512 definition, but then the thread migrates to an efficiency core and the AVX-512 definition is dramatically slower than what could have been achieved with AVX2 (e.g. if a microcoded permute was 16x slower).

brigade 1131 days ago [-]

Most reasonable IMO would be a hypothetical AVX-256 that was AVX-512 minus ZMM registers. Intel chose against that.

So the only reasonable options for a big little system are:

1. Not have a big little system. But little cores are quite useful for power efficiency in laptops…

2. Don’t support AVX-512 on any cores. Everything beyond AVX2 becomes a dead extension outside of server chips.

3. The little cores support AVX-512 as best as is reasonable. Then thread director can weight AVX-512 usage even heavier than it already weights AVX2.

Also AVX2 performance on Alder Lake is already different between the cores enough that optimal implementations can be different.

jeffbee 1131 days ago [-]

That makes sense and alludes to my feeling that the big-little thing has been a big mistake. If they come out with an all-big workstation CPU that enables AVX-512 then I am changing to that.

Dylan16807 1131 days ago [-]

Can the shuffling instructions be reasonably efficient with a small ALU?

brigade 1131 days ago [-]

Depends on what you consider reasonable. Worst case is 512-bit vpermi2*, which could be implemented with 16x 128-bit vpermi2-like uops, if the needed masking was implicit.

Which to me is reasonable for ISA compatibility. (Also considering that having to deal with ISA incompatibility across active cores is not reasonable at all.)

moonchild 1131 days ago [-]

> having to deal with ISA incompatibility across active cores is not reasonable at all

If the OS would give me the tools to deal with it, I would find it completely reasonable to write (eg) both avx2 and avx512 kernels to be run concurrently on the same machine.

timerol 1131 days ago [-]

> Of course, to get these new benefits, you need recent Intel processors with adequate AVX-512 support and, evidently, you also need relatively recent C++ processors. Some of the recent laptop-class Intel processors do not support AVX-512 but you should be fine if you rely on AWS and have big Intel nodes.

What is meant by "relatively recent C++ processors"? Is that supposed to be "compilers"?

1131 days ago [-]

Narishma 1131 days ago [-]

It's supposed to be Intel, not C++.

bfrog 1131 days ago [-]

At what point is JSON not the right option? Surely when trying to do this sort of thing?

At what point is it saner to use something like flatbuffers or capnproto style message encoding instead.

avg_dev 1131 days ago [-]

Good thought. If you are coding in C++ maybe you can use some sort of binary serialization thing. Even in other languages if json parsing is a bottleneck it can possibly be optimized away through use of a binary wire format. That said vector operations available to programmers is always a welcome thing I’d say. And who knows how much production json parsing this library really does, it could be a ton.

I’m torn. I’ve worked at shops where we aim over time to reduce response time while serving business logic and using statistical models that get iterated on. Even there I haven’t seen a blatant need for non-JSON rpc. But I know my experience doesn’t mirror everyone’s. And I like seeing and learning about instruction sets. I’m currently taking a course in parallel computing and I just used a avx2 for the first time in a toy program to subtract one vector from another in a single instruction which while not particularly useful is a window into more interesting things and is still SIMD.

I think on the whole making json parsing for a large enough fraction of processors is probably a huge win for the environment. But who is parsing json in C++?

ollien 1131 days ago [-]

> But who is parsing json in C++?

Well, Facebook for one! Folly has lots of utilities for this (see folly::dynamic[1]). We make extensive use of this at my (non-Facebook) job.

[1] https://github.com/facebook/folly/blob/master/folly/docs/Dyn...

beached_whale 1130 days ago [-]

A lot. I know of a few people using my library are in finance doing some crypto currency stuff and the feeds are JSON and/or done via JSON-RPC I think. I didn't really get into the details. But even then, doing 2ish GB/s isn't that bad.

But people seem to think of C++ as only systems language only ignoring the decades of desktop usage and existing software. Plus web services running are using small fractions of the resources that systems like node use. I know in some examples I have done where a C++ web service used about 3MB of ram, plus usage depending on size of the request, whereas Node started at 300MB. With cloud costs, that's a lot more bang for buck.

But JSON is the lingua franca of networked I/O these days, need to interop

neurostimulant 1131 days ago [-]

GTA Online used a very slow c++ json parser for years which struggles with parsing their json file (which has grown to 10MB over the years). The load time was pretty bad, in some cases it can be more than 15 minutes. Improving them actually cut down startup time by 70%.

https://nee.lv/2021/02/28/How-I-cut-GTA-Online-loading-times...

avg_dev 1130 days ago [-]

I just read this; I remember reading it a while back. It's a good article. Approachable and actually achieves its goal. Thanks for the link.

> Original online mode load time: ~6m flat

> Time with only duplication check patch: 4m 30s

> Time with only JSON parser patch: 2m 50s

> Time with both issues patched: 1m 50s

> (6 * 60 - (1 * 60+50)) / (6 * 60) = 69.4% load time improvement (nice!)

I would have loved to see someone swap out the parser for simdjson, especially without the code. That would be amazing. Maybe the 1m50s can be beaten.

Just thinking about that core heating up for 6m one can see that this sort of code improvement helps reduce electricity usage, heat generated, and so on. It's just less computation. I see now that people need to parse JSON in C++ as they would any other language, and someone has to do it, and it should be done as fast and correctly as possible. It's a different issue from defining a performant wire protocol with good developer ergonomics (which to be frank I am not sure why my brain even went there to begin with; I had protobuf on the brain).

Makes sense.

refulgentis 1131 days ago [-]

Probably a lot of people. JSON is a pretty popular interchange format.

smabie 1131 days ago [-]

Often you do not get the choice of whether you want to be parsing json or not.

vardump 1131 days ago [-]

Sometimes you just don't have a choice when you need to interface with a third party data feed or software.

Isn't it better to have all options open?

jeffbee 1131 days ago [-]

Everyone who offers a JSON API because their unsophisticated customers demand it has reason to parse those requests in C++ as soon as they hit any kind of scale.

Andys 1131 days ago [-]

I was evaluating different formats for JSON-like formats and found that JSON parsing can be extremely fast and quite competitive with static binary formats.

beached_whale 1131 days ago [-]

If only intel wasnt dropping support for it on a lot of cpus

nomel 1131 days ago [-]

Well, if it means more cores, it's almost certainly worth it, in the grand scheme of things.

jrimbault 1131 days ago [-]

I'm using this comment as jumping point.

What's cost/opportunity to optimizing for a specific platform/instruction set ? At what point is it worth doing, when isn't it worth doing ? AVX-512 strikes as something... "ephemeral".

throwaway92394 1131 days ago [-]

Well I mean this article is demoing a 28% improvement (if I did my math right) for json parsing.

Sure AVX-512 is only applicable to specific workloads, and even many of those workloads the cost/opportunity of optimizing for AVX-512 might not be worth it. But there clearly ARE usecases that would benefit, and it might be worth it for more consumer applications to optimize for AVX-512 - but only if it can be used.

The way I see it is that the benefit of optimizing for AVX-512 is far higher if it becomes normal for consumer CPUs to have it. A 28% improvement is pretty decent, but it's only worth implementing if enough people can utilize it.

beached_whale 1131 days ago [-]

For many maybe not, but when writing the foundations of software it is good to start fast. There are libraries that abstract various simd architectures now too. Simdjson has their own and there are ones lile KUMI

skavi 1131 days ago [-]

writing code for SIMD can get you absolutely massive performance improvements. Whether it's worth the added complexity depends on the situation. If your data is already arranged in a cache friendly way (SoA), it shouldn't be incredibly difficult to use SIMD intrinsics to optimize. I'd first take a look at what the compiler is already generating for you to see if manual intervention is worth it.

jcranmer 1131 days ago [-]

Intel is not dropping support for it on a lot of CPUs.

The only thing they've done is disable it in the hybrid Alder Lake cores, presumably because the E-cores couldn't support it (while the P-cores could), and they didn't want to deal with the headaches of ISA extensions being supported only on some cores in the system.

coder543 1131 days ago [-]

> The only thing they've done is disable it in the hybrid Alder Lake cores

That is incorrect. You can buy Alder Lake CPUs that only have one type of core (the i3 series only has P-cores, for example), and those do not support AVX-512 either. They're not "hybrid" in any way.

Some of their motherboard partners initially allowed you to access AVX-512, but Intel has put a stop to this and the feature is disabled on all Alder Lake CPU SKUs, period.

Newer Alder Lake chips have AVX-512 fused off in silicon, if the firmware blob disabling it isn’t enough: https://www.tomshardware.com/news/intel-nukes-alder-lake-avx...

> Intel is not dropping support for it on a lot of CPUs.

That seems like a pretty questionable statement. Intel might keep AVX-512 around for Xeon, but it seems extremely dead on the consumer market. If Intel decides to bring it back for the next generation, that would be strange and very poor planning.

g42gregory 1131 days ago [-]

In consumer market, Intel supported AVX-512 only on -X, HEDT processors. There is a rumor that either Adler Lake-X is coming out or it will be folded into Xeon -W workstation line. It will support AVX-512.

coder543 1131 days ago [-]

I’m fairly sure the bog-standard consumer Rocket Lake processors had AVX-512.

gpderetta 1131 days ago [-]

It seems likely that the reason is that some intel customers are willing to pai a significant premium for the feature and intel doesn't want it to be available for cheap

arthur2e5 1131 days ago [-]

If what they’ve done with ECC memory is any indication, this sort of market segmentation would only slow down or even kill adoption of AVX-512 in the consumer sphere. Welp, at least there’s no reliability issue here, so Torvalds is unlikely to get angry anytime soon.

gpderetta 1130 days ago [-]

Probably intel doesn't care about AVX-512 in the consumer sphere.

Aardwolf 1131 days ago [-]

> Intel is not dropping support for it on a lot of CPUs.

There are 0 current generation consumer CPUs of neither Intel nor AMD that have it

> The only thing they've done is disable it in the hybrid Alder Lake cores

Which happen to be all the current generation Intel CPUs

beached_whale 1131 days ago [-]

Ah, headlines foiled me. I read disabled in Alder lake all together.

temac 1131 days ago [-]

It is disabled in all consumer Alder lake (and I don't remember if there will be Xeon of that gen with P-core only -- IIRC Intel stop the AVX512 validation late on those cores, but it was still before it was formally finished, so probably not). At one point it worked with some Bios on P-core only chips or if you disabled E-core on hybrid ones, but with up-to-date Intel microcode it does not work anymore.

pabs3 1131 days ago [-]

I wonder if simdjson is going to replace the native JSON parsing in various languages at some point.

ec109685 1131 days ago [-]

One of the most “10,00 foot” unintuitive things is that a text format like json is faster parse than a binary protocol like protobufs.

winrid 1131 days ago [-]

If you apply the same optimizations to the binary protocol it will be much faster than JSON.

Even comparing common deserializarion libraries like gson and thrift, thrift is faster despite being much older than protobufs.

ec109685 1129 days ago [-]

Good point. I was comparing to protobufs, which is slower to parse than json. Other formats are faster.

jxi 1131 days ago [-]

Is it actually faster?

NegativeLatency 1131 days ago [-]

"new" is relative since they've been out for almost 10 years: https://www.intel.com/content/www/us/en/developer/articles/t...

jeffbee 1131 days ago [-]

This code uses VBMI2, which just came out quite recently.

worewood 1131 days ago [-]

Using specialized instructions not always turn into performance improvements. Processors are pretty smart these days and the generated u-ops may be the same

incrudible 1130 days ago [-]

In my experience, even some of the simplest loops that a compiler can trivially vectorize do get a noteworthy speedup from using SIMD instructions.

Const-me 1130 days ago [-]

Upgrading from scalar to SSE vector, I often see an improvement by a large factor like 3-4. Sometimes, for smaller integer lanes or bytes, much more.

When upgrading vectors from SSE to AVX, the speedup is very rarely by a factor of 2. More often it's within 30-70%.

I never programmed AVX512, but I would expect for real-life code AVX512/AVX2 performance improvement to be much closer to AVX/SSE than to SSE/scalar.

incrudible 1130 days ago [-]

There is definitely an argument to be made regarding uops that just because the vector width of the interface doubled, the hardware resouces need not have doubled as well. AVX512 originally came from KNL accelerators, which were designed for more specific purposes.

Const-me 1129 days ago [-]

I agree. And these resources are not even limited to execution units inside cores.

Another one is bandwidth. On computers with dual-channel DDR4, aligned SSE vectors are delivered from main memory with a single transaction. For ideal scaling from SSE, AVX512 would need octa-channel DDR4, and 4x as much caches.

Another one is waste heat. On modern processors, sustained performance is often limited by thermals. For ideal scaling, AVX512 would need twice as much cooling compared to AVX.

Rendered at 21:00:06 GMT+0000 (Coordinated Universal Time) with Vercel.