I actually looked at this in detail about a year ago for some automated driving compute work at my previous job, and I found that the detailed info you'd want from Nvidia was just 100% unavailable. There's pretty good proxies in some of the data you can get out of Nvidia tools, and there's some extra info you can glean from some of the function call stack in the open source Nvidia driver shim layer (because the actual main components are still binary blob, even with the "open source" driver), but over all you still can't get much useful info out.
Now that Brendan works for Intel, he can get a lot of this info from the much more open source Intel GPU driver, but that's only so useful since everyone is either Nvidia or AMD still. The more hopeful sign is that a lot of the major customers of Nvidia are going to start demanding this sort of access and there's a real chance that AMD's more accessible driver starts documenting what to actually look at, which will create the market competition to fill this space. In the meantime, take a look at the flamegraph capabilities in PyTorch and similar frameworks, up an abstraction level and eek what performance you can.
ryao 55 days ago [-]
I just sent the link to a driver developer at Nvidia. If he shares the link with others at Nvidia, they should become aware of the idea tomorrow. That said, I have no idea if he will do that, but at least I tried.
sleepybrett 55 days ago [-]
Are they interested in you optimizing your workloads or just selling you more gpus to help you get to market faster...
saagarjha 55 days ago [-]
It is in Nvidia's interest that their cards have better developer experience and cost less to run than their competitors.
wcunning 55 days ago [-]
The problem is that CUDA already does that, and they're not incentivized to really improve from that baseline, given the capability and ease of use of the ROCM or Intel solutions.
yanniszark 55 days ago [-]
I'm not sure, it seems to me like this should be doable in Nvidia as well. This is a paper that uses instruction sampling (called CUPTI) in Nvidia to provide optimization advice:
It seems like the instruction sampler is there, and it also provides the stall reason.
wcunning 55 days ago [-]
The issue there is that that info is what Nvidia chooses to port out from the on-chip execution. Most of what we can do for observation is in the kernel driver space and not really on-chip or even low level transit to the chip. One of the other commenters pointed out that you can get huge benefits from avoiding busy waiting on the returned data from the chip, which makes total sense, but also increases latency, which didn't work for my near-realtime use case when I was investigating. Other than those types of low hanging fruit where you can accept a little latency for better power state management, it's hard to find low level optimizations specifically for Nvidia through the closed source parts of the CUDA stack or through the driver transit to chip when those are intentionally hidden.
A while ago, I read a paper on dissecting the Nvidia architecture using very specifically tuned microbenchmarking to understand things like cache structure on chip and the like [0]. Unfortunately, no one has done this for seriously in use, recent architectures, so it's hard to use this info today. Similarly, there isn't an eBPF VM running on the chip to summarize all of this and the Nvidia tools aren't intended to make this kind of info easy to get, probably specifically because of this paper...
> Imagine halving the resource costs of AI and what that could mean for the planet and the industry -- based on extreme estimates such savings could reduce the total US power usage by over 10% by 20301.
Why would it be the case that reducing the costs of AI reduces power consumption as opposed to increase AI usage (or another application using electricity)? I would think with cheaper AI their usage would be come more ubiquitous: LLMs in fridges, toasters, smart alarms, etc.
For example, food got cheaper and consumption has increased to the extent that obesity is a major problem, but this is much less than you might conclude from the degree to which productivity has increased per farmer.
For image generation, the energy needed to create an image is rapidly approaching the energy cost of a human noticing that they've seen an image — once it gets cheap enough (and good enough) to have it replace game rendering engines, we can't really spend meaningfully more on it.
(Probably. By that point they may be good enough to be trainers for other AI, or we might not need any better AI — impossible to know at this point).
For text generation, difficult to tell because e.g. source code and legal code have a lot of text.
_heimdall 56 days ago [-]
Food may be a bit of an outlier, the number of consumers won't change quickly in response and each person can only eat so much.
When it comes to converting electricity into images and text, there really is no upper bound in sight. People are happy to load the internet up with as much content as they can churn out.
wongarsu 55 days ago [-]
If we assume that text and images are made for human consumption then there is a limit in how much we can consume. In fact I doubt there is much room for our society's per-person media consumption to increase. There is obviously room for growth in fewer people seeing the same content, and room for some "waste" (i.e. content nobody ever sees). The upper bound (ignoring waste) would be if everybody only saw and read content that nobody else has ever seen and will ever see. But if we assume society continues to function as it does the real limit will be a lot lower.
Now maybe waste is a bigger issue with content than with food. I'm not sure. Both have some nonzero cost to waste. It might depend on how content is distributed
ben_w 55 days ago [-]
Mm.
I'd would say that text is capable of being extremely useful even when no human reads it, because of source code, maths proofs, etc.
But I'm curious: 238 wpm * 0.75 words per token * 16 (waking) hours per day * 83 years * $10.00 / 1M output tokens (current API cost for 4o without batching) means the current cost of making as many tokens as a human can read in a lifetime is $92,300: https://www.wolframalpha.com/input?i=238+words+per+minute+%2...
With these numbers, a well-written project with even a billion lines of code would be a rounding error even if only a thousand people used any specific such software and none of that was ever shared with what other people wanted to get done.
_heimdall 55 days ago [-]
Its an interesting question for sure. Anecdotally it seems to me like there's a ton of content thrown online that is rarely, if ever, consumed. From bot-generated blog posts to social media posts, surely some of it is never seen or viewed only a few times before it gets buried and never seen again.
Market dynamics should push people to stop generating that content if they don't enough value to justify the cost. In practice, though, it hasn't seemed to happen yet and we must be pass a threshold where there's more content created online than we could ever value.
It'd make for an interesting study, but short of having verifiable data I have to assume we'll continue increasing the rate at which content is created whether the value is there or not.
UltraSane 55 days ago [-]
Yandex image search works really well at finding similar images but it also leads you to some very strange parts of the internet that are exactly what you are describing: bot generated pages that almost no one reads.
HappMacDonald 53 days ago [-]
The simplest example of what you describe regarding fully bespoke content would be realtime generation of VR feeds. Of course even in VR people would be consuming still more 2D content: the environments are built out of 3d models textured by 2d content at higher resolutions than most viewers will ever closely inspect.
You'd most likely categorize all of the unseen textures or higher-than-needed resolution in your "waste" bucket, and I can't argue with that. But VR still clearly means that there is at least theoretical room for "realtime video generated custom for every viewer, which in turn is composed of even more content sources".
ben_w 56 days ago [-]
You don't see an end at the level of everyone having it at 60fps (or so) in each eye?
_heimdall 56 days ago [-]
I'm not quite sure what you mean, 60fps would have something to do with output displays but nothing to do with the content. There's no upper bound to how much content people would have LLMs make, whether that content is being consumed on cell phone screens or some kind of in-eye display.
ben_w 55 days ago [-]
If you generate a new image 60 times per second, that's reasonably described as "60 fps", this is how the output of video game engines has been described for at least 25 years*.
If everyone's doing that all day every day on each eye, that's a reasonable guess of an upper bound: you as a human cannot actually consume more even if you make it.
GANs can already do that speed, but any given GAN is a specialist AI and not a general model; diffusion models are general, but they're slower (best generation speed I've seen is 4-5 frames per second on unknown hardware). LLMs aren't really suited to doing images at all, but can control other models (this is what ChatGPT does when "it makes an image" — it calls out to DALL•E).
* how long I've been paying attention to that, not a detailed historical analysis
_heimdall 55 days ago [-]
Sure, I supposed you could calculate a limit by looking at how many human eyes there are, how many frames per second they can see, and max resolution visible. That still isn't actually a limit on how many images could be made, only how many could be consumed.
That said, if we got to such a massive scale I'd expect us to hit other limits first (electricity available, best produced, storage space, network transmission, etc.).
Or did I totally misunderstand your example here? I may have misread it completely, if so sorry about that!
ben_w 55 days ago [-]
> Sure, I supposed you could calculate a limit by looking at how many human eyes there are, how many frames per second they can see, and max resolution visible. That still isn't actually a limit on how many images could be made, only how many could be consumed.
Sure, absolutely. But I can say the same of food, which is why I drew the analogy between them previously.
> That said, if we got to such a massive scale I'd expect us to hit other limits first (electricity available, best produced, storage space, network transmission, etc.).
Difficult to guess when the quality isn't yet at the right threshold: GANs are already this speed on phone hardware*, so we're not bounded on that specific combination with available electrical energy; on the other hand, 2 years ago I was seeing images for about 3 kJ, which is in the region of hundreds of kilowatts for 2 eyes at 60 fps, which is absolutely going to be a problem… if they were limited to that hardware and with that model (though both are moving targets, I'd be very surprised if the unknown hardware that I've seen doing 4-5 fps was burning 12-15 kW, but it's not strictly speaking impossible it really was that power hungry).
The ultimate end-game for image-gen AI is a closed-loop system where a computer can monitor sexual arousal levels and generate the most arousing porn possible for the subject. This would be VERY addictive. Unless people can just become completely immune to all pornographic stimuli.
ben_w 55 days ago [-]
I'd say the Matrix rather than that; most of us have a refractory period where that would at best do nothing and at worst be actively undesirable.
debugnik 55 days ago [-]
You're assuming people will create content to consume it, and not just to spam various platforms, competing for attention. Most of it might only be ever consumed by crawlers, if at all.
AcerbicZero 55 days ago [-]
I think you're missing the broader analogy here; Cheap LLMs == LLMs everywhere. Cheap food == People everywhere.
I'm no Malthusian, but the paradox holds here pretty well.
ben_w 55 days ago [-]
The population indeed went up, and at the same time the fertility rate is declining. What Malthus was expecting is that more food would just lead to more people on the knife-edge of famine, and we're wildly far from that in most of the world. (What is paradoxical is that the USA is simultaneously very rich, has high obesity, and somehow manages to also have a huge problem with kids going hungry).
The very specific point I'm claiming is that the increased consumption isn't always unbounded.
hhdhdbdb 55 days ago [-]
Why is fertility declining? I posit we are hitting non-food constraints. Political ones. Land use constraints. If you build millions of homes fertility will go up.
HappMacDonald 53 days ago [-]
In wealthier, modern economies:
* More women work more and invest in their own education and fewer spend time alone at home as they might in poorer countries which would facilitate giving birth and investing time on childcare that way.
* More men and women derive their primary income from work that children cannot easily participate in. EG: office work, work from home computer work, vs farming or working with one's hands. In many poorer countries it is common practice to have more children at least partially to bolster the labor force around the house.
* Wealthier nations have better access to family planning: contraception, abortion, pasttimes that can meaningfully compete against getting laid in the first place.
I'd assume environmental, but there's also more subtle answers than will fit in a comment box — whatever the cause, it has to be near-global.
China's building loads more houses, still has a fertility decline.
shadowmnifold 54 days ago [-]
Surely, the reasons are multivariate with all kinds of interactions and feedback mechanisms between the variables.
It is really a good example of what natural dimension reducers we are, even when we know it makes no sense. It is like we can't but help ourselves to reduce things to one explanatory variable.
My favorite is the news headline "The market went up today because of X".
hhdhdbdb 53 days ago [-]
They never say that.
They say: Tesla shares up as revealations surface that the wind is blowing east.
hhdhdbdb 54 days ago [-]
Yes I forgot to mention the implied: Homes, that meet code, with connected utilities in places people want to live that are not being landbanked.
AcerbicZero 55 days ago [-]
The fertility rate trends are missing the core point here. Your obesity and hunger examples actually reinforce the Jevons paradox - when a resource becomes cheap enough, we find ways to use it even beyond what seems rational. But more importantly, you're still not getting the original Malthusian comparison: Malthus wasn't predicting that cheaper food would make people eat more (obesity) - he was predicting that cheaper food would lead to more total people. Similarly, cheaper AI won't just make individual AIs consume more - it means AI will be deployed everywhere possible. The parallel is about multiplication of instances, not increased individual consumption.
skybrian 55 days ago [-]
Image generation isn't cheap enough until we have sites that work like Google Image search, filling the page with image variations nearly instantly and available for free.
Also TIL this is generated at 20 frames per second, the best I've used myself was "only" 4-5; does anyone know the performance and power consumption of a Google TPU?
hhdhdbdb 55 days ago [-]
Bitcoin is a pure exanple thay shows the limit to energy consumption is how much money people have to throw at it. And if that money is thrown into generating more energy it is a cycle. There is no stomach size and human reproduction constraints. We can waste power as quickly as we can generate more.
The only hope is to generate this power greenly.
ben_w 55 days ago [-]
The existence of examples where it happens by design does not say anything either way about if it must happen all the time.
hhdhdbdb 55 days ago [-]
Yeah I am not saying all the time, but I am saying when it happens it can br less bounded than "human population growth in the early 21st century."
esafak 56 days ago [-]
It's possible to decrease costs faster than usage can rise.
Ensorceled 55 days ago [-]
Insulation, double glazed windows and other improvements in reducing heating and cooling costs in houses resulted in houses doubling in size.
Increasing fuel economy resulted in many more cars being replaced by SUVs.
AI usage will definitely increase to fill the space.
airstrike 56 days ago [-]
You specifically picked things like toasters and fridges which seem like frivolous if not entirely useless applications of LLMs.
But you can be more charitable and imagine more productive uses of AI on the edge that are impossible today. Those uses would presumably create some value, so if by reducing AI energy costs by 90% we get all the AI usage we have today plus those new uses that aren't currently viable, it's a better bang for buck.
ithkuil 55 days ago [-]
AI will be useful with toasters and fridges but of course that doesn't mean it will have to run on the devices itself
derektank 55 days ago [-]
I actually think that fridges with image recognition would be a value add depending on the price. Could evaluate whether or not your food has spoiled, queue up a list of items to purchase, etc.
spockz 55 days ago [-]
Maybe for larger kitchen/restaurants. But for residential use I think it would only serve to further distance the human from nature with all subsequent drawbacks.
workflowsauce 55 days ago [-]
Fridge snake that crawls through the fridge and maps out the food
lodovic 56 days ago [-]
I had the same thought - power use will not be halved, usage will double instead.
theptip 56 days ago [-]
The answer depends on what is rate-limiting growth; while we are supply-constrained on GPUs you can’t just increase AI usage.
The next bottleneck will be datacenter power interconnects, but in that scenario as you say you can expect power usage to expand to fill the supply gap if there is a perf win.
55 days ago [-]
layer8 55 days ago [-]
That depends on whether AI cost is dominated by power consumption cost [0]. I don’t think it is?
[0] For inference, that is. Training is another matter, and energy consumption for hardware manufacturing yet another.
xnx 56 days ago [-]
> Imagine halving the resource costs of AI and what that could mean for the planet and the industry
Google has done this:
"In eighteen months, we reduced costs by more than 90% for these queries through hardware, engineering, and technical breakthroughs, while doubling the size of our custom Gemini model." https://blog.google/inside-google/message-ceo/alphabet-earni...
moffkalast 55 days ago [-]
That would be notable... if anyone was actually using Gemini.
xnx 55 days ago [-]
People who don't are missing out. I get perfect JSON formatted responses to my prompts for pennies.
moffkalast 55 days ago [-]
Even Llama 3.1 can give you perfect JSON formatted responses for free these days. Also you really ought to be using yaml instead, you save 30% on tokens.
Tried the Gemini Advanced trial last week. For some reason their so called 1M context model is limited to 10 files at a time, so you can't upload a codebase for it to reference and even with the extra data the end result is somehow worse than both Sonnet or 4o without much given context at all. It's definitely not on the level as a coding assistant at least.
htrp 56 days ago [-]
rephrased as "We took compute from everything else.... and gave it to AI"
dan-robertson 55 days ago [-]
Being able to ‘connect’ call stacks between python, c++, and the gpu/accelerator seems useful.
I wonder if this pushes a bit much towards flamegraphs specifically. They were an innovation when they were first invented and the alternatives were things like perf report, but now I think they’re more one tool among many. In particular, I think many people who are serious about performance often reach for things like pprof for statistical profiles and various traceing and trace-visualisation tools for more fine-grained information (things like bpftrace, systemtap, or custom instrumentation on the recording side and perfetto or the many game-development oriented tools on the visualisation (and sometimes instrumentation) side).
I was particularly surprised by the statement about intel’s engineers not knowing what to do with the flamegraphs. I read it as them already having tools that are better suited to their particular needs, because I think the alternative has to be that they are incompetent or, at best, not thinking about performance at all.
Lots of performance measuring on Linux is done through the perf subsystem and Intel have made a lot of contributions to make it good. Similarly, Intel have added hardware features that are useful for measuring and improving performance – an area where their chips have features that, at least on chips I’ve used, easily beat AMD’s offerings. This kind of plumbing is important and useful, and I guess the flamegraphs demonstrate that the plumbing was done.
stefan_ 55 days ago [-]
It's a bit weird, very much a "software optimization" approach. But looking at the flame graph, you couldn't tell a model running in FP32 from one in INT8, taking 3x the time and energy.
bornfreddy 55 days ago [-]
And? This is an information trivially obtainable in a different way (e.g. using a stopwatch), while flamegraphs visualise where that time was spent, helping us to determine the parts that need to be optimised.
55 days ago [-]
kevg123 56 days ago [-]
> based on Intel EU stall profiling for hardware profiling
For each function you know how much CPU is spent in the function itself, as opposed to child calls. All in a simple text file without the need for constantly scrolling, panning, and enlarging to get the information you need.
davidclark 56 days ago [-]
This is so cool! Flame graphs are super helpful for analyzing bottlenecks. The eflambe library for elixir has let us catch some tricky issues.
I never really liked flamegraphs much but I am going to put that aside for a bit and try to be as objective as possible.
I don't find the usecase presented here compelling. Cutting out the "yo we will save you $x billion in compute" costs the tools presented here seem to be…stacktraces for your kernels. Stacktraces that go from your Python code through the driver shim to the kernel and finally onto the GPU. Neat. I don't actually know very much about what Intel has in this area so perhaps this is a step forward for them? If so, I will always applaud people figuring out how to piece together symbols and whatnot to make profiling work.
However, I am still not very impressed. Sure, there are some workloads where it is nice to know that 70% of your time is spent in some GEMM. But I think the real optimization doesn't look like that all. For most "real" workloads, you already know the basics of how your kernels look and execute. Nobody is burning a million dollars an hour on a training run without knowing what each and every one of the important kernels are. Some of them were probably written by hand. Some might be written in higher-level PyTorch/Triton/JAX/whatever. Still others might be built on some general library. But the people who do this are not stupid, and they aren't going to be caught unawares that a random kernel has suddenly popped up on their flamegraph. They should already know what is there. And most of these tools have debugging facilities to dump intermediate state in forms that tools understand. Often this is incomplete and buggy, I know. But it's there and people do use them.
What these people are optimizing are things that flamegraphs do not show. That's things like latency in kernel launches, or synchronization overhead with the host. It's global memory traffic and warp stalls. Sure, the tools to profile this are immature compared to what the hyperscalers have for CPUs. But they are still present and used heavily: I don't buy the argument that knowing that your python code calls a kernel through __cuda12_ioctl_whatever is actually helpful. This seems like a solution searching for a problem, or maybe a basic diagnostic tool at best.
bornfreddy 55 days ago [-]
> What these people are optimizing are things that flamegraphs do not show. That's things like latency in kernel launches, or synchronization overhead with the...
What OP is showing is an example of what can be shown on flamegraphs. They are a generic visualisation tool so if you want to include latency or whatever (financial cost maybe?) you are free to do it.
As for the rest, Intel is here providing tools for developers who would like to optimize the sw stacks on their platform. Invaluable if you would like to efficiency support non-NVidia hardware.
saagarjha 54 days ago [-]
Flamegraphs categorically cannot represent timeseries data. That's not what they are designed to do and they don't have a way to display it.
bornfreddy 53 days ago [-]
That is not true, they definitely can represent some timeseries data in specific ways. But that's not even connected to what I said - I specifically mentioned latency which can be included in profiling data. Or am I misunderstanding what you are trying to say?
saagarjha 52 days ago [-]
How would you indicate how long a kernel takes to launch in a flamegraph?
_heimdall 56 days ago [-]
> Imagine halving the resource costs of AI and what that could mean for the planet and the industry -- based on extreme estimates such savings could reduce the total US power usage by over 10% by 2030
The way this is phrased threw me off. It sounded to me like the author was comparing the power use of a more efficient LLM industry to US usage without LLMs and expecting it to be 10% lower.
Looking into the source linked with the claim, it doesn't even hold up when compared against how much power LLMs use today. The linked article raises an estimate that LLM power use could increase 15-23 times between 2023 and 2027, and that by 2030 LLMs could account for 20-25% of our total energy use.
Working that match backwards, the benefit the author is hailing as a success is that we would only increase energy use by say 7.5-11.5 times by 2027 and that in 2030 LLMs would only be 10% of the total energy use. That's not a win in my book, and doesn't account for the Jevan's Paradox problem where we would almost certainly just use all that efficiency gain to further grow LLM use compared to the 2030 prediction without the efficiency gains.
have_faith 56 days ago [-]
> Imagine halving the resource costs of AI ... based on extreme estimates such savings could reduce the total US power usage by over 10% by 2030
Is that implying that by 2030 they expect at least 20% of all US energy to be used by AI?
benreesman 56 days ago [-]
Data centers are big consumers of energy. Most modern data centers will have a mix of vector and scalar compute because ML/AI is a bunch of stuff, most of which was ubiquitous a decade ago.
In the limit case where Prineville just gets 100k BH100 slammed into it? The absolute best you’re going to do is to have Brendan Gregg looking at the cost. He’s the acknowledged world expert on profiling and performance tuning on modern gear in the general case. There are experts in a vertical (SG14, you want to watch Carl Cook).
I’ve been around the block and my go-to on performance trouble is “What’s the Gregg book say here…” it your first stop.
Writingdorky 56 days ago [-]
The data source is linked and is based on the ARM Datacenter Energy prediction.
But i don't think its too far fetched.
The compute needed for digital twins, simulating a whole army of robots than uploading it to the robots, who sitll need a ton of compute, is not unrealistic.
Cars like Tesla have A TON of compute build in too.
And we have seen what suddenly happens to an LLM when you switch the amount of parameters. We were in a investment hell were it was not clear in what to invest (crypto, blockchain and NFT bubble bursted) but AI opened up the sky again.
If we continue like this, it will not be far fetched that everyone has their own private agent running and paying for it (private / isolated for data security) + your work agent.
klysm 56 days ago [-]
Seems pretty absurd
benreesman 56 days ago [-]
Given who said it, I chose to read for understanding.
adrianco 56 days ago [-]
This is super interesting and useful. I tried reading the code to understand how GPU workloads worked last year and it was easy to get lost in all the options and pluggable layers.
Veserv 56 days ago [-]
I do not really understand the mentioned difficulties with instruction profiling.
Are they saying it is hard to sample the stacks across the boundary? Are they saying it is hard to do so coherently because the accelerator engine is actually asynchronous so you need to do some sort of cross-boundary correlation?
However, they then talk about file systems and /proc representations which have nothing to do with the actual sampling process; only posing problems for the display of human-readable information. Many naive profiling, tracing, and logging implementations conflate these actions to their detriment; are they being conflated here or is it just a generic statement of the scope of problems?
yanniszark 55 days ago [-]
Trying to find out more about this EU stall thing Brendan talks about. Is it instruction sampling that gives you the reason for the stall? Sounds like a pretty advanced hw functionality.
shidoshi 56 days ago [-]
I can imagine Nelson and other Anthropic engineers jumping for joy at this release.
treefarmer 56 days ago [-]
Would love it if it was available and open source so people could use it in their own projects (or on their own hardware), instead of only being available on Intel's AI Cloud. But cool idea and execution nevertheless!
flamingspear 55 days ago [-]
Yeah, would love to built-in support for this in PyTorch or TF
r3tr0 55 days ago [-]
i am actually working on a platform that makes this sort of stuff easy. We use BPF under the hood and let you remotely deploy them across a cluster and visualize them.
Unrelated, but on the topic of reducing power consumption, I want to once again note that both AMD and NVidia max out a CPU core per blocking API call, preventing your CPU from entering low power states even when doing nothing but waiting on the GPU, for no reason other than to minimally rice benchmarks.
Basically, these APIs are set up to busyspin while waiting for a bus write from the GPU by default (!), rather than use interrupts like every other hardware device on your system.
This saves me 20W whenever my GPU is busy in ComfyUI.
Every single device using the default settings for CUDA/ROCM burns a CPU core per worker thread for no reason.
bob1029 56 days ago [-]
> for no reason other than to minimally rice benchmarks.
For AI/ML applications, perhaps no one will notice.
For gaming, yielding threads of execution to the OS can periodically incur minimum scheduler delays of 10-20ms. Many gamers will notice an ~extra frame of latency being randomly injected.
FeepingCreature 56 days ago [-]
Sure, but CUDA is an AI/ML API, and anyways you're not doing blocking calls when writing a graphics engine regardless. (Well, you better not.) And anyways, these calls will already busyspin for a few millis before yielding to the OS - it's just that you have to explicitly opt in to the latter part. So these are the sorts of calls that you'd use for high-throughput work, but they behave like calls designed for very-low-latency work. There is no point in shaving a few milliseconds off a low-seconds call other than to make NVidia look a few percent better in benchmarks. The tradeoffs are all wrong, and because nobody knows about it, megawatts of energy are being wasted.
saagarjha 55 days ago [-]
This is important if you are launching many kernels and orchestrating their execution from the CPU.
FeepingCreature 55 days ago [-]
In that case (which tbh is kind of bad design imo), you should have to explicitly opt in to the power-hungry mode.
saagarjha 54 days ago [-]
This is a thing that people want, hence the decision. Unfortunately those people pay Nvidia a lot more money than you do.
FeepingCreature 54 days ago [-]
The thing is, it's really not hard to recognize this access pattern. Just bucket API call times and switch modes on the fly.
There is simply no excuse for an app that does 10 API calls a second to burn 100% CPU.
56 days ago [-]
nonamepcbrand1 56 days ago [-]
totally looks like self promotion article lol
tantalor 56 days ago [-]
This guy invented flame graphs (among other things) so... I'm gonna allow it.
There has been a bit of hyperbole of late about energy saving AI.
There isn't a magic bullet here, it's just people improving a relatively new technology. Even though the underlying neural nets are fairly old now, the newness of transformers and the newness of the massive scale means there's quite a lot of low hanging fruit still. Some of the best minds are on this problem and are reaching for the hardest to get fruit.
A lot of these advancements work well together improving efficiency a few percent here, a few percent there.
This is a good thing, but people are doing crazy comparisons by extrapolating older tech into future use cases.
This is like estimating the impact of cars by correctly guessing that there are 1.4 Billion cars in the world and multiplying that by the impact of a single model-T Ford.
Rendered at 07:46:49 GMT+0000 (Coordinated Universal Time) with Vercel.
Now that Brendan works for Intel, he can get a lot of this info from the much more open source Intel GPU driver, but that's only so useful since everyone is either Nvidia or AMD still. The more hopeful sign is that a lot of the major customers of Nvidia are going to start demanding this sort of access and there's a real chance that AMD's more accessible driver starts documenting what to actually look at, which will create the market competition to fill this space. In the meantime, take a look at the flamegraph capabilities in PyTorch and similar frameworks, up an abstraction level and eek what performance you can.
https://ieeexplore.ieee.org/document/9370339
It seems like the instruction sampler is there, and it also provides the stall reason.
A while ago, I read a paper on dissecting the Nvidia architecture using very specifically tuned microbenchmarking to understand things like cache structure on chip and the like [0]. Unfortunately, no one has done this for seriously in use, recent architectures, so it's hard to use this info today. Similarly, there isn't an eBPF VM running on the chip to summarize all of this and the Nvidia tools aren't intended to make this kind of info easy to get, probably specifically because of this paper...
[0] https://arxiv.org/pdf/1804.06826
Why would it be the case that reducing the costs of AI reduces power consumption as opposed to increase AI usage (or another application using electricity)? I would think with cheaper AI their usage would be come more ubiquitous: LLMs in fridges, toasters, smart alarms, etc.
For example, food got cheaper and consumption has increased to the extent that obesity is a major problem, but this is much less than you might conclude from the degree to which productivity has increased per farmer.
For image generation, the energy needed to create an image is rapidly approaching the energy cost of a human noticing that they've seen an image — once it gets cheap enough (and good enough) to have it replace game rendering engines, we can't really spend meaningfully more on it.
(Probably. By that point they may be good enough to be trainers for other AI, or we might not need any better AI — impossible to know at this point).
For text generation, difficult to tell because e.g. source code and legal code have a lot of text.
When it comes to converting electricity into images and text, there really is no upper bound in sight. People are happy to load the internet up with as much content as they can churn out.
Now maybe waste is a bigger issue with content than with food. I'm not sure. Both have some nonzero cost to waste. It might depend on how content is distributed
I'd would say that text is capable of being extremely useful even when no human reads it, because of source code, maths proofs, etc.
But I'm curious: 238 wpm * 0.75 words per token * 16 (waking) hours per day * 83 years * $10.00 / 1M output tokens (current API cost for 4o without batching) means the current cost of making as many tokens as a human can read in a lifetime is $92,300: https://www.wolframalpha.com/input?i=238+words+per+minute+%2...
With these numbers, a well-written project with even a billion lines of code would be a rounding error even if only a thousand people used any specific such software and none of that was ever shared with what other people wanted to get done.
Market dynamics should push people to stop generating that content if they don't enough value to justify the cost. In practice, though, it hasn't seemed to happen yet and we must be pass a threshold where there's more content created online than we could ever value.
It'd make for an interesting study, but short of having verifiable data I have to assume we'll continue increasing the rate at which content is created whether the value is there or not.
You'd most likely categorize all of the unseen textures or higher-than-needed resolution in your "waste" bucket, and I can't argue with that. But VR still clearly means that there is at least theoretical room for "realtime video generated custom for every viewer, which in turn is composed of even more content sources".
If everyone's doing that all day every day on each eye, that's a reasonable guess of an upper bound: you as a human cannot actually consume more even if you make it.
GANs can already do that speed, but any given GAN is a specialist AI and not a general model; diffusion models are general, but they're slower (best generation speed I've seen is 4-5 frames per second on unknown hardware). LLMs aren't really suited to doing images at all, but can control other models (this is what ChatGPT does when "it makes an image" — it calls out to DALL•E).
* how long I've been paying attention to that, not a detailed historical analysis
That said, if we got to such a massive scale I'd expect us to hit other limits first (electricity available, best produced, storage space, network transmission, etc.).
Or did I totally misunderstand your example here? I may have misread it completely, if so sorry about that!
Sure, absolutely. But I can say the same of food, which is why I drew the analogy between them previously.
> That said, if we got to such a massive scale I'd expect us to hit other limits first (electricity available, best produced, storage space, network transmission, etc.).
Difficult to guess when the quality isn't yet at the right threshold: GANs are already this speed on phone hardware*, so we're not bounded on that specific combination with available electrical energy; on the other hand, 2 years ago I was seeing images for about 3 kJ, which is in the region of hundreds of kilowatts for 2 eyes at 60 fps, which is absolutely going to be a problem… if they were limited to that hardware and with that model (though both are moving targets, I'd be very surprised if the unknown hardware that I've seen doing 4-5 fps was burning 12-15 kW, but it's not strictly speaking impossible it really was that power hungry).
* Specifically: on an iPhone 11, BlazeStyleGAN model was generating images in 12.14 ms, which is just over 82 fps — https://research.google/blog/mediapipe-facestylizer-on-devic...
I'm no Malthusian, but the paradox holds here pretty well.
The very specific point I'm claiming is that the increased consumption isn't always unbounded.
* More women work more and invest in their own education and fewer spend time alone at home as they might in poorer countries which would facilitate giving birth and investing time on childcare that way.
* More men and women derive their primary income from work that children cannot easily participate in. EG: office work, work from home computer work, vs farming or working with one's hands. In many poorer countries it is common practice to have more children at least partially to bolster the labor force around the house.
* Wealthier nations have better access to family planning: contraception, abortion, pasttimes that can meaningfully compete against getting laid in the first place.
Sources: Colleran, H., Snopkowski, K. Variation in wealth and educational drivers of fertility decline across 45 countries. Popul Ecol 60, 155–169 (2018). https://doi.org/10.1007/s10144-018-0626-5 https://link.springer.com/article/10.1007/s10144-018-0626-5
More Work, Fewer Babies: What Does Workism Have to Do with Falling Fertility? - Laurie DeRose and Lyman Stone https://ifstudies.org/ifs-admin/resources/reports/ifs-workis...
I'd assume environmental, but there's also more subtle answers than will fit in a comment box — whatever the cause, it has to be near-global.
China's building loads more houses, still has a fertility decline.
It is really a good example of what natural dimension reducers we are, even when we know it makes no sense. It is like we can't but help ourselves to reduce things to one explanatory variable.
My favorite is the news headline "The market went up today because of X".
They say: Tesla shares up as revealations surface that the wind is blowing east.
https://arxiv.org/abs/2408.14837
Also TIL this is generated at 20 frames per second, the best I've used myself was "only" 4-5; does anyone know the performance and power consumption of a Google TPU?
The only hope is to generate this power greenly.
Increasing fuel economy resulted in many more cars being replaced by SUVs.
AI usage will definitely increase to fill the space.
But you can be more charitable and imagine more productive uses of AI on the edge that are impossible today. Those uses would presumably create some value, so if by reducing AI energy costs by 90% we get all the AI usage we have today plus those new uses that aren't currently viable, it's a better bang for buck.
The next bottleneck will be datacenter power interconnects, but in that scenario as you say you can expect power usage to expand to fill the supply gap if there is a perf win.
[0] For inference, that is. Training is another matter, and energy consumption for hardware manufacturing yet another.
Google has done this: "In eighteen months, we reduced costs by more than 90% for these queries through hardware, engineering, and technical breakthroughs, while doubling the size of our custom Gemini model." https://blog.google/inside-google/message-ceo/alphabet-earni...
Tried the Gemini Advanced trial last week. For some reason their so called 1M context model is limited to 10 files at a time, so you can't upload a codebase for it to reference and even with the extra data the end result is somehow worse than both Sonnet or 4o without much given context at all. It's definitely not on the level as a coding assistant at least.
I wonder if this pushes a bit much towards flamegraphs specifically. They were an innovation when they were first invented and the alternatives were things like perf report, but now I think they’re more one tool among many. In particular, I think many people who are serious about performance often reach for things like pprof for statistical profiles and various traceing and trace-visualisation tools for more fine-grained information (things like bpftrace, systemtap, or custom instrumentation on the recording side and perfetto or the many game-development oriented tools on the visualisation (and sometimes instrumentation) side).
I was particularly surprised by the statement about intel’s engineers not knowing what to do with the flamegraphs. I read it as them already having tools that are better suited to their particular needs, because I think the alternative has to be that they are incompetent or, at best, not thinking about performance at all.
Lots of performance measuring on Linux is done through the perf subsystem and Intel have made a lot of contributions to make it good. Similarly, Intel have added hardware features that are useful for measuring and improving performance – an area where their chips have features that, at least on chips I’ve used, easily beat AMD’s offerings. This kind of plumbing is important and useful, and I guess the flamegraphs demonstrate that the plumbing was done.
It wasn't clearly defined but I think EU stall means Execution Unit stall which is when a GPU "becomes stalled when all of its threads are waiting for results from fixed function units" https://www.intel.com/content/www/us/en/docs/gpa/user-guide/...
https://ftp.gnu.org/old-gnu/Manuals/gprof-2.9.1/html_chapter...
For each function you know how much CPU is spent in the function itself, as opposed to child calls. All in a simple text file without the need for constantly scrolling, panning, and enlarging to get the information you need.
https://github.com/Stratus3D/eflambe/blob/master/README.adoc
I don't find the usecase presented here compelling. Cutting out the "yo we will save you $x billion in compute" costs the tools presented here seem to be…stacktraces for your kernels. Stacktraces that go from your Python code through the driver shim to the kernel and finally onto the GPU. Neat. I don't actually know very much about what Intel has in this area so perhaps this is a step forward for them? If so, I will always applaud people figuring out how to piece together symbols and whatnot to make profiling work.
However, I am still not very impressed. Sure, there are some workloads where it is nice to know that 70% of your time is spent in some GEMM. But I think the real optimization doesn't look like that all. For most "real" workloads, you already know the basics of how your kernels look and execute. Nobody is burning a million dollars an hour on a training run without knowing what each and every one of the important kernels are. Some of them were probably written by hand. Some might be written in higher-level PyTorch/Triton/JAX/whatever. Still others might be built on some general library. But the people who do this are not stupid, and they aren't going to be caught unawares that a random kernel has suddenly popped up on their flamegraph. They should already know what is there. And most of these tools have debugging facilities to dump intermediate state in forms that tools understand. Often this is incomplete and buggy, I know. But it's there and people do use them.
What these people are optimizing are things that flamegraphs do not show. That's things like latency in kernel launches, or synchronization overhead with the host. It's global memory traffic and warp stalls. Sure, the tools to profile this are immature compared to what the hyperscalers have for CPUs. But they are still present and used heavily: I don't buy the argument that knowing that your python code calls a kernel through __cuda12_ioctl_whatever is actually helpful. This seems like a solution searching for a problem, or maybe a basic diagnostic tool at best.
What OP is showing is an example of what can be shown on flamegraphs. They are a generic visualisation tool so if you want to include latency or whatever (financial cost maybe?) you are free to do it.
As for the rest, Intel is here providing tools for developers who would like to optimize the sw stacks on their platform. Invaluable if you would like to efficiency support non-NVidia hardware.
The way this is phrased threw me off. It sounded to me like the author was comparing the power use of a more efficient LLM industry to US usage without LLMs and expecting it to be 10% lower.
Looking into the source linked with the claim, it doesn't even hold up when compared against how much power LLMs use today. The linked article raises an estimate that LLM power use could increase 15-23 times between 2023 and 2027, and that by 2030 LLMs could account for 20-25% of our total energy use.
Working that match backwards, the benefit the author is hailing as a success is that we would only increase energy use by say 7.5-11.5 times by 2027 and that in 2030 LLMs would only be 10% of the total energy use. That's not a win in my book, and doesn't account for the Jevan's Paradox problem where we would almost certainly just use all that efficiency gain to further grow LLM use compared to the 2030 prediction without the efficiency gains.
Is that implying that by 2030 they expect at least 20% of all US energy to be used by AI?
In the limit case where Prineville just gets 100k BH100 slammed into it? The absolute best you’re going to do is to have Brendan Gregg looking at the cost. He’s the acknowledged world expert on profiling and performance tuning on modern gear in the general case. There are experts in a vertical (SG14, you want to watch Carl Cook).
I’ve been around the block and my go-to on performance trouble is “What’s the Gregg book say here…” it your first stop.
But i don't think its too far fetched.
The compute needed for digital twins, simulating a whole army of robots than uploading it to the robots, who sitll need a ton of compute, is not unrealistic.
Cars like Tesla have A TON of compute build in too.
And we have seen what suddenly happens to an LLM when you switch the amount of parameters. We were in a investment hell were it was not clear in what to invest (crypto, blockchain and NFT bubble bursted) but AI opened up the sky again.
If we continue like this, it will not be far fetched that everyone has their own private agent running and paying for it (private / isolated for data security) + your work agent.
Are they saying it is hard to sample the stacks across the boundary? Are they saying it is hard to do so coherently because the accelerator engine is actually asynchronous so you need to do some sort of cross-boundary correlation?
However, they then talk about file systems and /proc representations which have nothing to do with the actual sampling process; only posing problems for the display of human-readable information. Many naive profiling, tracing, and logging implementations conflate these actions to their detriment; are they being conflated here or is it just a generic statement of the scope of problems?
Check us out: https://yeet.cx
Our current package index is a bit thin:
https://yeet.cx/discover
We have a ton in the pipeline and are going to add more in the coming weeks and release an SDK.
Basically, these APIs are set up to busyspin while waiting for a bus write from the GPU by default (!), rather than use interrupts like every other hardware device on your system.
You turn it off with
NVidia: `cudaSetDeviceFlags(cudaDeviceScheduleBlockingSync)`
AMD: `hipSetDeviceFlags(hipDeviceScheduleBlockingSync)`
On Pytorch
NVidia: `import ctypes \ ctypes.CDLL('libcudart.so').cudaSetDeviceFlags(4)`
AMD: `import ctypes \ ctypes.CDLL('libamdhip64.so').hipSetDeviceFlags(4)`
This saves me 20W whenever my GPU is busy in ComfyUI.
Every single device using the default settings for CUDA/ROCM burns a CPU core per worker thread for no reason.
For AI/ML applications, perhaps no one will notice.
For gaming, yielding threads of execution to the OS can periodically incur minimum scheduler delays of 10-20ms. Many gamers will notice an ~extra frame of latency being randomly injected.
There is simply no excuse for an app that does 10 API calls a second to burn 100% CPU.
https://en.wikipedia.org/wiki/Brendan_Gregg
There isn't a magic bullet here, it's just people improving a relatively new technology. Even though the underlying neural nets are fairly old now, the newness of transformers and the newness of the massive scale means there's quite a lot of low hanging fruit still. Some of the best minds are on this problem and are reaching for the hardest to get fruit.
A lot of these advancements work well together improving efficiency a few percent here, a few percent there.
This is a good thing, but people are doing crazy comparisons by extrapolating older tech into future use cases.
This is like estimating the impact of cars by correctly guessing that there are 1.4 Billion cars in the world and multiplying that by the impact of a single model-T Ford.