> This is not the first time we can see Nvidia taking shortcuts to achieve maximum performance of their GPUs
Why is implementing it correctly not performant? For context I have no idea how rounding is typically implemented anyways.
crote 4 hours ago [-]
Another thing to keep in mind is that CPU processing of denormals tends to be extremely slow - I vaguely recall running into something like a 10x slowdown a decade ago.
For a lot of applications the difference between a denormal and zero is small enough to be irrelevant, so if you expect near-zero values to be common, enabling a denormals-to-zero compiler flag might give you a pretty nice performance boost for free.
mananaysiempre 10 minutes ago [-]
> CPU processing of denormals tends to be extremely slow - I vaguely recall running into something like a 10x slowdown a decade ago
Intel CPU processing, where slowdowns be as bad as couple hundred cycles. AMD CPUs penalize them much more mildly, usually single-digit cycles. (No idea about ARM.)
adgjlsfhk1 2 hours ago [-]
cpus that aren't Intel are plenty fast on denormals. Intel is the only one where denormals are 100x slower. (and Intel has fixed that on their new cpus, but only on their e cores)
andrepd 58 minutes ago [-]
More like 100x, but not sure how true that is nowadays.
yosefk 2 hours ago [-]
Flush denormals to zero. Even their inventor had trouble writing correct code in their presence - see the Appendix to that "what every programmer should know..." paper
mananaysiempre 13 minutes ago [-]
On the other hand, they (unexpectedly to the inventor, who intended them to be a debugging tool) underpin a few foundational results in correctly rounded computation, such as https://en.wikipedia.org/wiki/Sterbenz_lemma.
loicd 2 hours ago [-]
> Even their inventor had trouble writing correct code in their presence
I didn't know that. Could you provide a more specific reference?
andrepd 56 minutes ago [-]
It's one of several issues with the design of IEEE floats, unfortunately. I wish we could start thinking more seriously about a new design, to complement if not replace IEEE in the long term. Posits are an example https://github.com/andrepd/posit-rust
Rendered at 16:16:48 GMT+0000 (Coordinated Universal Time) with Vercel.
Why is implementing it correctly not performant? For context I have no idea how rounding is typically implemented anyways.
For a lot of applications the difference between a denormal and zero is small enough to be irrelevant, so if you expect near-zero values to be common, enabling a denormals-to-zero compiler flag might give you a pretty nice performance boost for free.
Intel CPU processing, where slowdowns be as bad as couple hundred cycles. AMD CPUs penalize them much more mildly, usually single-digit cycles. (No idea about ARM.)
I didn't know that. Could you provide a more specific reference?