Really cool solution! One question: maybe I missed it, but there's no technical reason the tag bits could not use the entire range of exponent bits, no? Other than the fact that having up to 2048 tags would be ridiculously branchy, I guess.
Here's a variation I just thought of, which probably has a few footguns I'm overlooking right now: use the seven highest exponent bits instead of three. Then we can directly read the most significant byte of a double, skipping the need for a rotation altogether.
After reading the most significant byte, subtract 16 (or 0b00010000) from it, then mask out the sign. To test for unboxed doubles, test if the resulting value is bigger than 15. If so, it is an unboxed double, otherwise the lower four bits of the byte are available as alternative tags (so 16 tags).
Effectively, we adjusted the exponent range 2⁻⁷⁶⁷..2⁻⁵¹¹ into the 0b(0)0000000 - 0b(0)0001111 range and made them boxed doubles. Every other double, which is 15/16th of all possible doubles, is now unboxed. This includes subnormals, zero, all NaN encodings (so it even can exist in superposition with a NaN or NuN tagging system I guess?) and both infinities.
To be clear, this is off the top of my head so maybe I made a few crucial mistakes here.
omelancon 9 days ago [-]
Hi, I'm one of the authors of the paper. Thanks for your questions and comments!
There are many reasons why 3-bit tags work well in practice. Importantly, it allows aligning heap objects on 64-bit machine words. Dereferencing a tagged pointer can then be done in a single machine instruction, a MOV offset by the tag.
One of our goals is to make self-tagging as straightforward to implement in existing systems as possible. Thus requiring to move away from the ubiquitous 3-bit tag scheme is definitely a no-go.
Another goal is performance, obviously. It turns out that if the transformation from self-tagged float to IEEE754 float requires more than a few (2-3) machine instructions, it is no longer as advantageous. Thus the choice of tags 000, 011 and 100, which encoding/decoding is a single bitwise rotation.
Also keep in mind that assigning more tags to self-tagging to capture floats that are never used in practice just adds a strain on other objects. That's why we include a usage analysis of floats to guide tag selection in our paper.
In fact, we are currently working on an improved version that uses a single tag to capture all "useful" values. Capturing 3/8 of floats seems unnecessary, especially since one of the tag is only meant to capture +-0.0. The trick is to rotate the tag to bits 2-3-4 of the exponent instead of 1-2-3 and add an offset to the exponent to "shift" the range of captured values.
But in the end, it feels like we are barely scratching the surface and that a lot of fine-tuning can still be done by toying with tag placement and transformation. But I think the next step is a quality over quantity improvement: capture less floats but capture the right ones.
vanderZwan 9 days ago [-]
Thank you for the explanation! I was not aware of the context behind the choice of three-bit tags.
> The trick is to rotate the tag to bits 2-3-4 of the exponent instead of 1-2-3 and add an offset to the exponent to "shift" the range of captured values.
Maybe I misunderstand, but isn't that a similar idea to what I just described? Adding an offset to "rotate" the ranges of the exponent by a segment, putting the one with zero in the high side? The main difference being that you stick to the upper four bits of the exponent, and that I suggested using one of the upper three bit sequences to mark bits 4-5-6-7 as tag bits? (I mistakenly included the sign bit before and claimed this unboxes 15/16ths of all doubles, it's actually 7/8ths) Which probably has consequences for minimizing instructions, like you mentioned.
> But I think the next step is a quality over quantity improvement: capture less floats but capture the right ones.
I suspect that ensuring NaN and Infinity are in there will be crucial to avoid performance cliffs in certain types of code; I have seen production code where it is known that initiated values are always finite, so either of those two are then used as ways to "tag" a float value as "missing" or something to that degree.
Anyway, looking forward to future results of your research!
omelancon 8 days ago [-]
> Maybe I misunderstand, but isn't that a similar idea to what I just described? Adding an offset to "rotate" the ranges of the exponent by a segment...
Yes it is similar. It seems to me that there really isn't that many useful operations that can be applied to the exponent beside adding an offset. But that's only a suspicion, do not take my word for it.
> I suspect that ensuring NaN and Infinity are in there will be crucial to avoid performance cliffs...
This is a reasonable assumption. There are in fact ways to rotate and add an offset such that the exponent can overflows/underflows to capture exponents 0 and 0x7ff (for inf and nan) with a single well-positioned tag. Making it work in practice is not as simple, but we are working on it.
vanderZwan 7 days ago [-]
Yeah, this is one of those situations where speculating about behavior one thing, but well-designed benchmarks and tests may very well give some surprising results. I am very much looking forward to the follow-up paper where you'll share your results :). Good luck with the research to you and you colleagues!
jacobp100 8 days ago [-]
Just to add some JS engine info - every engine stores numbers as 32 bit integers if possible, which is usually done by tagging a pointer. Also, JSC needs pointers to be represented exactly as-is, because it will scan the stack for anything that looks like a pointer, and retain it when running the GC
funny_falcon 8 days ago [-]
Just correction: CRuby uses “Float Self-Tagging” for years.
refulgentis 8 days ago [-]
This is your 4th comment claiming this: it's not true.
You're right that Ruby uses tags, ex. Objective-C does also and has for a while.
The innovation here is its a tag without the tag bits. That's why its self-tagging, not tagging.
SonOfLilit 8 days ago [-]
In another comment, this user links a CRuby commit that they claim adds it. It seems legit.
Linked commit contains code for rotating tagged floats so bits 60..62 go to the least significant positions, and a comment about a range of unboxed floats between 1.7...e-77 and 1.7...e77, plus special casing 0.0
Also, funny enough, this idea and variations of it look surprisingly easy to implement in JS itself using typed arrays. I won't any claims of impact on performance though...
You need a Float64Array for the values (plus a Uint32Array/Uint8Array using the same buffer to be able to manipulate the integer bits, which technically also requires knowledge of endianness since it's technically not in the JS spec and some funny mobile hardware still exists out there). Mocking a heap is easy enough: the "heap" can be a plain array, the "pointers" indices into the array. Using a Map<value, index> would you "intern" duplicate values (i.e. ensure doubles outside the unboxed range are only "heap" allocated once).
pizlonator 8 days ago [-]
I doubt this is actually faster than NaN or why they call NuN tagging. The code sequences they cite for encoding and decoding are worse than what I expect NuN tagging to give you.
If they want to convince me that their thing is faster, they should do a comparison against a production implementation of NuN tagging. Note that the specifics of getting it right involve wacky register allocation tricks on x86 and super careful instruction selection on arm.
It seems that they use some very nonstandard JS implementation of NaN tagging as a strawman comparison.
(Source: I wrote a lot of the NuN tagging optimizations in JavaScriptCore, but I didn’t invent the technique.)
Leszek 7 days ago [-]
Agreed, NuN tagging is an awesome use of the NaN space, and it has the same benefits of being a no-op for pointers. I would only consider this for cases where NuN tagging is impossible, e.g. some system that needs more pointer bits than NuN tagging allows.
I'd add that the claim that this could be implemented in V8 doesn't take into account pointer compression, where on-heap V8 tagged pointers are 32-bit, not 64-bit.
dgreensp 7 days ago [-]
It’s also worth noting that engines with JIT, like V8, don’t box intermediate floating-point numbers in calculations after optimization has been performed. Arrays of (only) numbers also don’t box the numbers (though all numeric object property values that aren’t small integers are now boxed in V8). This means you can read floats from arrays, do math on them, and write them to arrays, without boxing or unboxing. This wouldn’t necessarily be true in a less sophisticated runtime that lacks an optimizing compiler (of which V8 has two or three IIRC).
djwatson24 7 days ago [-]
Can you go in to more detail on 'wacky register allocation tricks' or instruction selection needed to support nun-tagging? Or pointers to code somewhere? Would be nice to compare some of them to the paper.
plagiarist 8 days ago [-]
The idea in the paper is really cool.
People who enjoyed this might also like to read how Apple used tagged pointers for short strings in Objective-C [0]. I think that's when I first learned about tagged pointers. NaN-boxing was mindblowing for me. I love this kind of stuff.
Another cool thing that seems related: exploiting alignment to free up N bits in a 'pointer' representation, because your values have to be aligned. The JVM does this to expand the set of possible addresses representable in 32 bits: https://shipilev.net/jvm/anatomy-quarks/23-compressed-refere...
So, for example, with 3 bits of alignment required, the first valid address for a pointer to point to after 0x0 is 0x8, and after that is 0x10, but you represent those as 0x1 and 0x2 respectively, and use a shift to get back the actual address (0x1 << 3 = 0x8, actual address). I think this is gestured at in section 1.1 of the paper, sort of, except they envision using the space thus freed for tags, rather than additional bits. (Which only makes sense if your address is 32 bits anyway, rather than 64 as in the paper: no one has 67-bit addresses. So saving 3 bits doesn't buy you anything. I think.)
> Aligning all heap-allocated values to 64-bit machine words conveniently frees the low bits of pointers to store a 3-bit tag.
plagiarist 8 days ago [-]
It's interesting which runtimes exploit the extra space for what reasons! Definitely makes more sense to have the extra address space on 32 bits compared to 64. I wonder if the extra addresses are specific to JVM / not something that works well in the C family?
lmm 8 days ago [-]
Well in C you have non-aligned pointers, because you can have pointers to things that aren't objects and might not be aligned (e.g. individual chars or shorts). In Java everything is at least 8-byte-aligned, you can't store a loose char/short/int on the heap (it has to go in a boxed object that's 8-byte-aligned, though the compiler will do this semi-automatically) and you can't take a pointer to an individual element of an array.
If you applied the C standard strictly, you could use a JVM-style representation for pointers to longs, pointers, and structs that start with longs and pointers, so you could theoretically have an implementation where those pointers were shorter. But you'd have to convert back and forth when casting to and from void* (and char*), and in practice C people expect to be able to cast a long* to int, cast that to void*, and get the same result as casting long* to void*, even though doing that and using it is undefined behaviour according to the standard.
pavpanchekha 9 days ago [-]
I loved this clever, weird, awesome paper, so a short summary.
In many dynamic languages some values are stored on the heap ("boxed") and represented as a pointer, while others are represented as an immediate value ("unboxed" or "immediate"). Pointer tagging is a common way to do that: the low bit of the value tells you the value's type, and some types are immediate while others are boxed.
Naturally, the tag bits have a fixed value, so can't be used to store data. So for example your language might offer 61-bit integer immediates instead of 64-bit integers; the other three bits are used for tags. Possibly, larger integers are stored on the heap and treated as a different type (for example Python 2.X had separate int and long types for these cases).
However, it's hard to use this strategy for floats, because floats need all 64 bits (or 32 bits for single-precision, same difference). There's a trick called "NaN boxing" which makes use of the large number of NaNs in the float representation, but read the paper if you want more on that.
The authors' insight is that, suppose you have a three-bit tag and 011 is the tag for floats. By totally random chance, _some_ floats will end in 011; you can represent those as immediates with those tag bits. Obviously, that's unlikely, though you can raise the chances by using, like, 010, 011, 100, and 101 all as float tags. Still, the low bits are a bad choice. But what about high bits? Most floats have one of a couple high bit patterns, because most floats are either 0 or between, say, 1e-100 and 1e100. Floats outside that range can be boxed but since they're really rare it's not a big cost to box them.
So basically, we use high bits as our tag bits and map all the common float prefixes to float tags. This allows unboxing the vast majority of floats, which leads to big speedups on float-heavy benchmarks.
A personal note: I've been working in numerics and floating-point for a decade now and have had to deal with float boxing both from a research point of view (lots of runtime analysis systems for floats), from a user point of view (using unboxed float vectors for significant speedup in my own software), and from a teaching point of view (discussing boxing in my compilers class, using NaN-boxing as an example of cleverness).
This idea is so simple, so crazy, so stupid, and works so well, but I never thought of it. Bravo to the authors.
daanx 8 days ago [-]
> This idea is so simple, so crazy, so stupid, and works so well, but I never thought of it. Bravo to the authors.
Thanks for the nice summary -- looking forward to read the paper!
The same idea of self-tagging is actually also used in Koka language [1] runtime system where by default the Koka compiler only heap allocates float64's when their absolute value is outside the range [2e-511,2e512) and not 0, infinity, or NaN (see [2]). This saves indeed many (many!) heap allocations for float intensive programs.
Since Koka only uses 1 bit to distinguish pointers from values, another slightly faster option is to only box negative float64's but of course, negative numbers are still quite common so it saves less allocations in general.
ps. If you enjoy reading about tagging, I recently wrote a note on efficiently supporting seamless large integer arithmetic (as used in Koka as well) and discuss how certain hardware instructions could really help to speed this up [3]:
> This allows unboxing the vast majority of floats, which leads to big speedups on float-heavy benchmarks.
NaN-boxing allows all floats to be unboxed though. The main benefit of the self-tagging approach seems to be that by boxing some floats, we can make space for 64-bit pointers which are too large for NaN-boxing.
The surprising part of the paper is that "some floats" is only a small minority of values - not, say, 50% of them.
skybrian 8 days ago [-]
A small minority, but apparently it includes all the floats you’re likely to use. It seems the insight is that you only need 8 bits of exponent in most cases. (And single-precision floating point only has 8 bits of exponent.)
Most double-precision floats are never used because they have high exponents.
bee_rider 7 days ago [-]
Got take, but a double with only 8 bits of exponent actually seems kind of nice, you get the extra precision but it can be cast down to a single and you only lose precision; you don’t have values that are outside the range of singles.
pansa2 8 days ago [-]
> A small minority, but apparently it includes all the floats you’re likely to use.
Sorry, I meant a small minority need to be boxed - all the floats you're likely to use can remain unboxed.
adgjlsfhk1 8 days ago [-]
50% means you only get 1 tag bit.
also you totally can fit 64 bit pointers inside a NaN. 46 bit pointers are only 48 bits and you have 53 bits of NaN payload. (you also could get an extra 3 bits if you only allow storing 8 byte aligned pointers unboxed)
pansa2 8 days ago [-]
> 50% means you only get 1 tag bit.
That's enough to distinguish between "unboxed float" and "something else", where the latter can have additional tag bits.
> [64-bit] pointers are only 48 bits and you have 53 bits of NaN payload.
The paper specifically talks about support for "high memory addresses that do not fit in 48 bits". If you don't have to handle those high addresses, I don't think this approach has any benefits compared to NaN-boxing.
dzaima 8 days ago [-]
Of note is that even if you have some massive ≥2^48 data sources, you could still quite likely get away with having NaN-boxed pointers to within the low-size heap, with an extra indirection for massive data. This only would break apart if you managed to reach around 2^45 distinct referenceable objects, which you probably shouldn't ever have (esp. in a GCd language).
rtpg 8 days ago [-]
Do all float operations need to reconfirm those bits afterwards though? I suppose if you have some sort of JIT you can end up with a bunch of unboxed floats and would only pay the cost on boundaries though
pansa2 8 days ago [-]
> reconfirm those bits afterwards
Thanks - I hadn't thought about that but it seems to be the main downside of this approach. The benefit of NaN-boxing is that it reassigns values that are otherwise unused - floating-point calculations will never generate NaNs with those bit patterns.
AlotOfReading 8 days ago [-]
An additional wrinkle is that NaNs are a bit unstable and can have large performance penalties. You can't let the NaNs ever escape into arithmetic and you may even have issues even storing them in a register.
phire 8 days ago [-]
Yes, but there should be some optimisation opportunities.
Off the top of my head: Any multiply by a constant less than 1.0 will never overflow the unboxed range (but might underflow) and there should be times when it's provably better to check the inputs are inside a range, rather than checking the outputs.
It's worth pointing out that these overflow/underflow checks will be very cold (on typical code). They won't waste much in the way of branch-prediction resources.
I wonder if it's worth taking advantage of floating point overflow/underflow exceptions. I think a multiplication by 2^767 will trigger an exception if the value would overflow, and the corresponding multiply by 2^-765 will catch underflows.
It's tempting to allocate two more tags for floats (001 and 010), covering the entire range from -2^257 to +2^257. It will be rare to actually see those small floats near zero, but it could be worth eliminating the possibility of underflows.
ithkuil 8 days ago [-]
You check the tag before doing float operations
pansa2 8 days ago [-]
And afterwards, because floating-point arithmetic can change the value of the tag. This isn't necessary with NaN-boxing, because it uses NaN bit patterns that the hardware never generates.
lifthrasiir 8 days ago [-]
Only when they have to be boxed, but yes if you are talking about that.
ndjdjddjsjj 8 days ago [-]
You need to check after the floating point operation though just in case. Or after the boundary where you pass the float to something else expecting this scheme.
gus_massa 8 days ago [-]
Nice explanation but it took me a while to understand the trick. They are hidding the tag in the "exponent" of the float, not in the "mantisa"!
skybrian 8 days ago [-]
It’s clever, but not random chance. That would be too much of a coincidence. They rotate the floats to make it happen the way they want.
It’s hardly random that only 8 bits of exponent are needed for many calculations.
jonnycomputer 8 days ago [-]
Thank you for the clear explanation!
8 days ago [-]
comex 8 days ago [-]
> For instance, NaN-tagging prevents (or largely complicates) optimizations relying on stack allocations. The stack uses high memory addresses that do not fit in 48 bits unless encoded relative to the location of the stack segment.
Er, what? The paper says they tested on a Xeon CPU, so x86-64, running Linux. On traditional x86-64, all pointers fit in 48 bits, period. Stack memory is no exception. More recently the architecture was extended to allow 56-bit pointers, but my impression is that Linux (like other OSes) keeps them disabled by default in userspace. According to the documentation [1]:
> Not all user space is ready to handle wide addresses. [..] To mitigate this, we are not going to allocate virtual address space above 47-bit by default.
So how would the stack end up above 47 bits? Is the documentation out of date?
The address space size limitations doesn't mean that only the least significants bits are used, the memory hole is in the middle of the address space[1].
I don't know what Linux does specifically (or under what configurations), but one some other operating systems the user space stack is in the higher half[2].
In TXR Lisp, on 64 bit targets, NaN boxing is used. Specifically, that variant whereby pointers represent themselves and a delta operation is needed to recover double precision float values. Is that the same as "NuN tagging"?
I like the "tagging" terminology better than "boxing". "NaN Boxing" specifically enables unboxed floats, so the name is stupid. Everyone seems to be using it though.
Joker_vD 7 days ago [-]
I imagine that in a "heavy numerical" part of the code one would ideally use identity-tagged floats, i.e., with no rotation applied. Instead, the (few) pointers that are used in this part would be stored in rotated state.
But thinking about how to mesh this scheme together with other, "normal" (that is, pointer-heavy) part of the code makes me slightly nauseous.
naasking 4 days ago [-]
> But thinking about how to mesh this scheme together with other, "normal" (that is, pointer-heavy) part of the code makes me slightly nauseous.
Haskell already solved this with kinds. Unboxed values are of kind # rather than the usual kind * of all other values, and you can lift a # to a * via the usual boxing operation.
> CRuby uses this technique on 64bit platforms for years.
What do you mean by "this technique"?
The paper says that CRuby uses tagged objects but could benefit from the innovation being discussed here, a specific bit pattern used to tag floats. See the following quote:
> Therefore, implementations that represent floats as tagged pointers could benefit from it with minimal implementation effort. Such popular implementations include CPython [11], CRuby [32] and Google’s V8 [33].
funny_falcon 8 days ago [-]
In fact, CRuby successfully combined “Self Tagging” with pointer tagging. Here's the commit:
Linked commit contains code for rotating tagged floats so bits 60..62 go to the least significant positions, and a comment about a range of unboxed floats between 1.7...e-77 and 1.7...e77, plus special casing 0.0.
funny_falcon 8 days ago [-]
I mean, CRuby does “Float Self Tagging” for years. Paper just has the mistake about CRuby.
Rendered at 13:36:34 GMT+0000 (Coordinated Universal Time) with Vercel.
Here's a variation I just thought of, which probably has a few footguns I'm overlooking right now: use the seven highest exponent bits instead of three. Then we can directly read the most significant byte of a double, skipping the need for a rotation altogether.
After reading the most significant byte, subtract 16 (or 0b00010000) from it, then mask out the sign. To test for unboxed doubles, test if the resulting value is bigger than 15. If so, it is an unboxed double, otherwise the lower four bits of the byte are available as alternative tags (so 16 tags).
Effectively, we adjusted the exponent range 2⁻⁷⁶⁷..2⁻⁵¹¹ into the 0b(0)0000000 - 0b(0)0001111 range and made them boxed doubles. Every other double, which is 15/16th of all possible doubles, is now unboxed. This includes subnormals, zero, all NaN encodings (so it even can exist in superposition with a NaN or NuN tagging system I guess?) and both infinities.
To be clear, this is off the top of my head so maybe I made a few crucial mistakes here.
There are many reasons why 3-bit tags work well in practice. Importantly, it allows aligning heap objects on 64-bit machine words. Dereferencing a tagged pointer can then be done in a single machine instruction, a MOV offset by the tag.
One of our goals is to make self-tagging as straightforward to implement in existing systems as possible. Thus requiring to move away from the ubiquitous 3-bit tag scheme is definitely a no-go.
Another goal is performance, obviously. It turns out that if the transformation from self-tagged float to IEEE754 float requires more than a few (2-3) machine instructions, it is no longer as advantageous. Thus the choice of tags 000, 011 and 100, which encoding/decoding is a single bitwise rotation.
Also keep in mind that assigning more tags to self-tagging to capture floats that are never used in practice just adds a strain on other objects. That's why we include a usage analysis of floats to guide tag selection in our paper.
In fact, we are currently working on an improved version that uses a single tag to capture all "useful" values. Capturing 3/8 of floats seems unnecessary, especially since one of the tag is only meant to capture +-0.0. The trick is to rotate the tag to bits 2-3-4 of the exponent instead of 1-2-3 and add an offset to the exponent to "shift" the range of captured values.
But in the end, it feels like we are barely scratching the surface and that a lot of fine-tuning can still be done by toying with tag placement and transformation. But I think the next step is a quality over quantity improvement: capture less floats but capture the right ones.
> The trick is to rotate the tag to bits 2-3-4 of the exponent instead of 1-2-3 and add an offset to the exponent to "shift" the range of captured values.
Maybe I misunderstand, but isn't that a similar idea to what I just described? Adding an offset to "rotate" the ranges of the exponent by a segment, putting the one with zero in the high side? The main difference being that you stick to the upper four bits of the exponent, and that I suggested using one of the upper three bit sequences to mark bits 4-5-6-7 as tag bits? (I mistakenly included the sign bit before and claimed this unboxes 15/16ths of all doubles, it's actually 7/8ths) Which probably has consequences for minimizing instructions, like you mentioned.
> But I think the next step is a quality over quantity improvement: capture less floats but capture the right ones.
I suspect that ensuring NaN and Infinity are in there will be crucial to avoid performance cliffs in certain types of code; I have seen production code where it is known that initiated values are always finite, so either of those two are then used as ways to "tag" a float value as "missing" or something to that degree.
Anyway, looking forward to future results of your research!
Yes it is similar. It seems to me that there really isn't that many useful operations that can be applied to the exponent beside adding an offset. But that's only a suspicion, do not take my word for it.
> I suspect that ensuring NaN and Infinity are in there will be crucial to avoid performance cliffs...
This is a reasonable assumption. There are in fact ways to rotate and add an offset such that the exponent can overflows/underflows to capture exponents 0 and 0x7ff (for inf and nan) with a single well-positioned tag. Making it work in practice is not as simple, but we are working on it.
You're right that Ruby uses tags, ex. Objective-C does also and has for a while.
The innovation here is its a tag without the tag bits. That's why its self-tagging, not tagging.
Linked commit contains code for rotating tagged floats so bits 60..62 go to the least significant positions, and a comment about a range of unboxed floats between 1.7...e-77 and 1.7...e77, plus special casing 0.0
e.g. this excerpt:
You need a Float64Array for the values (plus a Uint32Array/Uint8Array using the same buffer to be able to manipulate the integer bits, which technically also requires knowledge of endianness since it's technically not in the JS spec and some funny mobile hardware still exists out there). Mocking a heap is easy enough: the "heap" can be a plain array, the "pointers" indices into the array. Using a Map<value, index> would you "intern" duplicate values (i.e. ensure doubles outside the unboxed range are only "heap" allocated once).
If they want to convince me that their thing is faster, they should do a comparison against a production implementation of NuN tagging. Note that the specifics of getting it right involve wacky register allocation tricks on x86 and super careful instruction selection on arm.
It seems that they use some very nonstandard JS implementation of NaN tagging as a strawman comparison.
(Source: I wrote a lot of the NuN tagging optimizations in JavaScriptCore, but I didn’t invent the technique.)
I'd add that the claim that this could be implemented in V8 doesn't take into account pointer compression, where on-heap V8 tagged pointers are 32-bit, not 64-bit.
People who enjoyed this might also like to read how Apple used tagged pointers for short strings in Objective-C [0]. I think that's when I first learned about tagged pointers. NaN-boxing was mindblowing for me. I love this kind of stuff.
[0] https://mikeash.com/pyblog/friday-qa-2015-07-31-tagged-point...
So, for example, with 3 bits of alignment required, the first valid address for a pointer to point to after 0x0 is 0x8, and after that is 0x10, but you represent those as 0x1 and 0x2 respectively, and use a shift to get back the actual address (0x1 << 3 = 0x8, actual address). I think this is gestured at in section 1.1 of the paper, sort of, except they envision using the space thus freed for tags, rather than additional bits. (Which only makes sense if your address is 32 bits anyway, rather than 64 as in the paper: no one has 67-bit addresses. So saving 3 bits doesn't buy you anything. I think.)
> Aligning all heap-allocated values to 64-bit machine words conveniently frees the low bits of pointers to store a 3-bit tag.
If you applied the C standard strictly, you could use a JVM-style representation for pointers to longs, pointers, and structs that start with longs and pointers, so you could theoretically have an implementation where those pointers were shorter. But you'd have to convert back and forth when casting to and from void* (and char*), and in practice C people expect to be able to cast a long* to int, cast that to void*, and get the same result as casting long* to void*, even though doing that and using it is undefined behaviour according to the standard.
In many dynamic languages some values are stored on the heap ("boxed") and represented as a pointer, while others are represented as an immediate value ("unboxed" or "immediate"). Pointer tagging is a common way to do that: the low bit of the value tells you the value's type, and some types are immediate while others are boxed.
Naturally, the tag bits have a fixed value, so can't be used to store data. So for example your language might offer 61-bit integer immediates instead of 64-bit integers; the other three bits are used for tags. Possibly, larger integers are stored on the heap and treated as a different type (for example Python 2.X had separate int and long types for these cases).
However, it's hard to use this strategy for floats, because floats need all 64 bits (or 32 bits for single-precision, same difference). There's a trick called "NaN boxing" which makes use of the large number of NaNs in the float representation, but read the paper if you want more on that.
The authors' insight is that, suppose you have a three-bit tag and 011 is the tag for floats. By totally random chance, _some_ floats will end in 011; you can represent those as immediates with those tag bits. Obviously, that's unlikely, though you can raise the chances by using, like, 010, 011, 100, and 101 all as float tags. Still, the low bits are a bad choice. But what about high bits? Most floats have one of a couple high bit patterns, because most floats are either 0 or between, say, 1e-100 and 1e100. Floats outside that range can be boxed but since they're really rare it's not a big cost to box them.
So basically, we use high bits as our tag bits and map all the common float prefixes to float tags. This allows unboxing the vast majority of floats, which leads to big speedups on float-heavy benchmarks.
A personal note: I've been working in numerics and floating-point for a decade now and have had to deal with float boxing both from a research point of view (lots of runtime analysis systems for floats), from a user point of view (using unboxed float vectors for significant speedup in my own software), and from a teaching point of view (discussing boxing in my compilers class, using NaN-boxing as an example of cleverness).
This idea is so simple, so crazy, so stupid, and works so well, but I never thought of it. Bravo to the authors.
Thanks for the nice summary -- looking forward to read the paper!
The same idea of self-tagging is actually also used in Koka language [1] runtime system where by default the Koka compiler only heap allocates float64's when their absolute value is outside the range [2e-511,2e512) and not 0, infinity, or NaN (see [2]). This saves indeed many (many!) heap allocations for float intensive programs.
Since Koka only uses 1 bit to distinguish pointers from values, another slightly faster option is to only box negative float64's but of course, negative numbers are still quite common so it saves less allocations in general.
[1] https://koka-lang.github.io/koka/doc/book.html#sec-value-typ...
[2] https://github.com/koka-lang/koka/blob/dev/kklib/src/box.c#L...
ps. If you enjoy reading about tagging, I recently wrote a note on efficiently supporting seamless large integer arithmetic (as used in Koka as well) and discuss how certain hardware instructions could really help to speed this up [3]:
[3] https://www.microsoft.com/en-us/research/uploads/prod/2022/0... (ML workshop 2022)
NaN-boxing allows all floats to be unboxed though. The main benefit of the self-tagging approach seems to be that by boxing some floats, we can make space for 64-bit pointers which are too large for NaN-boxing.
The surprising part of the paper is that "some floats" is only a small minority of values - not, say, 50% of them.
Most double-precision floats are never used because they have high exponents.
Sorry, I meant a small minority need to be boxed - all the floats you're likely to use can remain unboxed.
also you totally can fit 64 bit pointers inside a NaN. 46 bit pointers are only 48 bits and you have 53 bits of NaN payload. (you also could get an extra 3 bits if you only allow storing 8 byte aligned pointers unboxed)
That's enough to distinguish between "unboxed float" and "something else", where the latter can have additional tag bits.
> [64-bit] pointers are only 48 bits and you have 53 bits of NaN payload.
The paper specifically talks about support for "high memory addresses that do not fit in 48 bits". If you don't have to handle those high addresses, I don't think this approach has any benefits compared to NaN-boxing.
Thanks - I hadn't thought about that but it seems to be the main downside of this approach. The benefit of NaN-boxing is that it reassigns values that are otherwise unused - floating-point calculations will never generate NaNs with those bit patterns.
Off the top of my head: Any multiply by a constant less than 1.0 will never overflow the unboxed range (but might underflow) and there should be times when it's provably better to check the inputs are inside a range, rather than checking the outputs.
It's worth pointing out that these overflow/underflow checks will be very cold (on typical code). They won't waste much in the way of branch-prediction resources.
I wonder if it's worth taking advantage of floating point overflow/underflow exceptions. I think a multiplication by 2^767 will trigger an exception if the value would overflow, and the corresponding multiply by 2^-765 will catch underflows.
It's tempting to allocate two more tags for floats (001 and 010), covering the entire range from -2^257 to +2^257. It will be rare to actually see those small floats near zero, but it could be worth eliminating the possibility of underflows.
It’s hardly random that only 8 bits of exponent are needed for many calculations.
Er, what? The paper says they tested on a Xeon CPU, so x86-64, running Linux. On traditional x86-64, all pointers fit in 48 bits, period. Stack memory is no exception. More recently the architecture was extended to allow 56-bit pointers, but my impression is that Linux (like other OSes) keeps them disabled by default in userspace. According to the documentation [1]:
> Not all user space is ready to handle wide addresses. [..] To mitigate this, we are not going to allocate virtual address space above 47-bit by default.
So how would the stack end up above 47 bits? Is the documentation out of date?
[1] https://docs.kernel.org/arch/x86/x86_64/5level-paging.html
I don't know what Linux does specifically (or under what configurations), but one some other operating systems the user space stack is in the higher half[2].
[1] https://en.wikipedia.org/wiki/X86-64#Canonical_form_addresse...
[2] https://github.com/illumos/illumos-gate/blob/master/usr/src/...
I like the "tagging" terminology better than "boxing". "NaN Boxing" specifically enables unboxed floats, so the name is stupid. Everyone seems to be using it though.
But thinking about how to mesh this scheme together with other, "normal" (that is, pointer-heavy) part of the code makes me slightly nauseous.
Haskell already solved this with kinds. Unboxed values are of kind # rather than the usual kind * of all other values, and you can lift a # to a * via the usual boxing operation.
https://wiki.haskell.org/Unboxed_type
Edit: the commit https://github.com/ruby/ruby/commit/b3b5e626ad69bf22be3228f8...
What do you mean by "this technique"?
The paper says that CRuby uses tagged objects but could benefit from the innovation being discussed here, a specific bit pattern used to tag floats. See the following quote:
> Therefore, implementations that represent floats as tagged pointers could benefit from it with minimal implementation effort. Such popular implementations include CPython [11], CRuby [32] and Google’s V8 [33].
https://github.com/ruby/ruby/commit/b3b5e626ad69bf22be3228f8...
Linked commit contains code for rotating tagged floats so bits 60..62 go to the least significant positions, and a comment about a range of unboxed floats between 1.7...e-77 and 1.7...e77, plus special casing 0.0.