The problem is that this breaks down once you try to use SIMD instructions. I'd developed a similar kind of approach to encoding integers (and ieee774 floats) a couple of years ago (first byte encodes length and first bit of data: https://github.com/kstenerud/bonjson/blob/05b91f6fe7d6b07186... ). It was very clever and used compiler intrinsics to get the length in 1 instruction, so 2 instructions got you the final value, with no branches.
The true irony is that SIMD text parsing would outperform this!
nine_k 13 minutes ago [-]
I think these are different use cases. If you talk about SIMD, you talk about the CPU and efficient processing of large numbers of integers. I think that when a solution like this crops up, it's about storage or transmission, and dense packing at the cost of non-uniformity. It's more like time-series databases pack numbers by delta encoding.
the-lazy-guy 6 minutes ago [-]
Stored and transmitted data also has to be decoded. With modern datacenter hardware bottleneck is often CPU rather than network or disk (SSD). It depends on specific properties of the data. (I used to work on search index implementation which is about decoding and intersecting large amounts of hit-lists; and right SIMD-friendly varint encoding is obviously crucial)
kstenerud 10 minutes ago [-]
The thing is, most real-world numbers will fit within 1-3 bytes (even at 7 bits per byte), so ultradense packing doesn't actually buy much outside of benchmarks.
I spent WAYYYYYYYY too much time exploring this...
michaelmure 2 minutes ago [-]
One nice upside of having a single way to encode a value is fuzzing: when you work on an encoder/decoder, you can use a fuzzer and do round-trip comparison until you find crashes or inputs/outputs that don't match (and therefore issues in the code). But with LEB128 for example, the fuzzer quickly learn about those alternatives encoding and there is not much you can do from there.
stebalien 57 minutes ago [-]
I've used LEB128 (with canonicalisation) extensively and... this looks so much nicer for most use-cases (length prefixed, supports the full uint64 range without that extra 10th byte).
The downside is the encoding size. LEB128 quickly grows to 2 bytes, but stays at 2 bytes all the way to 2^14. This is important if you're using these numbers as tags/identifiers as we were in the multicodec [1] project, or for network message lengths. bijou64 only gives you 500 <= 2 byte numbers.
I'm working on a C++ library at work that binds wire data and application data through token and model layers, which includes among other things a fair amount of tokenizers/composers for various formats (JSON, CBOR, BSON, CSV...).
This looks neat, but if encoding/decoding performance is important, payload size isn't and the integer is bounded, I would just put a fixed-size integer into the payload as-is.
LEB128 (and JSON for that matter) can encode integer values of arbitrary length. This doesn't, which may or may not be important but it's different.
I'll admit that I do not do any cryptographic work with my library and therefore canonical representations aren't a huge concern in my use-cases. I merely provide various configurable limits (max value length, max depth, max items per collection) in an effort to prevent infinitely long documents from hogging my tokenizers indefinitely.
omoikane 43 minutes ago [-]
UTF-8 has the same issue ("overlong encoding") where multiple representations are possible the same code point. Someone proposed a similar tweak to remove the overlapping ranges by adjusting the base offset for byte sequences that are longer than 1. That was discussed here:
This "corrected UTF-8" has other problems, but I thought it's interesting how the shifted-offset idea carries over.
billpg 26 minutes ago [-]
I forget where I encountered it, but I've seen similar encodings that eliminated the possibility of many possible encodings for the same number by making the length part of the value.
Values 0-127 are a single byte, but if that first byte has the continuation bit set, not only does that indicate the next byte has 7 more bits to contribute, it also moves the base up to the next window.
10000000 00000000 is the only way to represent 128.
10000000 10000000 00000000 is the only way to represent 16512.
In short: instead of a truly indefinite-length solution with a signal bit on the current byte saying whether to check the next byte, this uses a counter. Values 0x0 to 0xF7 are one-byte integers, 0xF8 to 0xFF use the upper 5 bits as a counter for the number of subsequent bytes. This limits the maximum magnitude to slightly less than 2 ^ 264 (almost all 33-byte values), which seems to be okay for practical computations. The proposed standard limits the supported size to u64 though.
The upsides: the size of the integer is apparent upon reading the first byte, and every number has exactly one canonical representation. I wish C strings had been standardized around something similar, instead on null termination.
> ...adversarial input, which is rarely in the test suite.
This made my scratch my head. My tests for quite pedestrian APIs often contain adversarial input of obvious shapes. I though that for anything security-related (like the author's project) testing against adversarial input would be be a prominent part.
onlyrealcuzzo 19 minutes ago [-]
> I though that for anything security-related (like the author's project) testing against adversarial input would be be a prominent part.
They might have a different definition of adversarial than you.
> My tests for quite pedestrian APIs often contain adversarial input of obvious shapes.
This doesn't seem like what I would call adversarial.
This seems like standard negative testing or boundary value analysis - which I would be shocked if they didn't do.
HansHamster 47 minutes ago [-]
It feels a bit unfair to say that it is faster by being able to tell the total length from the first byte and capping it at 64 bit, while some of the other formats can store arbitrarily large integers.
I guess you could use another variable length encoding for the prefix at the cost of some performance and using even more space...
petermcneeley 19 minutes ago [-]
esp when the number is capped at only 64 bits which is quite small for some bigInt style numbers.
willtemperley 52 minutes ago [-]
Maybe someone can explain why an encoder would ever create the padding bytes allowed in LEB128. I contributed the parser for LEB128 in apple/swift-binary-parsing and I’m still none the wiser. I’m genuinely mystified.
scottlamb 39 minutes ago [-]
I can think of two reasons.
The first is what they describe here: as an attack. It's like why would anyone ever overflow a buffer with shellcode.
It allows you to fill in padding in a buffer. For example, all data in a buffer will be interpreted by a downstream system, and someone pre-calculated the size of that buffer. Rather than encode everything twice (once to figure out the exact size needed, and a second time to actually populate the buffer) the buffer size was calculated using foreknowledge of how many values would be written to that buffer and then just pessimistically assuming all of them are max-size so writing will never fail. Another situation is when you're rewriting part of an already-encoded file. If you want to change a bit of payload then using padding bytes gives you more flexibility so you can do that without having to do any memcpy into a new buffer.
It's uncommon but I've definitely seen it done (with media containers like Matroska, not actually LEB128) in extremely high-throughput systems that can't spare any cycles.
esrauch 45 minutes ago [-]
Let's say you are writing into a byte[] and have a LEB128 length-prefix followed by a payload, but that determining the length actually involves nontrivial encoding work. For example, you have a UTF16 string and want to write out a UTF8 string, you want to go over the characters and write them out, but the UTF8 length is not known without doing all of that work.
If you can choose a fixed number of bytes for the length prefix, you can skip that number, do the encoding and find out the length, and then come back and fill in the length-prefix after.
But you actually don't know how many bytes it will take without doing all of the work to know the payload length (since larger payloads take more bytes to represent the length).
If you allow overlong representation you can reserve a few bytes and sometimes it'll just be the effective no-op bytes. If you don't, you won't be able to.
willtemperley 20 minutes ago [-]
Thank you for solving that mystery!
axod 29 minutes ago [-]
Maybe you want to byte align some data, or pack to a certain size but keep compat. I think they're going to be rare cases, but I can see it being used.
boricj 40 minutes ago [-]
Laziness probably. Maybe there's an argument if you want to avoid branches and just blast the integer out in a fixed number of statements/instructions/bytes, but that sounds a bit fringe.
I happen to be guilty of a variant of this, where I don't bother emitting a 16-bit floating point number instead of a 32-bit one in my CBOR encoder even if it can be represented exactly. That one is laziness.
layer8 46 minutes ago [-]
The issue is that non-unique encodings are an attack vector, because parsers may in practice behave differently for noncanonical (or nominally invalid) encodings.
kbolino 14 minutes ago [-]
For example, you have an envelope format that goes: length prefix in LEB128, message, signature. One party controls the length prefix and signature, a different party controls the message. The message-writing party carefully crafts the message so that, in isolation, it appears innocuous, but when wrapped in the envelope, the first few bytes of the message look like continuation of the length prefix. Best case is the receiving party safely fails to parse the message, worst case is the receiving party successfully parses the message, verifies its digital signature, and interprets it differently than the signer did.
Chaosvex 49 minutes ago [-]
You wouldn't. It's a strange argument that can be countered with, "maybe don't do that?"
willtemperley 46 minutes ago [-]
So why does the spec allow it? Like a good engineer I read the spec and tested against the over-wide example encodings given.
Chaosvex 37 minutes ago [-]
Because it's not a real standard and there is no blessed RFC for it. The DWARF spec is as close as you'll get and it says, "The integer zero is a special case, consisting of a single zero byte." So in a way, it doesn't.
Either way, a properly written decoder (and it's like ten lines) should really not have any problems with it. I was agreeing with you.
Edit: to clarify, I was talking about the author's argument being strange, not yours.
willtemperley 18 minutes ago [-]
The WASM spec is more explicit about over-long LEB128 encoding.
Edit: a properly written decoder is a lot more than 10 lines if you properly deal with integer overflow and both signed and unsigned ints.
RedShift1 1 hours ago [-]
This seems quite convoluted just to avoid the "0 can be represented in more than one way" problem.
bjoli 42 minutes ago [-]
Having all numbers be valid in only one way is a great idea. So much that I believe webassembly enforced canonical leb128, at the cost of decoding speed.
And say you have it as part of some other data. If you want to be able to hash it by the raw memory bytes, many different ways to represent a number becomes a problem.
nine_k 60 minutes ago [-]
It allows finding out the length (and allocating memory) after reading the first byte.
ape4 50 minutes ago [-]
Comparing a number to zero is something that's done a lot
Chaosvex 45 minutes ago [-]
True but also not particularly relevant?
ahoka 57 minutes ago [-]
I think it's neat.
cantalopes 42 minutes ago [-]
I love the random hyperlink underlines on that page
alex-reyss 15 minutes ago [-]
[dead]
amluto 21 minutes ago [-]
Just a quick reminder:
> This causes problems for signed data if you ever want to do things like compression since you need to know the exact bytes that were signed.
If you are verifying a signature by taking some logical data structure, turning it into a byte string, and calling the verification primitive on those bytes, you likely have a design error. You should instead collect bytes, verify the signature, and then parse the bytes after verifying the signature. And remember to include enough context in those bytes so a different message signed for a different purpose by the same key doesn’t confuse you.
Rendered at 16:47:11 GMT+0000 (Coordinated Universal Time) with Vercel.
But testing proved that when you move to SIMD instructions, ULEB128 (https://github.com/kstenerud/bonjson/blob/main/bonjson.md#ty...) or sentinel values (https://github.com/kstenerud/bonjson/blob/main/bonjson.md#lo...) win every time.
The true irony is that SIMD text parsing would outperform this!
I spent WAYYYYYYYY too much time exploring this...
The downside is the encoding size. LEB128 quickly grows to 2 bytes, but stays at 2 bytes all the way to 2^14. This is important if you're using these numbers as tags/identifiers as we were in the multicodec [1] project, or for network message lengths. bijou64 only gives you 500 <= 2 byte numbers.
[1]: https://github.com/multiformats/multicodec
This looks neat, but if encoding/decoding performance is important, payload size isn't and the integer is bounded, I would just put a fixed-size integer into the payload as-is.
LEB128 (and JSON for that matter) can encode integer values of arbitrary length. This doesn't, which may or may not be important but it's different.
I'll admit that I do not do any cryptographic work with my library and therefore canonical representations aren't a huge concern in my use-cases. I merely provide various configurable limits (max value length, max depth, max items per collection) in an effort to prevent infinitely long documents from hogging my tokenizers indefinitely.
https://news.ycombinator.com/item?id=44456073 - Corrected UTF-8 (2025-07-03, 54 comments)
This "corrected UTF-8" has other problems, but I thought it's interesting how the shifted-offset idea carries over.
Values 0-127 are a single byte, but if that first byte has the continuation bit set, not only does that indicate the next byte has 7 more bits to contribute, it also moves the base up to the next window.
10000000 00000000 is the only way to represent 128.
10000000 10000000 00000000 is the only way to represent 16512.
Does this encoding have a name?
The upsides: the size of the integer is apparent upon reading the first byte, and every number has exactly one canonical representation. I wish C strings had been standardized around something similar, instead on null termination.
> ...adversarial input, which is rarely in the test suite.
This made my scratch my head. My tests for quite pedestrian APIs often contain adversarial input of obvious shapes. I though that for anything security-related (like the author's project) testing against adversarial input would be be a prominent part.
They might have a different definition of adversarial than you.
> My tests for quite pedestrian APIs often contain adversarial input of obvious shapes.
This doesn't seem like what I would call adversarial.
This seems like standard negative testing or boundary value analysis - which I would be shocked if they didn't do.
The first is what they describe here: as an attack. It's like why would anyone ever overflow a buffer with shellcode.
The second is that they are implementing a spec that requires appending a varint length-prefixed field to a buffer but don't really care about the space optimization, don't know the field's length when they start appending it, and don't want to put the field into a second, temporary buffer or slide it down into place. https://github.com/FFmpeg/FFmpeg/blob/468a743af1653a08f47081... vs say my own code which does the slide: https://github.com/scottlamb/retina/blob/6972ac4261ce7bf5b58...
It's uncommon but I've definitely seen it done (with media containers like Matroska, not actually LEB128) in extremely high-throughput systems that can't spare any cycles.
If you can choose a fixed number of bytes for the length prefix, you can skip that number, do the encoding and find out the length, and then come back and fill in the length-prefix after.
But you actually don't know how many bytes it will take without doing all of the work to know the payload length (since larger payloads take more bytes to represent the length).
If you allow overlong representation you can reserve a few bytes and sometimes it'll just be the effective no-op bytes. If you don't, you won't be able to.
I happen to be guilty of a variant of this, where I don't bother emitting a 16-bit floating point number instead of a 32-bit one in my CBOR encoder even if it can be represented exactly. That one is laziness.
Either way, a properly written decoder (and it's like ten lines) should really not have any problems with it. I was agreeing with you.
Edit: to clarify, I was talking about the author's argument being strange, not yours.
Edit: a properly written decoder is a lot more than 10 lines if you properly deal with integer overflow and both signed and unsigned ints.
And say you have it as part of some other data. If you want to be able to hash it by the raw memory bytes, many different ways to represent a number becomes a problem.
> This causes problems for signed data if you ever want to do things like compression since you need to know the exact bytes that were signed.
If you are verifying a signature by taking some logical data structure, turning it into a byte string, and calling the verification primitive on those bytes, you likely have a design error. You should instead collect bytes, verify the signature, and then parse the bytes after verifying the signature. And remember to include enough context in those bytes so a different message signed for a different purpose by the same key doesn’t confuse you.