Next.js App Router + React Server Components Demo

new
past
show
ask
show
jobs
submit

▲Memory access is O(N^[1/3]) (vitalik.eth.limo)

136 points by jxmorris12 3 days ago | 100 comments

hardwaresofton 3 minutes ago [-]

The meat of the article:

> In my binary field code, I found that an 8-bit precomputation table (in that context, 224 items taking up 128 MB) led to faster computations than a 16-bit precomputation table (232 items taking up 8 GB): while the latter fit into RAM, the former could fit into cache, and the faster access time of the former was decisive.

Interesting to see some first principles thinking to back out the “why”.

j2kun 5 hours ago [-]

> The empirical argument

> We can ask a question: how long (in nanoseconds) does it take to access a type of memory of which an average laptop has N bytes? Here's GPT's answer:

"Here's what GPT says" is not an empirical argument. If you can't do better than that (run a benchmark, cite some literature), why should I bother to read what you wrote?

raincole 2 hours ago [-]

The cool thing about "here's what GPT says" is that you can make GPT says whatever you want!

https://chatgpt.com/share/68e6eeba-8284-800e-b399-338e6c4783...

https://chatgpt.com/share/68e6ef4a-bdd0-800e-877a-b3d5d4dc51...

mochomocha 1 hours ago [-]

The article started really well, and I was looking forward to the empirical argument.

Truly mind-boggling times where "here is the empirical proof" means "here is what chatGPT says" to some people.

LegionMammal978 7 hours ago [-]

Though at the limit, it would have to be at least O(sqrt(n)) thanks to the Bekenstein bound [0]. And of course, as mentioned in TFA, you can always do better if you can get away with local random access in parallel, rather than global random access.

[0] https://en.wikipedia.org/wiki/Bekenstein_bound

dooglius 6 hours ago [-]

If one wants to start talking cosmology, it's unlikely to the case that arbitrarily long-lived computers are possible, I don't think any of the theories in [0] are conducive to either an infinite-time or infinite-memory computer, so the strict mathematical definition for Big-O doesn't hold up. IMO it's better to use Big-O as an effective theory for predicting runtime on human-scale computers than take the mathematical formalism too literally.

[0] https://en.wikipedia.org/wiki/Ultimate_fate_of_the_universe?...

jychang 4 hours ago [-]

> infinite-time or infinite-memory computer

That doesn't apply for the Bekenstein Bound though.

Literally the first line of the wikipedia article:

> In physics, the Bekenstein bound (named after Jacob Bekenstein) is an upper limit on the thermodynamic entropy S, or Shannon entropy H, that can be contained within a given *finite* region of space which has a *finite* amount of energy—or equivalently, the maximum amount of information that is required to perfectly describe a given physical system down to the quantum level.

hnuser123456 2 hours ago [-]

Okay, then we just use a bunch of tiny black holes and pack some extra dark energy between them, then we can get back to volumetric scaling. In fact, since dark matter isn't yet fully understood, once we can harness it, we can fit several times as much black hole surface area in the same space as a singular black hole.

dooglius 2 hours ago [-]

I'm arguing against the use of Big-O "in the limit" as GP puts it; our tech is far away from that limit and O(N^{1/3}) is a better model.

Dylan16807 7 hours ago [-]

Once you're talking about computers that verge on being black holes, I think you're making too many assumptions about how everything works to give it a speed rating.

jychang 4 hours ago [-]

Strong disagree- this is a fundamental core aspect of information theory.

Dylan16807 3 hours ago [-]

The core aspect of information theory is how much information fits in a sphere. Getting from there to memory access latency in a real computer is several abstractions away and a lot of those abstractions might not hold.

boothby 3 hours ago [-]

It says that for a sufficiently large storage system, the information within will ultimately be limited by the surface area and not the volume. That you can indeed judge a book by its cover. For the sake of asymptotic analysis of galactic algorithms, one need only consider schemes for reading and writing information on the surface of a sphere. Where it comes to "real hardware," this sort of analysis is inapplicable.

Dylan16807 3 hours ago [-]

The book you are presenting says nothing about latency. I judge the book as fine but not answering the right question.

justinpombrio 3 hours ago [-]

Could you name one that seems likely to fail?

Dylan16807 3 hours ago [-]

Off the top of my head: Assuming there's a specific point the data needs to get to. Assuming the size of the data sphere doesn't impact the speeds of anything inside it. Assuming we're using a classical computer. Assuming the support scaffolding of the computer stays a fixed percentage of the mass and doesn't eat into the data budget.

And I know some of those still fit into "at least" but if one of those would make it notably worse than sqrt(n) then I stand by my claim that it's a bad speed rating.

gpm 1 hours ago [-]

Time dilation for one. As you get closer to a black hole, an observer far away sees it as if time is slowing down for you. Not really sure how this changes the Big O analysis.

hinkley 5 hours ago [-]

Ultimately I think we will go to full NUMA like Sun and others tried. Instead of having L4 and then L5 caches, each core simply has 4GB of local working memory and you use programming languages that are ready for this. Erlang would be easy, and I think Rust has the semantics to make it work but it might take years of compiler advancement to make it efficient.

All shared state is communicated through shared memory pool that is either accessed directly through segregated address ranges or via DMA.

arka2147483647 2 hours ago [-]

Honestly, i doubt it. That exposes many details to the programmers that many of them would prefer not to know.

The higher level the language, the less interest there is to manually manage memory. It is just something to offload to the gc/runtime/etc.

So, i think this is a no-go. The market wont accept it.

hinkley 46 minutes ago [-]

You already don’t have a choice. The reason we are all in the cloud is that hardware stopped scaling properly vertically and had to scale horizontally, and we needed abstractions that kept us from going insane doing that.

If you really want to dumb down what I’m suggesting, it’s is tantamount to blade servers with a better backplane, treating the box as a single machine instead of a cluster. If IPC replaces a lot of the RPC, you kick the Amdahl’s Law can down the road at least a couple of process cycles before we have to think of more clever things to do.

We didn’t have any of the right tooling in place fifteen years ago when this problem first started to be unavoidable, but is now within reach, if not in fact extant.

dadoomer 1 hours ago [-]

You don't need every programmer to leverage the architecture for the market to accept it, just a few that hyper-optimize an implementation to a highly valuable problem (e.g., ads or something like that).

5 hours ago [-]

Legend2440 6 hours ago [-]

I've seen this idea before.

Similar article on the same topic from 2014: https://www.ilikebigbits.com/2014_04_21_myth_of_ram_1.html

sshine 6 hours ago [-]

https://cs.stackexchange.com/questions/19367/omega-sqrt3n-lo...

samuelknight 2 hours ago [-]

I disagree with this model because it assumes processing occurs at a point and memory is (optimally) distributed across space around it in every direction in an analog to a Von Neumann CPU architecture. However it is entirely possible to distribute compute with memory. For example, Samsung has a technology called PIM (Processing in Memory) where simple compute units are inserted inside HBM memory layers. Algorithms that can take advantage of this run much faster and at much lower power because it skips the bus entirely. More importantly, the compute scales in proportion to the memory size/space.

haberman 2 hours ago [-]

The article says exactly this in bold at the bottom:

> If you can break up a task into many parts, each of which is highly local, then memory access in each part will be O(1). GPUs are already often very good at getting precisely these kinds of efficiencies. But if the task requires a lot of memory interdependencies, then you will get lots of O(N^⅓) terms. An open problem is coming up with mathematical models of computation that are simple but do a good job of capturing these nuances.

cosmos0072 7 hours ago [-]

The math looks suspicious to me, or at least how it is presented.

If, as stated, accessing one register requires ~0.3 ns and available registers sum up to ~2560 B, while accessing RAM requires ~80 ns and available RAM is ~32 GiB, then it means that memory access time is O(N^1/3) where N is the memory size.

Thus accessing the whole N bytes of memory of a certain kind (registers, or L1/L2/L3 cache, or RAM) takes N * O(N^1/3) = O(N^4/3).

One could argue that the title "Memory access is O(N^1/3)" refers to memory access time, but that contradicts the very article's body, which explains in detail "in 2x time you can access 8x as much memory" both in text and with a diagram.

Such statement would require that accessing the whole N bytes of memory of a certain kind requires O(N^1/3) time, while the measurements themselves produce a very different estimate: accessing the whole N bytes of memory of a certain kind requires O(N^4/3) time, not O(N^1/3)

timerol 7 hours ago [-]

I did not interpret the article as you did, and thought it was clear throughout that the author was talking about an individual read from memory, not reading all of a given amount of memory. "Memory access, both in theory and in practice, takes O(N^⅓) time: if your memory is 8x bigger, it will take 2x longer to do a read or write to it." Emphasis on "a read or write".

I read "in 2x time you can access 8x as much memory" as "in 2x time you can access any byte in 8x as much memory", not "in 2x time you can access the entirety of 8x as much memory". Though I agree that the wording of that line is bad.

In normal big-O notation, accessing N bytes of memory is already O(N), and I think it's clear from context that the author is not claiming that you can access N bytes of memory in less time than O(N).

hinkley 5 hours ago [-]

Nobody has ever had this confusion about the access time of hash tables except maybe in the introductory class. What you’re describing is the same reasoning as any data structure. Which is correct. Physical memory hierarchies are a data structure. Literally.

I’m confused by GP’s confusion.

tsimionescu 5 hours ago [-]

This is completely false. All regularly cited algorithm complexity classes are based on estimating a memory access as an O(1) operation. For example, if you model memory access as O(N^1/3), linear search worse case is not O(N), it is O(N^4/3): in the worse case you have to make N memory accesses and N comparisons, and if each memory access in N^1/3 time, this requires N^4/3 + N time, which is O(N^4/3).

Kranar 15 minutes ago [-]

This is untrue, algorithms measure time complexity classes based on specific operations, for example comparison algorithms are cited as typically O(n * log(n)) but this refers to the number of comparisons irrespective of what the complexity of memory accesses is. For example it's possible that comparing two values to each other has time complexity of O(2^N) in which case sorting such a data structure would be impractical, and yet it would still be the case that the sorting algorithm itself has a time complexity of O(N log(N)) because time complexity is with respect to some given set of operations.

Another common scenario where this comes up and actually results in a great deal of confusion and misconceptions are hash maps, which are said to have a time complexity of O(1), but that does not mean that if you actually benchmark the performance of a hash map with respect to its size, that the graph will be flat or asymptotically approaches a constant value. Larger hash maps are slower to access than smaller hash maps because the O(1) isn't intended to be a claim about the overall performance of the hash map as a whole, but rather a claim about the average number of probe operations needed to lookup a value.

In fact, in the absolute purest form of time complexity analysis, where the operations involved are literally the transitions of a Turing machine, memory access is not assumed to be O(1) but rather O(n).

feoren 3 hours ago [-]

> linear search worse case is not O(N), it is O(N^4/3)

No, these are different 'N's. The N in the article is the size of the memory pool over which your data is (presumably randomly) distributed. Many factors can influence this. Let's call this size M. Linear search is O(N) where N is the number of elements. It is not O(N^4/3), it is O(N * M^1/3).

There's a good argument to be made that M^(1/3) should be considered a constant, so the algorithm is indeed simply O(N). If you include M^(1/3), why are you not also including your CPU speed? The speed of light? The number of times the OS switches threads during your algorithm? Everyone knows that an O(N) algorithm run on the same data will take different speeds on different hardware. The point of Big-O is to have some reasonable understanding of how much worse it will get if you need to run this algorithm on 10x or 100x as much data, compared to some baseline that you simply have to benchmark because it relies on too many external factors (memory size being one).

> All regularly cited algorithm complexity classes are based on estimating a memory access as an O(1) operation

That's not even true: there are plenty of "memory-aware" algorithms that are designed to maximize the usage of caching. There are abstract memory models that are explicitly considered in modern algorithm design.

tsimionescu 3 hours ago [-]

You can model things as having M be a constant - and that's what people typically do. The point is that this is a bad model, that breaks down when your data becomes huge. If you're tying to see how an algorithm will scale from a thousand items to a billion items, then sure - you don't really need to model memory access speeds (though even this is very debatable, as it leads to very wrong conclusions, such as thinking that adding items to the middle of a linked list is faster than adding them to the middle of an array, for large enough arrays - which is simply wrong on modern hardware).

However, if you want to model how your algorithm scales to petabytes of data, then the model you were using breaks down, as the cost of memory access for an array that fits in RAM is much smaller than the cost of memory access for the kind of network storage that you'll need for this level of data. So, for this problem, modeling memory access as a function of N may give you a better fit for all three cases (1K items, 1G items, and 1P items).

> That's not even true: there are plenty of "memory-aware" algorithms that are designed to maximize the usage of caching.

I know they exist, but I have yet to see any kind of popular resource use them. What are the complexities of Quicksort and Mergesort in a memory aware model? How often are they mentioned compared to how often you see O(N log N) / O(N²)?

dooglius 2 hours ago [-]

> if you model memory access as O(N^1/3), linear search worse case is not O(N), it is O(N^4/3)

This would be true if we modeled memory access as Theta(N^{1/3}) but that's not the claim. One can imagine the data organized/prefetched in such a way that a linear access scan is O(1) per element but a random access is expected Theta(N^{1/3}). You see this same sort of pattern with well-known data structures like a (balanced) binary tree; random access is O(log(n)), but a linear scan is O(n).

squirrellous 44 minutes ago [-]

I think the article glossed over a bit about how to interpret the table and the formula. The formula is only correct if you take into account the memory hierarchy, and think of N as the working set size of an algorithm. So if your working set fits into L1 cache, then you get L1 cache latency, if your working set is very large and spills into RAM, then you get RAM latency, etc.

gowld 7 hours ago [-]

> "in 2x time you can access 8x as much memory"

is NOT what the article says.

The article says (in three ways!):

> if your memory is 8x bigger, it will take 2x longer to do a read or write to it.

> In a three-dimensional world, you can fit 8x as much memory within 2x the distance from you.

> Double the distance, eight times the memory.

the key worda there are a, which is a single access, and distance, which is a measure of time.

N is the amount of memory, and O() is the time to access an element of memory.

hinkley 5 hours ago [-]

The operation GP is thinking of is a full scan, and that will always take n(n^(1/3)) lower bound time. Though if done right all of that latency will be occupied with computation and allow people to delude themselves into thinking it doesn’t matter.

But when something is constrained two or three ways, it drastically reduces the incentive to prioritize tackling any one of the problems with anything but small incremental improvements.

Dylan16807 3 hours ago [-]

> The operation GP is thinking of is a full scan, and that will always take n(n^(1/3)) lower bound time.

It doesn't. Full scans are faster than accessing each memory address in an unordered way.

Let's look at a Ryzen 2600X. You can sustain 32 bytes per second from L1, 32 bytes per second from L2, and 20 bytes per cycle from L3. That's 64KB, 512KB, and 16MB caches all having almost the same bandwidth despite very different latencies.

You can also imagine an infiniband network that fills 2 racks, and another one that fills 50000 racks. The bandwidth of a single node is the same in both situations, so even though latency gets worse as you add more nodes and hops, it's going to take exactly O(n) time for a single thread to scan the entire memory.

You can find correlations between memory size and bandwidth, but they're significantly weaker and less consistent than the correlations between memory size and latency.

hinkley 41 minutes ago [-]

I don’t think I can agree to ignore the latency problem. Intermachine Latency is the source of a cube root of N term.

wazdra 3 hours ago [-]

I don't think we have such a lower bound: from a “theoretical" point of view (in the sense of the post), your processor could walk on the cube of memory and collect each bit one by one. Each move+read costs O(1) (if you move correctly), so you get O(n) to read the whole cube of n bits.

If you require the full scan to be done in a specific order however, indeed, in the worse case, you have to go from one end of the cube to the other between each reads, which incurs a O(n^{1/3}) multiplicative cost. Note that this does not constitute a theoretical lower-bound: it might be possible to detect those jumps and use the time spent in a corner to store a few values that will become useful later. This does look like a fun computational problem, I don't know it's exact worst-case complexity.

dheera 6 hours ago [-]

I hate big O notation. It should be O(N) = N^(1/3)

That way O is the function and it's a function of N.

The current way it's notated, O is the effectively the inverse function.

pfortuny 6 hours ago [-]

Yes, the notation seems wrong but it is because O(...) is a set, not a function. The functions is what goes inside.

So it should be f€O(...) instead of f=...

(don't know how to write the "belongs" symbol on iOS).

singhrac 6 hours ago [-]

Here you go: ∈ (so f ∈ (...))

tsimionescu 4 hours ago [-]

O(N) is not a function, though, so the notation is doing a good job of saying this. When we say "the complexity class of an algorithm is O(log N)", we mean "the function WorseCaseComplexity(N) for that algorithm is in the class O(log N)", The function "WorseCaseComplexity(N)" measures how much time the algorithm will take for any input of size N in the worse case scenario. We can also say "the average case complexity of quicksort is O(N log N)", which means "the function AverageCaseComplexity(N) for quicksort in the class O(N log N)", where AverageCaseComplexity(N) is a function that measure how much time quicksort will need to finish for an "average" input of size N.

Saying that a function f(n) is in the class O(g(n)) means that there exists an M and a C such that f(n) < C * g(n) for any n < M. That is, it means that, past some point, f(n) is always lower than C * g(n), for some constant C. For example, we can say the function f(n) = 2n+ 1 is in the class O(n), because 2n + 1 < 7n for any n > 1. Technically, we can also say that our f(n) is in the complexity class O(2^n), because 2n + 1 < 2^n for any n >= 3, but people don't normally do this. Technically, what we typically care about is not O(f(n)), it's more of a "least upper bound", which is closer to big_theta(n).

purplesyringa 5 hours ago [-]

No it shouldn't. The function you're talking about is typically called T(N), for "time". The problem is that you can't write T(N) = N^(1/3) because it's not exactly N^(1/3) -- for one thing, it's approximate up to a constant factor, and for another thing, it's only an upper bound. Big-O solves both of these issues: T(N) = O(N^(1/3)) means that the function T(N) grows at most as fast as N^(1/3) (i.e.: forms a relationship between the two functions T(N) and N^(1/3)). The "T(N) =" is often silent, since it's clear when we're talking about time, so at the end you just get O(N^(1/3)).

bigbuppo 6 hours ago [-]

All their numbers are immediatley suspect since they admit to using ChatGPT to get their numbers. Oh, and their wonderful conclusion that something that fits fully in the CPU's caches is faster than something sitting a few hops away in main memory.

devnullbrain 2 hours ago [-]

https://en.algorithmica.org/hpc/external-memory/model/

Sometimes everything else is O(0)

aapoalas 6 hours ago [-]

I do like the idea of this, but after writing a longer response explaining my positive view on this I came to a different conclusion: I was thinking this'd be a possibly useful measure for programs running on CPUs with contention, where your data will occasionally drop out of cache because the CPU is doing something else. But in that sort of a situation, you'd expect the L1 memory speed to be overtaken _before_ L1 memory size is reached. This function instead fits fairly well to the actual L1 size (as given by ChatGPT anyway), meaning that it's best thought of as a measure of random access speed on an uncontested CPU.

That being said, I do still like the fundamental idea of figuring out a rough but usable O-estimate for random memory access speeds in a program. It never hurts to have more quick estimation tools in your toolbox.

hinkley 4 hours ago [-]

I had set associative caching on a final exam in 1993. We’ve had it for a long long time because it substantially improves the behavior with worst case eviction pattern relative to naive caches. The sophistication has crept up over time. But if somehow that trick had been missed by all computer engineers, I firmly believe we would have been having this discussion 25-30 years ago.

AdamH12113 6 hours ago [-]

I don't think this holds up. Historically, memory sizes have increased exponentially, but access times have gotten faster, not slower. And since the access time comes from the memory architecture, you can get 8 GB of RAM or 64 GB of RAM with the same access times. The estimated values in the table are not an especially good fit (30-50% off) and get worse if you adjust the memory sizes.

Theoretically, it still doesn't hold up, at least not for the foreseeable future. PCBs and integrated circuits are basically two-dimensional. Access times are limited by things like trace lengths (at the board level) and parasitics (at the IC level), none of which are defined by volume.

ta12653421 5 hours ago [-]

Not true, because then in theory you could build just L1 - L2 - L3 Cache with 64GB RAM instead of 1 - 2 MB: For SRAM in L1/L2/L3 for example you need to manufacture 6 transistors for 1 bit, while for DRAM you need 1 transistor and 1 capacitor. Thus, would men your chips at that high speed would become very big, and the speed of information through the wires would make a difference: On semiconductor level its a difference if you need to travel 1inch or 10inch billion times per second, creating an "efficient border" of how big your SRAM could max be in dependence of chip-size (and other factors like thermal effects)

Source: "What every Programmer should know about memory" https://people.freebsd.org/~lstewart/articles/cpumemory.pdf

hinkley 5 hours ago [-]

You’re cheating and I don’t think you realize it.

Why didn’t computers have 128 terabytes of memory ten years ago? Because the access time would have been shit. You’re watching generation after generation of memory architectures compromise between access time and max capacity and drawing the wrong conclusions. If memory size were free we wouldn’t have to wait five years to get twice as much of it.

zahlman 5 hours ago [-]

There are also economic considerations, power use, etc.

hinkley 4 hours ago [-]

On the whole I agree, but the details keep bumping back into my assertion. Power use was a factor of Dennard scaling until very recently. So again you just wait until the next hardware generation and then trade a little time for more space.

tsimionescu 5 hours ago [-]

While in absolute terms memory access has gotten faster, in relative terms it is MUCH slower today, compared to CPU speeds.

A modern CPU can perform hundreds or even thousands of computations while waiting for a single word to be read from main memory - and you get another order of magnitude slowdown if we're going to access data from an SSD. This used to be much closer to 1:1 with old machines, say in the Pentium 1-3 era or so.

And regardless of any speedup, the point remains as true today as it has always been: the more memory you want to access, the slower accessing it will be. Retrieving a word from a pool of 50PB will be much slower than retrieving a word from a pool of 1MB, for various fundamental reasons (even address resolution has an impact, even if we want to ignore physics).

Legend2440 5 hours ago [-]

Memory access times have not significantly improved in many years.

Memory bandwidth has improved, but it hasn't kept up with memory size or with CPU speeds. When I was a kid you could get a speedup by using lookup tables for trig functions - you'd never do that today, it's faster to recalculate.

2D vs 3D is legit, I have seen this law written down as O(sqrt N) for that reason. However, there's a lot of layer stacking going on on memory chips these days (especially flash memory or HBM for GPUs) so it's partially 3D.

marcosdumay 4 hours ago [-]

You are correct in that the OP used the wrong metric. He should have written a big omega, not a big O.

> PCBs and integrated circuits are basically two-dimensional.

Yes, what pushes the complexity into Ω(n^1/2), that fits the original claim.

> Access times are limited by things like trace lengths

Again, Ω(n^1/2)

> and parasitics

And those are Ω(n)

So, as you found, on practice it's much worse, but the limit on the article is also there.

Rhapso 3 hours ago [-]

You have to cool memory when it reads or writes, so accessible memory scales with surface area, not volume. O(N^0.5) becomes the new floor.

If it turns out the universe is holographic, then you have the same bound. This might just be enforment of the holographic principle...

bawolff 6 hours ago [-]

This doesn't seem all that compelling to me - the practical argument relied on fitting into cache which is going to be more like a step function than N^1/3.

As far as the theoretical argument... idk could be true but i suspect this is too simple a model to give useful results.

hinkley 5 hours ago [-]

Order is about approaching infinity.

Bucketed response times still follow a curve as the X axis goes to infinity.

It’s really the same for addition and multiplication. If the add fits into a register it seems like it’s O(1j but if you’re dealing with Mersenne prime candidates the lie becomes obvious.

That we aren’t acknowledging any of this with cloud computing is quite frustrating. You can’t fit the problem on one core? Next bucket. Can’t fit it in one server? Next bucket. One rack? One data center? So on and so forth.

Order of complexity tells us which problems we should refuse to take on at all. You always have to remember that and not fool yourself into thinking the subset that is possible is the entire problem space. It’s just the productive slice of it.

strongpigeon 7 hours ago [-]

Is it me or it feels like the "empirical argument" is correct (it is an observation after all), but the "theoretical argument" wildly off?

My understanding is that different levels of cache and memory are implemented pretty differently to optimize for density/speed. As in, this scaling is not the result of some natural geometric law, but rather because it was designed this way by the designers for those chips to serve the expected workloads. Some chips, like the mainframe CPUs by IBM have huge caches, which might not follow the same scaling.

I'm no performance expert, but this struck me as odd.

zamadatix 7 hours ago [-]

The theoretical argument seems sound, but it does ignore there are massive constant factors in current implementations beyond just the theoretical limit alone (particularly cost and heat) and skips directly explaining why those end up having similar growth rates.

The actual formula used (at the bottom of the ChatGPT screenshot) includes corrections for some of these factors, without them it'd have the right growth rate but yield nonsense.

strongpigeon 7 hours ago [-]

I guess you're right in a purely geometric sense. It's just that it seems almost silly to consider that given that (AIUI) the 3D geometric constraints don't impact the memory access latency at all for now (and likely for any reasonable period of time).

Like you said, thermal and cost constraints dwarf the geometrical one. But I guess my point is that they make it a non-issue and therefore isn't a sound theoretical explanation as to why memory access is O(N^[1/3]).

zamadatix 6 hours ago [-]

Should thermal and cost constraints at scale not also tend to relate to the volume of the individual components in the same way (ignoring constant factors) as the growth factors for an idealized memory structure around the CPU itself? In a more literal sense: the size and quantity of transistors (or other alternative units) also describe the cost, heat dissipation, and volume of the memory simultaneously. Tweaking any of the parameters still ultimately results in a "how much can we handle in that volume of product" equation, which will be the ultimate bound.

The difference is we spread them out into differently optimized volumes instead of build a homogenous cube, which is (most likely IMO) where most of the constant factors come from.

I think this is the part the article glossed over to just get to showing the empirical results, but I also don't feel it's an inherently unreasonable set of assumptions. At the very least, matches what the theoretical limit would be in the far future even if it were to only happen to coincidentally match current systems for other reasons.

hinkley 5 hours ago [-]

Thermal is a huge issue because Dennard Scaling has been dead for a long time. We are kind of limping along with Moore but anything that looks like Dennard is going to involve a change of materials or new chemistry.

I get the impression that backside power was the last big Dennard-adjacent win, and that’s more about relocating some of the heat to a spot closer to the chip surface, which gives thermal conductivity a boost since the heat has to move a shorter distance to get out. I think that leaves gallium, optical communication to create less heat in the bulk of the chip, and maybe new/cheaoer silicon on insulator improvements for less parasitic loss? What other things are still in the bag of tricks? Because smaller gates isn’t.

wmf 4 hours ago [-]

geometric constraints don't impact the memory access latency at all for now

I don't know; every time Intel/AMD increase cache size it also takes more cycles. That sounds like a speed of light limit.

hinkley 5 hours ago [-]

How you gonna pack bits onto a physical chip except to put them into a cube? What’s the longest path in a cube? What’s the average path length in a cube? They’re all functions of the surface area of the cube.

strongpigeon 5 hours ago [-]

Yes, but that's not at all how we're packing bits into physical chips right now. The fact that access latency is O(N^[1/3]) has nothing to do with this, it's just that the relative size of caches and memories have been designed that way.

That this latency is of the same order as if you were putting bits in a cube is more coincidental than anything.

hinkley 4 hours ago [-]

A bound can exist due to multiple factors. If you fixed the other bound, speed of light would still dictate a growth rate higher that the volume of a cube.

CamperBob2 5 hours ago [-]

Hypercube, e.g. the Connection Machine: https://www.tamikothiel.com/theory/cm_txts/index.html

hinkley 5 hours ago [-]

That is artistic license, and you know it. Do you have access to a tesseract? If so why the hell are you on HN?

Edit to add: And on further reflection, this won’t save you. Because the heat production in the hypercube is a function of internal volume, but the heat dissipation is a function of the interface with 3D space, so each hypercube is limited in size, and must then be separated in time or space to allow that heat to be transferred away, which takes up more than O(n^(1/3)) distance and distance means speed of light delays.

CamperBob2 3 hours ago [-]

Sounds like you've got it all figured out, all right.

hinkley 1 hours ago [-]

There is no beating the speed of light. You cannot stack servers like cordwood. So no matter what toroidal internetworking architecture you make the maximum interconnect length is always, always a function of the dimensions of the devices and typically with a constant factor of. 5x to 10x on top.

Star interconnects are limited to the surface area of each device, which is the cube root of the volume. Because you have to have space to plug the wires in.

“You’ve got it all figured out” is deflecting basic physics facts. What is your clever solution to data center physics?

andy99 5 hours ago [-]

I’ve worked a bit comparing precomputed lookup tables with computing the result in real time, and in many cases, e.g a multiply operation will be faster than the corresponding lookup. Point being that, for the example he gives, it’s also necessary to know how optimized the machine is for performing the computation you want to look up.

hinkley 5 hours ago [-]

I’m having a difficult time adapting to this new reality. Memoization has been faster for most of my career, and if used carefully also improves reading comprehension.

Having to go back to inlining calculations is going to hurt my soul in tiny ways.

6 hours ago [-]

amelius 2 hours ago [-]

Not if the data is on a tape in a Turing machine.

luizfelberti 5 hours ago [-]

Ah yes, pretending we can access infinite amounts of memory instantaneously or in a finite/bounded amount of time is the achilles heel of the Von Neumann abstract computer model, and is the point where it completely diverges from physical reality.

Acknowledging that memory access is not instantaneous immediately throws you into the realm of distributed systems though and something much closer to an actor model of computation. It's a pretty meaningful theoretical gap, more so than people realize.

hinkley 5 hours ago [-]

I would like to see someone pick up Knuth’s torch and formulate a new order of complexity for distributed computing.

Many of the products we use, and for probably the last fifty years really, live in the space between theory and practice. We need to collect all of this and teach it. Computer has grown 6, maybe more orders of magnitude since Knuth pioneered these techniques. In any other domain of computer science the solutions often change when the order of magnitude of the problem changes, and after several it’s inescapable.

ddtaylor 5 hours ago [-]

I think the notation is supposed to mean the worst performance. This is more an argument about amortized time analysis.

tsimionescu 5 hours ago [-]

No, this is a common misconception. Big O notation just represents a class of functions, it has nothing to do with complexity analysis, except for being very popular in the field. Whether an algorithm has O(N) worse case runtime or O(N) best case runtime or O(N) typical case runtime are all valid questions.

sparkie 4 hours ago [-]

The misconception is mixing up notations. Using Big-O for the upper bound has been the norm for describing algorithms for at least half a century.

In Knuth's description[1] of Big-O notation (and related variants), from 1976, he starts out by saying:

    Most of us have gotten accustomed to the idea of using the notation O(f(n)) to stand for any function whose magnitude is upper-bounded by a constant times f(n) , for all large n .

Emphasis on upper-bounded. He goes on to describing other notations: Big-omega, for a lower bound, and so forth.

[1]:https://danluu.com/knuth-big-o.pdf

svat 4 hours ago [-]

Yes O(f(n)) shows how the function f(n) is upper-bounded, but the point of the comment you're replying to is that the function f could be the worst-case, average-case, or best-case running time of an algorithm, or even a (say) number-theoretic function that has nothing to do with running times. (More here: https://stackoverflow.com/a/1960493 including the example in the comments of an algorithm whose best-case cost is O(n^2) and worst-case cost is Ω(n^3) — it is perfectly meaningful to give an upper bound on the best case, and a lower bound on the worst case.)

tsimionescu 4 hours ago [-]

Again, the worse case complexity can be O(N), but it can also be Big-Omega(N^2). Note that Knuth talks about the magnitude of a function, not of the runtime of an algorithm. The worse case complexity of an algorithm is a function, call it f_worse. The best case complexity of that same algorithm is a different function, call it f_best. It's just as meaningful to talk about O(f_worse) as it is to talk of O(f_best).

This is actually pretty common in CS. For example, people will often say that the typical case complexity of Quicksort is O(N log N), even though the worse case complexity is O(N^2). I doubt you'll find anyone who can tell you off the top of their head what is the actual complexity function of quicksort, and that will depend anyway on details of how the exact algorithm is implemented, even in pseudocode (will it do N reads, or 2N reads?).

Let's take a simple example: linear search. The worse case complexity is that we need to read every element from the array, and then compare it to the desired value, and we never find the desired value. So, in the worse case, the runtime is 2N (N reads + N comparisons) - f_worse(n) = 2n. In the best case, we read the first item in the array, we compare it to the pivot, and it is equal - so we need 2 operations - f_best(n) = 2. Now, we can say that O(f_worse(n)) = O(2n) = O(n). And we can also say that O(f_best(n)) = O(2) = O(1). So the best case complexity for linear search is O(1), and the worse case complexity is O(n). We never want to remember details like this to say things like "the best case complexity of linear search is 2".

6 hours ago [-]

runeblaze 3 hours ago [-]

Ummm guys when we talk about memory access in theory can we just be rigorous and talk about the computational model?

The real RAM model “in theory” tells me that memory access is O(1). Of course real RAM is a spherical cow but like we could use a bit more rigor

eesmith 7 hours ago [-]

The fancy word for assuming O(1) access and not needing to spill over into longer addressing modes is, I believe, "transdichotomous model" - https://en.wikipedia.org/wiki/Transdichotomous_model

> In computational complexity theory, and more specifically in the analysis of algorithms with integer data, the transdichotomous model is a variation of the random-access machine in which the machine word size is assumed to match the problem size.

feoren 3 hours ago [-]

Let's be careful about exactly what we mean by this. When we say an algorithm is O(f(N)), we need to be very clear about what N we're talking about. The whole point is to focus on a few variables of interest (often "number of elements" or "total input size") while recognizing that we are leaving out many constants: CPU speed, average CPU utilization, the speed of light, and (usually) the size of memory. If I run a task against some data and say "good job guys, this only took 1 second on 1000 data points! Looks like our work here is done", it would be quite unnerving to learn that the algorithm is actually O(3^N). 1000 data points better be pretty close to the max I'll ever run it on; 2000 data points and I might be waiting until the heat death of the universe.

I'm seeing some commenters happily adding 1/3 to the exponent of other algorithms. This insight does not make an O(N^2) algorithm O(N^7/3) or O(N^8/3) or anything else; those are different Ns. It might be O(N^2 + (N*M)^1/3) or O((N * M^(1/3))^2) or almost any other combination, depending on the details of the algorithm.

Early algorithm design was happy to treat "speed of memory access" as one of these constants that you don't worry about until you have a speed benchmark. If my algorithm takes 1 second on 1000 data points, I don't care if that's because of memory access speed, CPU speed, or the speed of light -- unless I have some control over those variables. The whole reason we like O(N) algorithms more than O(N^2) ones is because we can (usually) push them farther without having to buy better hardware.

More modern algorithm design does take memory access into account, often by trying to maximize usage of caches. The abstract model is a series of progressively larger and slower caches, and there are ways of designing algorithms that have provable bounds on their usage of these various caches. It might be useful for these algorithms to assume that the speed of a cache access is O(M^1/3), where M is the size of memory, but that actually lowers their generality: the same idea holds between L2 -> L3 cache as L3 -> RAM and even RAM -> Disk, and certainly RAM -> Disk does not follow the O(M^1/3) law. See https://en.wikipedia.org/wiki/Cache-oblivious_algorithm

So basically this matters for people who want some idea of how much faster (or slower) algorithms might run if they change the amount of memory available to the application, but even that depends so heavily on details that it's not likely to be "8x memory = 2x slower". I'd argue it's perfectly fine to keep M^(1/3) as one of your constants that you ignore in algorithm design, even as you develop algorithms that are more cache- and memory-access-aware. This may justify why cache-aware algorithms are important, but it probably doesn't change their design or analysis at all. It seems mainly just a useful insight for people responsible for provisioning resources who think more hardware is always better.

dmitrygr 4 hours ago [-]

Mathematically wrong, unless restated correctly: "accessing ALL memory at a given cache level is O(N^[1/3])". After restatement: trite and useless.

Accessing any given one byte in the address space is amortized to O(1)

curiouscoding 2 hours ago [-]

Try measuring it! You'll quickly see that the latency of addressing n bytes of memory is definitely not O(1).

See eg "The Myth of RAM": https://www.ilikebigbits.com/2014_04_21_myth_of_ram_1.html

dmitrygr 1 hours ago [-]

100% wrong

8^(1/3) is 2

So according to OP (and you by agreement), upgrading my machine from 4GB of RAM to 32GB should double my RAM access time. Obviously, and demonstrably, nothing of the sort occurs!

4 hours ago [-]

westurner 7 hours ago [-]

> L3 cache is not built for mass throughput in the same way that DRAM is, and so it has roughly identical mass throughput despite its much closer distance to the computation.

"The von Neumann bottleneck is impeding AI computing?" (2025) https://news.ycombinator.com/item?id=45398473 :

> How does Cerebras WSE-3 with 44GB of 'L2' on-chip SRAM compare to Google's TPUs, Tesla's TPUs, NorthPole, Groq LPU, Tenstorrent's, and AMD's NPU designs?

From https://news.ycombinator.com/item?id=42875728 :

> WSE-3: 21 PB/S

From https://hackernoon.com/nvidias-mega-machine-crushes-all-of-2... :

> At Computex 2025, Nvidia’s Jensen Huang dropped a bombshell: the NVLink Spine, a compute beast pumping 130 terabytes per second, eclipsing the internet’s 2024 peak of 112.5 TB/s.

"A Comparison of the Cerebras Wafer-Scale Integration Technology with Nvidia GPU-based Systems for Artificial Intelligence" (2025-03) https://arxiv.org/abs/2503.11698v1

b0gb 7 hours ago [-]

time is relative... </joke>

sparkie 5 hours ago [-]

It's true though. Time complexity is a relative measure. Big-O notation is an upper bound, not a lower one. You don't take the fastest storage (L1 cache) as the baseline `x` and then say it takes some f(n) > x time to access storage because it's not in the fastest cache. This is a complete misrepresentation of Big-O notation.

When we say something is O(1), we're saying that there is a constant upper bound on time to compute the algorithm. It is constant time if it takes no longer to to perform the algorithm when `n` becomes large. O(1) simply means the worst case time to compute is independent of `n`. If the algorithm is accessing storages of varying latencies, then the baseline is the slowest medium.

For small `n` (ie, what might fit into cache), Big-O notation is not really that useful. When `n` is tiny a O(n) algorithm can outperform a O(1) algorithm. The notation doesn't tell us anything about how efficient a particular implementation is. It is mainly useful when we're talking about `n` which can grow to large sizes, where whatever micro-optimizations we, or the machine might perform are dominated by `n`.

tsimionescu 4 hours ago [-]

This is quite wrong. If an algorithm requires one memory read, and reading one byte from memory takes (size of data)^1/3 time steps, then the complexity of that algorithm is O(N^1/3), not O(1).

This is well known for doing large number arithmetic. For example, we typically consider that a+a is an operation that takes O(1) time to execute. This is because we know we are limiting the analysis to small numbers, even if we have a lot of them. But if the numbers can be arbitrarily large, then a+a doesn't take O(1) time, it takes O(log a) time. So, an algorithm that needs to perform O(N) additions doesn't take O(N) time, it takes O(N log K) time. And if we are doing something like adding the first N numbers, the complexity is actually O(N log N), not O(N).

The same thing happens to memory: if we assume that the problem will fit into very fast memory even for the largest sizes, we can indeed approximate memory access as O(1). But if we admit that the more memory we need to store the components of the problem, the slower memory access will get, then we need to model the cost of accessing memory as a function of program size - and the author is proposing O(N^1/3) as that cost. This means that our algorithm for adding the first N numbers actually requires O(N^4/3) time for reading N numbers from memory and then O(N log N) additions, which gives it a total complexity of O(N^4/3 + N log N) = O(N^4/3).

Now, this type of analysis is not useful if we think all our numbers will fit into a single computer's memory, probably. But if we are trying to, say, figure out how our algorithm will scale from gigabytes of data to hundreds of petabytes of data, we need to model the rising cost of memory access as our data set gets bigger too. At some point as N grows to infinity, even just computing the address of the next byte becomes a complex operation that can take longer than the whole time it took to add the first 100GB of numbers.

sparkie 4 hours ago [-]

O(1) simply means that there's a constant upper bound to the algorithm as n -> ∞.

In English: There is some worst-case time taken to compute the algorithm that adding any further `n` will not take any longer.

tsimionescu 4 hours ago [-]

Sure. And for any algorithm that needs to perform a read from memory, no such upper bound exists, this is the point.

For example, say we want to read the last number from an array. We get given the array index as an input, and we need to perform one memory read. You'd typically say that such an algorithm has a complexity of O(1).

However, if you actually write this algorithm and benchmark it for N=1, N=1MB, N=1GB, N=1TB, N=1PB and so on, you'll find that each of these is larger than the next. You'll also find that this number never stops growing. Accessing the last item in an array that is as large as a planet will take longer than accessing the last item in an array that is as large as a continent. So the actual complexity class is greater than O(1). Maybe O(n^1/3) is a good, maybe O(n^1/2) is better, maybe O(log n) would be best - whatever it is, it's not a constant.

b0gb 4 hours ago [-]

the last or the first... it's a matter of perspective... still relative </joke^2>

AndyKelley 7 hours ago [-]

This post was really good right up until the screenshot of ChatGPT... It would be a big improvement to convert that to HTML, fact check it, and then not cite LLM as a source.

kccqzy 6 hours ago [-]

Especially since the ChatGPT answer starts with "you are right" suggesting that the author is convincing ChatGPT of a specific viewpoint.

6 hours ago [-]

masijo 5 hours ago [-]

It's getting very annoying to see posts that use ChatGPT as source, sadly it seems like this is the future.

Rarebox 6 hours ago [-]

Also just stopped reading at that point. The idea seemed quite clever.

FuckButtons 6 hours ago [-]

This seems like a mathematician trying to shoehorn a closed form solution on to something which is architecture dependent and is probably better summarized with: Only operate on values which fit inside, ideally the l2 but certainly l3 cache of your target machine for performance sensitive code paths.

Rendered at 01:10:58 GMT+0000 (Coordinated Universal Time) with Vercel.