Next.js App Router + React Server Components Demo

new
past
show
ask
show
jobs
submit

▲Copying collectors with block-structured heaps are unreliable (wingolog.org)

59 points by ingve 467 days ago | 16 comments

bjourne 467 days ago [-]

But this only happens if the copying collectors' space is growable and discontinuous. I think a better scheme is to just use copying for the young generation and mark and sweep for the older ones. Structuring the heap into blocks seem quite misguided, imo.

ryanobjc 467 days ago [-]

This is exactly how the JVM CMS works, then what happens is the old gen becomes too fragmented to allocate, you end up with a GC failure and the JVM will rewrite the entire old heap.

When hadoop depended on a single namenode, a 64gb+ heap would take 5+ minutes to collect on these failure modes. Quite painful.

bjourne 467 days ago [-]

Yes, but there are two easy solutions: 1. Larger nurseries to minimize churn in older generations. Fragmentation is a symptom of too many "false survivors". 2. Concurrent heap compaction.

jakewins 466 days ago [-]

The failure mode here is usually caching, in my experience. Lots of data, all dying old.

hinkley 466 days ago [-]

When JVMs started hitting a wall around 1GB of memory, we ended up moving long-lived data out of the old generation and into in-memory databases.

Someone 467 days ago [-]

> Structuring the heap into blocks seem quite misguided, imo.

So, how would you keep allocations fast on multi-core hardware?

FTA: “I am targetting embedders with multiple mutator threads, and a classic semi-space collector doesn’t scale – you could access the allocation pointer atomically, but that would be a bottleneck, serializing mutators, not to mention the cache contention.”

neonsunset 466 days ago [-]

Per-core heaps. Count, sizes and proportions of generational heaps can also be scaled dynamically according to allocation rate, target % CPU time spent in GC and collection efficiency, like .NET's GC does.

https://devblogs.microsoft.com/dotnet/put-a-dpad-on-that-gc/

https://maoni0.medium.com/dynamically-adapting-to-applicatio...

hinkley 466 days ago [-]

Per-thread nurseries worked just fine for Java for close to a decade. They made it more efficient using escape analysis. Escape analysis indirectly lead to borrow checking and Rust.

kccqzy 466 days ago [-]

While per-thread nurseries certainly work, I recently became aware that TCMalloc actually supports either per-thread mode or per-CPU mode: https://google.github.io/tcmalloc/overview.html#tcmalloc-cac...

I have not looked into this very deeply but from first principles it would appear to me that a per-CPU nursery would be strictly better than a per-thread nursery for typical applications.

convolvatron 466 days ago [-]

that would almost certainly be true if you could make statements about thread-cpu affinity

466 days ago [-]

scottlamb 466 days ago [-]

> While per-thread nurseries certainly work, I recently became aware that TCMalloc actually supports either per-thread mode or per-CPU mode: https://google.github.io/tcmalloc/overview.html#tcmalloc-cac...

Yeah, that's pretty new relative to when the old gperftools version of tcmalloc hit the scene. It works via the rseq operations on Linux; afaik no other OS is supported.

> I have not looked into this very deeply but from first principles it would appear to me that a per-CPU nursery would be strictly better than a per-thread nursery for typical applications.

Both are quite effective at removing the contention, so I think it mostly comes down to which needs fewer allocation caches to do it. In the case of a tie, per-CPU is probably better because if a thread migrates across cores you want to use memory likely to be hot in the CPU cache. If you have far more threads than cores, per-CPU should win. If you're single-threaded, per-thread should win. If you have a thread-per-core reactor, and either use the whole machine, or restrict it with cpuset so it always stays on the same cores, per-CPU should win, and more so if there are any ancillary threads for logging or whatever. All those profiles exist; what a "typical application" is I dunno.

There's also the concept of "virtual CPU IDs" (the language may have changed), which can minimize the downside of per-CPU for small cgroups on huge machines. It really should be better than per-thread, as running threads <= total threads. Not sure if that's even in mainline Linux now or not. https://lwn.net/Articles/885818/

kccqzy 466 days ago [-]

Ah yes, thanks for the detailed explanation! By "typical application" I mean those applications where there are far more (mostly idle) threads than available cores. I guess I shouldn't have used the phrase "typical application" since this differs so much by people's experience.

gpderetta 466 days ago [-]

If you allow thread migration across CPUs, then per-CPU nurseries become problematic when collecting, unless you stop the world.

bjourne 467 days ago [-]

You have private, fixed-size, copy-collected spaces for each thread.

bjoli 466 days ago [-]

I am not sure Andy is very present on social media. He would probably appreciate a comment.

chrisjj 467 days ago [-]

> a form of soft reliability

A form of reliability that is unreliable, right? :)

Rendered at 05:27:37 GMT+0000 (Coordinated Universal Time) with Vercel.