NHacker Next
  • new
  • past
  • show
  • ask
  • show
  • jobs
  • submit
Copying collectors with block-structured heaps are unreliable (wingolog.org)
bjourne 8 days ago [-]
But this only happens if the copying collectors' space is growable and discontinuous. I think a better scheme is to just use copying for the young generation and mark and sweep for the older ones. Structuring the heap into blocks seem quite misguided, imo.
ryanobjc 8 days ago [-]
This is exactly how the JVM CMS works, then what happens is the old gen becomes too fragmented to allocate, you end up with a GC failure and the JVM will rewrite the entire old heap.

When hadoop depended on a single namenode, a 64gb+ heap would take 5+ minutes to collect on these failure modes. Quite painful.

bjourne 8 days ago [-]
Yes, but there are two easy solutions: 1. Larger nurseries to minimize churn in older generations. Fragmentation is a symptom of too many "false survivors". 2. Concurrent heap compaction.
jakewins 7 days ago [-]
The failure mode here is usually caching, in my experience. Lots of data, all dying old.
hinkley 7 days ago [-]
When JVMs started hitting a wall around 1GB of memory, we ended up moving long-lived data out of the old generation and into in-memory databases.
Someone 8 days ago [-]
> Structuring the heap into blocks seem quite misguided, imo.

So, how would you keep allocations fast on multi-core hardware?

FTA: “I am targetting embedders with multiple mutator threads, and a classic semi-space collector doesn’t scale – you could access the allocation pointer atomically, but that would be a bottleneck, serializing mutators, not to mention the cache contention.”

neonsunset 7 days ago [-]
Per-core heaps. Count, sizes and proportions of generational heaps can also be scaled dynamically according to allocation rate, target % CPU time spent in GC and collection efficiency, like .NET's GC does.

https://devblogs.microsoft.com/dotnet/put-a-dpad-on-that-gc/

https://maoni0.medium.com/dynamically-adapting-to-applicatio...

hinkley 7 days ago [-]
Per-thread nurseries worked just fine for Java for close to a decade. They made it more efficient using escape analysis. Escape analysis indirectly lead to borrow checking and Rust.
kccqzy 7 days ago [-]
While per-thread nurseries certainly work, I recently became aware that TCMalloc actually supports either per-thread mode or per-CPU mode: https://google.github.io/tcmalloc/overview.html#tcmalloc-cac...

I have not looked into this very deeply but from first principles it would appear to me that a per-CPU nursery would be strictly better than a per-thread nursery for typical applications.

convolvatron 7 days ago [-]
that would almost certainly be true if you could make statements about thread-cpu affinity
7 days ago [-]
scottlamb 7 days ago [-]
> While per-thread nurseries certainly work, I recently became aware that TCMalloc actually supports either per-thread mode or per-CPU mode: https://google.github.io/tcmalloc/overview.html#tcmalloc-cac...

Yeah, that's pretty new relative to when the old gperftools version of tcmalloc hit the scene. It works via the rseq operations on Linux; afaik no other OS is supported.

> I have not looked into this very deeply but from first principles it would appear to me that a per-CPU nursery would be strictly better than a per-thread nursery for typical applications.

Both are quite effective at removing the contention, so I think it mostly comes down to which needs fewer allocation caches to do it. In the case of a tie, per-CPU is probably better because if a thread migrates across cores you want to use memory likely to be hot in the CPU cache. If you have far more threads than cores, per-CPU should win. If you're single-threaded, per-thread should win. If you have a thread-per-core reactor, and either use the whole machine, or restrict it with cpuset so it always stays on the same cores, per-CPU should win, and more so if there are any ancillary threads for logging or whatever. All those profiles exist; what a "typical application" is I dunno.

There's also the concept of "virtual CPU IDs" (the language may have changed), which can minimize the downside of per-CPU for small cgroups on huge machines. It really should be better than per-thread, as running threads <= total threads. Not sure if that's even in mainline Linux now or not. https://lwn.net/Articles/885818/

kccqzy 7 days ago [-]
Ah yes, thanks for the detailed explanation! By "typical application" I mean those applications where there are far more (mostly idle) threads than available cores. I guess I shouldn't have used the phrase "typical application" since this differs so much by people's experience.
gpderetta 7 days ago [-]
If you allow thread migration across CPUs, then per-CPU nurseries become problematic when collecting, unless you stop the world.
bjourne 8 days ago [-]
You have private, fixed-size, copy-collected spaces for each thread.
bjoli 7 days ago [-]
I am not sure Andy is very present on social media. He would probably appreciate a comment.
chrisjj 8 days ago [-]
> a form of soft reliability

A form of reliability that is unreliable, right? :)

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact
Rendered at 04:58:53 GMT+0000 (Coordinated Universal Time) with Vercel.