NHacker Next
  • new
  • past
  • show
  • ask
  • show
  • jobs
  • submit
Copying collectors with block-structured heaps are unreliable (wingolog.org)
bjourne 104 days ago [-]
But this only happens if the copying collectors' space is growable and discontinuous. I think a better scheme is to just use copying for the young generation and mark and sweep for the older ones. Structuring the heap into blocks seem quite misguided, imo.
ryanobjc 104 days ago [-]
This is exactly how the JVM CMS works, then what happens is the old gen becomes too fragmented to allocate, you end up with a GC failure and the JVM will rewrite the entire old heap.

When hadoop depended on a single namenode, a 64gb+ heap would take 5+ minutes to collect on these failure modes. Quite painful.

bjourne 104 days ago [-]
Yes, but there are two easy solutions: 1. Larger nurseries to minimize churn in older generations. Fragmentation is a symptom of too many "false survivors". 2. Concurrent heap compaction.
jakewins 104 days ago [-]
The failure mode here is usually caching, in my experience. Lots of data, all dying old.
hinkley 104 days ago [-]
When JVMs started hitting a wall around 1GB of memory, we ended up moving long-lived data out of the old generation and into in-memory databases.
Someone 104 days ago [-]
> Structuring the heap into blocks seem quite misguided, imo.

So, how would you keep allocations fast on multi-core hardware?

FTA: “I am targetting embedders with multiple mutator threads, and a classic semi-space collector doesn’t scale – you could access the allocation pointer atomically, but that would be a bottleneck, serializing mutators, not to mention the cache contention.”

neonsunset 104 days ago [-]
Per-core heaps. Count, sizes and proportions of generational heaps can also be scaled dynamically according to allocation rate, target % CPU time spent in GC and collection efficiency, like .NET's GC does.

https://devblogs.microsoft.com/dotnet/put-a-dpad-on-that-gc/

https://maoni0.medium.com/dynamically-adapting-to-applicatio...

hinkley 104 days ago [-]
Per-thread nurseries worked just fine for Java for close to a decade. They made it more efficient using escape analysis. Escape analysis indirectly lead to borrow checking and Rust.
kccqzy 104 days ago [-]
While per-thread nurseries certainly work, I recently became aware that TCMalloc actually supports either per-thread mode or per-CPU mode: https://google.github.io/tcmalloc/overview.html#tcmalloc-cac...

I have not looked into this very deeply but from first principles it would appear to me that a per-CPU nursery would be strictly better than a per-thread nursery for typical applications.

convolvatron 104 days ago [-]
that would almost certainly be true if you could make statements about thread-cpu affinity
104 days ago [-]
scottlamb 104 days ago [-]
> While per-thread nurseries certainly work, I recently became aware that TCMalloc actually supports either per-thread mode or per-CPU mode: https://google.github.io/tcmalloc/overview.html#tcmalloc-cac...

Yeah, that's pretty new relative to when the old gperftools version of tcmalloc hit the scene. It works via the rseq operations on Linux; afaik no other OS is supported.

> I have not looked into this very deeply but from first principles it would appear to me that a per-CPU nursery would be strictly better than a per-thread nursery for typical applications.

Both are quite effective at removing the contention, so I think it mostly comes down to which needs fewer allocation caches to do it. In the case of a tie, per-CPU is probably better because if a thread migrates across cores you want to use memory likely to be hot in the CPU cache. If you have far more threads than cores, per-CPU should win. If you're single-threaded, per-thread should win. If you have a thread-per-core reactor, and either use the whole machine, or restrict it with cpuset so it always stays on the same cores, per-CPU should win, and more so if there are any ancillary threads for logging or whatever. All those profiles exist; what a "typical application" is I dunno.

There's also the concept of "virtual CPU IDs" (the language may have changed), which can minimize the downside of per-CPU for small cgroups on huge machines. It really should be better than per-thread, as running threads <= total threads. Not sure if that's even in mainline Linux now or not. https://lwn.net/Articles/885818/

kccqzy 104 days ago [-]
Ah yes, thanks for the detailed explanation! By "typical application" I mean those applications where there are far more (mostly idle) threads than available cores. I guess I shouldn't have used the phrase "typical application" since this differs so much by people's experience.
gpderetta 104 days ago [-]
If you allow thread migration across CPUs, then per-CPU nurseries become problematic when collecting, unless you stop the world.
bjourne 104 days ago [-]
You have private, fixed-size, copy-collected spaces for each thread.
bjoli 104 days ago [-]
I am not sure Andy is very present on social media. He would probably appreciate a comment.
chrisjj 104 days ago [-]
> a form of soft reliability

A form of reliability that is unreliable, right? :)

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact
Rendered at 19:55:34 GMT+0000 (Coordinated Universal Time) with Vercel.