NHacker Next
  • new
  • past
  • show
  • ask
  • show
  • jobs
  • submit
The Slotted Counter Pattern (2020) (planetscale.com)
oftenwrong 11 hours ago [-]
This is similar to how java.util.concurrent.atomic.LongAdder works:

https://github.com/openjdk/jdk/blob/master/src/java.base/sha...

o11c 10 hours ago [-]
Huh, that's ... still quite inefficient. Since a multiple variables in a cache line is fine if all will be accessed from the same cpu, you ideally want a separate allocator for that, then you can avoid all the spooky dynamic growth. And most CPUs offer a direct "atomic add" instruction which is much faster than a CAS loop. For pure stat counters you generally want `relaxed` memory ordering on that; for other cases acquire and/or release may be desired (this is tricky to get performance-optimal given that some platforms upgrade to a stronger memory ordering for free so it's best to require it in your algorithm, whereas others are best with an explicit fence in the edge case). I've never found a real-world use for `seq_cst`.

It's unfortunate that per-cpu variable are difficult in userland, but there are at least 2 ways to fully emulate them - rseq and pinning - and you can also just revert to full-blown thread-locals (which have much better tooling support) if you aren't constructing a gratuitously large number of threads, or shared thread-locals if you do have a lot of threads. If you make the wrong choice here, correctness never suffers, only performance.

BeeOnRope 5 hours ago [-]
Uncontended CAS without carried dependendies on the result (almost always the case in this use case) are similar in performace to atomic add on most platforms.

The CAS is the price they pay for contention detection, though it would be interesting to consider solutions which usually use unconditional atomics with only the occasional CAS in order to check contention, or which relied on some other contention detection approach (e.g., doing a second read to detect when the value incremented by more than your own increment).

The solution looks reasonable to me given the constraints.

o11c 4 hours ago [-]
Part of my point was that "check for contention" is often a silly thing to do in the first place, when you can just avoid it entirely (and with simpler code too).

Admittedly, Java is fundamentally incapable of half of the solutions, but making a simple bump allocator (called once per statistic at startup) over per-thread arrays is still possible.

BeeOnRope 11 hours ago [-]
Yeah exactly, and this is a commonly used trick in concurrent data structures in general. The Java implemenation has the additional twist that they don't use a fixed number of "slots" but rather start at 1 and use CAS failures as a mechanism to detect contention and then grow the number of slots until there are no longer CAS failures.
hassleblad23 12 hours ago [-]
Are there are any other approaches for this problem? Seems like there should be.
Squid_Tamer 9 hours ago [-]
I think probabilistic counters are really interesting for this situation, as long as you don't need exact numbers.

  if random.randint(1, 100) == 1:
      db.increment(row)
Then, looking at the database, you can know that the actual count is approximately 100x the number of times the row was incremented.

There's an even more aggressive version that I've seen called a 'Morris Counter' which uses just a few bits to build an order-of-magnitude estimate of the number of events: https://en.wikipedia.org/wiki/Approximate_counting_algorithm

coolhand2120 9 hours ago [-]
You could add a row for each count. You could add more associated information with the count too, like date, why, etc.. Then that insert would trigger a debounced method that would query the event stack to give you a count. You can make periodic snapshots of the total to avoid O(n) table scans. This allows scale without row locking and a much more accurate accounting of when the counts happen. You rewind/fast forward, get a time series, whatever with the event stack. This is all part of the CQRS pattern.

https://martinfowler.com/bliki/CQRS.html

aqueueaqueue 10 hours ago [-]
Outside of the DB: some kind of queue.

Could be a SQS or an in memory queue in your app (hand waves about mutiple nodes but that can be dealt with or use redis). Have a single processor that aggregates and does writes. You can use DLQ if something goes wrong to avoid data loss.

Inside the DB maybe just append rows for each add to a table without indexes. That way the engine probably just needs to write out a journal record and append but doesn't spend time on updating indexes. Then periodically read all from the table and aggregate into another. You could cycle between 2 tables.

These is eventually consistent in aggregations but transactional in source of truth. I.e. if you write you can be sure the data is stored, but may take time to aggregate.

Or Use a more exotic DB. That is distributed but still supports transactions.

smlacy 11 hours ago [-]
Usually for this kind of use case you want both INCREMENT and DECREMENT. Using a random number for the "Slot" (AKA "Shard") seems like an anti-pattern. If the Slot were derived from a hash of the data itself, then DECREMENT would be much more straightforward.
roywiggins 10 hours ago [-]
Hm, why wouldn't decrements work with random slots, if you're always just summing up all the slots to get the final tally?
immibis 10 hours ago [-]
Yes: use a different tool, such as Redis.

Updating a SQL database row can be expensive because of the transaction overhead. Once you increment this counter, the row with the counter remains locked until the transaction is committed. But the actual useful work of incrementing a number is already quicker than selecting a random slot. If the increment can be a simple ADD instruction, without transactional semantics, then this random-slot-based approach is slower than just incrementing it. Especially if you don't care about missing a few increments and therefore don't need a mutex.

For even more scale, don't use Redis but have a shared memory block containing your integer.

For even more scale, you can have one such slot per CPU. Make sure they're in different cache lines, to avoid false sharing. Now you have slots again, but they're per-CPU and so you still avoid all the SQL database overhead.

Performance is all about structuring the task as close to the machine as possible.

jeffrallen 10 hours ago [-]
Some people, when they have the problem of incrementing integers think, "I'll use Redis."

Now they have two problems.

esafak 3 hours ago [-]
Can you go into more detail? Redis supports counters: https://redis.io/glossary/storing-counters-in-redis/
whirlwin 12 hours ago [-]
It's an interesting approach. In similar cases has anyone got experience using a database more optimized on writes, e.g. Cassandra?
eatenfill 11 hours ago [-]
phillip it's called lock striping
smlacy 11 hours ago [-]
Really just a special case of sharding.
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact
Rendered at 06:55:52 GMT+0000 (Coordinated Universal Time) with Vercel.