NHacker Next
  • new
  • past
  • show
  • ask
  • show
  • jobs
  • submit
Minimizing correlated failures in distributed systems (aws.amazon.com)
0xbadcafebee 700 days ago [-]
I think the biggest takeaway here is system-wide operations are bad for system-wide reliability.

Also watch out for snowballing. External dependent systems should not rely on your entire system, but a portion of it. Building your system into regions and AZs is a cheap hack to enable you to perform operations on a portion of your system rather than the whole. If external dependent systems also depend on just a portion of your system, you avoid total system collapse and snowballing.

mjb 700 days ago [-]
> I think the biggest takeaway here is system-wide operations are bad for system-wide reliability.

Yes!

System-wide operations, whether they are human-driven operations ("ssh onto that box"), control-plane operations ("remove all the failed servers"), DI operations ("deploy the new code"), or even basic algorithmic things like replication ("put the same state onto all the servers") are the top causes of correlation that I've seen in the wild. Whether or not this matters to you depends a lot on what you're building, and how often you can tolerate failures. But if you're building something that needs high availability, durability, integrity, etc it's worth paying a huge amount of attention to the things that can introduce correlation in your systems.

If you're interested in reading more beyond what Joe (the OP) talks about some methods of avoiding those in the article (he's a colleague of mine at AWS):

* Our "Millions of Tiny Databases" paper goes into a lot of detail on another AWS take on reducing correlated failure (https://www.usenix.org/conference/nsdi20/presentation/brooke...). * Some AWS folks from the S3 team also touch on correlation in this talk: https://www.youtube.com/watch?v=DzRyrvUF-C0&t=2410s * I've written in the past about the role of software deployments in correlated failure (https://brooker.co.za/blog/2022/01/31/deployments.html), and about how to think about the role of redundancy (https://brooker.co.za/blog/2021/04/14/redundancy.html).

bvaldivielso 700 days ago [-]
The article mentions the software running in each availability zone as a source of correlation in the system. That correlation can be bad, because it makes the system less resilient (that's the point of the article).

I wonder if Amazon would ever consider having completely independent implementations of the same software (made and maintained by different teams) running in each availability zone. This would reduce the correlation between AZs. Of course, this would be much more expensive, but perhaps worth it if availability was critical.

In fact, I vaguely recall that this was a common practice in the development of some kind of critical system (avionics?).

BaconPackets 700 days ago [-]
I'm really curious at how everyone is approaching scalability/reliability/redundancy at scale.

We are in the midst of a 10k VM migration to AWS and GCP and it's definitely challenging.

Balancing the speed of the migration itself VS not ending up lift and shiftong workloads is difficult.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact
Rendered at 08:27:01 GMT+0000 (Coordinated Universal Time) with Vercel.