On the general topic - I’m interested in people sharing their experiences using LLMs in livesite scenarios.
bagels 340 days ago [-]
I found something important was broken at Facebook, created a SEV, but the hardest part was figuring out what team to page. I unfortunately roped in one non-responsible team's oncall who was making dinner at home, but thankfully, he did know what team to reach out to, based on the description of the problem.
Would be nice if there was better tooling for going from observed problem to responsible team.
chronid 340 days ago [-]
Some FAANGs at least (though they may not cover everything) have a "help something is broken but I don't know what to do" team and/or rotation for incident response, staffed on multiple continents to "follow the sun".
But you need to know they exist. :)
ElevenLathe 338 days ago [-]
I've worked on several such teams (not at FANGy places, but some household names), variously called just the NOC or SOC (early on in my career, the role was also a kind of on-duty Linux admin/computer generalist), Command Center, and Mission Control. It was great fun a lot of the time but the hours got to be tiresome.
I would be very surprised if any enterprise of significant size and IT complexity didn't have an IT incident response team. I'm biased but I think they are a necessity in complex environments where oncall engineers can't possibly even keep track of all their integrators and integrators' integrators, etc. It also helps to have incident commanders who do that job multiple times a week instead of a few times a decade.
fma 339 days ago [-]
I never worked at a FAANG...but a Fortune 20 company the last 9 years. There is no system of record of applications?
I can go to a website and type in search terms, URLs and pull up exactly who to contact. Even our generic "help something is broken" group relies on this. There are many names listed so even if the on call person listed is "making dinner", you have their backup, their manager, etc.
I can tag my system as dependent on another and if they have issues I get alerted.
chronid 339 days ago [-]
I am fairly simplifying, but you are expected to know your direct dependencies (and normally wil), pagers have embedded escalation rules with prinaries and secondaries, etc. The tooling once you know what to do is better than anything outside of FAANGs I've seen in terms of integration and reliability.
Escalation teams are usually reserved for the "oh fuck" situations, like "I don't work on this site but I found it broken" or "hey I think we are going to lose soon this availability zone" or "I am panicking and have no idea how to manage this incident, please help me".
They're a glue mechanism to prevent silos and paralysis during an event, usually pretty good engineers too.
jedberg 339 days ago [-]
That was one of the first things we built at Netflix when I got there. We had a paging schedule tied to every micro service. If you knew what service was broken, you could just "page the service" and their current on call would get paged.
If you didn't know what it was, you could page the SRE team and we'd diagnose with you.
Sometimes as SREs we would shortcut the process and just know who the right person is with the answer, but at least this way that tribal knowledge was somewhat encoded.
bagels 339 days ago [-]
Yeah, if you know what service is down, it's also trivial at Facebook to track down the oncall for that service. What isn't trivial is when you get a blank page where there might be dozens or hundreds of teams that might be responsible.
pzh 339 days ago [-]
So if you look at the flip side, the on call engineer is being misled by AI ~60% of the time. The question is does this slow them down more or less than the speedup they get when the AI is right the other 40% of the time.
whiplash451 339 days ago [-]
I would expect HN to flag product marketing posts when the original post is available instead
netik 340 days ago [-]
Great idea but yet another blog post, which is actually marketing, which ends with “they did it buy our product so you can too”, which is probably not what Meta did.
> Humans aren't great at incident response, and we all hate waking up at 2am to resolve an issue.
Agree that most people hate being woken at 2am, but disagree that humans aren't great at incident response. Speaking generally, I think we're about as good as it gets when it comes to adaptability and the kind reasoning that's necessary to investigate complex issues.
That said, I also think AI can play a massive role aiding humans, especially in undifferentiated tasks like checking deployments, code changes, past incidents, and when it comes to spotting patterns.
IMO the sweet spot is going to come from highly ergonomic AI products that enable collaborative incident response, rather AI incident management or any other marketing BS.
serious_angel 339 days ago [-]
Frankly, the whole phrase quoted sounds like a personal issue. If it is, let's hope they'll manage it, and get a proper sleep.
wkat4242 339 days ago [-]
The thing is, nobody needs to be woken up at 2am at Meta. They're big enough to have a follow the sun model where they rotate support around the world. They have a footprint almost everywhere.
jolarkin 339 days ago [-]
of course they do, why would this be interesting at all???
trod1234 338 days ago [-]
The interest is in the broad trend.
This clearly telegraphs a few things for those that are observant enough to notice.
Anyone interested in going into that field who has a brain would note that they are removing the entry level jobs using this since those are the simpler jobs.
With less demand for said labor, the more cutthroat the competition to get said jobs will be. If you double the competition, wages must naturally drop and with that so does competency and skill. When proficiency isn't rewarded, those with options go elsewhere.
When any other job with 1/10th the responsibility gets paid the same, its a simple no brainer choice.
When industry with large marketshare telegraphs this, it is a strong indicator that IT operations (in this case) will soon be a dead profession (soon being a relative generation), and anyone investing in education for it will get a negative ROI.
Upfront, they companies may get some profits off the bottom-line, but long-term they won't be able to find or keep talent. Said talent will have left and you get Atlas Shrugged dynamics.
Aside from that, that type of strategy also suggests other cascading dynamics that lead to grid/societal collapse following a burning the bridges approach. Those in their ivory towers may not see the consequences of their choices until it is too late for anything except coffin nails.
The fundamental issue of market sector concentration where few parties cooperate is that counter-party systemic risk becomes unmanageable and chaotic, destructive decisions cannot be softened.
In chaotic feedback systems, the general rule of thumb is chaos increases until the underlying imbalance is equalized.
For some, recognizing the inevitable conclusion of a series of events may dictate a radically altered approach towards life planning/survival.
trod1234 338 days ago [-]
Yeah this is just brilliant. /s
Reading between the lines/Translation:
We need ~50% less IT operations staff, we can have LLMs do this instead.
The LLMs are coming for your white-collar work, and this is likely driving the white-collar recession.
Here's a thought, incident response is one of the areas where entry-level SA's sharpen their teeth and skills to become able to complete intermediate and senior level roles.
What happens when all the low-hanging fruit, the entry-level jobs are now replaced by LLMs and there is no short-term business need to hire such people.
No jobs means labor pool finds something else outside their profession, or they sit around on food stamps homeless, agitating with the homies, until a critical mass occurs.
Your intermediate and senior people naturally age and die, so how do you find replacements for them? ...
When there is no economic advantage for developing a skill set, no one goes into the field, it acts like a sieve, and eventually the skills involved becomes lost knowledge. Those that have the skills that are unable to find work seek work elsewhere, and rarely return. They were burned severely, enough times that it becomes a bad bet to try again in that field.
These things aren't rocket science, and yet people seem to be so slothful or greedy, that they are unable to see or act to prevent what naturally happens next.
When people can't find jobs to feed themselves or loved ones, where the only future which has been imposed on them is slavery or death, these people will organize and do the only thing they can once they are desperate enough; and that is violence. These same dynamics were present in the decades leading up to 1776, according to historic record.
On the general topic - I’m interested in people sharing their experiences using LLMs in livesite scenarios.
Would be nice if there was better tooling for going from observed problem to responsible team.
But you need to know they exist. :)
I would be very surprised if any enterprise of significant size and IT complexity didn't have an IT incident response team. I'm biased but I think they are a necessity in complex environments where oncall engineers can't possibly even keep track of all their integrators and integrators' integrators, etc. It also helps to have incident commanders who do that job multiple times a week instead of a few times a decade.
I can go to a website and type in search terms, URLs and pull up exactly who to contact. Even our generic "help something is broken" group relies on this. There are many names listed so even if the on call person listed is "making dinner", you have their backup, their manager, etc.
I can tag my system as dependent on another and if they have issues I get alerted.
Escalation teams are usually reserved for the "oh fuck" situations, like "I don't work on this site but I found it broken" or "hey I think we are going to lose soon this availability zone" or "I am panicking and have no idea how to manage this incident, please help me".
They're a glue mechanism to prevent silos and paralysis during an event, usually pretty good engineers too.
If you didn't know what it was, you could page the SRE team and we'd diagnose with you.
Sometimes as SREs we would shortcut the process and just know who the right person is with the answer, but at least this way that tribal knowledge was somewhat encoded.
https://news.ycombinator.com/item?id=41326039
If you're looking for something open source: https://github.com/robusta-dev/holmesgpt/
Agree that most people hate being woken at 2am, but disagree that humans aren't great at incident response. Speaking generally, I think we're about as good as it gets when it comes to adaptability and the kind reasoning that's necessary to investigate complex issues.
That said, I also think AI can play a massive role aiding humans, especially in undifferentiated tasks like checking deployments, code changes, past incidents, and when it comes to spotting patterns.
IMO the sweet spot is going to come from highly ergonomic AI products that enable collaborative incident response, rather AI incident management or any other marketing BS.
This clearly telegraphs a few things for those that are observant enough to notice.
Anyone interested in going into that field who has a brain would note that they are removing the entry level jobs using this since those are the simpler jobs.
With less demand for said labor, the more cutthroat the competition to get said jobs will be. If you double the competition, wages must naturally drop and with that so does competency and skill. When proficiency isn't rewarded, those with options go elsewhere.
When any other job with 1/10th the responsibility gets paid the same, its a simple no brainer choice.
When industry with large marketshare telegraphs this, it is a strong indicator that IT operations (in this case) will soon be a dead profession (soon being a relative generation), and anyone investing in education for it will get a negative ROI.
Upfront, they companies may get some profits off the bottom-line, but long-term they won't be able to find or keep talent. Said talent will have left and you get Atlas Shrugged dynamics.
Aside from that, that type of strategy also suggests other cascading dynamics that lead to grid/societal collapse following a burning the bridges approach. Those in their ivory towers may not see the consequences of their choices until it is too late for anything except coffin nails.
The fundamental issue of market sector concentration where few parties cooperate is that counter-party systemic risk becomes unmanageable and chaotic, destructive decisions cannot be softened.
In chaotic feedback systems, the general rule of thumb is chaos increases until the underlying imbalance is equalized.
For some, recognizing the inevitable conclusion of a series of events may dictate a radically altered approach towards life planning/survival.
Reading between the lines/Translation: We need ~50% less IT operations staff, we can have LLMs do this instead.
The LLMs are coming for your white-collar work, and this is likely driving the white-collar recession.
Here's a thought, incident response is one of the areas where entry-level SA's sharpen their teeth and skills to become able to complete intermediate and senior level roles.
What happens when all the low-hanging fruit, the entry-level jobs are now replaced by LLMs and there is no short-term business need to hire such people.
No jobs means labor pool finds something else outside their profession, or they sit around on food stamps homeless, agitating with the homies, until a critical mass occurs.
Your intermediate and senior people naturally age and die, so how do you find replacements for them? ...
When there is no economic advantage for developing a skill set, no one goes into the field, it acts like a sieve, and eventually the skills involved becomes lost knowledge. Those that have the skills that are unable to find work seek work elsewhere, and rarely return. They were burned severely, enough times that it becomes a bad bet to try again in that field.
These things aren't rocket science, and yet people seem to be so slothful or greedy, that they are unable to see or act to prevent what naturally happens next.
When people can't find jobs to feed themselves or loved ones, where the only future which has been imposed on them is slavery or death, these people will organize and do the only thing they can once they are desperate enough; and that is violence. These same dynamics were present in the decades leading up to 1776, according to historic record.
It is so extremely short-sighted.
https://www.wildmoose.ai
Edit: Lol at myself, I thought this was a blog post from Meta and I was pointing out that there is a YC company that does this for everyone.
Now I realize that this was an ad for a different YC company that also does (although WM is a year older).