This pattern has a smell. If you're shipping continuously then your on-call engineer is going to be fixing the issues the other engineers are shipping, instead of those engineers following up on their deployments and fixing issues caused by those changes. If you're not shipping continuously, then anyway customer issues can't be fixed continuously, and your list of bugs can be prioritized by management with the rest of the work to be done. The author quotes maker vs. manager schedules, but one of the conclusions of following that is that engineers don't talk directly to customers, because "talking to customers" is another kind of meeting, which is a "manager schedule" kind of thing rather than a "maker schedule" kind of thing.
There's simply no substitute for Kanban processes and for proactive communication from engineers. In a small team without dedicated customer support, a manager takes the customer call, decides whether it's legitimately a bug, creates a ticket to track it and prioritizes it in the Kanban queue. An engineer takes the ticket, fixes it, ships it, communicates that they shipped something to the rest of their team, is responsible for monitoring it in production afterwards, and only takes a new ticket from the queue when they're satisfied that the change is working. But the proactive communication is key: other engineers on the team are also shipping, and everyone needs to understand what production looks like. Management is responsible for balancing support and feature tasks by balancing the priority of tasks in the Kanban queue.
thih9 16 days ago [-]
> on-call engineer is going to be fixing the issues the other engineers are shipping, instead of those engineers following up on their deployments and fixing issues caused by those changes
Solution: don’t. If a bug has been introduced by the currently running long process, forward it back. This is not distracting, this is very much on topic.
And if a bug is discovered after the cycle ends - then the teams swap anyway and the person who introduced the issue can still work on the fix.
dakshgupta 16 days ago [-]
This is a real shortcoming, the engineers that ship feature X will not be responsible for the immediate aftermath. Currently we haven’t seen this hurt in practice, probably because we are very small and in-person, but you might be correct and it would then likely be the first thing that breaks about this as our team grows.
safety1st 16 days ago [-]
I commented a while back on another post about a company I worked at which actually made developers spend a few days a year taking tech support calls. This takes their responsibility for and awareness of the aftermath of their work to a whole new level and from my perspective was very effective. Could be an alternate route to address the same problem.
crabmusket 16 days ago [-]
We often plan projects in release stages including a limited alpha to a few customers who are interested in a feature. We expect that during the alpha period, the developer who worked on the feature will need to make changes and address feedback from the users. And the same after a general release. We have longer rotations than yours so there is usually time to schedule this in around feature releases before that developer is responsible for general defensive work.
pnut 16 days ago [-]
We came to this conclusion from a different direction - feature implementation teams are focused on subdomains, but defensive teams are spread across the entire ecosystem.
Additionally, defensive devs have brutal SLAs, and are frequently touching code with no prior exposure to the domain.
They got known as "platform vandals" by the feature teams, & we eventually put an end to the separation.
cutemonster 16 days ago [-]
That sounds interesting ("platform vandals" and your solution). At what type of software company do you work? About how many are you, what type of product, if I can ask?
spease 16 days ago [-]
It really depends on the context. Some types of troubleshooting just involves a lot of time-consuming trial-and-error that doesn’t teach anything, it just rules out possibilities to diagnose the issue. Some products have a long deployment cycle or feedback loop. Some people are just more suited to or greatly prefer either low or high context switched work.
Good management means finding the right balance for the team, product, and business context that you have, rather than inflexibly trying to force one strategy to work because it’s supposedly the best.
WorldWideWebb 15 days ago [-]
What do you do if the manager has no technical skill/knowledge (basically generic middle manager that happens to lead a technical team)?
dakiol 17 days ago [-]
I once worked for a company that required from each engineer in the team to do what they called “firefighting” during working hours (so not exactly on-call). So for one week, I was triaging bug tickets and trying to resolve them. These bugs belonged to the area my team was part of, so it affected the same product but a vast amount of micro services, most of which I didn’t know much about (besides how to use their APIs). It didn’t make much sense to me. So you have Joe punching code like there’s no tomorrow and introducing bugs because features must go live asap. And then it’s me the one fixing stuff. So unproductive. I always advocated for a slower pace of feature delivery (so more testing and less bugs on production) but everyone was like “are you from the 80s or something? We gotta move fast man!”
onion2k 16 days ago [-]
This sort of thing is introduced when the number of bugs in production, especially bugs that aren't user-facing or a danger to data (eg 'just' an unhandled exception or a weird log entry), gets to a peak and someone decides it's important enough to actually do something about it. Those things are always such a low priority that they're rarely dealt with any other way.
In my experience whenever that happens someone always finds an "oh @#$&" case where a bug is actually far more serious than everyone thought.
It is an approach that's less productive than slowing down and delivering quality, but it's also completely inevitable once a team/company grows to a sufficient size.
dakshgupta 17 days ago [-]
This is interesting because it’s what I imagine would happen if we scaled this system to a larger team - offense engineers would get sloppy, defensive engineers would get overwhelmed, even with the rotation cycles.
Small, in-person, high-trust teams have the advantage of not falling into bad offense habits.
Additionally, a slower shipping pace simply isn’t an option, seeing as the only advantage we have over our giant competitors is speed.
jedberg 17 days ago [-]
> offense engineers would get sloppy
Wouldn't they be incentivized to maintain discipline because they will be the defensive engineers next week when their own code breaks?
dakshgupta 17 days ago [-]
I suspect as the company gets larger time between defensive sprints will get longer, but yes, for smaller teams this is what keeps quality high, you’ll have to clean up your own mess next week.
16 days ago [-]
DJBunnies 17 days ago [-]
I think we’ve worked for the same org
17 days ago [-]
resonious 16 days ago [-]
I honestly don't really like the "let's slow down" approach. It's hard for me to buy into the idea that simply slowing down will increase product quality. But I think your comment already contains the key to better quality: close the feedback loop so that engineers are responsible for their own bugs. If I have the option of throwing crap over the wall, I will gravitate towards it. If I have to face all of the consequences of my code, I might behave otherwise.
isametry 16 days ago [-]
Slow is smooth, smooth is fast?
samarthr1 15 days ago [-]
Greetings, Phil Dunphy
smoothisfast2 16 days ago [-]
Slowing down doesn't mean going slow. There's more to software development than vomiting out lines of code as quickly as possible.
no_wizard 16 days ago [-]
>Slowing down doesn't mean going slow. There's more to software development than vomiting out lines of code as quickly as possible.
Tell that to seemingly every engineering manager and product manager coming online over the last 8-10 years.
I first noticed in 2016 there seemed to be a direct correlation between more private equity and MBA's getting into the field and the decline of software quality.
So now you have a generation of managers (and really executives) who know little of the true tradeoffs between quality and quantity because they only ever saw success pushing code as fast as possible regardless of its quality and dealing with the aftermath. This lead them to promotions, new jobs etc.
We did this to ourselves really, by not becoming managers and executives ourselves as engineers.
glenjamin 16 days ago [-]
Having a proportion of the team act as triage for issues / alerts / questions / requests is a generally good pattern that I think is pretty common - especially when aligned with an on-call rotation. I've done it a few times by having a single person in a team of 6 or 7 do it. If you're having to devote 50% of your 4-person team to this sort of work, that suggests your ratios are a bit off imo.
The thing I found most surprising about this article was this phrasing:
> We instruct half the team (2 engineers) at a given point to work on long-running tasks in 2-4 week blocks. This could be refactors, big features, etc. During this time, they don’t have to deal with any support tickets or bugs. Their only job is to focus on getting their big PR out.
This suggests that this pair of people only release 1 big PR for that whole cycle - if that's the case this is an extremely late integration and I think you'd benefit from adopting a much more continuous integration and deployment process.
wavemode 16 days ago [-]
> This suggests that this pair of people only release 1 big PR for that whole cycle
I think that's a too-literal reading of the text.
The way I took it, it was meant to be more of a generalization.
Yes, sometimes it really does take weeks before one can get an initial PR out on a feature, especially when working on something that is new and complex, and especially if it requires some upfront system design and/or requirements gathering.
But other times, surely, one also has the ability to pump out small PRs on a more continuous basis, when the work is more straightforward. I don't think the two possibilities are mutually exclusive.
Kinrany 16 days ago [-]
I thought that at first, but the article literally says "getting their big PR out".
DanHulton 16 days ago [-]
Yeah, but again you might be being too literal. You could get a half dozen "big PRs" out in a month or so, but you'd still want to be able to just focus on "getting your (current) big PR out", you know?
The important part is that you're not interrupted during your large-scale tasks, not the absolute length of those tasks.
Kinrany 16 days ago [-]
That's fair: even if their team has a problem with PR size, this doesn't have all that much to do with the pattern the article describes.
The_Colonel 16 days ago [-]
> This suggests that this pair of people only release 1 big PR for that whole cycle - if that's the case this is an extremely late integration
I don't think it suggests how the time block translates into PRs. It could very well be a series of PRs.
In any case, the nature of the product / features / refactorings usually dictates the minimum size of a PR.
marcinzm 16 days ago [-]
> In any case, the nature of the product / features / refactorings usually dictates the minimum size of a PR.
Why not split the big tickets into smaller tickets which are delivered individually? There's cases where you literally can't but in my experience those are the minority or at least should be assuming a decently designed system.
The_Colonel 16 days ago [-]
> Why not split the big tickets into smaller tickets which are delivered individually?
Because it is already the smallest increment you can make. Or because splitting it further would add a lot of overhead.
> There's cases where you literally can't but in my experience those are the minority
I think in this sentence, there's a hidden assumption that most projects look like your project(s). That's likely false.
marcinzm 16 days ago [-]
> I think in this sentence, there's a hidden assumption that most projects look like your project(s). That's likely false.
You left out the part of that quote where I explained my assumption very clearly: A decently designed system.
In my experience if you cannot split tasks into <1 week the vast majority of the time then your code has massive land mines in it. The design may be too inter-connected, too many assumptions baked too deeply, not enough tests, or various other issues. You should address those landmines before you step on them rather than perpetually trying to walk around them. Then splitting projects down becomes much much easier.
The_Colonel 16 days ago [-]
> You left out the part of that quote where I explained my assumption very clearly: A decently designed system.
That's one possible reason. Sometimes software is designed badly from the ground up, sometimes it accumulates a lot of accidental complexity over years or decades. Solving that problem is usually out of your control in those cases, and only sometimes there's a business driver to fix it.
But there are many other cases. You have software with millions of lines of code, decades of commit history. Even if the design is reasonable, there will be a significant amount of both accidental and essential complexity - from certain size/age you simply won't find any pristine, perfectly clean project. Implementing a relatively simple feature might mean you will need to learn the interacting features you've never dealt with so far, study documentation, talk to people you've never met (no one has a complete understanding either). Your acceptance testing suite runs for 10 hours on a cluster of machines, and you might need several iterations to get them right. You have projects where the trade-off between velocity and tolerance for risk is different from yours, and the processes designed around it are more strict and formal than you're used to.
skydhash 16 days ago [-]
And also you have to backport all the changes that have been made on the main branch. Especially for upgrading or stack switching tasks.
ozim 16 days ago [-]
Ticket can have multiple smaller PRs.
Lots of time it is true that ticket == pr but it is not the law.
It sometimes makes sense to separate subtasks under a ticket but that is only if it makes sense in business context.
codemac 16 days ago [-]
> extremely late integration
That's only late if there are other big changes going in at the same time. The vast majority of operational/ticketing issues have few code changes.
I'm glad I had the experience of working on a literal waterfall software project in my life (e.g. plan out the next 2 years first, then we "execute" according to a very detailed plan that entire time). Huge patches were common in this workflow, and only caused chaos when many people were working in the same directory/area. Otherwise it was usually easier on testing/integration - only 1 patch to test.
yayitswei 16 days ago [-]
A PR that moves the needle is worth 2-4 weeks or more. Small improvements or fixes can be handled by the team on the defense rotation.
marcinzm 16 days ago [-]
That's also been my experience. It's part time work for a single on call engineer on a team of 6-8. If it's their full time work for a given sprint then we have an urgent retro item to discuss around bug rates, code quality and so on.
cutemonster 16 days ago [-]
Might be quick nice-to-have features too (not only bugs)
stopachka 17 days ago [-]
> While this is flattering, the truth is that our product is covered in warts, and our “lean” team is more a product of our inability to identify and hire great engineers, rather than an insistence on superhuman efficiency.
> The result is that our product breaks more often than we’d like. The core functionality may remain largely intact but the periphery is often buggy, something we expect will improve only as our engineering headcount catches up to our product scope.
I really resonate with this problem. It was fun to read. We've been tried different methods to balance customers and long-term projects too.
Some more ideas that can be useful:
* Make quality projects an explicit monthly goal.
For example, when we noticed our the edges in our surface area got too buggy, we started a 'Make X great' goal for the month. This way you don't only have to react to users reporting bugs, but can be proactive
* Reduce Scope
Sometimes it can help to reduce scope; for example, before adding a new 'nice to have feature', focus on making the core experience really great. We also considered pausing larger enterprise contracts, mainly because it would take away from the core experience.
---
All this to say, I like your approach; I would also consider a few others (make quality projects a goal, and cut scope)
cutemonster 16 days ago [-]
> Make quality projects .. can be proactive
What are some proactive ways? Ideally that cannot easily be gamed?
I suppose test coverage and such things, and an internal QA team. What I thought the article was about (before having read it) was having half of the developers do red team penetration testing, or looking for UX bugs, of things the other half had written.
Any more ideas? Do you have any internal definitions of "a quality project"?
Attummm 16 days ago [-]
When you get to that stage, software engineering has failed fundamentally.
This is akin to having a boat that isn't seaworthy, so the suggestion is to have a rowing team and a bucket team.
One rows, and the other scoops the water out.
While missing the actual issue at hand.
Instead, focus on creating a better boat. In this case, that would mean investing in testing: unit tests, integration tests, and QA tests.
Have staff engineers guide the teams and make their KPI reducing incidents.
Increase the quality and reduce the bugs, and there will be fewer outages and issues.
lucasyvas 16 days ago [-]
> When you get to that stage, software engineering has failed fundamentally.
Agreed - this is a survival mode tactic in every company I’ve been when it’s happened. If you’re permanently in the described mode and you’re small sized, you might as well be dead.
If mid to large and temporary, this might be acceptable to right the ship.
intelVISA 16 days ago [-]
Yep, software is about cohesion. Having one side beloved by product and blessed with 'the offense' racing ahead to create extra work for the other is not the play.
Even when they rotate - who wants to clock in to wade through a fresh swamp they've never seen? Don't make the swamp: if you're moving too slow shipping things without sinking half the ship each PR then raise your budget to better engineers - they exist.
This premise is like advocating for tech debt loan sharks; I really hope TFA was ironic. Sure, it makes sense from a business perspective as a last gasp to sneakily sell off your failed company but you would never blog "hey here at LLM-4-YOU, Inc. we're sinking".
ericmcer 16 days ago [-]
You are viewing it like an engineer. From a business perspective if you can keep a product stable while growing your user base until you become an attractive acquisition target then this is a great strategy.
Sometimes as an engineer I like the frantically scooping water while we try to scale rapidly because it means leaderships vision is to get an exit for everyone as fast as possible. If leadership said "lets take a step back and spend 3 months stabilizing everything and creating a testing/QA framework" I would know they want to ride it out til the end.
Attummm 16 days ago [-]
I think you're not following what I tried to convey.
It shouldn't have ever come to the point where incidents, outages, and bugs have become prominent enough to warrant a team.
Either have kickass dev(s), although improbable, thus the second level: Implement mitigations, focus on testing, and have staff engineers with KPIs to lower incidents. Give them them the space but be prepared to let them go if incidents don't go down.
There is no stopping of development.
Refactoring by itself doesn't guarantee better code or fewer incidents.
But don't allow bugs, or known issues, as they can be death by thousand cuts.
The viewpoint is from not an engineer. Having constant incidents doesn't show confidence or competence to investors and customers.
As it diverts attention from creating business value into firefighting, which has zero business value and is bad for morale.
Thus, tech investment rather than debt always pays off if implemented right.
kqr 16 days ago [-]
Wait, are you saying well-managed software development has no interrupt-driven work, and still quickly and efficiently delivers value to end users?
How does one get to that state?
fryz 17 days ago [-]
Neat article - I know the author mentioned this in the post, but I only see this working as long as a few assumptions hold:
* avg tenure / skill level of team is relatively uniform
* team is small with high-touch comms (eg: same/near timezone)
* most importantly - everyone feels accountable and has agency for work others do (eg: codebase is small, relatively simple, etc)
Where I would expect to see this fall apart is when these assumptions drift and holding accountability becomes harder. When folks start to specialize, something becomes complex, or work quality is sacrificed for short-term deliverables, the folks that feel the pain are the defense folks and they dont have agency to drive the improvements.
The incentives for folks on defense are completely different than folks on offense, which can make conversations about what to prioritize difficult in the long term.
dakshgupta 17 days ago [-]
These assumptions are most likely important and true in our case, we work out of the same room (in fact we also all live together) and 3/4 are equally skilled (I am not as technical)
eschneider 17 days ago [-]
If the event-driven 'fixing problems' part of development gets separated from the long-term 'feature development', you're building a disaster for yourself. Nothing more soul-sucking than fixing other people's bugs while they happily go along and make more of them.
dakshgupta 17 days ago [-]
There is certainly some razor applied on whether a request is unique to one user or is widely requested/likely to improve the experience for many users
jedberg 17 days ago [-]
> this is also a very specific and usually ephemeral situation - a small team running a disproportionately fast growing product in a hyper-competitive and fast-evolving space.
This is basically how we ran things for the reliability team at Netflix. One person was on call for a week at a time. They had to deal with tickets and issues. Everyone else was on backup and only called for a big issue.
The week after you were on call was spent following up on incidents and remediation. But the remaining weeks were for deep work, building new reliability tools.
The tools that allowed us to be resilient enough that being on call for one week straight didn't kill you. :)
dakshgupta 17 days ago [-]
I am surprised and impressed a company at that scale functions like this. We often internally discuss if we can still doing this when we’re 7-8 engineers.
jedberg 17 days ago [-]
I think you're looking at it backwards. We were only able to do it because we had so many engineers that we had time to write tools to make the system reliable enough.
On call for a week at a time only really works if you only get paged at night once a week max. If you get paged every night, you will die from sleep deprivation.
dmoy 16 days ago [-]
Moving from 24/7 oncall to 12 hour shifts trading off with another continent is really nice
cgearhart 17 days ago [-]
This is often harder at large companies because you very rarely make career progress playing defense, so it becomes very tricky to do it fairly. It can work wonders if you have the right teammates, but it’s almost a prisoners dilemma game that falls apart as soon as one person opts out.
dakshgupta 17 days ago [-]
Good point, we will usually only rotate when the long running task is done but eventually we’ll arrive at some feature that takes more then a few weeks to build so will need to restructure our methods then.
shalmanese 17 days ago [-]
To the people pooh poohing this, do y’all really work with such terrible coworkers that you can’t imagine an effective version of this?
You need trust in your team to make this work but you also need trust in your team to make any high velocity system work. Personally, I find the ideas here extremely compelling and optimizing for distraction minimization sounds like a really interesting framework to view engineering from.
johnnyanmac 16 days ago [-]
work with terrible management that can't imagine an effective version of this.
jph 17 days ago [-]
Small teams shouldn't split like this IMHO. It's better/smarter/faster IMHO to do "all hands on deck" to get things done.
For prioritization, use a triage queue because it aims the whole team at the most valuable work. This needs to be the mission-critical MVP & PMF work, rather than what the article describes as "event driven" customer requests i.e. interruptions.
dakshgupta 17 days ago [-]
A triage queue makes a lot of sense, only downside being the challenge of getting a lot done without interruption.
bvirb 17 days ago [-]
In a similar boat (small team, have to balance new stuff, maintenance, customer requests, bugs, etc).
We ended up with a system where we break work up into things that take about a day. If someone thinks something is going to take a long time then we try to break it down until some part of it can be done in about a day. So we kinda side-step the problem of having people able to focus on something for weeks by not letting anything take weeks. The same person will probably end up working on the smaller tasks, but they can more easily jump between things as priorities change, and pretty often after doing a few of the smaller tasks either more of us can jump in or we realize we don't actually need to do the rest of it.
It also helps keep PRs reasonably sized (if you do PRs).
Kinrany 16 days ago [-]
You're not addressing the issue of triage also being an interruption.
jph 16 days ago [-]
A triage queue can include how/when to do triage. As a specific example, set 60 minutes each Friday to sift through bug reports together. Small teams with good customers can reply honestly with "Thank you for your bug report. We're tracking it now at $URL. We expect to look at it after we've shipped $FEATURE. If we've misunderstood the urgency or severity please call us directly at $PHONE."
d4nt 16 days ago [-]
I think they’re on to something, but the solution needs more work. Sometimes it’s not just individual engineers who are playing defence, it’s whole departments or whole companies that are set up around “don’t change anything, you might break it”. Then the company creates special “labs” teams to innovate.
To borrow a football term, sometimes company structure seems like it’s playing the “long ball” game. Everyone sitting back in defence, then the occasional hail mary long pass up to the opposite end. I would love to see a more well developed understanding within companies that certain teams, and the processes that they have are defensive, others are attacking, and others are “mid field”, i.e. they’re responsible for developing the foundations on which an attacking team can operate (e.g. longer term refactors, API design, filling in gaps in features that were built to a deadline). To win a game you need a good proportion of defence, mid field and attack, and a good interface between those three groups.
svilen_dobrev 17 days ago [-]
IMO the split, although good (the pattern is "sacrifice one person" as per Coplien/Harrision's Organisational patterns book [0]), is too drastic. It should be not defense vs offense 100% with a wall inbetween, but for each and every issue (defense) and/or feature (offense), someone has to pick it and become the responsible (which may or may not mean completely doing it by hirself). Fixing a bug for an hour-or-two sometimes has been exactly the break i needed in order to continue digging some big feature when i feel stuck.
And the team should check the balances once in a while, and maybe rethink the strategy, to avoid overworking someone and underworking someone else, thus creating bottlenecks and vacuums.
At least this is the way i have worked and organised such teams - 2-5 ppl covering everything. Frankly, we never had many customers :/ but even one is enough to generate plenty of "noise" - which sometimes is just noise, but if good customer, will be mostly real defects and generally under-tended parts. Also, good customers accept a NO as answer. So, do say more NOs.. there is some psychological phenomena in software engineering in saying yes and promising moonshots when one knows it cannot happen NOW, but looks good..
Interesting concept. Certainly worth trying, but in the name of offense (read: being proactive):
- "and our “lean” team is more a product of our inability to identify and hire great engineers, rather than an insistence on superhuman efficiency."
Can we all at some point have a serious discussion on hiring and training. It seems that many teams are unstaffed or at least not satisfied with the quality and quantity of their team. Why is that? Why does it seem to be the norm?
- what about mitigating bugs in the first place? Shouldn't someone be assigned to that? Yeah, sure, bugs are a given. They are going to happen. But in production bugs are something real and paying customers shouldn't experience. At the very least what about feature flags? That is sonething new is introduced to a limited number of user. If there's a bug and it's significant enough, the flag is flipped and the new feature withdrawn. Then the bug can be sorted as someone is available.
Prehaps the profession just is what it is? Some teams are almost miraculously better than others? Maybe that's luck, individuals, product, and/or the stack? Maybe like plumbers and shit there are just things that engineering teams can't avoid? I'm not suggesting we surrender, but that we become more realistic about expectations.
philipwhiuk 16 days ago [-]
We have a person who is 'Batman' to triage production issues. Generally they'll pick up smaller sprint tasks. It rotates every week. It's still stuff from the team so they aren't doing stuff unknown (or if they are, it's likely they'll work on it soon).
The aim is generally not to provide a perfect fix but an MVP fix and raise tickets in the queue for regular planning.
It rotates round every week or so.
My company's not very devops so it's not on-call, but it's 'point of contact'.
ryukoposting 16 days ago [-]
I can't be the only one who finds the graphics at the top of this article off-putting. I find it hard to take someone seriously when they plaster GenAI slop across the top of their blog.
That said, there's some credence to what the author is describing. Although I haven't personally worked under the exact system described, I have worked in environments where engineers take turns being the first point of contact for support. In my experience, it worked pretty well. People know your bandwidth is going to be a bit shorter when you're on support, and so your tasks get dialed back a bit during that period.
I think the author, and several people in the comments, make the mistake of assuming that an "engineer on support" necessarily can fix any given problem they are approached with. Larger firms could allocate a complete cross-functional team of support engineers, but this is very costly for small outfits. If you have mobile apps, in-house hardware products and/or integrations with third-party hardware, it's basically guaranteed that your support engineer(s) will eventually be given a problem that they don't have the expertise to solve.
In that situation, the support engineer still has the competencies to figure out who does know how to fix the problem. So, the support engineer often acts more as a dispatcher than a singular fixer of bugs. Their impact is still positive, but more subtle than "they fix the bugs." The support engineer's deep system knowledge allows them to suss out important details before the bug is dispatched to the appropriate dev(s), thereby minimizing downtime for the folks who will actually implement the fix.
jwrallie 16 days ago [-]
I think interruptions damage the productivity overall, not only of engineers. Maybe some are unaware of it, and others simply don’t care. They don’t want to sacrifice their own productivity by waiting on someone busy, so they interrupt and after getting the information they want, they feel good. From their perspective, the productivity increased, not decreased.
Some engineers are more likely to avoid interrupting others because they can sympathize.
smugglerFlynn 16 days ago [-]
Constantly working in what OP describes as defence might also be negatively affecting the perception of cause and effect of own actions:
Specifically, we show that individuals following clock-time [where tasks are organized based on a clock**] rather than event-time [where tasks are organized based on their order of completion] discriminate less between causally related and causally unrelated events, which in turn increases their belief that the world is controlled by chance or fate. In contrast, individuals following event-time (vs. clock-time) appear to believe that things happen more as a result of their own actions.[0]
** - in my experience, clock based organisation seems to be very characteristic to what OP describes as defensive, when you become driven by incoming priorities and meetings
Broader article about impact of schedules at [1] is also highly relevant and worth the read.
By "constantly", do you mean for 2-4 weeks in a row?
smugglerFlynn 16 days ago [-]
I was thinking perpetually, which is not unusual for some of the tech companies and/or roles.
october8140 16 days ago [-]
My first job had a huge QA team. It was my job to work quickly and it was their job to find the issues. This actually set me up really poorly because I got in the habit of not doing proper QA.
There were at least 10 people doing it for me. When I left it took awhile for me to learn what properly QAing my own worked looked like.
ntarora 16 days ago [-]
Our team ended up having the oncall engineer for the week also work primarily on bug squashing and anything that makes support easier. Over time the support and monitoring becomes better. Basically dedicated tech debt capacity, which has worked well for us.
marcinzm 16 days ago [-]
It feels like having 50% of your team's time be spent on urgent support, triage and bugs is a lot. That seems like a much better thing to solve versus trying to work around the issue by splitting the team. Probably having those people fix bugs while a 4 week re-factor in a secluded branch is constantly in process doesn't help with efficiency or bug rate.
Kinrany 16 days ago [-]
It's a team of 4, so the only options are 25% and 50%.
But the fact that this explicit split makes the choice visible is clearly an upside.
JohnMakin 16 days ago [-]
This is a common "pattern" on well-ran ops teams. The work of a typical ops team consists of a lot of new work but tons of interruptions come in as new issues arise and must be dealt with. So we would typically assign 1 engineer (who was also typically on call) a lighter workload and would be responsible for triaging most issues that came in.
toolslive 16 days ago [-]
The proposed strategy will work, as will plenty of others, because it's a small team. That is the fundamental reason. Small teams are more efficient. So if you're managing a team of 10+ individuals: split them in 2 teams and keep them out of each other's way/harm.
ozim 16 days ago [-]
I like the approach as it is easy to explain and it is having catchy names.
But sounds like there has to be a lot of micro management involved and when you have team of 4 it is easy to keep up but as soon as you go to 20 and that increase also means much more customer requests it will fall apart.
ndndjdjdn 16 days ago [-]
This is probably devops. A single team talking full responsibility and swapping oncall-type shifts. These guys know their dogfood.
You want the defensive team to work on automating away stuff that pays off for itself in the 1-4 week timeframe. If they get any slack to do so!
stronglikedan 17 days ago [-]
Everyone on every team should have something to "own" and feel proud of. You don't "own" anything if you're always on team defense. Following this advice is a sure fire way to have a high churn rate.
FireBeyond 17 days ago [-]
Yup, last place I was at I had engineers begging me (PM) to advocate against this, because leadership was all "We're going to form a SEAL team to blaze out [exciting, interesting, new, fun idea/s]. Another team will be on bug fixes."
My team had a bunch of stability work, and bug fixes (and there was a lot of bugs and a lot of tech debt, and very little organizational enthusiasm to fix the latter).
Guess where there morale was, compared to some of the other teams?
000ooo000 16 days ago [-]
Splitting a team by interesting/uninteresting work is a comically bad idea. It's puzzling that it ever gets pitched, let alone adopted.
Edit: I mean an ongoing split, not a rotation
LatticeAnimal 17 days ago [-]
From the post:
> At the end of the cycle, we swap.
They swap teams every 2-4 weeks so nobody will always be on team defense.
ninininino 17 days ago [-]
You didn't read the article did you, they swap every 2 weeks between being on offense and defense.
bsimpson 16 days ago [-]
Ha - I think greptile was my first email address!
Reptile was my favorite Mortal Kombat character, and our ISP added a G before all the sub accounts. They put a P in front of my dad's.
eiathom 16 days ago [-]
And, what else?
Putting a couple of buzzwords on a practice being performed for at least 15 years now doesn't make you clever. Quite the opposite in fact.
Kinrany 16 days ago [-]
Do you have a name for this practice?
bradarner 17 days ago [-]
Don't do this to yourself.
There are 2 fundamental aspects of software engineering:
Get it right
Keep it right
You have only 4 engineers on your team. That is a tiny team. The entire team SHOULD be playing "offense" and "defense" because you are all responsible for getting it right and keeping it right. Part of the challenge sounds like poor engineering practices and shipping junk into production. That is NOT fixed by splitting your small team's cognitive load. If you have warts in your product, then all 4 of you should be aware of it, bothered by it and working to fix it.
Or, if it isn't slowing growth and core metrics, just ignore it.
You've got to be comfortable with painful imperfections early in a product's life.
Product scope is a prioritization activity not an team organization question. In fact, splitting up your efforts will negatively impact your product scope because you are dividing your time and creating more slack than by moving as a small unit in sync.
You've got to get comfortable telling users: "that thing that annoys you, isn't valuable right now for the broader user base. We've got 3 other things that will create WAY MORE value for you and everyone else. So we're going to work on that first."
MattPalmer1086 17 days ago [-]
I have worked in a small team that did exactly this, and it works well.
It's just a support rota at the end of the day. Everyone does it, but not all the time, freeing you up to focus on more challenging things for a period without interruption.
This was an established business (although small), with some big customers, and responsive support was necessary. There was no way we could just say "that thing that annoys you, tough, we are working on something way more exciting." Maybe that works for startups.
bradarner 16 days ago [-]
Yes, very good point. I would argue that what I’m suggesting is particularly well suited to startups. It may be relevant to larger companies as well but I think the politics and risk profile of larger companies makes this nearly impossible to implement.
dakshgupta 17 days ago [-]
All of these are great points. I do want to add we rotate offense and defense every 2-3 weeks, and the act of doing defense which is usually customer facing gives that half of the team a ton of data to base the next move on.
bradarner 17 days ago [-]
The challenge is that you actually want your entire team to benefit from the feedback. The 4 of you are going to benefit IMMENSELY from directly experiencing every single pain point- together.
As developers we like to focus. But there is vast difference between "manager time" and "builder time" and what you are experiencing.
You are creating immense value with every single customer interaction!
CUSTOMER FACING FIXES ARE NOT 'MANAGER TIME'!!!!!!
They are builder time!!!!
The only reason I'm insisting is because I've lived through it before and made every mistake in the book...it was painful scaling an engineering and product team to >200 people the first time I did it. I made so many mistakes. But at 4 people you are NOT yet facing any real scaling pain. You don't have the team size where you should be solving things with organizational techniques.
I would advise that you have a couple of columns in a kanban board: Now, Next, Later, Done & Rejected. And communicate it to customers. Pull up the board and say: "here is what we are working on." When you lay our the priorities to customers you'd be surprised how supportive they are and if they aren't...tough luck.
Plus, 2-3 weeks feels like an eternity when you are on defense. You start to dread defense.
And, it also divorces the core business value into 2 separate outcomes rather than a single outcome. If a bug helps advance your customers to their outcome, then it isn't "defense" it is "offense". If it doesn't advance your customer, why are you doing it? If you succeed, all of your ugly, monkey patched code will be thrown away or phased out within a couple of years anyway.
FridgeSeal 16 days ago [-]
Whilst I very much agree with you, actually doing this properly and pulling this off requires PM’s and/or Account Managers who are willing and capable of _actually managing_ customers.
Many, many people I’ve dealt with in these roles don’t or can’t, and seem to think their sole task is to mainline customer needs into dev teams. The PM’s I’ve had who _actually_ do manage back properly had happier dev teams, and ultimately happier clients, it’s not a mystery, but for some reason it’s a rare skill.
bradarner 16 days ago [-]
Yes completely agree. This is hard for a PM to do.
I’m assuming that the OP is a founder and can actually make these calls.
dijksterhuis 16 days ago [-]
the reasons PM stuff is ‘hard’ in my, admittedly limited, experience often seems to come down to
- saying No, and sticking to it when it matters — what you’ve mentioned.
- knowing how the product gets built — knowing *the why behind the no*.
PMs don’t usually have the technical understanding to do the second one. so the first one falls flat because why would someone stick to their guns when they do not understand why they need to say No, and keep saying No.
there are cases where talking to customer highlights a mistaken understanding in the *why we’re saying No*. those moments are gold because they’re challenging crucial assumptions. i love those moments. they’re basically higher level debugging.
but, again, without the technical understanding a PM can’t notice those moments.
they end up just filling up a massive backlog of everything because they don’t know how to filter wants vs. needs and stuff.
—
also i agree with a lot of what you’ve said in this chain of discussion.
get it right first time, then keep it right is so on point these days. especially for smaller teams. 90% of teams are not the next uber and don’t need to worry about massive growth spurts. most users don’t want the frontend changing every single day. they want stability.
worry about getting it right first. be like uber/google if you need to, when you need to.
johnrob 16 days ago [-]
I thought you made the rotation aspect quite clear. Everyone plays both roles and I’m sure when a bigger issue arises everyone becomes aware regardless.
Personally, I like this because as a dev I can set expectations accordingly. Either I plan for minimal disruption and get it, or take the on call side which I’m fine with so long as I’m not asked to do anything else (frustration is when your expected to build features while getting “stuck” fixing prod issues).
rkangel 16 days ago [-]
> You've got to get comfortable telling users: "that thing that annoys you, isn't valuable right now for the broader user base. We've got 3 other things that will create WAY MORE value for you and everyone else. So we're going to work on that first."
Yes, but you've got to spend time talking to users to say that. Many engineering teams have incoming "stuff". Depending on your context that might be bug reports from your customer base, or feature requests from clients etc. You don't want these queries (that take half an hour and are spread out over the week) to be repeatedly interrupting your engineering team, it's not great for getting stuff done and isn't great for getting timely helpful answers back to the people who asked.
There's a few approaches. This post describes one ("take it in turns"). In some organisations, QA is the first line of defence. In my team, I (as the lead) do as much of it as I can because that's valuable to keep the team productive.
ramesh31 17 days ago [-]
To add to this, ego is always a thing among developers. Your defensive players will inevitably end up resenting the offense for 1. leaving so many loose ends to pick up and 2. not getting the opportunity for greenfield themselves. You could try to "fix" that by rotating, but then you're losing context and headed down the road toward man-monthing.
CooCooCaCha 17 days ago [-]
Interesting that you describe it as ego. I don’t think a team shoveling shit onto your plate and disliking it is ego.
I feel similar things about the product and business side, it often feels like people are trying to pass their job off to you and if you push back then you’re the asshole. For example, sending us unfinished designs and requirements that haven’t been fully thought through.
I imagine this is exactly how splitting teams into offense and defense will go.
FridgeSeal 17 days ago [-]
> For example, sending us unfinished designs and requirements that haven’t been fully thought through
Oh man. Once had a founder who did this to the dev team: blurry, pixelated screenshots with 2 or 3 arrows and vague “do something like <massively under specified statement>”.
The team _requested_ that we have a bit more detail and clarity in the designs, because it was causing us significant slowdown and we were told “be quiet, stop complaining, it’s a ‘team effort’ so you’re just as at fault too”.
Unsurprisingly, morale was low and all the good people left quickly.
dakshgupta 17 days ago [-]
To add - I personally enjoy defense more because the quick dopamine hits of user requests fix -> fix issue -> tell user -> user is delighted is pretty addictive. Does get old after a few weeks.
17 days ago [-]
throwaway984393 16 days ago [-]
[dead]
wetpaws 16 days ago [-]
[dead]
joshhart 17 days ago [-]
[flagged]
dang 16 days ago [-]
"Please don't post shallow dismissals, especially of other people's work. A good critical comment teaches us something."
I personally found the idea inspiring and the article itself is explaining it succinctly. Even if it's not completely revolutionary, it's small, self containing concept that's actionable.
Lowley surprised why there are so many harsh voices in this thread, but the article definitely has merrit, even if it won't be usefull/possible to implement for everyone
thesandlord 17 days ago [-]
Don't do that! This was a great post with a lot to learn from.
The fact you came to a very similar solution from first principles is very interesting (assuming you didn't know about this before!)
stopachka 17 days ago [-]
I resonated with your post Daksh. Keep up the good work
candiddevmike 17 days ago [-]
Or the idea of an "interrupt handler". OP may find other SRE concepts insightful, like error budgets.
wombatpm 16 days ago [-]
Error budget or recovery cost tracking goes a long way towards defeating the We never have time or money to do it right, but we’ll find time and money to fix it later mindset.
dakshgupta 16 days ago [-]
I’m generally a strong believer in “if it’s not measured it’s not managed” so this seems like it would be useful to explore. I suspect it’s tricky to assign a cost to a bug though.
wombatpm 13 days ago [-]
It’s tricky to assign impact, lost sales, revenue, customer satisfaction. Cost is just development and testing time.
joony527 16 days ago [-]
[flagged]
shermantanktop 17 days ago [-]
[flagged]
dang 16 days ago [-]
Please don't do this here.
shermantanktop 15 days ago [-]
Fair enough, and understood - I’d delete it but see no link.
dang 15 days ago [-]
No need - we only care about things going forward. Thanks for the kind reply.
Roelven 16 days ago [-]
Getting so tired of the war metaphors in attempts to describe software development. We solve business problems using code, we don't make a living by role-playing military tactics. Chill out my dudes
madeofpalk 17 days ago [-]
Somewhat random side note - I find it so fascinating that developers invented this myth that they’re the only people who have ‘concentration’ when this is so obviously wrong. Ask any ‘knowledge worker’ or yell even physical labourer and I’m sure they’ll tell you about the productivity of being "in the zone" and lack of interruptions. Back in early 2010s they called it ‘flow’.
dakshgupta 16 days ago [-]
My theory is that to outsiders software development looks closer to other generic computer based desk jobs than to the job of a writer or physical builder, so to them it’s less obvious that programming needs “flow” too.
000ooo000 16 days ago [-]
The article doesn't say or suggest that. It says it applies to engineers.
Towaway69 16 days ago [-]
What's wrong with collaboratively working together? Why is there a need to create an atificial competition between a "offence" and a "defence" team?
And why should team members be collaborative amongst their team? E.g. why should the "offence" team members suddenly help each other if it's not happening generally?
This sounds a lot like JDD - Jock Driven Development.
Perhaps the underlying problems of "don't touch it because we don't understand it" should be solved before engaging in fake competition to increase the stress levels.
megunderstood 16 days ago [-]
Sounds like you didn't read the article.
The idea has nothing to do with creating artificial competition and it is actually designed as a form of collaboration.
Some work requires concentration and the defensive team is there to maintain the conditions for this concentration, i.e. prevent the offensive team from getting interrupted.
Towaway69 16 days ago [-]
Ok, that might well be the case! Many apologies for my mistaken assumptions.
Then perhaps the terminology - for me - has a different meaning.
Kinrany 16 days ago [-]
It's a common and pervasive mistake to assume the meaning of a term by association with the words used in the term. Names of terms are at best mnemonics.
namenotrequired 16 days ago [-]
Many are complaining that this way the engineers are incentivised to carelessly create bugs because they have to ship fast and won’t be responsible for fixing them.
That’s easy to fix with an exception: you won’t have to worry about support for X time unless you’re the one who recently made the bug.
It turns out that once they’re responsible for their bugs, there won’t actually be that many bugs and so interruptions to a focused engineer will be rare.
That's how we do it in my startup. We have six engineers, most are even pretty junior. Only one will be responsible for support in any given sprint and often he’ll have time left over to work on other things e.g. updating dependencies.
Rendered at 10:33:12 GMT+0000 (Coordinated Universal Time) with Vercel.
There's simply no substitute for Kanban processes and for proactive communication from engineers. In a small team without dedicated customer support, a manager takes the customer call, decides whether it's legitimately a bug, creates a ticket to track it and prioritizes it in the Kanban queue. An engineer takes the ticket, fixes it, ships it, communicates that they shipped something to the rest of their team, is responsible for monitoring it in production afterwards, and only takes a new ticket from the queue when they're satisfied that the change is working. But the proactive communication is key: other engineers on the team are also shipping, and everyone needs to understand what production looks like. Management is responsible for balancing support and feature tasks by balancing the priority of tasks in the Kanban queue.
Solution: don’t. If a bug has been introduced by the currently running long process, forward it back. This is not distracting, this is very much on topic.
And if a bug is discovered after the cycle ends - then the teams swap anyway and the person who introduced the issue can still work on the fix.
Additionally, defensive devs have brutal SLAs, and are frequently touching code with no prior exposure to the domain.
They got known as "platform vandals" by the feature teams, & we eventually put an end to the separation.
Good management means finding the right balance for the team, product, and business context that you have, rather than inflexibly trying to force one strategy to work because it’s supposedly the best.
In my experience whenever that happens someone always finds an "oh @#$&" case where a bug is actually far more serious than everyone thought.
It is an approach that's less productive than slowing down and delivering quality, but it's also completely inevitable once a team/company grows to a sufficient size.
Small, in-person, high-trust teams have the advantage of not falling into bad offense habits.
Additionally, a slower shipping pace simply isn’t an option, seeing as the only advantage we have over our giant competitors is speed.
Wouldn't they be incentivized to maintain discipline because they will be the defensive engineers next week when their own code breaks?
Tell that to seemingly every engineering manager and product manager coming online over the last 8-10 years.
I first noticed in 2016 there seemed to be a direct correlation between more private equity and MBA's getting into the field and the decline of software quality.
So now you have a generation of managers (and really executives) who know little of the true tradeoffs between quality and quantity because they only ever saw success pushing code as fast as possible regardless of its quality and dealing with the aftermath. This lead them to promotions, new jobs etc.
We did this to ourselves really, by not becoming managers and executives ourselves as engineers.
The thing I found most surprising about this article was this phrasing:
> We instruct half the team (2 engineers) at a given point to work on long-running tasks in 2-4 week blocks. This could be refactors, big features, etc. During this time, they don’t have to deal with any support tickets or bugs. Their only job is to focus on getting their big PR out.
This suggests that this pair of people only release 1 big PR for that whole cycle - if that's the case this is an extremely late integration and I think you'd benefit from adopting a much more continuous integration and deployment process.
I think that's a too-literal reading of the text.
The way I took it, it was meant to be more of a generalization.
Yes, sometimes it really does take weeks before one can get an initial PR out on a feature, especially when working on something that is new and complex, and especially if it requires some upfront system design and/or requirements gathering.
But other times, surely, one also has the ability to pump out small PRs on a more continuous basis, when the work is more straightforward. I don't think the two possibilities are mutually exclusive.
The important part is that you're not interrupted during your large-scale tasks, not the absolute length of those tasks.
I don't think it suggests how the time block translates into PRs. It could very well be a series of PRs.
In any case, the nature of the product / features / refactorings usually dictates the minimum size of a PR.
Why not split the big tickets into smaller tickets which are delivered individually? There's cases where you literally can't but in my experience those are the minority or at least should be assuming a decently designed system.
Because it is already the smallest increment you can make. Or because splitting it further would add a lot of overhead.
> There's cases where you literally can't but in my experience those are the minority
I think in this sentence, there's a hidden assumption that most projects look like your project(s). That's likely false.
You left out the part of that quote where I explained my assumption very clearly: A decently designed system.
In my experience if you cannot split tasks into <1 week the vast majority of the time then your code has massive land mines in it. The design may be too inter-connected, too many assumptions baked too deeply, not enough tests, or various other issues. You should address those landmines before you step on them rather than perpetually trying to walk around them. Then splitting projects down becomes much much easier.
That's one possible reason. Sometimes software is designed badly from the ground up, sometimes it accumulates a lot of accidental complexity over years or decades. Solving that problem is usually out of your control in those cases, and only sometimes there's a business driver to fix it.
But there are many other cases. You have software with millions of lines of code, decades of commit history. Even if the design is reasonable, there will be a significant amount of both accidental and essential complexity - from certain size/age you simply won't find any pristine, perfectly clean project. Implementing a relatively simple feature might mean you will need to learn the interacting features you've never dealt with so far, study documentation, talk to people you've never met (no one has a complete understanding either). Your acceptance testing suite runs for 10 hours on a cluster of machines, and you might need several iterations to get them right. You have projects where the trade-off between velocity and tolerance for risk is different from yours, and the processes designed around it are more strict and formal than you're used to.
Lots of time it is true that ticket == pr but it is not the law.
It sometimes makes sense to separate subtasks under a ticket but that is only if it makes sense in business context.
That's only late if there are other big changes going in at the same time. The vast majority of operational/ticketing issues have few code changes.
I'm glad I had the experience of working on a literal waterfall software project in my life (e.g. plan out the next 2 years first, then we "execute" according to a very detailed plan that entire time). Huge patches were common in this workflow, and only caused chaos when many people were working in the same directory/area. Otherwise it was usually easier on testing/integration - only 1 patch to test.
> The result is that our product breaks more often than we’d like. The core functionality may remain largely intact but the periphery is often buggy, something we expect will improve only as our engineering headcount catches up to our product scope.
I really resonate with this problem. It was fun to read. We've been tried different methods to balance customers and long-term projects too.
Some more ideas that can be useful:
* Make quality projects an explicit monthly goal.
For example, when we noticed our the edges in our surface area got too buggy, we started a 'Make X great' goal for the month. This way you don't only have to react to users reporting bugs, but can be proactive
* Reduce Scope
Sometimes it can help to reduce scope; for example, before adding a new 'nice to have feature', focus on making the core experience really great. We also considered pausing larger enterprise contracts, mainly because it would take away from the core experience.
---
All this to say, I like your approach; I would also consider a few others (make quality projects a goal, and cut scope)
What are some proactive ways? Ideally that cannot easily be gamed?
I suppose test coverage and such things, and an internal QA team. What I thought the article was about (before having read it) was having half of the developers do red team penetration testing, or looking for UX bugs, of things the other half had written.
Any more ideas? Do you have any internal definitions of "a quality project"?
This is akin to having a boat that isn't seaworthy, so the suggestion is to have a rowing team and a bucket team. One rows, and the other scoops the water out. While missing the actual issue at hand. Instead, focus on creating a better boat. In this case, that would mean investing in testing: unit tests, integration tests, and QA tests.
Have staff engineers guide the teams and make their KPI reducing incidents. Increase the quality and reduce the bugs, and there will be fewer outages and issues.
Agreed - this is a survival mode tactic in every company I’ve been when it’s happened. If you’re permanently in the described mode and you’re small sized, you might as well be dead.
If mid to large and temporary, this might be acceptable to right the ship.
Even when they rotate - who wants to clock in to wade through a fresh swamp they've never seen? Don't make the swamp: if you're moving too slow shipping things without sinking half the ship each PR then raise your budget to better engineers - they exist.
This premise is like advocating for tech debt loan sharks; I really hope TFA was ironic. Sure, it makes sense from a business perspective as a last gasp to sneakily sell off your failed company but you would never blog "hey here at LLM-4-YOU, Inc. we're sinking".
Sometimes as an engineer I like the frantically scooping water while we try to scale rapidly because it means leaderships vision is to get an exit for everyone as fast as possible. If leadership said "lets take a step back and spend 3 months stabilizing everything and creating a testing/QA framework" I would know they want to ride it out til the end.
It shouldn't have ever come to the point where incidents, outages, and bugs have become prominent enough to warrant a team.
Either have kickass dev(s), although improbable, thus the second level: Implement mitigations, focus on testing, and have staff engineers with KPIs to lower incidents. Give them them the space but be prepared to let them go if incidents don't go down.
There is no stopping of development. Refactoring by itself doesn't guarantee better code or fewer incidents. But don't allow bugs, or known issues, as they can be death by thousand cuts.
The viewpoint is from not an engineer. Having constant incidents doesn't show confidence or competence to investors and customers. As it diverts attention from creating business value into firefighting, which has zero business value and is bad for morale.
Thus, tech investment rather than debt always pays off if implemented right.
How does one get to that state?
* avg tenure / skill level of team is relatively uniform
* team is small with high-touch comms (eg: same/near timezone)
* most importantly - everyone feels accountable and has agency for work others do (eg: codebase is small, relatively simple, etc)
Where I would expect to see this fall apart is when these assumptions drift and holding accountability becomes harder. When folks start to specialize, something becomes complex, or work quality is sacrificed for short-term deliverables, the folks that feel the pain are the defense folks and they dont have agency to drive the improvements.
The incentives for folks on defense are completely different than folks on offense, which can make conversations about what to prioritize difficult in the long term.
This is basically how we ran things for the reliability team at Netflix. One person was on call for a week at a time. They had to deal with tickets and issues. Everyone else was on backup and only called for a big issue.
The week after you were on call was spent following up on incidents and remediation. But the remaining weeks were for deep work, building new reliability tools.
The tools that allowed us to be resilient enough that being on call for one week straight didn't kill you. :)
On call for a week at a time only really works if you only get paged at night once a week max. If you get paged every night, you will die from sleep deprivation.
You need trust in your team to make this work but you also need trust in your team to make any high velocity system work. Personally, I find the ideas here extremely compelling and optimizing for distraction minimization sounds like a really interesting framework to view engineering from.
For prioritization, use a triage queue because it aims the whole team at the most valuable work. This needs to be the mission-critical MVP & PMF work, rather than what the article describes as "event driven" customer requests i.e. interruptions.
We ended up with a system where we break work up into things that take about a day. If someone thinks something is going to take a long time then we try to break it down until some part of it can be done in about a day. So we kinda side-step the problem of having people able to focus on something for weeks by not letting anything take weeks. The same person will probably end up working on the smaller tasks, but they can more easily jump between things as priorities change, and pretty often after doing a few of the smaller tasks either more of us can jump in or we realize we don't actually need to do the rest of it.
It also helps keep PRs reasonably sized (if you do PRs).
To borrow a football term, sometimes company structure seems like it’s playing the “long ball” game. Everyone sitting back in defence, then the occasional hail mary long pass up to the opposite end. I would love to see a more well developed understanding within companies that certain teams, and the processes that they have are defensive, others are attacking, and others are “mid field”, i.e. they’re responsible for developing the foundations on which an attacking team can operate (e.g. longer term refactors, API design, filling in gaps in features that were built to a deadline). To win a game you need a good proportion of defence, mid field and attack, and a good interface between those three groups.
And the team should check the balances once in a while, and maybe rethink the strategy, to avoid overworking someone and underworking someone else, thus creating bottlenecks and vacuums.
At least this is the way i have worked and organised such teams - 2-5 ppl covering everything. Frankly, we never had many customers :/ but even one is enough to generate plenty of "noise" - which sometimes is just noise, but if good customer, will be mostly real defects and generally under-tended parts. Also, good customers accept a NO as answer. So, do say more NOs.. there is some psychological phenomena in software engineering in saying yes and promising moonshots when one knows it cannot happen NOW, but looks good..
have fun!
[0] https://svilendobrev.com/rabota/orgpat/OrgPatterns-patlets.h...
- "and our “lean” team is more a product of our inability to identify and hire great engineers, rather than an insistence on superhuman efficiency."
Can we all at some point have a serious discussion on hiring and training. It seems that many teams are unstaffed or at least not satisfied with the quality and quantity of their team. Why is that? Why does it seem to be the norm?
- what about mitigating bugs in the first place? Shouldn't someone be assigned to that? Yeah, sure, bugs are a given. They are going to happen. But in production bugs are something real and paying customers shouldn't experience. At the very least what about feature flags? That is sonething new is introduced to a limited number of user. If there's a bug and it's significant enough, the flag is flipped and the new feature withdrawn. Then the bug can be sorted as someone is available.
Prehaps the profession just is what it is? Some teams are almost miraculously better than others? Maybe that's luck, individuals, product, and/or the stack? Maybe like plumbers and shit there are just things that engineering teams can't avoid? I'm not suggesting we surrender, but that we become more realistic about expectations.
The aim is generally not to provide a perfect fix but an MVP fix and raise tickets in the queue for regular planning.
It rotates round every week or so.
My company's not very devops so it's not on-call, but it's 'point of contact'.
That said, there's some credence to what the author is describing. Although I haven't personally worked under the exact system described, I have worked in environments where engineers take turns being the first point of contact for support. In my experience, it worked pretty well. People know your bandwidth is going to be a bit shorter when you're on support, and so your tasks get dialed back a bit during that period.
I think the author, and several people in the comments, make the mistake of assuming that an "engineer on support" necessarily can fix any given problem they are approached with. Larger firms could allocate a complete cross-functional team of support engineers, but this is very costly for small outfits. If you have mobile apps, in-house hardware products and/or integrations with third-party hardware, it's basically guaranteed that your support engineer(s) will eventually be given a problem that they don't have the expertise to solve.
In that situation, the support engineer still has the competencies to figure out who does know how to fix the problem. So, the support engineer often acts more as a dispatcher than a singular fixer of bugs. Their impact is still positive, but more subtle than "they fix the bugs." The support engineer's deep system knowledge allows them to suss out important details before the bug is dispatched to the appropriate dev(s), thereby minimizing downtime for the folks who will actually implement the fix.
Some engineers are more likely to avoid interrupting others because they can sympathize.
Broader article about impact of schedules at [1] is also highly relevant and worth the read.
But the fact that this explicit split makes the choice visible is clearly an upside.
But sounds like there has to be a lot of micro management involved and when you have team of 4 it is easy to keep up but as soon as you go to 20 and that increase also means much more customer requests it will fall apart.
You want the defensive team to work on automating away stuff that pays off for itself in the 1-4 week timeframe. If they get any slack to do so!
My team had a bunch of stability work, and bug fixes (and there was a lot of bugs and a lot of tech debt, and very little organizational enthusiasm to fix the latter).
Guess where there morale was, compared to some of the other teams?
Edit: I mean an ongoing split, not a rotation
> At the end of the cycle, we swap.
They swap teams every 2-4 weeks so nobody will always be on team defense.
Reptile was my favorite Mortal Kombat character, and our ISP added a G before all the sub accounts. They put a P in front of my dad's.
Putting a couple of buzzwords on a practice being performed for at least 15 years now doesn't make you clever. Quite the opposite in fact.
There are 2 fundamental aspects of software engineering:
Get it right
Keep it right
You have only 4 engineers on your team. That is a tiny team. The entire team SHOULD be playing "offense" and "defense" because you are all responsible for getting it right and keeping it right. Part of the challenge sounds like poor engineering practices and shipping junk into production. That is NOT fixed by splitting your small team's cognitive load. If you have warts in your product, then all 4 of you should be aware of it, bothered by it and working to fix it.
Or, if it isn't slowing growth and core metrics, just ignore it.
You've got to be comfortable with painful imperfections early in a product's life.
Product scope is a prioritization activity not an team organization question. In fact, splitting up your efforts will negatively impact your product scope because you are dividing your time and creating more slack than by moving as a small unit in sync.
You've got to get comfortable telling users: "that thing that annoys you, isn't valuable right now for the broader user base. We've got 3 other things that will create WAY MORE value for you and everyone else. So we're going to work on that first."
It's just a support rota at the end of the day. Everyone does it, but not all the time, freeing you up to focus on more challenging things for a period without interruption.
This was an established business (although small), with some big customers, and responsive support was necessary. There was no way we could just say "that thing that annoys you, tough, we are working on something way more exciting." Maybe that works for startups.
As developers we like to focus. But there is vast difference between "manager time" and "builder time" and what you are experiencing.
You are creating immense value with every single customer interaction!
CUSTOMER FACING FIXES ARE NOT 'MANAGER TIME'!!!!!!
They are builder time!!!!
The only reason I'm insisting is because I've lived through it before and made every mistake in the book...it was painful scaling an engineering and product team to >200 people the first time I did it. I made so many mistakes. But at 4 people you are NOT yet facing any real scaling pain. You don't have the team size where you should be solving things with organizational techniques.
I would advise that you have a couple of columns in a kanban board: Now, Next, Later, Done & Rejected. And communicate it to customers. Pull up the board and say: "here is what we are working on." When you lay our the priorities to customers you'd be surprised how supportive they are and if they aren't...tough luck.
Plus, 2-3 weeks feels like an eternity when you are on defense. You start to dread defense.
And, it also divorces the core business value into 2 separate outcomes rather than a single outcome. If a bug helps advance your customers to their outcome, then it isn't "defense" it is "offense". If it doesn't advance your customer, why are you doing it? If you succeed, all of your ugly, monkey patched code will be thrown away or phased out within a couple of years anyway.
Many, many people I’ve dealt with in these roles don’t or can’t, and seem to think their sole task is to mainline customer needs into dev teams. The PM’s I’ve had who _actually_ do manage back properly had happier dev teams, and ultimately happier clients, it’s not a mystery, but for some reason it’s a rare skill.
I’m assuming that the OP is a founder and can actually make these calls.
- saying No, and sticking to it when it matters — what you’ve mentioned.
- knowing how the product gets built — knowing *the why behind the no*.
PMs don’t usually have the technical understanding to do the second one. so the first one falls flat because why would someone stick to their guns when they do not understand why they need to say No, and keep saying No.
there are cases where talking to customer highlights a mistaken understanding in the *why we’re saying No*. those moments are gold because they’re challenging crucial assumptions. i love those moments. they’re basically higher level debugging.
but, again, without the technical understanding a PM can’t notice those moments.
they end up just filling up a massive backlog of everything because they don’t know how to filter wants vs. needs and stuff.
— also i agree with a lot of what you’ve said in this chain of discussion.
get it right first time, then keep it right is so on point these days. especially for smaller teams. 90% of teams are not the next uber and don’t need to worry about massive growth spurts. most users don’t want the frontend changing every single day. they want stability.
worry about getting it right first. be like uber/google if you need to, when you need to.
Yes, but you've got to spend time talking to users to say that. Many engineering teams have incoming "stuff". Depending on your context that might be bug reports from your customer base, or feature requests from clients etc. You don't want these queries (that take half an hour and are spread out over the week) to be repeatedly interrupting your engineering team, it's not great for getting stuff done and isn't great for getting timely helpful answers back to the people who asked.
There's a few approaches. This post describes one ("take it in turns"). In some organisations, QA is the first line of defence. In my team, I (as the lead) do as much of it as I can because that's valuable to keep the team productive.
I feel similar things about the product and business side, it often feels like people are trying to pass their job off to you and if you push back then you’re the asshole. For example, sending us unfinished designs and requirements that haven’t been fully thought through.
I imagine this is exactly how splitting teams into offense and defense will go.
Oh man. Once had a founder who did this to the dev team: blurry, pixelated screenshots with 2 or 3 arrows and vague “do something like <massively under specified statement>”.
The team _requested_ that we have a bit more detail and clarity in the designs, because it was causing us significant slowdown and we were told “be quiet, stop complaining, it’s a ‘team effort’ so you’re just as at fault too”.
Unsurprisingly, morale was low and all the good people left quickly.
https://news.ycombinator.com/newsguidelines.html
I personally found the idea inspiring and the article itself is explaining it succinctly. Even if it's not completely revolutionary, it's small, self containing concept that's actionable.
Lowley surprised why there are so many harsh voices in this thread, but the article definitely has merrit, even if it won't be usefull/possible to implement for everyone
The fact you came to a very similar solution from first principles is very interesting (assuming you didn't know about this before!)
And why should team members be collaborative amongst their team? E.g. why should the "offence" team members suddenly help each other if it's not happening generally?
This sounds a lot like JDD - Jock Driven Development.
Perhaps the underlying problems of "don't touch it because we don't understand it" should be solved before engaging in fake competition to increase the stress levels.
The idea has nothing to do with creating artificial competition and it is actually designed as a form of collaboration.
Some work requires concentration and the defensive team is there to maintain the conditions for this concentration, i.e. prevent the offensive team from getting interrupted.
Then perhaps the terminology - for me - has a different meaning.
That’s easy to fix with an exception: you won’t have to worry about support for X time unless you’re the one who recently made the bug.
It turns out that once they’re responsible for their bugs, there won’t actually be that many bugs and so interruptions to a focused engineer will be rare.
That's how we do it in my startup. We have six engineers, most are even pretty junior. Only one will be responsible for support in any given sprint and often he’ll have time left over to work on other things e.g. updating dependencies.