npm Blog (Archive)

The npm blog has been discontinued.

Updates from the npm team are now published on the GitHub Blog and the GitHub Changelog.

npm On-Call

image

This is Teacup, our adopted wombat and latest on-call engineer.

Just like everyone, Teacup has to take responsibility for what we ship to production. It won’t be smooth sailing on-boarding her — the late night alerts, the fact she doesn’t even own a computer, or just how difficult it is to acknowledge PagerDuty alerts with paws.

Okay, okay, maybe I’ve stretched the truth a teensy bit.

Not everyone is on-call — just most software engineers and some managers.

We also don’t force an adopted wombat to answer alerts — although Teacup is real, utterly adorable, and would rock on-call if she wanted to. ️

Phew!

So who is on-call and what does that look like at npm?

Je Suis Wombats

When I say ‘most engineers’ are on-call, I mean:

In the past we have also been joined by members of the security team — I’ll be covering later the way other teams are involved in the on-call process.

Hi, I’m Wes

I’m an SRE wombat, which means I spend my workdays improving and maintaining the reliability of the npm registry.

That might sound like I’m permanently on-call, but really it’s more about preventing incidents and supporting infrastructure changes, such as changes required for product updates, new features, security updates, and whatnot.

Wombat’s Day On-Call

If you reside in time zone between the Pacific and Atlantic oceans then an AM shift is just that — a morning.

Yes, there is also a slightly more disruptive PM shift, although for aforementioned time zones it only runs to around 9-11pm, and a contracted third party takes over night low priority and easy to triage pages.

However, some wombats, like myself, are in other time zones - like the UK, where the 'AM shift’ actually runs 2pm - 10pm.

If you’re on primary that means you get notified by automated alerts first if they trigger.

On receiving an alert, you acknowledge it (so it doesn’t fall through to the secondary wombat on-call after a while), and get to a computer if you’re not at one already.

Next:

  1. A specific Slack channel is used.
  2. The primary on-call wombat takes point (aka incident manager).
  3. Everyone involved joins a Zoom video call to communicate with as much 'bandwidth’ as possible.
  4. The marketing team follows their own IR process designed to ensure updates to users are easy to understand, compassionate about user frustrations, and timely.
  5. The secondary on-call wombat takes point publishing any required status updates to our status page.

None of this is set in stone, and one of the reasons I have found npm’s on-call culture to be one of the nicest I’ve been a part of is precisely because everyone is prepared to adapt and go with the flow in an emergency (and out).

Wombats have a high level of empathy, compassion, and respect for each other, so when someone needs to skip being on-call because some life event just sprung on them, or swap for a holiday, we’re there in same way as when things are on fire.

Be excellent to each other, and make sure Teacup takes the following day off after her out-of-hours call, 'key?