npm Blog (Archive)

The npm blog has been discontinued.

Updates from the npm team are now published on the GitHub Blog and the GitHub Changelog.

Four hours of partial outage, 2014-02-24

From 2014-02-24 12:00 UTC (4:00 AM US/Pacific) until about 16:00 UTC (8:00 AM US/Pacific), package tarball downloads were returning 503 response codes.

This affected the EU and eastern US areas more dramatically, but simply because of the time of day at which it occurred. There was nothing location-specific about the outage.

The root cause is that a bug in HAProxy causes it to leak file descriptors, resulting in slow or non-responsive connections. Joyent’s Manta system thus failed to respond to Fastly’s requests in a timely fashion, and Fastly’s system returned a 503 or 500 error to report that the backend was unavailable.

error graph

We first increased the error timeouts in our Fastly configuration. This avoided the immediate problem, and stopped the outage.

Second, we were already in the process of mirroring our data in Manta over to a separate system in a different datacenter to remove that single point of failure. Manta is lovely, and we will continue to use it for many things, but any single point of failure needs to be spread out.

Third, the reason why this outage lasted so long is that we were monitoring the hosts directly, but not monitoring for errors from our Fastly logs. We are setting up monitoring on those logs today, so that a flurry of 5xx response codes will wake us up, even if the hosts appear to be functioning.


Update: We have Nagios alerts set up on Fastly’s data feed, so any increase in errors will indeed alert whoever is on pager duty.

Joyent has confirmed that they restarted the system that was leaking file descriptors, and they’re in the process of building a more permanent fix.