package tarball read outage today

Earlier today, July 6, 2016, the npm registry experienced a read outage for 0.5% of all package tarballs for all network regions. Not all packages and versions were affected, but the ones that were affected were completely unavailable during the outage for any region of our CDN.

The unavailable tarballs were offline for about 16 hours, from mid-afternoon PDT on July 5 to early morning July 6. All tarballs should now be available for read.

Here’s the outage timeline:

Evening of 2016 July 1: We pushed code that implemented a change in how we generate etags for tarballs for for new publications only.
Midday 2016 July 6: We commenced running a script that updated all existing tarballs on all servers to the new etags scheme. This ran over the course of some hours.
10pm PDT 2016 July 5: We received a report of some 502s being returned by some servers. In this course of investigating this, we observed a known issue with a very slow process leak in an existing service. We cleared up the known issue in the mistaken belief that this was causing the 502s.
6am PDT 2016 July 6: We received more reports of 502s on tarballs, and opened a status incident.
8am PDT: The root cause of the issue was identified as a negative file modification times for the specific affected tarballs. A new script was immediately run to fix mtimes on all tarball files appearing in the logs as producing 502 errors.

Over the next hour 502 rates fell to the normal 0.

What we’re doing to prevent outages like this next time

We’re adding an alert on all 500-class status codes, not just 503s. This alert will catch the category of errors, not simply this specific problem.

We’re also revising our operational playbook to encourage examination of our CDN logs more frequently; we could have caught the problem very soon after introducing it if we had carefully verified that our guess about the source of 502s had resulted in making them vanish from our CDN logging. We can also do better with tools for examining the patterns of errors across POPs, which would have made it clearer to us immediately that the error was not specific to the US East coast and was therefore unlikely to have been caused by an outage in our CDN.

Read on if you would like the details of the bug.

nginx and etags

The root cause for this outage was an interesting interaction of file modification time, nginx’s method of generating etags, and cache headers.

We recently examined our CDN caching strategies and learned that we were not caching as effectively as we might, because of a property of nginx. Nginx’s etags are generated using the file modification time as well as its size, roughly as mtime + '-' + the file size in bytes. This meant that if mtimes for package tarballs varied across our nginx instances, our CDN would treat the files from each server as distinct, and cache them separately. Getting the most from our CDN’s caches and from our users’ local tarball caches is key to good performance on npm installs, so we took steps to make the etags match across all our services.

Our chosen scheme was to set the file modification time to the first 32-bit BE integer from their md5 hash. This was entirely arbitrary but looked sufficient after testing in our staging environment. We produced consistent etags. Unfortunately, the script that applied this change to our production environment failed to clamp the resulting integer, resulting in negative numbers for timestamps. Ordinarily, this would result in a the infamous Dec 31, 1969 date one sees for timestamps before the Unix epoch.

Unfortunately, negative mtimes triggered an nginx bug. Nginx will serve the first request for a file in this state and deliver the negative etag. However, if there is a negative etag in the if-none-match header nginx attempts to serve a 304 but never completes the request. This resulted in the the bad gateway message returned by our CDN to users attempting to fetch a tarball with the bad mtime.

You can observe this behavior yourself with nginx and curl:

The final request never completes even though nginx has correctly given it a 304 status.

Because this only affected a small subset of tarballs, not including the tarball fetched by our smoketest alert, all servers remained in the pool. We have an alert on above-normal 503 error rates served by our CDN, but this error state produced 502s and was not caught.

All the tarballs that were producing a 502 gateway timeout error turned out to have negative timestamps in their file mtimes. The fix was to touch them all so their times were inconsistent across our servers but valid, thus both busting our CDN’s cache and dodging the nginx behavior.

tl;dr

The logs from our CDN are invaluable, because they tell us what quality of service our users are truly experiencing. Sometimes everything looks green on our own monitoring, but it’s not green from our users’ perspective. The logs are how we know.