Fastly, Manta, Loggly, and CouchDB Attachments

tl;dr - Things have been happening. Please read this (especially the “Future Plans” part) if you are currently replicating the registry CouchDB.

As of today, Fastly is sitting in front of all requests to registry.npmjs.org. The good news here is that requests are hitting the cache well over 90% of the time, and especially for those of you who don’t leave near us-east, ping times are now much faster.

If you were using Node v0.6, you may have noticed that you started seeing SSL certificate errors. This is because npm originally used a unique Certificate Authority called “npmCA” to sign the SSL cert that it uses for the registry. However, since “npmCA” isn’t trusted by web browsers and curl, that SSL certificate would cause problems for non-npm clients. Very frequently, someone would complain that the registry is “broken” because their browser can’t load it, and we’d have to explain this situation. It was not ideal.

Early in 2013, I changed the npm client to add a second trusted certificate authority, so that it could use the same SSL cert as the npm website. Before switching over, however, we had to wait for all of the prior versions of Node and npm to no longer be supported. Then, we waited another 6 months just to be safe.

If you are still using Node v0.6 for some reason, or a very old version of v0.8, you can work around this by doing npm config set strict-ssl false, though I’d really recommend that you upgrade, for this and many other reasons.

A big blocker to getting a CDN set up was being able to get at the log data. Logs can often contain private information, so of course we can’t make them public. However, people really like seeing the download counts on the website, and logs can be really handy for catching certain types of problems in production.

The logs from fastly are feeding into Loggly using the lylog proxy module. Also, I have lylog configured to echo locally, so that data can be parsed daily and fed into the download count database.

All npm registry tarballs are being synchronized into Joyent Manta, using the mcouch module. One nice thing about Fastly is that they let you write your own VCL, so it wasn’t too hard to have different rules for fetching JSON from CouchDB or tarballs from Manta.

If you’re interested in running some compute jobs on the npm package data, you can use the Manta CLI tools and point them at the files sitting in /isaacs/public/npm. They have the form of /isaacs/public/npm/$package/_attachments/$package-$version.tgz for the tarballs, and /isaacs/public/npm/$package/doc.json for the root document as it exists in CouchDB.

Future Plans

The vast majority of npm users just use the npm client, and as long as publish and install work, they’re happy.

However, a fair number of users are replicating the npm data direct from the CouchDB instance into a local CouchDB replica, and then using that instead of going to the public registry.

If you are one such user, this is for you.

There are two major issues here:

If you’ve ever tried it, you know that it consumes an enormous amount of disk space, and crashes or randomly uses all your memory from time to time. It’s common practice to set up a cron to just restart the replicator every so often.
If you have got it working, then you probably rely on those attachments being in the database, so that you don’t have to make an external network call (even a fastly-cached one) to download packages.

The problem (1) is caused by having every known package stored as an attachment in a single giant registry.couch file. The .couch file is big and unwieldy, and we can’t do anything interesting with the files themselves. A better architecture is to move the tarballs to a comprehensive data warehouse with compute and analytics capabilities, such as Manta, and just let CouchDB do what it’s best at: managing a GB or so of JSON docs.

However, if we move to such an architecture, and thus solve (1), we risk breaking (2) for you, because now you won’t have all of the tarballs stored locally in your replica.

There are a few solutions that come to mind, such as having an attachment-free “registry-v2” database or something. Another option would be to write a replicator script that only downloads packages that you actually care about, since you likely don’t need ALL of them anyway.

Rest assured, we’ll be working to ensure continuity and backwards compatibility to the greatest degree possible.

If you are a replication user, it’d be great if you could describe a bit about your use case. Would it be better to make a network request to get public packages, if it meant that you had a much smaller database to deal with?