The npm blog has been discontinued.
Zero One Infinity READMEs
npm has only been a company for 3 years, but it has been a code base for around 5–6 years. Much of it has been rewritten, but the cores of the CLI and registry are still the original code. Having only worked at npm for a year at this point, there’s still a lot of things left for me to learn about how the whole system works.
Sometimes, a user files a bug which, in the process of debugging it, teaches you some things you didn’t know about your own system. This is the story of one of those bugs.
Over the past week or so, several people filed issues regarding some strange truncating in npm package pages. In one issue, a user reported what appeared to be a broken link in their
Another user pointed out that the entire end portion of their
README was missing!
As a maintainer of npm’s markdown parser, marky-markdown, I was concerned that these issues were a result of a parsing rule gone awry. However, another marky-markdown maintainer, @revin, quickly noted something odd: the description was cut off at exactly 255 characters, and the
README was cut off at exactly 64kb. As my colleague @aredridel pointed out: those numbers are smoking guns.
Indeed, an internal npm service called
registry-relational-follower was truncating both the READMEs and descriptions of packages published to the npm registry. This was a surprise to me and my colleagues, so I filed an issue on our public registry repo. In nearly no time at all, our CTO @ceejbot responded by saying that this truncation was intended behavior(!) and closed the issue.
“TIL!” I thought. And that’s when I decided to dig into how the registry handles
READMEs… and why.
The Zero One Infinity Rule
Before I dive into exactly what happens to your packages’
READMEs between your writing & publishing to their rendering on the npm website, let’s address the 800-lb gorilla in the room:
When I discovered that the registry was arbitarily truncating
READMEs, I thought: “Seems bad.”
Maybe you thought this, too.
Indeed, at least one other person did, commenting on the closed issue:
This may be desired by npm, but I doubt any package authors desire their descriptions to be truncated. Also, see zero-one-infinity.
I should point out that commenting negatively on an already closed issue isn’t the best move in the world. However, I appreciated this comment, because it gave me new words to explain my own vaguely negative feelings about this truncation situation — fancy words with a nice name: The Zero One Infinity rule.
The Zero One Infinity rule is a guiding priniciple made popular by Dutch computer scientist Willem Van der Poel and goes as follows:
Allow none of foo, one of foo, or any number of foo. —Jargon File
This principle stands to eliminate arbitrary restrictions of any kind. Functionally, it suggests that, if you are going to allow something at all, allow one thing or allow an inifinite amount of things. These seem to be aligned with a seemingly symbiotic rule: the Principle of Least Astonishment, which states:
If a necessary feature has a high astonishment factor, it may be necessary to redesign the feature.
In the end, these principles are fancy, important-sounding ways of saying: arbitrary restrictions are surprising, and we shouldn’t be surprising our users.
Now that we can agree that surprising users with strange and seemingly arbitrary restrictions is no bueno … why does the npm registry currently have this restriction? Certainly npm’s developers don’t want to be surprising developers, right?
An Archaeology of Registry Architecture
Indeed, they don’t! The current restriction on description and
README size is a Band-Aid that npm’s registry developers were forced to apply as a result of the original architecture of the npm registry: large
READMEs were making npm slow.
How the heck…, you might be thinking. Reasonable. Let’s take a look.
How npm Deals with READMEs on Publish
Currently, here is how your
READMEs are dealt with by the registry:
When you type
npm publish, the CLI tool takes a look at your
.npmignore (or your
.gitignore, if no
.npmignore is present) and the
files key of your
package.json. Based on what it finds there, the CLI takes the files you intend to publish and runs
npm pack, which packs everything up in a tarball, or
.tar.gz file. npm doesn’t allow you to ever ignore the
README file, so that gets packed up no matter what!
When you type
npm publish, your
README gets packed into a package tarball. This is what gets downloaded when someone
npm installs your package. But this is not the only thing that happens with your
npm publish runs
npm pack, it also runs a script called
publish.js that builds an object containing the package’s metadata. Over the course of your package’s life (as you publish new versions), this metadata grows. First,
read-package-json is run and grabs the content of your README file based on what you’ve listed in your
publish.js adds this README data to the metadata for your package. You can think of this metadata as a more verbose version of your
package.json — if you ever want to check out what it looks like, you can go to
http://registry.npmjs.com/. For example, check out
http://registry.npmjs.com/marky-markdown. As you’ll see, there’s
README data in there for whichever version of your package has the
publish.js sends this metadata, including your
validate-and-store… and here is where we bump into our truncation situation.
npm publish sends the entire
README data to the registry, but the entire
README does not get written to the database. Instead, when the database receives the
README, it truncates it at 64kb before inserting.
This means: while we talk about a package on the npm registry as a single entity, the truth is that a single package is actully made up of multiple components that are dealt with by the npm registry services differently. Notably, there’s one service for tarballs, and another for metadata, and your
README is added to both.
This means that the registry has 2 versions of your README: - The original version as a file in the package tarball - A potentially truncated version in the package metadata
As you may now be guessing, users have been seeing truncated
READMEs on the npm website because the npm website uses the
README data from package metadata. This makes a fair amount of sense: if we wanted to use the
READMEs in the package tarballs, we’d have to unpack every package tarball to retrieve the
README, and that would not be super efficient. Reading
README data from a JSON response, which is how the npm registry serves package metadata, seems at least a little more reasonable than unpacking over 350,000 tarballs.
History Lesson Time
So now we know where the READMEs are truncated, and how those truncated
READMEs are used — but it’s still not necessarily clear why. Understanding this requires a bit of archaeology.
Like many things about npm, this truncation was not always the case. On January 20, 2014, @isaacs committed the 64kb
README truncation to
npm-registry-couchapp, and he had several very good reasons for doing so:
First, allowing extremely large
READMEs exposed us to a potential DDoS attack. An unsavory actor could automate publishing several packages with epically large
READMEs and take down a bunch of npm’s infrastructure.
Second, extremely large
READMEs in the package metadata were exploding the file size of that document, which made
GETrequests to retrieve package data very slow. Requesting the package metadata happens for every package on an npm install, so ostentisbly a single
npm installcould be gummed up in having to read several packages with very long
READMEs that wouldn’t even be useful to the end user, who would either use the unpacked
READMEfrom the tarball or wouldn’t even need the
READMEif, for example, the package was a transitive dependency far down in the dependency tree.
Interestingly enough, the predicament of exploding document size was a problem that npm had dealt with before.
Remember when we pointed out that a single package is actually a set of data managed by several different services? Like many things at npm, this also was not always the case.
CouchDB comes with an out-of-the-box functionality called CouchApp that is a web application served directly from CouchDB. npm’s registry was originally exclusively a CouchApp: packages were single, document-based entities with the tarballs as attachments on the documents. The simplicity of this architecture made it easy to work with and maintain, i.e., a totally reasonable version 1.
Soon after that, though, npm began to grow extremely quickly — package publishes and downloads exploded — and the original architecture scaled poorly. As packages grew in size and number, and dependency trees grew in length and complexity, performance ground to a halt and npm’s registry would crash often. This was a period of intense growing pains for npm.
To mitigate this situation, @isaacs split the registry into two pieces: a registry that had only metadata (attachments were moved to an object store called Manta and removed from the CouchDB), which he called
skim, and another registry that contained both the metadata and the tarball attachment called
full-fat. This splitting was the first of what would be multiple (and ongoing!) refactoring efforts to reduce the size of package metadata documents and distributing how we process packages across multiple services to improve performance.
If you look at the npm registry architecture today, you’ll see the effects of our now CTO @ceejbot’s effort to continue to split the monolith: slowly separating out registry functionality into multiple smaller services, some of which are no longer backed by the original CouchDB, and are backed by Postgres.
Plans for the Future
Turns out that nobody thinks that arbitrarily restricting
README length is a good thing. There are plans in the works for a registry version 3, and changing up the
README lifecycle is definitely in the cards. Much like the original shift that @isaacs made when he created the
full-fat registry services, the team would ideally like to see
README data removed from the package metadata document and moved to a service that can render them and serve them statically to the website. This would bring several awesome benefits:
- No more
READMEtruncating! Good-bye arbitrary restrictions!
- Speeding up the website by moving markdown parsing to its own service.
- Speeding up the website even more by pre-parsing
READMEs and serving them statically instead of parsing them on request. (Yes we cache, but still…)
READMEs for all versions of a package! By lowring the cost of
READMEs, we can not only parse more of a single
README, but parse more
READMEs too! :)
npm cares deeply about backwards compatibility, so all of the original endpoints and functionality of our original API will continue to be supported as the npm regsitry grows out of its CouchApp and CouchDB origins. This means there will always be a service where you can request a package’s metadata and get the
README for the
latest version. However, npm itself doesn’t have to use that service. Moving on from it towards our vision of registry version 3 will be an awesome improvement, across several axes.
A friend recently tweeted:
systems as designed are great, but systems as found are awful
This is not a shot at npm; this statement is pretty ubiquitously true. Most systems that are of any interest to anyone are the products of a long and likely complicated history of constraints and motivations, and such circumstances often produce strange results. As displeasing as the systems you find might be, there is still a pleasure in finding out how a system “works” (for certain values of “work,” of course).
In the end, the “fix” for the “bug” was “we’ve got a plan for that, but it’s gonna take a while.” That isn’t all that satisfying. However, the process of tracking down a seemingly simple element of the npm registry system and exploring it across services and time was extremely rewarding.
In fact, in the process of writing this post I became aware that Crates.io, the website for the Rust Programming Language’s package manager Cargo, was dealing with a very similar situation regarding their package
READMEs. Instead of trying to remove them from their package metadata like us, they’re considering putting it in! If I hadn’t had the opportunity to dig around in the internals of npm’s registry, I might not have been ready to offer them suggestions with the strength of 5 years of experience.
So — the moral of the story is this: When you can, take the time to dig through the caves of your own software and ask questions about past decisions and lessons. Then, write down what you learn. It might be helpful one day, and probably sooner than you think.