numeric precision matters: how npm download counts work

We pretty frequently get questions about npm’s download stats, what we count and don’t count, and whether an author’s package download numbers are “real”. I’ve given lots of piecemeal explanations of this on Twitter, but I thought I’d give the full version so we can point people at it.

npm’s download stats are naïve by design: they are simply a count of the number of HTTP 200 responses we served that were tarball files, i.e. packages. This means the number includes:

automated build servers
downloads by mirrors
robots that download every package for analysis

So the count of “downloads” is much larger than the number of people who typed “npm install yourpackage” on any given day. There are some mitigating factors:

any given mirror will only download a given version of a package one time, and usually the same day that it’s published – they are smart enough to not re-download packages they’ve already seen.
Similarly, build servers will usually not re-download a package they’ve seen before, because it will be in npm’s cache. (Builds that happen in disposable VMs or Docker instances are notable exceptions)
if you, a human, have installed the package before, it will nearly always be installed from your local npm cache, so that doesn’t get counted either

Bottom line: most packages get a trickle of downloads every day, and that’s not necessarily indicative that they’re being actively used. Only if your package is getting > 50 downloads/day can you be sure you’re seeing signal instead of noise. You will also get a burst of downloads whenever you publish a new package or a version of that package, because all the mirrors will download it.

The reasons we don’t filter or discard the automated downloads are

bot filtering is really hard, and never totally accurate, and requires constant manual intervention or crazy machine learning to get right, and we are not an analytics company. We’re pretty sure you’d rather we focused on npm itself rather than the website stats. Maybe when we’re bigger we’ll do more.
automated traffic is relevant to some people. Lots of testing and build tool packages are downloaded as part of automation and so their usage is usefully indicated by automated downloads. So if we started filtering automated traffic we’d immediately get asked to create a separate count of “verified bot traffic” or something and it would quickly spiral out of control.

It’s much easier to say: use these numbers as directional indicators of package popularity. They are not absolute numbers, and they are definitely not the same as the number of “users” of a package.

An aside about automated downloads

Back in February we stopped allowing force-publishing of existing packages. The constant stream of automated downloads by mirrors was part of why: if you publish a package, it is immediately downloaded and copied to dozens of servers worldwide that are outside our control. Allowing you to over-write that package gave you the false sense that the old version was “gone”, or that if there was sensitive information in the package you had safely erased it. That was never true, and keeping the automated downloads in our counts helps underline that.