Incident report: npm, Inc. operations incident of January 6, 2018

On Saturday, January 6, 2018, we incorrectly removed the user floatdrop and blocked the discovery and download of all 102 of their packages on the public npm Registry. Some of those packages were highly depended on, such as require-from-string, and removal disrupted many users’ installations.

On Sunday, we published an initial blog post to clarify that this issue was an internal operations issue and not a security issue, but at the time of that post we lacked many details because we had not yet conducted npm’s post-incident retrospective process. This disclosure follows our retrospective and goes into detail about how this mistake happened and what actions we’ve already taken and will take to prevent similar incidents.

A full list of the affected packages is at the end of this post.

Root cause

npm’s automated spam analysis process examines every package publication for signals that a package may be spam. These signals include data about the publisher as well as the package’s README.

On this date, a package was published that contained spam content plus the README for floatdrop’s legitimate package timed-out. Because of the matching READMEs, our spam system flagged floatdrop as associated with the spammer. In the course of reviewing and acting on spam reports, an npm staffer acted on this flag without further investigating the user and removed the user and all of their packages from the registry.

Within 60 seconds, it became clear that floatdrop was not a spammer—and that their packages were in heavy use in the npm ecosystem. The staffer notified colleagues and we re-activated the user and began restoring the packages to circulation immediately.

Most of the packages were restored quickly, because the restoration was a matter of unsetting the deleted tombstones in our database, while also restoring package data tarballs and package metadata documents. However, during the time between discovery and restoration, other npm users published a number of new packages that used the names of deleted packages. We locked this down once we discovered it, but cleaning up the overpublished packages and inspecting their contents took additional time.

Background

When are packages and accounts removed?

As a general rule, the npm Registry is and ought to be immutable, just like other package registries such as RubyGems and crates.io. The basis of open source software development requires that developers are able to depend on the code they build into their projects. In addition, a large global network of mirrors and caches means that removing a package from npm’s Registry doesn’t really make it “go away,” anyway.

However, there are legitimate cases for removing a package once it has been published.

In a typical week, most of the npm support team’s work is devoted to handling user requests for package deletion, which is more common than you might expect. Many people publish test packages then ask to have them deprecated or deleted. There also is a steady flow of requests to remove packages that contain contain private code that users have published inadvertently or inappropriately.

As of 2016, users are unable to delete packages more than 24 hours after they are published. npm staff must evaluate these requests on a case-by-case basis to assess the risk of negative effects on other developers’ projects if a dependency is removed. In many of these cases, deprecating the package instead of deleting it solves the user’s problem, but for some, removing the package from the registry is the best solution.

The second broad category of deletions is when npm removes a package from the registry because it contains problematic content. Malware or spam are examples of content we will delete. We are obligated to remove packages that would harm others or violate the law, and we also believe in keeping the registry free from packages that serve no valid purpose.

Spam—packages that are either blank or populated with someone else’s code, with READMEs that attempt to direct traffic to another website—has become a far larger problem in the npm Registry in the last year. This is an unwelcome side effect of our community’s popularity, because packages’ pages on the npmjs.com site have become highly ranked in search engines.

Fortunately, our understanding of the problem has also increased as our tooling gets better at surfacing it to us. Working with Smyte, we have developed systems to analyze package contents as they are published, as well as flag users with problematic posting habits or associations with previously detected spam. These flags are posted in a Slack channel for review by npm support staff. We then take a closer look at the details of why the user or package was flagged, and, when we feel it is appropriate, remove it from the registry.

To support these common workflows, we have internal tools for removing packages and user accounts in one action, steps we have taken many thousands of times in the last few months. Deleting a user’s account is the most common action we take in response to spam and it was this tool that was implicated in Saturday’s incident. Our systems incorrectly flagged floatdrop, and npm personnel mistakenly removed their account.

When are package names reused?

Another general principle, and a corollary to our prohibition against removing packages 24 hours after they’re published, is that a package name and version should not be reused on the registry. If I publish foo@1.2.3 and other developers depend on this package in their projects, it is bad for me to remove this package and break their dependencies—but it’s even worse to publish a new foo@1.2.3 that does something else. Everyone whose projects depend on this package would pull in the new code automatically, leading to potentially disastrous results.

In cases where the npm staff accepts a user’s request to delete a package, we publish a replacement package by the same name—a security placeholder. This both alerts those who had depended on it that the original package is no longer available and prevents others from publishing new code using that package name. At the time of Saturday’s incident, however, we did not have a policy to publish placeholders for packages that were deleted if they were spam. This made it possible for other users to publish new versions of eleven of the removed packages.

After a thorough examination of the replacement packages’ contents, we have confirmed that none was malicious or harmful. Ten were exact replacements of the code that had just been removed, while the eleventh contained strings of text from the Bible—and its publisher immediately contacted npm to advise us of its publication. As many in the npm community have pointed out, however, this oversight could have enabled malicious code to be published and downloaded by users with dependencies on the original packages. We consider this an unacceptable security risk.

Timeline

All times here are in UTC to place them in context with our status incident.

18:36 — floatdrop user deleted

18:43 — floatdrop notified by email

18:58 — first report of require-by-string failing installations

19:17 — user restored; package restoration commences

19:43 — status incident posted to status.npmjs.org

20:12 — all but the 11 over-published packages are restored

21:58 — overpublished packages are restored, review in progress

22:35 — status incident closed

Steps we’re taking in response

Our first action, which began immediately after the incident concluded, was to implement a 24-hour cooldown on republication of any deleted package name. We make exceptions for npm support staff, who often publish security placeholders as part of their work, and for a package’s original publisher. This work is in testing and will roll out this week. Blocking republication in this way gives us a window of time during which a mistakenly deleted package can be restored without being complicated by external action. It also prevents bad actors from snagging popular package names with malware or useless content.
We have instituted new guidelines about what actions we’re willing to take on weekends or outside of our team’s normal working hours. In particular, we will only delete spam by hand during normal work hours. We will make exceptions for emergencies, such as when phishing content is flagged, but whoever is alerted to the malicious content outside of work hours will consult a second person to assist in review and determine an appropriate course of action.
We will establish a more robust checklist of actions to take when operational incidents happen. Checklists are easy to follow during stressful moments. In particular, posting a status message to promptly alert our users that an incident is in progress should happen almost immediately. In this case, we did not post a status message for over an hour after discovering the mistake and our users were left to speculate on the nature of the incident.
We will improve internal tooling to make it easier for a human being to double-check a lower-confidence spam flag. In this case, providing information like the ages of the packages owned by floatdrop, the number of versions published, and the number of dependents would have instantly made it clear that this user was legitimate and their content was not spam. We also will send these signals to Smyte’s spam analysis system to prevent false positives in the first place. Further automating our spam responses to remove human judgement from the loop in clear-cut cases will avoid the cognitive burden of making repetitive decisions about spam.
To improve the safety of automation like this, we will improve our tools for reverting mistaken deletions. Our tools for restoring package data were workable but we had no tools for restoring the relational database data. The user-team-package relations in our database are difficult to work with by hand because they were designed to be—we have policies against ever running SQL by hand against production databases—but in this case we were forced to do so by the lack of other tools. Restoring data should be as easy as deleting it.
Finally, we’ll work closely with Smyte to adapt to the specific spam technique we observed on Saturday. If spam uses copies of existing packages, this should improve our confidence in the spam rating, not destroy our confidence in the user whose content has been copied. This is ongoing work for us and for Smyte, and we expect our analysis here to continue to improve.

An apology

Our systems and processes balance the need to eliminate spam with the need to reduce false positives. However, we failed to address the need to recover swiftly and cleanly from human error. We will continue to incorporate this understanding in the design and implementation of systems in the future.

We apologize for this mistake. We further apologize to everyone who experienced broken installations during this incident. We know you rely on us to be an invisible and reliable part of your JavaScript development infrastructure, and on Saturday we were not.

Affected packages

These 11 packages were republished by other users:

create-error-class
duplexer3
gulp-plumber
infinity-agent
is-retry-allowed
pinkie
pinkie-promise
read-all-stream
require-from-string
vinyl-git

This is the full list of all 102 affected packages:

@floatdrop/duplexer2
@floatdrop/express-co
after-event
bem-deps
bem-object
bem-pack
bemjson-to-html
bh-property-helpers
cacha
capture-stack-trace
cctz
chnpm
co-with-promise
configs-overload
connect-once
create-error-class
dag
debug-http
dependencies-diff
deps-graph
deps-normalize
dns-graceful-stack-switch
dns-gracefull-stack-switch
duplexer3
each-done
enb-browserify
express-cocaine-service
express-dinja
express-error-with-sources
express-generators
express-mongo-db
express-mongoose-db
express-public-ip
express-real-ip
express-render-jsx
express-request-id
express-stackman
flatit
funsert
get-iterable
glue-streams
got-promise
gulp-batch
gulp-bem
gulp-bem-debug
gulp-bem-js-pack
gulp-bem-pack
gulp-bh
gulp-ext
gulp-grep-stream
gulp-plumber
gulp-reload
gulp-start
gulp-start-process
gulp-watch
hashdir
httpinkie
ignore-middleware-error
infinity-agent
is-retry-allowed
jsot
jsot-bh
memorize-middleware
memorize-promise
migratio
missing-middleware-error
nested-prop
npmup
object-match-statement
p-batch
parse-bem-identifier
parsetrace
path2glob
pff
pg-parade
pinkie
pinkie-promise
plugin-jsx
protoscop
proxy-support
qwish
react-styled
read-all-stream
read-streams
reque
require-from-string
require-or-die
save-stream
sent
snap-context
stream-assert
stream-dirs
tech-deps.js
timed-out
tobe
update-my-deps
vinyl-git
wrap-middleware
y-header
y-tabs
yajob
yandex-photos