Publishing issues 2014-02-12

Last night, from ~9.45pm to 11.00pm Pacific, some publishes to the registry were failing. The root cause was an undiscovered bug in CouchDB which caused replication to fail, leading to conflicts when users tried to publish.

In our current architecture (which we outlined last week) we have a master database, known as skimdb, which accepts reads and writes, and two replicated copies of it, which we’ll call skimdb-2 and skimdb-3, which serve only reads. The bug caused replication to halt, which meant new data on the master didn’t get to the replicas.

This meant that if you published after replication had stopped, and then tried to publish again, you could see an error due to conflicts. There was however a 1 in 3 chance that even with both replicas delayed, you could both read from and write to the master, in which case your publish would have worked (which is why it worked for some people when they retried).

The CouchDB team are already hard at work coming up with a fix for the replication bug. In the meantime, we are implementing better monitoring of the replication status of the two downstream servers, as well as modifying our servers to allow them to accept the gigantic URLs which are being accidentally created by the bug.

Update: subsequent to posting this post-mortem, several people still reported ongoing issues, so we re-checked our servers and discovered that an additional, incomplete replacement replica was accidentally in production rotation. We have now removed it from production, which should resolve the issue for everyone. We apologize for accidentally subscribing to the “move fast and break things” ethos, and will be improving internal communication to make sure that can’t happen again.