[Retrospect] Command time-outs due to API backlog

Incident Report for Flying Sphinx

Resolved

Starting approximately 22 hours ago (01:30 AEDT, 14:30 UTC, 06:30 PST), the Flying Sphinx API was under heavy load caused by an unprecedented number of command requests (index, stop, start, etc), which over time delayed the processing of these commands across the system. Searching was not impacted at all, but certainly many customers were seeing timeouts. This load ebbed and flowed over the course of six hours, until it was finally resolved at 7:30 AEDT, 20:30 UTC, 12:30 PST.

The initial alert raised about the situation was missed because I am based in Melbourne, Australia, and I slept through the Pingdom notification (when normally they wake me - I have a dedicated phone for this purpose only that lives beside my bed). Later notifications did wake me, but the worst impacts of this issue were not until towards the end of the timeframe, so this wasn't until 6:45 AEDT.

Obviously, this is not acceptable, and I'm very sorry for the major inconvenience. I'll be working to reduce the impact of any one customer's command requests on everyone else's, and I've modified Pingdom to persist with SMS notifications while any downtime issue remains unfixed - so even if I sleep through one, a later alert will be sure to wake me.

Again, apologies for this significant downtime. I'll be working hard to ensure this problem does not occur again.

Posted Feb 27, 2015 - 23:12 AEDT