The underlying problem was some out-of-memory errors from the database, which essentially killed the API. In turn, the Sphinx proxies that authenticate daemon search requests couldn't get updated credential lists, and thus started blocking search requests for some customers. Added to this was the fact that it happened while I was sleeping, and because Pingdom didn't consider the API as offline, I didn't receive any alerts to wake me up. The database just recently had been switched from a legacy plan to a current one, and I believe this is related. So, to address all of this:
* I've lodged a ticket with Heroku to clarify the database change and any associated memory changes.
* I will be updating the proxy to continue with out-of-date credentials if it can't retrieve new ones, instead of blocking *all* access on a given server in the case of an API failure.
* I will be connecting error spikes via Bugsnag to alerts to my dedicated phone, to ensure I'm woken up should similar issues crop up again, instead of several hours delay.
I am very sorry for this issue occurring, and greatly appreciate your patience and understanding.
Posted Mar 01, 2016 - 11:23 AEDT
Monitoring
Everything's been functioning fine for a little while now, but will continue to keep an eye on things and put things into place to stop this issue from having such a far-reaching impact again.
Posted Mar 01, 2016 - 10:22 AEDT
Update
Still hunting down the finer details, but API and daemon behaviour seems to be returning to normal.
I'm very sorry for this outage. I'm in Australia, and this problem happened overnight. I do have a dedicated phone for Pingdom alerts, but this particular problem didn't flow through to Pingdom - something I'll be remedying as soon as the initial problem is confirmed and fully resolved, so future such issues wake me up and are dealt with far more promptly.
Posted Mar 01, 2016 - 08:58 AEDT
Identified
Major API outage related to an underlying database problem has been fixed. API requests will now work reliably again. Will be following up directly with customers who had reported problems - if you're still seeing problems, do get in touch.
Posted Mar 01, 2016 - 08:26 AEDT
Investigating
Major problems that (so very annoyingly) did not trigger pingdom and thus earlier escalation, reason as yet unknown. Investigating.