Sphinx daemons not responding on Robigus
Incident Report for Flying Sphinx
Earlier today, Robigus stopped responding to search requests (connections to Sphinx daemons).

The part of the Flying Sphinx infrastructure that failed was the Sphinx proxy on your original server - it stopped responding to any TCP requests (though the logs had no suggestion as to why). Clearly, this is a critical part of everything - if the proxy’s down, you can’t connect to your Sphinx daemon at all (and that’s essential in both searching and regenerating).

I usually get downtime alerts (with a dedicated phone + SMS messages which should wake me up if it’s the middle of the night), but this wasn’t triggered by the proxy failing.

So, I’ve made the following changes:

* If the proxy does not respond to health checks, it’s considered a major server failure, and thus I will get SMS alerts.
* Also, if it’s not responding to health checks, Monit will restart the proxy process, so resolution should be sorted out within a minute.

This is now all in place, but I’ll continue to think through better ways of handling such situations. I’m very sorry for the downtime!
Posted about 2 months ago. Oct 03, 2018 - 18:49 AEST
This incident affected: Robigus.