Earlier today, Robigus stopped responding to search requests (connections to Sphinx daemons).
The part of the Flying Sphinx infrastructure that failed was the Sphinx proxy on your original server - it stopped responding to any TCP requests (though the logs had no suggestion as to why). Clearly, this is a critical part of everything - if the proxy’s down, you can’t connect to your Sphinx daemon at all (and that’s essential in both searching and regenerating).
I usually get downtime alerts (with a dedicated phone + SMS messages which should wake me up if it’s the middle of the night), but this wasn’t triggered by the proxy failing.
So, I’ve made the following changes:
* If the proxy does not respond to health checks, it’s considered a major server failure, and thus I will get SMS alerts. * Also, if it’s not responding to health checks, Monit will restart the proxy process, so resolution should be sorted out within a minute.
This is now all in place, but I’ll continue to think through better ways of handling such situations. I’m very sorry for the downtime!
Posted about 2 months ago. Oct 03, 2018 - 18:49 AEST