(Edited on 2019-12-30 to reflect the availability of icinga2-watchdog.sh and to fix minor copy problems related to the timeline and headlines, edited 2020-07-16 to fix links and minor copy problems.)
Learning the hard way about data loss, backups, and monitoring
Once upon a time, I had a really nicely tuned Nagios 1.x monitoring system for The Obscure Organization. It was in production for 7-8 years, from 2009 or maybe 2010 to 2016. It ran on a CentOS 5 host running on the Nexcess VPS service.
I had it tuned that it would only alert if something was broken, it would shut up in the middle of the night, and realistically it could go weeks or months between alert notifications. The systems it was monitoring were very stable, also. I want to emphasize that I had it really, really well tuned – better than any other monitoring system I’ve ever worked with.
Trouble in Monitoring Paradise
But then in the early part of 2016 I got email from Nexcess saying they were shutting down that line of business. They asked me several times if I had migrated away from their service and I said “No!”. At the time I was working on an extremely demanding customer engagement at work and putting in 60+ hour weeks and I could not take the time to migrate this system on their schedule.
I heard nothing from them, but they kept sending bills like clockwork that got paid through a PayPal subscription, so I figured all was well and I still had time.
The only production workload running on that server was the Nagios monitoring system. It monitored all of the Obscure hosts, the services that were running on them, and the Bareos / Bacula backup system jobs that reliably backed up all the servers. It also monitored some odds and ends of information systems I cared about: I’d get a text if my printer ran out of paper.
I’ve had either Bacula or its more libre-focused fork Bareos running in production for Obscure since 2005. I had to use it in anger to restore when Obscure suffered a catastrophic hardware failure in 2008 in the RAID controller on our main server, and it worked flawlessly when it counted. I completed a full restore to a borrowed VM in about 6 hours, almost all of it waiting time for the data to spool off the backup disks. In 2016 I had enough disk space attached to hold 3 months of backups (dailies for a week, differential weeklies for a month, and full monthlies for 3 months).
When I needed to restore something trivial due to operator error in the waning months of 2016, I discovered to my horror that:
- The backup system had run out of space and had stopped working several months before due to a low disk space condition, and
- the Nagios server had all of its backup files deleted or overwritten months before.
I lost the entire configuration of my finely-tuned monitoring system! 😱😭
Nexcess didn’t have any backups of the decommissioned systems, which seemed kind of reckless. Or maybe their backups rotated out of their storage pools by that point too.
Aftermath and Recovery
In January of 2017 I got Nexcess to refund Obscure more than 6 months of service charges, but that doesn’t really make up for losing the entire Nagios configuration.
In February of 2017 I started building a new Icinga2 server on the a new host that is the only one that Obscure currently runs in AWS EC2, with the intent to replace the old Nagios server.
I had to set the new Icinga2 server aside as I cared for my mother in her last months, she was diagnosed with terminal lung cancer in April of 2017 and died on November 28, 2017.
I didn’t get the new Icinga2 server configured to do anything useful until April of 2019, and I got it to be effectively a superset of what I had in the old Nagios monitor by June of 2019.
You can tune your monitoring system too well! If it is normally silent it can lull you into a false sense of security.
This is why I have Uptime Robot configured as a secondary monitoring system for the Icinga system that replaced the old, now-lost Nagios system. I also have Uptime Robot monitor the most critical hosts and services for Obscure, in case the new primary monitoring system fails. The Uptime Robot free tier is pretty capable, it a good enough job where it counts and sends a non-Obscure email address alerts if things go boom.
I also wrote a watchdog script that runs through cron (icinga2-watchdog.sh) that will tickle me (and the Slack #alerts room where these alerts go) once per day. Between Uptime Robot and the watchdog script, it should be hard to ignore a cascading failure that takes out the monitoring system.
SIC TRANSIT GLORIA MUNDI