How our new backup system saved 24+ hours of downtime
Remember when we announced our new infrastructure in October last year? Part of the innovation, which we were particularly proud of, was our in-house created backup/restore system. A few days ago this system was put to its first critical real-life test and the results were impressive. We were able to restore 3 times more data, 7 times faster, compared to the previous such event when we were still using the old backup solution. Here is how we did it.
How often do we need massive backup restores?
The short answer is: very rarely. Having a highly redundant infrastructure with multiple SSDs in RAID almost eliminates the need of such restores. Normally, when an SSD fails, it is seamlessly replaced with a new piece of hardware without any noteworthy downtime or data loss. And disk failures are very common: for a provider of our size, it is normal to see such events on an almost daily basis. However, every now and then, a misfortunate coincidence of several hardware and software failures at once can make the standard hardware replacement impossible. And these are the times, when we need to restore all the accounts that were on the damaged instance from our backup copies.
Previously, before our new backup system.
The previous time we needed to make full backup restore of a whole shared hosting server was more than an year ago. Back then we were using R1Soft backup, which is among the most popular in our industry. Hosting providers like us use this software for two main reasons. First, it is quite reliable. We’ve almost never had any serious issues with missing and corrupt backups. And second, it is very lightweight and does not create significant load on the production servers while creating the backups (a resource-intensive process that takes place every day). With these two features R1Soft works perfectly in 99% of the time — when it creates the backups and when individual backup copies are needed.
However, in the rare occasions when a full restore of multiple accounts is necessary, R1Soft has one serious drawback — the recovery process is painfully slow and the affected sites can experience prolonged outage. In the event in question, all our affected accounts were down for 28 hours. It took this long for two reasons. First, R1Soft does not allow simultaneous restores from and to multiple locations. All the data needs to be recovered through one single network interface and this is slow. Another issue with R1Soft is that the recovery cannot be incremental and the server instance is down during the whole restore process. All affected sites can only come back online at the same time, after the whole information is transferred from the backup server to the production machine. Therefore, even the smallest website could not be brought back up until we have restored the full server.
Most shared hosting providers will hardly consider this story a serious problem that requires further actions once the restore is over. After all, only a single machine was affected and all customers got their websites back without data loss. The downtime of the sites was also almost negligible on an yearly basis: 28 hours are just 0,3% of the year. However, at SiteGround, we were quite unhappy with the duration of the issue and were determined to prevent this from repeating in the future.
And now, after our new backup system.
That’s how we set our minds on creating our own backup system to guarantee a faster restore process and our talented DevOps department started working on it. We launched the new solution in October 2015 but it wasn’t until just a few days ago that we had to use it in an event similar to the one described above. Compared to our then-used solution R1Soft, our own system makes distributed backups and allows simultaneous restores from multiple backup instances to multiple production servers. Thus, we now were able to recover 4TB of data (which was nearly three times more than the previous time), in just 4 hours, compared to the 28 hours from the story above. Moreover, our system allows incremental recovery and the first accounts were up just a few minutes after the issue was identified, with the longest downtime (about 4 hours) affecting only few individual sites. This brought down the average downtime for all affected accounts to less than 2 hours, compared to 28 hours from before. Quite an impressive improvement, isn’t it? But…
Can it get even faster?
Yes, it can! In our latest massive restore case, we actually were not able to use the Infiniband network connectivity between our backup servers and the production ones as planned in such cases. Thus the data was transferred through the standard network of 1 GBit/s, instead over the 10 Gbit/s Infiniband connection. This, we found, was due to a dormant hardware issue that we were able to discover only during an actual restore. However, we have already made sure that next time this will not be an issue, and thus will make the restore even faster.
Another thing is that with the new system we can theoretically restore on unlimited number of production instances simultaneously, but in practice we are limited, not by the backup system itself, but by the way our DNS system works at the moment. We had three instances affected by the issue and each of them had individual DNS. Thus we needed to restore to only three new instances using the old IPs, so that the domain names, which are not registered with us can continue to work as before and do not experience additional downtime, due to DNS propagation time. To avoid such limitation in the future we plan to work on a brand new central DNS and/or proxy system.
Our backup system story is just another example of how we approach problems. We are never satisfied to just fix the immediate issue and forget about it until the next time. We take each problem as a challenge that needs a unique solution. And if such a solution does not exist at that time, we never shy away from inventing it ourselves.