T-SQL Tuesday #35: A Horror Story

T-SQL Tuesday! This year, time seems to be flying by faster than ever. It's October already, and we're up to our 35th installment of T-SQL Tuesday. This time we're hosted by Nick Haslam (blog | @nhaslam). Nick's asking "what is your most horrifying discovery from your work with SQL Server?"

The story that scares me the most is one I've actually told a few times lately. It's all about the condition of database backups at my current job right after I started there. As soon as I found out what was going on, I knew resolving this issue would be my top priority.

When I started at my current job, our main production database was 22TB in size. I was of course aware of this going in, and had done plenty of reading about best practices for dealing with very large databases. A commonly covered topic is backups: there are many articles out there mentioning how native SQL Server backups aren't always the best choice when dealing with VLDBs and that SAN snapshots may be more appropriate in many environments. While a native backup will read the entire database and write its contents out to a backup file and can take many hours, a SAN snapshot will only take a few seconds to create a read-only image of all database files at a particular point in time. This snapshot can then be mounted on a different server and backed up from there so that the main server doesn't incur any of the processing or I/O load associated with the backup.

While going over all of our systems to get a general lay of the land, I found that backups of the production database were being created not with snapshots but clones instead. For those who might not be familiar, a snapshot is basically a "virtual" copy of the data that can be created in seconds and consume very little disk space (generally only changes to data since the snapshot was created are actually written to disk) while a clone is a full copy of the data that can take a very long time depending on the amount of data being copied. Clones aren't always bad, and in some cases can be an excellent solution to a problem. In this case, however, they were a disaster waiting to happen because the clone was taking about 5 days to complete. On top of that, there was only enough disk space to store one clone at a time. So the backup process worked pretty much like this:

Sunday – Start creating clone
Thursday – Clone is finished
Sunday – Delete previous clone and start creating new one

So there was a chance of everything being restorable on Fridays and Saturdays, but if a failure occurred during the other 5 days of the week when the clone was being created, we would be up a creek with no backup of any type. As someone who's obsessed with backups and recovery, this was quite horrifying! Needless to say, improving this process was at the top of my to-do list.

To make a long story of experimentation and optimization short, I ended up leveraging several SQL Server technologies including data compression, backup compression, and backup striping across multiple drives to bring the time to create a full backup down to a little over 3 hours. Differential backups are now possible and complete in about 20 minutes. I'm happy to report that backups are now fast enough to be useful, small enough that we have sufficient space to store many of them at a time, and are easily testable.

So that's my horror story – I'm glad to say it has a happy ending! Thanks to Nick for the great topic!