T-SQL Tuesday #19 – Disasters & Recovery

T-SQL Tuesday Logo This month's T-SQL Tuesday comes to us courtesy of Allen Kinsel and the topic is disasters. In the spirit of the beginning of hurricane season, Allen's asking for stories related to disasters, either preparing for them or recovering from them. A recent tale I have is not quite about preparing for a disaster, but rather about lack of preparation and (inevitably) the disaster eventually occurring.

When I started my last job, one of the peculiarities I noticed was that the transaction logs on the main production database were extremely large and taking up nearly all its disk space. Digging a little deeper, it was because transaction log backups were only being taken once daily. Several times I recommended increasing the frequency of the transaction log backups to reduce the log size. As an added bonus I could shrink the log and partially re-grow it so it would have much less than its current 25,000 VLFs.

Each time I recommended doint this I was told not to, as backing up the log more frequently "would cause blocking". I tried my best to reason and explain that backups don't cause blocking, and even setup a demonstration showing that backups don't block and the I/O caused by the backup didn't have an observable effect on the system. Still, no dice. Someone had also read that shrinking files was bad, hence it was 100% forbidden. I was also told they've never had any issues taking daily log backups, so why bother changing it? I agree that shrinking files is not something to be done regularly or even often, but it is a tool that serves a purpose and should be used responsibly. As for the "we've always done things a certain way and have never had a problem" argument, that's one of the stupidest in the book. There's a first time for everything!

Shortly after I left the company, a friend and former co-worker emailed to let me know what my prophecy had indeed come true. Replication got behind on a machine (I don't know all the particulars of why) and it ended up that the transaction log on the production DB was growing uncontrollably and disk space was becoming an issue. My friend said he advised them replication would need to be broken in order to shrink the log, but he was told that he was "digging too deeply" for a solution, and that bouncing the server would help. They later bounced it, which did nothing as predicted. It ended up they had to take down the system and break replication in order to back up the log and shrink it to around 30 VLFs. I have no idea if log backups are recurring more frequently now or not.

To me, the moral of this story is that the old adage of "If it's not broken don't fix it" doesn't always apply. A simple solution that poses a small break from what's been accepted as correct is not necessarily incorrect, and in this case could very well have prevented some downtime.