Bigger Databases Aren’t Always Better

I'm fortunate to be able to work with some rather large SQL Server instances on a daily basis, including databases over 25TB in size. Due to their bulk, "VLDBs" (Very Large Databases) present some challenges for DBAs: following best practices for small transactional systems might not be feasible on a VLDB. Regular maintenance processes such as backups and integrity checks take longer when there's more data to scan. You also really need to think twice before doing something like adding an index to a 2TB table! I've learned a ton about working with VLDBs in the time I've been at my current job, and I hope to share a lot of that knowledge here in the coming months.

Let's say you have a database that receives daily imports from your company's accounting system so users can query it and create reports. This setup becomes popular, and management asks you to add weekly loads from the payroll software. You begin importing payroll into the same database and all is good until you get a request to add hourly updates from the inventory application. These requests continue, and after a few additional systems you now have a really large database on your hands. VLDBs are great when they have a purpose and fulfill a need. If a database is huge only because more and more unrelated things kept being added to it, however, that purpose starts going the way of the dodo.

Big
Bigger than necessary

So we now have a database that contains data from multiple systems that are being updated at different frequencies. This is where things have the potential to get hairy. Records in the database's copy of the accounting system, which is loaded daily, may not have corresponding entries in its copy of the payroll system, which is only loaded weekly. Even if all the systems were loaded into the database at the exact same rate there is no guarantee that all the corresponding records would be available at the same time across systems. Backups are also starting to take a considerable amount of time due to the volume of data they contain. At this point it's probably worth taking a step back and looking at how the system is designed. A single database may be meeting our needs, but when manageability starts to become difficult we might be better served by making some changes.

SQL Server implements many features at the database level, however the core functionalities of a database in my opinion are logical consistency and recovery. Logical consistency (the "C" in ACID) guarantees that the database will always be in a consistent state. This state is enforced through transactions made possible by the database's transaction log. In our scenario the database is doing all this work to ensure everything is always in a consistent state, however there's no consistency between the different systems being stored within the database.

Backups, restores and recovery also take place in terms of the entire database. You can recover to any point in time (assuming the database is using the full recovery model), but you're recovering the entire database to that point in time. Restoring our example database may bring us to a point in time where the payroll system was fully loaded, however the accounting system's load was only halfway complete. There's no way to recover only part of the database further ahead in time than the rest.

In this situation since the separate systems within the database will never be consistent with each other I would lean towards splitting this database up into multiple smaller databases, one for each system. There are many pros and cons to this – here are some of the larger points:

Pros:

  • Individual databases are smaller and more manageable than a single large database. Maintenance tasks such as integrity checks will take less time due to the smaller size.
  • Backups and restores will be smaller and will take less time.
  • Databases can be recovered individually. Should one database need to be taken offline the others will be unaffected. All downtimes are bad, but a partial downtime is often better than having everything unavailable!
  • The systems won't have to share a transaction log when in separate databases. This can result in increased performance when the transaction logs are placed on separate disks.

Cons:

  • You now have more databases to maintain.
  • It's more difficult (but not impossible) to figure out the state of all the different systems at a particular point in time when they're in separate databases.
  • Users will most likely be required to connect to multiple databases.
  • Setting up security will involve a bit more work, especially in the likely situation of having to accomodate joins across databases.

In many of the cases I've seen, splitting things up into multiple databases has been worth the trouble. VLDBs that were getting too large for their own good were transformed into several smaller databases that were much easier to manage. Having to connect to multiple databases proves to be trivial for most users, and things can be further simplified by creating views where appropriate.

Final Thoughts

VLDBs aren't a bad thing. As time progresses it will only become more common to have a lot of closely-related data that warrants a very large database. But when a database is huge for no good reason it may be better off being split up. This is especially true in cases where the database contains data from multiple disparate sources. When faced with bringing a new dataset into your environment it's an excellent time to think about whether this data belongs in its own database or not. While it's easy to keep adding to a giant "one stop shop" database, bigger isn't always going to be better.