Feb 202013
 

apple & orangeBeing a DBA and data professional doesn’t mean I always work with SQL Server – sometimes I’m not working with databases at all. We’ve recently acquired some new storage at work (aka Daddy Warbucks bought us a new SAN) and I’ve been charged with moving things to it. Some aspects of this are easier than others and there will be a few more posts coming about that in the future.

For now though, let’s talk about copying files. Lots of files. Server logs, audit logs, things like that. Mostly they were small in size but large in quantity. I ended up with a handful of directories with several thousand files in them that needed to move from SAN A to SAN B. Windows gives us several ways to do this:

  • Copy in Windows Explorer (ewww)
  • The good ol’ DOS Copy command
  • Xcopy, which offers a few more features
  • Robocopy, the most advanced of the MS copy utilities (though I believe it prefers to be called Murphy)

All of these will do a fine job of copying your files, though Robocopy will probably be the fastest due to its multithreading capabilities. But how do you know they all reached their destination intact? Copy and Xcopy offer the option of verification (both using the /v parameter) but sadly Robocopy does not. I’m not sure if verification is just built-in to Robocopy and can’t be disabled, or if it doesn’t exist at all. Either way I didn’t want to risk errors in moving all this data, so I decided to go the extra mile and use another tool to make sure. It didn’t take me long to find the Microsoft File Checksum Integrity Verifier (“FCIV” for short), a nifty little unsupported command-line utility that does exactly what I was looking for.

FCIV In A Nutshell

Basically, FCIV calculates MD5 or SHA-1 hash values for files and outputs them either to the screen or to an XML file. It can also compare files to those checksums saved in XML and tell you if anything differs or is missing. A demo is worth a lot of words, so let’s see it in action!

  • Download Microsoft FCIV and extract the executable wherever you like – for this demo I put it in G:\
  • Download the demo files and extract them. I put mine in G:\demofiles
  • Use FCIV to generate checksums of all files in the folder and save to an XML file with the following syntax:

fciv.exe -add G:\demofiles -wp -sha1 -xml G:\hashdb.xml

-wp means we’re saving only the file names in the XML file, not their full path
-sha1 specifies to calculate a SHA-1 hash on each file. The default is MD5.
-xml means output the checksums to an XML file, in this case the G:\hashdb.xml that follows it.

fciv create screenshot

Let’s open up that XML file and see what it contains:

fciv xml file

As you can see it’s very simple, just the file names and a checksum for each. Now let’s make a few changes.

  • Change the name of the directory the files are in. I changed mine from “demofiles” to “demofiles2”.
  • Delete fileE.txt
  • In fileD.txt, delete the line that says “***DELETE THIS LINE***”

Now let’s use FCIV to verify our files against the checksums we captured in the XML file. Change the current directory to demofiles2 (it won’t work unless you do this) and then run

G:\fciv.exe -v -sha1 -xml G:\hashdb.xml

-v means we’re now in verification mode, so it will verify checksums in the current directory against those in the XML file
-sha1 again specifies we’re using the SHA-1 hash
-xml is the file we’re comparing our calculated checksums against

Here’s the output it produces:

fciv file verify

As you can see, FCIV is telling us that the contents of fileD have changed and fileE is missing. It’s really that easy!

Final Thoughts

I think FCIV is a great utility to keep in your toolbox. Some people may argue that checksum verification isn’t necessary – that Windows does it for you behind the scenes. That may be entirely true, but I wasn’t able to find any concrete documentation proving that it does. Then 10 minutes I spent finding this program online and figuring it out is a very small price to pay for some extra peace of mind in knowing that thousands of files made it to their destination intact.

Others may raise the point that both the MD5 and SHA-1 checksums both suffer from collision vulnerabilities and there are better alternatives out there that this application doesn’t support. They’re totally correct, but it’s also important to remember that we’re using these checksums to detect changes, not for cryptography or protecting secrets. Any form of verification is better than none, and for my purposes FCIV has proven to be very helpful.

Feb 122013
 

I’m fortunate to be able to work with some rather large SQL Server instances on a daily basis, including databases over 25TB in size. Due to their bulk, “VLDBs” (Very Large Databases) present some challenges for DBAs: following best practices for small transactional systems might not be feasible on a VLDB. Regular maintenance processes such as backups and integrity checks take longer when there’s more data to scan. You also really need to think twice before doing something like adding an index to a 2TB table! I’ve learned a ton about working with VLDBs in the time I’ve been at my current job, and I hope to share a lot of that knowledge here in the coming months.

Let’s say you have a database that receives daily imports from your company’s accounting system so users can query it and create reports. This setup becomes popular, and management asks you to add weekly loads from the payroll software. You begin importing payroll into the same database and all is good until you get a request to add hourly updates from the inventory application. These requests continue, and after a few additional systems you now have a really large database on your hands. VLDBs are great when they have a purpose and fulfill a need. If a database is huge only because more and more unrelated things kept being added to it, however, that purpose starts going the way of the dodo.

Big

Bigger than necessary

So we now have a database that contains data from multiple systems that are being updated at different frequencies. This is where things have the potential to get hairy. Records in the database’s copy of the accounting system, which is loaded daily, may not have corresponding entries in its copy of the payroll system, which is only loaded weekly. Even if all the systems were loaded into the database at the exact same rate there is no guarantee that all the corresponding records would be available at the same time across systems. Backups are also starting to take a considerable amount of time due to the volume of data they contain. At this point it’s probably worth taking a step back and looking at how the system is designed. A single database may be meeting our needs, but when manageability starts to become difficult we might be better served by making some changes.

SQL Server implements many features at the database level, however the core functionalities of a database in my opinion are logical consistency and recovery. Logical consistency (the “C” in ACID) guarantees that the database will always be in a consistent state. This state is enforced through transactions made possible by the database’s transaction log. In our scenario the database is doing all this work to ensure everything is always in a consistent state, however there’s no consistency between the different systems being stored within the database.

Backups, restores and recovery also take place in terms of the entire database. You can recover to any point in time (assuming the database is using the full recovery model), but you’re recovering the entire database to that point in time. Restoring our example database may bring us to a point in time where the payroll system was fully loaded, however the accounting system’s load was only halfway complete. There’s no way to recover only part of the database further ahead in time than the rest.

In this situation since the separate systems within the database will never be consistent with each other I would lean towards splitting this database up into multiple smaller databases, one for each system. There are many pros and cons to this – here are some of the larger points:

Pros:

  • Individual databases are smaller and more manageable than a single large database. Maintenance tasks such as integrity checks will take less time due to the smaller size.
  • Backups and restores will be smaller and will take less time.
  • Databases can be recovered individually. Should one database need to be taken offline the others will be unaffected. All downtimes are bad, but a partial downtime is often better than having everything unavailable!
  • The systems won’t have to share a transaction log when in separate databases. This can result in increased performance when the transaction logs are placed on separate disks.

Cons:

  • You now have more databases to maintain.
  • It’s more difficult (but not impossible) to figure out the state of all the different systems at a particular point in time when they’re in separate databases.
  • Users will most likely be required to connect to multiple databases.
  • Setting up security will involve a bit more work, especially in the likely situation of having to accomodate joins across databases.

In many of the cases I’ve seen, splitting things up into multiple databases has been worth the trouble. VLDBs that were getting too large for their own good were transformed into several smaller databases that were much easier to manage. Having to connect to multiple databases proves to be trivial for most users, and things can be further simplified by creating views where appropriate.

Final Thoughts

VLDBs aren’t a bad thing. As time progresses it will only become more common to have a lot of closely-related data that warrants a very large database. But when a database is huge for no good reason it may be better off being split up. This is especially true in cases where the database contains data from multiple disparate sources. When faced with bringing a new dataset into your environment it’s an excellent time to think about whether this data belongs in its own database or not. While it’s easy to keep adding to a giant “one stop shop” database, bigger isn’t always going to be better.