Jun 262014
 

I’ve been told on two occasions that the absence of the MEDIUMINT datatype is a serious shortcoming of SQL Server. If you’re not familiar with MEDIUMINT, it’s an integer type available in MySQL that consumes 3 Bytes and is capable of storing values ranging from -8,388,608 to 8,388,607 (signed) or from 0 to 15,777,215 (unsigned).

While SQL Server has many datatypes, it has nothing that exactly matches MEDIUMINT. In SQL Server, if you wanted to store a value between 32,768 and 2,147,483,647 you would need to use the INT datatype, each of which takes up 4 Bytes. The next smallest integer datatype is SMALLINT, which has a maximum value of 32,767 and only needs 2 Bytes. I’m not sure the lack of a MEDIUMINT datatype is really a shortcoming, but if you find yourself in a situation where it’s necessary to store a significant number of values in the 3 Byte range, I’m going to let you in on a little secret: SQL Server does have 3 Byte integers, they’re just lurking behind the scenes.

While being able to choose a 3 Byte integer may have some value, not having to choose it and reaping the benefits as if you did has even more value. Row Compression can do just that. SQL Server’s integer types (TINYINT, SMALLINT, INT, and BIGINT) are fixed-width, meaning they all consume the same amount of storage space whether or not the value they are storing warrants it. For instance, if a column (perhaps a foreign key reference) is only storing the values 1 through 5, only one Byte is necessary to store those values. TINYINT would be the most appropriate in this case, as it only stores a single Byte. Were that column defined as an INT, it would consume 4 Bytes, three of which would not be necessary. Enabling Row Compression changes all this, because it can free up space that’s not necessary for storing a value in a particular row.

Row compression lets the storage engine vary the storage format of fixed-width types, effectively treating them as if they were variable-width. This means that if an integer column has a row storing the value 5, only one Byte will be used. Similarly if the next row stores the value 500, only two Bytes are necessary for that row. But what if an application is storing the value 40,000? The SMALLINT data type (2 bytes) can only store values up to 32,767, so a third Byte would be necessary to get to 40,000. Can SQL Server store an integer as 3 Bytes internally, or can it only go up to 4 Bytes since the next largest datatype available to users is INT? Fortunately, we can experiment and find out.

To begin, let’s create a database

CREATE DATABASE [RowCompressionTest];
GO
USE [RowCompressionTest]
GO

and add a table

CREATE TABLE dbo.NumberData (
Value INT,
CONSTRAINT PK_NumberData PRIMARY KEY (Value)
);

and populate it with some values

INSERT INTO dbo.NumberData (Value) VALUES (0), (5), (500), (5000), (40000);

Now let’s look at how those values are stored. To view this information, trace flag 3604 must be enabled. This trace flag redirects some output values to the client so we can see them.

DBCC TRACEON (3604);

From here, 2 undocumented and unsupported commands are needed to find what we’re looking for. DBCC IND and DBCC PAGE are both completely safe and used internally by the SQL Server team at Microsoft, however please exercise caution should you decide to use them on a production system. Paul Randal has written great blog posts explaining both DBCC IND and DBCC PAGE in detail, so I won’t duplicate the effort here.

First, use DBCC IND to find the first (and, in this case, the only) data page used by the dbo.NumberData table. The file number is in the PageFID column, the page number is in the PagePID column, and you want the row where PageType = 1. In my case this is page 288, but your page number will probably differ.

DBCC IND ('RowCompressionTest','dbo.NumberData',1);

Next you will use DBCC PAGE to examine the contents of the data page you just found the number of. To do this, substitute your file number and page number in the query below. Mine are 1 and 288, respectively.

DBCC PAGE ('RowCompressionTest',**File_ID**,**Page_ID**,3)

There’s a lot of information returned here, and again please refer to Paul Randal’s posts if you are curious about what it all means. For the purpose of this demonstration, scroll down to where it says:

Slot 1 Column 1 Offset 0x4 Length 4 Length (physical) 4

Value = 5

This is the first column of the first row (slot). The next line down you see it has printed that the value is 5, which is exactly what we inserted. What’s of interest here is that it says the Length is 4, and the physical length is also 4. The length is in Bytes, and since this column was created with the INT type, the length of 4 is indeed correct. That value 5 is using 4 Bytes of storage, even though it could just as easy fit into a single Byte.

Now scroll down to:

Slot 4 Column 1 Offset 0x4 Length 4 Length (physical) 4

Value = 40000

The value 40,000 also has a (logical) length of 4 and a physical length of 4. Even though this value could be stored in 3 Bytes, choosing an INT means it’s taking up 4 Bytes whether we’d like it to or not.

Now let’s see what happens when Row Compression enters the game. Run the statement below to enable it.

ALTER INDEX PK_NumberData ON dbo.NumberData REBUILD WITH (Data_Compression = ROW);

Because the index was rebuilt, all the page numbers change. Now you’ll need to run DBCC IND again to get the file and page number, and plug that into DBCC PAGE.

DBCC IND ('RowCompressionTest','dbo.NumberData',1);
DBCC PAGE ('RowCompressionTest',**File_ID**,**Page_ID**,3)

Scroll down to Slot 1 Column 1 again and you’ll see something like this:

Almost identical to what we saw before, but now the physical length is only 1 instead of 4. This means that with row compression enabled, the value 5 is stored physically (on disk) using a single Byte instead of 4. But what about that 40,000 value?

Slot 4 Column 1 Offset 0x3 Length 4 Length (physical) 3

Value = 40000

The value 40,000 consumes 3 Bytes instead of 4 when row compression is enabled, just like the MEDIUMINT datatype. So if anyone ever gives you a hard time for SQL Server not having a 3-Byte integer type, be sure to tell them that a) they really need to get a life, and b) SQL Server can do it very intelligently behind the scenes!

Mar 112014
 

T-SQL Tuesday LogoThis post is part of T-SQL Tuesday #52, which is being hosted this month by Michael J. Swart (@MJSwart). Michael is asking us to argue against a popular opinion, and I’m more than happy to do so, as this is a belief that I’ve kept to myself for quite a while.

SQL Server’s row compression feature can be an amazing tool. Not only is it lightweight on CPU usage, especially when compared to page compression, but it can save a significant amount of disk space as well. Your data also remains compressed while in the buffer pool, meaning more rows can be stored in memory, reducing the need to make slower requests to disk. On top of all that, some queries (especially those involving index scans) can see dramatic performance improvements.

In fact, row compression is so good that Microsoft’s whitepaper actually states “If row compression results in space savings and the system can accommodate a 10 percent increase in CPU usage, all data should be row-compressed.”

Yes, row compression is a wonderful thing, and the databases I maintain frequently benefit from its use.

But I hate it.

Why? Because all too often, features designed to help make things easier also make people lazy.

By far, the biggest issue that row compression addresses is poor data typing, the use of a data type that isn’t appropriate for the values at hand. For example, if a column is only going to store the values 1 through 5, an integer data type is not necessary. A tinyint data type would be just as effective, and would consume only one quarter of the space. However if you are unable to change the data type, perhaps because the database is part of an application written by a third party, row compression can be a big help.

Row compression allows the storage engine to treat fixed-width data types as if they were variable-width. This means the disk space that isn’t needed to store a value in a fixed-width column and would typically go unused can be put to work. The savings can be tremendous, and SQL Server’s data compression features are completely transparent to end users or applications – all the magic happens behind the scenes, which is why a third party application would be none the wiser.

But what if you are able to change those data types, and just don’t feel the need to do so anymore? Data compression gives you most of the advantages of proper data typing, but all you have to do to get them is flip a switch – no pesky forethought necessary. And that’s why it’s terrible. Because for every person out there who designs their databases and data types with care, there are many more who aren’t interested, don’t care, or don’t even realize it’s an option. Features like row compression that mask these issues aren’t going to interest anyone in solving them the right way.

So while row compression is a wonderful tool and can do amazing things when used properly, don’t forget it’s also an enabler.