Three-Byte Integers in SQL Server: Fact or Fiction?

I've been told on two occasions that the absence of the MEDIUMINT datatype is a serious shortcoming of SQL Server. If you're not familiar with MEDIUMINT, it's an integer type available in MySQL that consumes 3 Bytes and is capable of storing values ranging from -8,388,608 to 8,388,607 (signed) or from 0 to 15,777,215 (unsigned).

While SQL Server has many datatypes, it has nothing that exactly matches MEDIUMINT. In SQL Server, if you wanted to store a value between 32,768 and 2,147,483,647 you would need to use the INT datatype, each of which takes up 4 Bytes. The next smallest integer datatype is SMALLINT, which has a maximum value of 32,767 and only needs 2 Bytes. I'm not sure the lack of a MEDIUMINT datatype is really a shortcoming, but if you find yourself in a situation where it's necessary to store a significant number of values in the 3 Byte range, I'm going to let you in on a little secret: SQL Server does have 3 Byte integers, they're just lurking behind the scenes.

While being able to choose a 3 Byte integer may have some value, not having to choose it and reaping the benefits as if you did has even more value. Row Compression can do just that. SQL Server's integer types (TINYINT, SMALLINT, INT, and BIGINT) are fixed-width, meaning they all consume the same amount of storage space whether or not the value they are storing warrants it. For instance, if a column (perhaps a foreign key reference) is only storing the values 1 through 5, only one Byte is necessary to store those values. TINYINT would be the most appropriate in this case, as it only stores a single Byte. Were that column defined as an INT, it would consume 4 Bytes, three of which would not be necessary. Enabling Row Compression changes all this, because it can free up space that's not necessary for storing a value in a particular row.

Row compression lets the storage engine vary the storage format of fixed-width types, effectively treating them as if they were variable-width. This means that if an integer column has a row storing the value 5, only one Byte will be used. Similarly if the next row stores the value 500, only two Bytes are necessary for that row. But what if an application is storing the value 40,000? The SMALLINT data type (2 bytes) can only store values up to 32,767, so a third Byte would be necessary to get to 40,000. Can SQL Server store an integer as 3 Bytes internally, or can it only go up to 4 Bytes since the next largest datatype available to users is INT? Fortunately, we can experiment and find out.

To begin, let's create a database

1CREATE DATABASE [RowCompressionTest];
2GO
3USE [RowCompressionTest]
4GO

and add a table

1CREATE TABLE dbo.NumberData (
2Value INT,
3CONSTRAINT PK_NumberData PRIMARY KEY (Value)
4);

and populate it with some values

1INSERT INTO dbo.NumberData (Value) VALUES (0), (5), (500), (5000), (40000);

Now let's look at how those values are stored. To view this information, trace flag 3604 must be enabled. This trace flag redirects some output values to the client so we can see them.

1DBCC TRACEON (3604);

From here, 2 undocumented and unsupported commands are needed to find what we're looking for. DBCC IND and DBCC PAGE are both completely safe and used internally by the SQL Server team at Microsoft, however please exercise caution should you decide to use them on a production system. Paul Randal has written great blog posts explaining both DBCC IND and DBCC PAGE in detail, so I won't duplicate the effort here.

First, use DBCC IND to find the first (and, in this case, the only) data page used by the dbo.NumberData table. The file number is in the PageFID column, the page number is in the PagePID column, and you want the row where PageType = 1. In my case this is page 288, but your page number will probably differ.

1DBCC IND ('RowCompressionTest','dbo.NumberData',1);

Next you will use DBCC PAGE to examine the contents of the data page you just found the number of. To do this, substitute your file number and page number in the query below. Mine are 1 and 288, respectively.

1DBCC PAGE ('RowCompressionTest',<File_ID>,<Page_ID>,3)

There's a lot of information returned here, and again please refer to Paul Randal's posts if you are curious about what it all means. For the purpose of this demonstration, scroll down to where it says:

1Slot 1 Column 1 Offset 0x4 Length 4 Length (physical) 4
2Value = 5

This is the first column of the first row (slot). The next line down you see it has printed that the value is 5, which is exactly what we inserted. What's of interest here is that it says the Length is 4, and the physical length is also 4. The length is in Bytes, and since this column was created with the INT type, the length of 4 is indeed correct. That value 5 is using 4 Bytes of storage, even though it could just as easy fit into a single Byte.

Now scroll down to:

1Slot 4 Column 1 Offset 0x4 Length 4 Length (physical) 4
2Value = 40000

The value 40,000 also has a (logical) length of 4 and a physical length of 4. Even though this value could be stored in 3 Bytes, choosing an INT means it's taking up 4 Bytes whether we'd like it to or not.

Now let's see what happens when Row Compression enters the game. Run the statement below to enable it.

1ALTER INDEX PK_NumberData ON dbo.NumberData REBUILD WITH (Data_Compression = ROW);

Because the index was rebuilt, all the page numbers change. Now you'll need to run DBCC IND again to get the file and page number, and plug that into DBCC PAGE.

1DBCC IND ('RowCompressionTest','dbo.NumberData',1);
1DBCC PAGE ('RowCompressionTest',**File_ID**,**Page_ID**,3)

Scroll down to Slot 1 Column 1 again and you'll see something like this:

Almost identical to what we saw before, but now the physical length is only 1 instead of 4. This means that with row compression enabled, the value 5 is stored physically (on disk) using a single Byte instead of 4. But what about that 40,000 value?

1Slot 4 Column 1 Offset 0x3 Length 4 Length (physical) 3
2Value = 40000

The value 40,000 consumes 3 Bytes instead of 4 when row compression is enabled, just like the MEDIUMINT datatype. So if anyone ever gives you a hard time for SQL Server not having a 3-Byte integer type, be sure to tell them that a) they really need to get a life, and b) SQL Server can do it very intelligently behind the scenes!