Feb 262015
 

Whether you are working in T-SQL, Oracle, MySQL, C#, or Java, the range of possible values for a signed (positive or negative) 32-bit integer is from -2^{31} or (-2,147,483,648) to (2^{31})-1 or (2,147,483,647). The fact that it’s so consistent across so many different platforms (and also against plenty of others I didn’t list) means there has to be more to it than just the preference of some developers somewhere, right? Exactly right.

Down to the bits

Any data your computer deals with, be it numbers, text, music or videos, all end up in binary at one point or another. Binary means values are in “base 2”, where each digit represents a power of 2, and the possible values for that digit are 0 or 1. A digit capable of storing only the values 0 and 1 is typically referred to as a bit, which is short for “binary digit”. This is in stark contrast to the decimal or “base 10” number system commonly used, where each digit represents a power of 10 and the possible values for that digit are 0 to 9. To show how values are calculated in different number systems, here is the value 37 in both binary and decimal:

(click to enlarge)

I realize that’s an extremely brief example, but if you want to learn more about binary numbers, there are tons of great resources online.

But what about negative values? Negative numbers can be represented a bunch of different ways. Sometimes it’s done with the minus sign

-150

or with parentheses

(150)

but those are all formats, or different ways of expressing a value. A format can be changed without altering the value itself.

How do you really represent that a value is negative? Since binary consists only of bits storing the values 0 and 1, what if the first bit of a binary number served not as a value, but to show whether it was positive or negative? A “1” in the left-most bit would effectively mean the value is negative, and a “0” would mean it is positive. Losing a bit to storing positive or negative would effectively cut the number of values you could store in half (since you are losing a power of two), but it would be awfully handy. There’s a little bit more to this method than that, but it’s known as…

Two’s Complement

Two’s Complement is a great way to express signed numbers in binary, for reasons we will see shortly.

Expressing a negative value via two’s complement is a rather simple process. Start with the binary expression of the positive value you want to negate. Next, flip all the bits so all the 0’s become 1’s, and all the 1’s become 0’s. If we were to stop here, this would be the one’s complement of the value. To get the two’s complement, we add 1 to the one’s complement. Here’s an example for the number 19:

010011  (19, unsigned, with leading 0 added)
101100  (bits flipped)
101101  (add 1. This is -19 in two's complement)

We now have the two’s complement of 19, which has the value -19.

Why did we have to add one? Let’s say we’re in a five-bit environment and want to find the two’s complement of zero (which would be negative zero). The number zero is represented as 00000 in this case. If we flip each bit, we now have 11111, and by adding one to that we now need a sixth bit so we can arrive at 100000. Since we’re still working in a five-bit environment, we ignore the leftmost bit, bringing us back to 00000, which represents both zero and negative zero. Having a single value for positive and negative zero is a key advantage that two’s complement has over ones’ complement.

An even larger advantage of two’s complement is that addition, subtraction, and multiplication work exactly the same as if all values were positive. You can even use the carry method to do this (like you were probably taught in school, but your kids most likely won’t learn thanks to common core in the US.) As a demonstration, let’s add 24 to the -19 value we just computed. The result should be 5.

(click to enlarge)

And indeed it is! If you’re paying close attention you will notice that I truncated the result to 6 bits and ignored the left-most carry. Restricting the result to the same number of significant bits you are adding is a requirement to arrive at a correct answer.

This is really neat, but it still doesn’t answer our original question about the possible range of values. Let’s look at the values you can create with 3 bits.

Three bits allow 8 or 2^3 values to be stored, and in the case of unsigned numbers, those values are from 0 to 7.  The two’s complement values of those same bit arrangements gives a range of -4 to 3.

Thinking of it in powers of 2, three bits in two’s complement allows us to store values from -(2^2) to (2^2)-1. Both the exponents are 2 instead of 3 because a bit is being used to determine whether the value is positive or negative. If we showed 2 as one less than three, the value range would look like:

-(2^{(3-1)}) to 2^{(3-1)}-1

So for n bits, two’s complement lets you express values ranging from -(2^{(n-1)}) to 2^{(n-1)}-1. This is exactly why for a 32-bit integer, the range is -(2^{31}) to 2^{31}-1.

And there you have it. Two’s complement is the reason why basically any signed data types have the range that they do.

 

Jun 262014
 

I’ve been told on two occasions that the absence of the MEDIUMINT datatype is a serious shortcoming of SQL Server. If you’re not familiar with MEDIUMINT, it’s an integer type available in MySQL that consumes 3 Bytes and is capable of storing values ranging from -8,388,608 to 8,388,607 (signed) or from 0 to 15,777,215 (unsigned).

While SQL Server has many datatypes, it has nothing that exactly matches MEDIUMINT. In SQL Server, if you wanted to store a value between 32,768 and 2,147,483,647 you would need to use the INT datatype, each of which takes up 4 Bytes. The next smallest integer datatype is SMALLINT, which has a maximum value of 32,767 and only needs 2 Bytes. I’m not sure the lack of a MEDIUMINT datatype is really a shortcoming, but if you find yourself in a situation where it’s necessary to store a significant number of values in the 3 Byte range, I’m going to let you in on a little secret: SQL Server does have 3 Byte integers, they’re just lurking behind the scenes.

While being able to choose a 3 Byte integer may have some value, not having to choose it and reaping the benefits as if you did has even more value. Row Compression can do just that. SQL Server’s integer types (TINYINT, SMALLINT, INT, and BIGINT) are fixed-width, meaning they all consume the same amount of storage space whether or not the value they are storing warrants it. For instance, if a column (perhaps a foreign key reference) is only storing the values 1 through 5, only one Byte is necessary to store those values. TINYINT would be the most appropriate in this case, as it only stores a single Byte. Were that column defined as an INT, it would consume 4 Bytes, three of which would not be necessary. Enabling Row Compression changes all this, because it can free up space that’s not necessary for storing a value in a particular row.

Row compression lets the storage engine vary the storage format of fixed-width types, effectively treating them as if they were variable-width. This means that if an integer column has a row storing the value 5, only one Byte will be used. Similarly if the next row stores the value 500, only two Bytes are necessary for that row. But what if an application is storing the value 40,000? The SMALLINT data type (2 bytes) can only store values up to 32,767, so a third Byte would be necessary to get to 40,000. Can SQL Server store an integer as 3 Bytes internally, or can it only go up to 4 Bytes since the next largest datatype available to users is INT? Fortunately, we can experiment and find out.

To begin, let’s create a database

CREATE DATABASE [RowCompressionTest];
GO
USE [RowCompressionTest]
GO

and add a table

CREATE TABLE dbo.NumberData (
Value INT,
CONSTRAINT PK_NumberData PRIMARY KEY (Value)
);

and populate it with some values

INSERT INTO dbo.NumberData (Value) VALUES (0), (5), (500), (5000), (40000);

Now let’s look at how those values are stored. To view this information, trace flag 3604 must be enabled. This trace flag redirects some output values to the client so we can see them.

DBCC TRACEON (3604);

From here, 2 undocumented and unsupported commands are needed to find what we’re looking for. DBCC IND and DBCC PAGE are both completely safe and used internally by the SQL Server team at Microsoft, however please exercise caution should you decide to use them on a production system. Paul Randal has written great blog posts explaining both DBCC IND and DBCC PAGE in detail, so I won’t duplicate the effort here.

First, use DBCC IND to find the first (and, in this case, the only) data page used by the dbo.NumberData table. The file number is in the PageFID column, the page number is in the PagePID column, and you want the row where PageType = 1. In my case this is page 288, but your page number will probably differ.

DBCC IND ('RowCompressionTest','dbo.NumberData',1);

Next you will use DBCC PAGE to examine the contents of the data page you just found the number of. To do this, substitute your file number and page number in the query below. Mine are 1 and 288, respectively.

DBCC PAGE ('RowCompressionTest',**File_ID**,**Page_ID**,3)

There’s a lot of information returned here, and again please refer to Paul Randal’s posts if you are curious about what it all means. For the purpose of this demonstration, scroll down to where it says:

Slot 1 Column 1 Offset 0x4 Length 4 Length (physical) 4

Value = 5

This is the first column of the first row (slot). The next line down you see it has printed that the value is 5, which is exactly what we inserted. What’s of interest here is that it says the Length is 4, and the physical length is also 4. The length is in Bytes, and since this column was created with the INT type, the length of 4 is indeed correct. That value 5 is using 4 Bytes of storage, even though it could just as easy fit into a single Byte.

Now scroll down to:

Slot 4 Column 1 Offset 0x4 Length 4 Length (physical) 4

Value = 40000

The value 40,000 also has a (logical) length of 4 and a physical length of 4. Even though this value could be stored in 3 Bytes, choosing an INT means it’s taking up 4 Bytes whether we’d like it to or not.

Now let’s see what happens when Row Compression enters the game. Run the statement below to enable it.

ALTER INDEX PK_NumberData ON dbo.NumberData REBUILD WITH (Data_Compression = ROW);

Because the index was rebuilt, all the page numbers change. Now you’ll need to run DBCC IND again to get the file and page number, and plug that into DBCC PAGE.

DBCC IND ('RowCompressionTest','dbo.NumberData',1);
DBCC PAGE ('RowCompressionTest',**File_ID**,**Page_ID**,3)

Scroll down to Slot 1 Column 1 again and you’ll see something like this:

Almost identical to what we saw before, but now the physical length is only 1 instead of 4. This means that with row compression enabled, the value 5 is stored physically (on disk) using a single Byte instead of 4. But what about that 40,000 value?

Slot 4 Column 1 Offset 0x3 Length 4 Length (physical) 3

Value = 40000

The value 40,000 consumes 3 Bytes instead of 4 when row compression is enabled, just like the MEDIUMINT datatype. So if anyone ever gives you a hard time for SQL Server not having a 3-Byte integer type, be sure to tell them that a) they really need to get a life, and b) SQL Server can do it very intelligently behind the scenes!