The 1940 Census: A True NoSQL Database!

On April 2 of this year, the National Archives released the complete population schedule of the 1940 census. These records were highly anticipated not only for their genealogical value but also because of their detailed information about an incredibly interesting period of U.S. History. This census captured the point in time where the country was finally starting to come out of the great depression but had not yet entered World War II. Many questions it included were new and designed to gauge the effects of the depression, with topics including income, education, unemployment, and migration. In 1940, millions were employed by the WPA, PWA, and other New Deal agencies, and the Farm Security Administration’s photography program had a small group of photographers traversing the nation capturing images of everyday American life. Some of my favorite photos come from this collection.

Population Schedule Form

Population Schedule Form (Click to Enlarge)

As interesting as the Census is for all its historical and social reasons, there’s an equally awesome tale to be told of all its data and the technologies behind it. Setting up a table in SQL Server to store 310 million rows and aggregate results from them would be pretty easy today – many DBAs deal with tables that are orders of magnitude larger than that, but in 1940 it was a major undertaking involving thousands of workers. Today the census is still a non-trivial task, however I’d imagine most of that work goes into getting data from the population into a database, while calculating the results from that is relatively simple.

ETL: Enumerate, Tabulate, Lock Up
Prior to 1960, censuses weren’t mailed to your house like they are now. Instead every household was visited in person by an enumerator, a single person responsible for an “Enumeration District”, or “ED”. EDs varied wildly in size, and could consist of a single block in a large city, or an entire township in a more rural area. The enumerator would stop by and ask questions about each member of your household while writing the answers onto a population schedule form that measured 23.75″ wide by 18.5″ high. Yep, the entire database was on paper. Torn page detection must have been a very serious issue! When the enumerator had information about every last person in their district, they would send their data to the Census Office via log shipping Registered Mail.

Once in Washington, an army of operators transferred each record from the population schedule forms to a punch card. Punch cards had been used to tabulate the census since 1890 and were still the best technology available fifty years later. The 1880 census was tabulated by hand, which took 7 years to complete. Knowing there had to be a better way to calculate results, former census employee Herman Hollerith set out to create a machine that could count results from data stored on punched cards. He won a contract to tabulate the 1890 census, which was completed in only 1 year. By 1900, he had formed the Tabulating Machine Company and greatly increased his fees, knowing he had a monopoly and the Census Office would have no other option than to pay them. By 1910, census employees had developed and patented their own counting machine to avoid using Hollerith’s. The Tabulating Machine Company, which by then had merged and changed names to the Computer Tabulating Recording Company, was nearly bankrupted by the loss of business. They eventually got their act together and were able to turn a profit. In 1924, Computer Tabulating Recording Company changed its name to International Business Machines Corporation.

Tabulating by Machine

Women in (1940) Technology: Tabulating By Machine

After the records were copied to punch cards and tabulated by machine, the aggregated results were released immediately for uses like determining congressional seats and allocation of public funds. Since the population schedules contain information on individuals, they are held for 72 years before being released for research purposes. Rather than keep all 3.9 million pages of records on paper, the Census Office used the most compressed format available at the time, microfilm. Apparently they had not yet discovered the rather obscure and undocumented BACKUP CENSUS TO TAPE=’MICROFILM’ WITH COMPRESSION; command. The records released this year are images scanned from that microfilm.

Indexing
Since all the data consists of images, there’s no easy way to index them. Optical character recognition software is pretty good these days, but probably not good enough to pick out the handwriting in these images – most of which is in cursive. Instead everything was indexed by enumeration district, meaning you need to know where someone was living during April of 1940 before you can search for them. Many genealogy websites are now working on indexing this data by name, but it is not expected to be completed for a few months.

Finding Your Family
If you had relatives in the US in 1940 and know where they lived at that time, I highly recommend looking for them. Everything can be found for free at http://1940census.archives.gov. The first thing you’ll need to do is find which enumeration district they lived in. If you have an address, you are very much in luck. If you only have a general idea of where they were, then you’ll probably have to do a bit more work to find them. The census site lets you drill down by state, county, and city, and provides a list of EDs that apply. If you’re searching in an urban area you might need to use maps and/or descriptions returned by the search to narrow down exactly which ED they were in. If the official site isn’t finding anything for you, I’ve also had luck using Steve Morse’s 1940 ED Finder. Once armed with a list of relevant enumeration districts, you can view or download the population schedules from each district and look for people you recognize. You’ll probably end up looking through all the sheets because the entries on the forms aren’t always in order. My guess is that enumerators would start going down a street, skip houses where nobody was at home and then come back to them later.

I was fortunate enough to find all of my family, and it’s really neat to be able to see a snapshot of their lives at a time when my grandparents were close to my age. It also gave me great appreciation for what a chore recordkeeping was in that era! Even if you have no relatives in this census I think it’s still worth taking a look at – it was very interesting to see what kinds of jobs people had, their education level, and how much they were paid. My family was in the suburbs of Chicago at that time, and probably 7 out of 10 people in their area worked in “telephone manufacturing” which would have been at Western Electric’s Hawthorne Works. My wife’s family was in a rural area downstate, and practically everyone worked on a farm and the few who didn’t were employed by the WPA. The best job title I saw when searching for her family was “chicken picker”.

Best of luck if you end up searching for your ancestors. If you find any who were employed as a chicken picker, let me know!

Fun Videos

SQL Saturday #119 Update

Chicago’s third annual SQL Saturday will soon be upon us, and as food and beverage coordinator it’s my job to obsess over what’s on the menu.

SQL Saturday food offerings seem to have evolved over time. The first few I attended offered pizza, which is of course a great way to feed a large number of people. While very cost-effective, pizza is also not all that exciting. This did not go unnoticed; it seems there was a shift to more local food specialties that aren’t pizza. Lately I’ve seen more local offerings like amazing barbecue in Kansas City and excellent bratwurst in Wisconsin.

BeefThis year I’m happy to announce that lunch for SQL Saturday Chicago will be Portillo’s! We’ll have Italian Beef sandwiches, Mostaccioli, and Grilled Veggie Sandwiches for those who have requested a vegetarian option. Be sure to bring your appetite!

Also if you haven’t already seen, our schedule has been posted! We have some excellent speakers lined up. Now is a great time to start thinking about what sessions you’d like to attend.

Pre-Cons
If you’re looking to extend your weekend a bit, we have two excellent pre-cons lined up for Friday, May 18:

These pre-cons are priced at $99 each and can be booked by emailing sqlsaturday119@sqlsaturday.com.

We’re looking forward to another wonderful event! Can’t wait to see you there in two weeks!

Compression Estimates That Don't Make Sense

“Statistics means never having to say you’re certain” is a well-known quote that would often come up in my AP Statistics class back in high school. The same can be implied for estimation, and while estimates need not be perfect, they should at least be logical. While estimating the effects of compressing data I recently came across a behavior that I don’t think Mr. Spock would approve of.

The sp_estimate_data_compression_savings procedure is an excellent tool for estimating how compression will affect an index or partition. It works by sampling 5% of the appropriate pages and applies the desired level of compression on them in TempDB. The results are then extrapolated to give an estimated size for the entire index/partition with that level of compression applied. It’s not perfect, but again it’s an estimate. I recently found however that depending on the datatypes present in the table, that estimate can get very bad. I noticed this in SQL Server 2008R2, but it also shows up in 2012.

For this demo, you’ll need the AdventureWorks database or one of its variants. I used AdventureWorks2008R2 and AdventureWorks2012 in my tests. You’ll also need tables a bit larger than AdventureWorks provides, so I use Adam Machanic’s bigAdventure script to create the dbo.bigTransactionHistory table we’ll be working with.

Once dbo.bigTransactionHistory is set up, we’re going to focus on it’s nonclustered index which is created in the script along with the table. Let’s rebuild this index with a fill factor of 100 so there’s as little fragmentation as possible.

ALTER INDEX IX_ProductID_TransactionDate ON dbo.bigTransactionHistory
REBUILD WITH (FILLFACTOR=100);

Now check the estimated size of it using all 3 compression settings: NONE, ROW, and PAGE.

EXEC sp_estimate_data_compression_savings 'dbo','bigTransactionHistory', 2, NULL, NONE;

EXEC sp_estimate_data_compression_savings 'dbo','bigTransactionHistory', 2, NULL, ROW;

EXEC sp_estimate_data_compression_savings 'dbo','bigTransactionHistory', 2, NULL, PAGE;

(Click to enlarge)
Screenshot1

The index is currently not compressed, and you can see that the first result set’s estimated size (with no compression) is close to the current size. It won’t be exactly the same and your values will vary, but the main idea here is that using no compression gives an estimate closest to the current value and row and page compression yield smaller values yet.

Now let’s change things up by adding another column to bigTransactionHistory and giving it a value:

ALTER TABLE dbo.bigTransactionHistory ADD vbm VARBINARY(MAX) NULL;

UPDATE dbo.bigTransactionHistory SET vbm = CONVERT(VARBINARY(MAX), '0xAB');

Then check the estimated sizes again just like before:

EXEC sp_estimate_data_compression_savings 'dbo','bigTransactionHistory', 2, NULL, NONE;

EXEC sp_estimate_data_compression_savings 'dbo','bigTransactionHistory', 2, NULL, ROW;

EXEC sp_estimate_data_compression_savings 'dbo','bigTransactionHistory', 2, NULL, PAGE;

Screenshot2

Look at that first result (with no compression) – it’s now way smaller than the current size. It’s actually even smaller than the estimated sizes for row and page compression, and it all seems to be due to the VARBINARY(MAX) column. If you remove that column and rebuild the table, the estimated size with no compression will be back to where it should be.

Strange, huh? I can only assume this has something to do with VARBINARY being a type that doesn’t get compressed, but I don’t see how that should affect the estimation.

BOL states that “If the requested compression setting is same as the current compression setting, the stored procedure will return the estimated size with no data fragmentation and using the existing fill factor.” I can’t see the assumption of no fragmentation being responsible for the estimated size shrinking so much and being smaller than the estimates using compression. Compression can increase the size of an object in certain cases, but the differences here are just too great.

I have submitted this on MS Connect – feel free to vote up if you can reproduce it!

SQL Saturday #118: Wisconsin-Style!

This past weekend I was very fortunate to be able to attend and speak at SQL Saturday #118 in Madison, Wisconsin. This was a top-notch event; all the hard work and planning that went into it really showed in how smoothly things ran.

My trip began around noon on Friday, which was a great decision because I hit no traffic at all on the way up there. I checked into the Crowne Plaza Madison and ended up with a little time to kill before heading out to help set things up. The hotel was excellent, featuring a pool and free wi-fi that was surprisingly fast. I’ve been gouged at other hotels for a much slower connection!

SQLCowA large group of volunteers converged on the venue around 4pm Friday to help with setup, stuff goodie bags, and make sure everything was ready to go. So many showed up that things got done significantly faster than planned, meaning more downtime before the speaker dinner started. Downtime’s never a bad thing at these events though – we all hung out at Ale Asylum, a brew pub with some very tasty beer!

The speaker dinner was held at Benvenuto’s Italian Grill, where much merriment was had, along with some excellent Italian food. Afterwards a lot of us headed back to the hotel bar before turning in for the night.

Doors opened for registration at 7:00 the next morning. The event was held at Madison Area Technical College, which was a terrific venue. There was excellent classroom space featuring large projector screens, plenty of desk space for laptops, and most rooms had tiered seating to help with visibility. The hallways were wide and there was plenty of open spaces to mingle and network with others. Much like the hotel, it also featured some impressively fast wi-fi.

My session, where I talked about data compression, was at 8:00 am – the first timeslot of the day. This was fine with me as I am very much a morning person. Not only was I awake and ready to go, but I had a pretty lively crowd of attendees that were eager to get the day started as well. My presentation filled the 75 minute timeslot perfectly, with enough time for some great questions afterwards.

Lunch was a delicious Wisconsin picnic featuring bratwurst, hamburgers, baked beans, potato salad, and cookies for dessert. To facilitate conversation on specific topics, this was a “Cows of a Spot” lunch, where speakers were assigned to specific tables marked with topics and attendees could come with their questions. I manned a table on performance and had some great conversations with those who stopped by.

The sessions themselves appeared very well-organized, and those I attended personally were of excellent quality. I was particularly impressed by how room monitors were assigned to each room and evaluation forms were given to speakers immediately following their session. Of all the SQL Saturdays I’ve spoken at this was the first time I’ve gotten to see my evaluations. I can remember one where the evaluation forms got lost before we got them, and another where it wasn’t communicated that speakers had to come pick them up so they were thrown out instead. Giving the forms to the speaker immediately following the session eliminated any confusion and made sure the evals got in the right hands as quickly as possible.

After a full day of training, closing ceremonies and the highly anticipated raffle drawings took place. There were some great prizes, including many books, Amazon Kindles, and even a HP Mini! The after party was held at Wilson’s Sports Bar across town. Unfortunately I don’t have a whole lot to say about this, as I left for home shortly after it started. I’m sure an amazing time was had by all, and you’ll be able to find blog posts by others who stayed.

All in all I thought this was an excellent event. While it was the first SQL Saturday in Wisconsin, it sure felt like it was put on by a bunch of seasoned pros. Many thanks to the organizing team, which consisted of Jes Borland (blog | @grrl_geek), Ted Krueger (blog | @onpnt), Gina Meronek (@equerystrian), Leonard Murphy (blog | @phonetictalk), and the MADPASS Board: Tim Benninghoff (blog), Matt Cherwin (@control_group), Tony Sebion (@tonysebion), and Steve Novoselac (blog | @scaleovenstove). Also a huge thanks to the sponsors – without their support it wouldn’t have been possible.

Preventing SSRS From Changing Text Size

Every once in a while I find myself building a report in SQL Server Reporting Services. I know my way around SSRS enough to get a report looking the way I like, but I’m definitely not an expert in the subject. While developing a report recently I noticed some very strange behavior that I have a hard time believing is intended: SSRS is changing font sizes on me depending on what I’m viewing! I’m happy to provide screenshots and instructions to reproduce this behavior, and also the solutions I found. If you’re curious, I first observed this in SQL Server 2008 R2 SP1 CU2 (10.50.2772), though I’m guessing it appears in earlier builds as well. It is still present in the latest build, SQL Server 2008 R2 SP1 CU5 (10.50.2806).

Let’s start off by creating a very simple table about soda pop soft drinks in a test database (mine is named “Sandbox”) and fill it with some data:

USE [Sandbox]
GO

CREATE TABLE [dbo].[SoftDrinks](
	[Name] [varchar](50) NOT NULL PRIMARY KEY,
	[Style] [varchar](50) NOT NULL,
	[CaffeineLevel] [varchar](10) NULL
);
GO

INSERT INTO [dbo].[SoftDrinks] VALUES (N'50/50', N'Citrus', NULL);
INSERT INTO [dbo].[SoftDrinks] VALUES (N'7 Up', N'Citrus', NULL);
INSERT INTO [dbo].[SoftDrinks] VALUES (N'Coca-Cola Classic', N'Cola', N'Medium');
INSERT INTO [dbo].[SoftDrinks] VALUES (N'Diet Pepsi', N'Diet', N'Medium');
INSERT INTO [dbo].[SoftDrinks] VALUES (N'Diet Rite', N'Diet', N'Medium');
INSERT INTO [dbo].[SoftDrinks] VALUES (N'Jolt', N'Cola', N'High');
INSERT INTO [dbo].[SoftDrinks] VALUES (N'Mello Yello', N'Citrus', N'High');
INSERT INTO [dbo].[SoftDrinks] VALUES (N'Mountain Dew', N'Citrus', N'High');
INSERT INTO [dbo].[SoftDrinks] VALUES (N'Pepsi', N'Cola', N'Medium');
INSERT INTO [dbo].[SoftDrinks] VALUES (N'RC Cola', N'Cola', N'Medium');

Selecting from the table should now get you this:
Query Result

I should add that the values in the CaffeineLevel column are completely made up. I just wanted a few different values with some NULLs mixed in.

Now that the table is set up, we need a report. To briefly describe the report I created, I’m using the List object with text boxes for Name and CaffeineLevel in the detail section. The report is grouped by Style and the groups are expandable and closed by default. The name of each group also displays a count of the number of records it contains. (Click to enlarge).
Visual Studio

To make things a bit quicker, you can download the report project here. It’s setup to use Windows authentication to connect to the Sandbox database on localhost. You may need to make adjustments to the connection information depending on your test environment. I wasn’t able to observe this behavior when previewing in Business Intelligence Development Studio, so you will need to deploy it to a report server.

Once the report is deployed, run it and you should see the 3 Style values with counts, however the details aren’t visible:
Report

Open the Cola and Diet groups and you’ll see the values for Name and CaffeineLevel. Everything looks fine, right?
Report

Now open the Citrus group. Notice how the size of the text in the CaffeineLevel column shrinks?
Report

Closing the Citrus group will return the text sizes for the other groups to normal. For fun, let’s try this again with a non-null value for the first row of the CaffeineLevel column. Add another row to the SoftDrinks table:

INSERT [dbo].[SoftDrinks] ([Name], [Style], [CaffeineLevel])
VALUES (N'123_NotNull', N'Citrus', N'Low');

Refresh the report and open all groups. You’ll see the font size doesn’t change, which should be expected.
Report

It appears that the cause of this is related to the top value in the column. If the top value is null, the font size gets smaller.

Solutions

I found a couple of ways to resolve this. The first is to remove the NULL values from the CaffeineLevel column by adjusting the query. I used the ISNULL() function and tried replacing the NULL values with an empty string, but that had no effect. Replacing null values with a single space did fix the problem, however.

SELECT
   Name,
   Style,
   --CaffeineLevel -- (The original query)
   --ISNULL(CaffeineLevel,'') AS CaffeineLevel -- (This doesn't work)
   ISNULL(CaffeineLevel,' ') AS CaffeineLevel -- (This works)
FROM dbo.SoftDrinks;

Another solution is to filter out the NULL values by using an expression in SSRS. For this I used the IsNothing() function to replace NULL values with a string containing a single space, which resolved the issue. Much like above, replacing NULL values with an empty string had no effect. The expression I used was:
=IIF(IsNothing(Fields!CaffeineLevel.Value)," ",Fields!CaffeineLevel.Value)

Whichever of the above you choose, you’ll arive at the correct (and visually pleasing) result of having the font sizes be the same:
Report

Personally I prefer to alter the query rather than add an SSRS expression, as I am not a fan of seeing <<expr>> in report definitions when it’s not truly necessary.

If anyone else has encountered this situation, I hope you find this solution helpful!