CS 442 - Programming Assignment 4

For this assignment you will write several more queries, but this time with real data and a substantial amount of data at that. This assignment is due Tuesday, October 28.

The first step of this assignment is for you to download some data sets from the Internet. From this U.S. Board on Geographic Names web site you need to download these four zip files and unzip them.

From this U.S. Census Bureau web site you need to download the following CSV (comma separated values) dataset containing the 2000 census results for all counties in the U.S.

CO-EST2007-ALLDATA.csv, County population dataset

Each of the above five files contains a set of data (you can ignore the extra files that contain data for Alaska and Hawaii; those datasets are separate because they use the Unicode character set). In this second step of this assignment you will import each dataset into a SQLite database table. Write an SQL script file, called BuildTables.sql, that defines a schema for each of the above five files. Use the following five names for your tables, NationalFile, IN_Features, US_CONCISE, GOVT_UNITS and CO_EST2007. If you open each of the above data files, you will see that each of them begins with a line that is close to an SQL schema. Make use of these lines to define the attributes in your tables. Also, each of the two web sites referenced above contain descriptions of the data contained in these files. Below are links to these descriptions, but you should go to those two web sites yourself and read them to get an idea of how this data is being distributed. One thing worth mentioning is that your schemas do not need any integrity constraints. We can assume that all of the data in the datasets is appropriate. So you can keep your schemas as simple as possible.

In your sql file, after the sql code for the schemas, you should write five sqlite3 .import commands that import each data set into its associated table. Besides the .import command, you will also need to make use of the .separator command (the four files from the USGS are in "pipe separated" format and the file from the U.S. Census Bureau is in "comma separated" format). Write a one line Windows .cmd file that runs your sql script. When you run the cmd file, it should create a database called CS442Hw4.db, run your sql script that creates the five tables in this database, and then import all of the data into the tables. When the script completes, you will have about a 250 MB database.

Now that you have a real database, you are ready for the third part of the assignment, writing queries. Create five files called query1.sql, query2.sql, query3.sql, query4.sql and query5.sql that contain code to solve each of the following queries.

Which county in Indiana has the most entries in the US_CONCISE table? Your query should return a single tuple containing the name of the county and the number of entries.
Which county in the lower 48 states has the most lakes? Your query should return a single tuple containing the county, state, and number of lakes.
Find all counties in the lower 48 states that have populations greater than 100,000 but no hospitals. List the county name, the state and the county population. (Be sure that your query only returns counties. There are some other kinds of population centers listed in the census dataset.)
This part has two related queries. Find the smallest county (in terms of population) in the lower 48 states that has an airport. Also, find the largest county that doesn't have an airport. Your queries should list the state, the county, and the county population.
Come up with a query of your own.

NOTE: Here are a few issues that you will need to deal with.

The datasets will not import cleanly into the tables. The datasets have a few minor problems that you will need to fix. In the sql file that builds the tables, document any changes that you need to make to a dataset so that it will import cleanly.
The Census table has 132 columns. Here is a hint. SQLite is pretty good about coming up with default data types.
As you work on the queries, you may find that you need to tweak your schemas. If you find that you need to modify a schema to make the tables more compatible with each other, document this in your sql file that builds the tables.
I found that I needed to slightly modify the contents of the U.S. Census dataset to make it work better with the other datasets. If you make any changes to a dataset, document them.
The IN_Features data is a subset of the NationalFile dataset and it is much smaller. This makes the IN_Features table a good substitute for the NationalFile table when you are testing queries. In fact, that is the only reason the IN_Features data is included in this assignment. Testing queries on the NationalFile data can take too long.

Turn in a zip file containing the following files. Turn in your six SQL files and for each SQL file turn in a one line Windows .cmd file that runs the sql script file with SQLite. Remember to put in each of your SQL files your name and assignment number. Please do not send me any of the dataset files or the SQLite database file. This assignment is due Tuesday, October 28.

Return to the main homework page.
Return to the CS 442 home page.

compliments and criticisms