This assignment is due Tuesday, September 24.
This homework assignment is loosely based on this infographic.
Download this zip file. In the zip file you will find a file nations.json
which is a JSON file holding all the data for the above mentioned infographic. This JSON file is essentially a small database of information about all the countries in the world. In this assignment you will write some Java programs that work together, using interprocess communication, to analyze this data.
In the zip file there is a second copy of the JSON file, called nations-ver2.json
, that is formatted in a way that makes it easier for you to see the structure of the data. Open nations-ver2.json
(in a decent text editor) and you will see that the file consists of a sequence of nation records, one for each country in the world. Each nation record consists of the name of the country, the geographic region of the country, and then three sub-records, one sub-record for income data, one for population data, and one for life expectancy data. Each of these sub-records consists of its name followed by a list of pairs. Each pair is two numbers, a year and a value.
If you open the file nations.json
(again, in a decent text editor) you will see that all of the data for each country is in a single line of the file. This makes nations.json
very easy to work with from Java code but it makes the file very difficult to read in an editor. Your Java programs will work with nations.json
, but you will need to read nations-ver2.json
to help you understand how to write, and then debug, your code.
Your goal is to write a program that extracts information from this database. You will write a program that lets a user do a "query" that has four parameters to it. The first parameter to a query is a search pattern used to choose a subset of countries from which to extract information. The second parameter of a query is one of three keywords, "income", "population", or "lifeExpectancy", that specifies which sub-record to use for each selected country. The third parameter is a year (between 1800 and 2009) for which to extract information, and the last parameter is an operation to perform on the data selected by the first three parameters. The allowed operations are "total", "avg", "max", "min", and "count".
For example, if a query has the four parameters
"Sub-Saharan" income 1950 avg
then your program will find every line of the file nations.json
that contains the string "Sub-Saharan", and from those lines extract the income values for the year 1950, and then compute the average of those values. In other words, you will compute the average income in sub-saharan Africa in the year 1950.
As another example, if a query has the four parameters
"United" population 1900 max
the query will compute the maximum population in 1900 of all the countries with the word "United" as part of their name (there are just three such countries). Notice that the result of any query is always a single number.
Here is how your program should be structured. You should write a program called Query.java
which will create a two stage pipeline made of two child processes (the two stage pipeline will do all of the real work).
The Query.java program will get the four parameters for a query from several possible places. There should be default values for each parameter built into Query.java (the default search pattern is "name", the default sub-record is population, the default year is 2008, and the default operation is count). Query.java should look for an optional configuration file called query.cfg
. If it exits, it must contain exactly four lines which contain, respectively, the search pattern, sub-record, year, and operation parameters. Next, Query.java should check for the existence of some environment variables. If there is an environment variable named CS404-pattern
, then its value becomes the search pattern. If there is an environment variable named CS404-field
, then its value denotes the sub-record. If there is an environment variable named CS404-year
, then its value becomes the year parameter. If there is an environment variable named CS404-op
, then its value denotes the operation to compute. Finally, Query.java should check for the existence of command line parameters. The command line syntax for Query.java is as follows.
Query [-p <pattern>] [-f {income | population | lifeExpectancy}] [-yr <year>] [-op {total | avg | max | min | count}]
Notice that all of the command line parameters to Query.java are optional and can be in any order.
After Query.java has determined its four parameter values, Query.java will create the first stage of the pipeline. The child process should be created so that its standard input is the file nations.json
. The first stage process is the program grep.exe
which is in the zip file in the subfolder grep
. The Query.java program will run grep.exe with a single command line parameter, which is the search pattern of the query (so grep.exe does all of the hard work of searching nations.json
). The standard output of grep.exe should be directed to the standard input of the second stage of the pipeline. And the standard output of the second stage should be inherited from the standard output of Query.java.
The second stage of the pipeline is a filter program that you write called BackEnd.java
. This program will be run with its standard input connected to grep's output. This program will be run with exactly three command line parameters, like this.
BackEnd {income | population | lifeExpectancy} <year> {total | avg | max | min | count}
This program will read one line at a time of nation data from its standard input (which comes from grep.exe) and search each line for the appropriate value from the chosen year pair in the chosen sub-record. If the operation is count, then BackEnd.java just counts the number of nations that have a value in the chosen sub-record for the chosen year. If the operation is total, then BackEnd.java adds up all of the found values. If the operation is average, then BackEnd.java computes the average of all the found values. And if operation is max or min, then BackEnd.java computes the maximum or minimum of the found values. The final computed value should be written by BackEnd.java to its standard output.
One important thing to know about the data in nations.json
is that not every country has a value for every year in every sub-record. Some countries actually have very little data. If a data item is missing from a chosen sub-record of a chosen country, then the missing data item should not effect the result being computed. In particular, a missing data item is not counted if the operation is count. The count operation counts the number of countries that have data for the chosen year in the chosen sub-record.
Here are a few remarks about BackEnd.java. It is pretty easy to use Java String methods to find data for a country. For example, if oneNation
is a Java String reference to one line from nations.json
, then
int index1 = oneNation.indexOf("income"); int index2 = oneNation.indexOf("population"); int index3 = oneNation.indexOf("lifeExpectancy");
will tell you exactly where in that string each of the three sub-records begins. Then you can search for the year and check if the index you get back is within the correct sub-record (if it isn't, that country does not have data for that year in the chosen sub-record). If the year is in the correct sub-record, the associated value is always right after a comma that is right after the year. So you can easily find the value you want by just using a few String methods.
You should write and test BackEnd.java before you write Query.java. It is easy to test BackEnd.java from the command line with a command like this
C:\> grep\grep.exe "United" < nations.json | java BackEnd population 1900 max
Use the file nations-ver2.json
to give you an idea of some reasonable queries and their results. You don't have to implement every operation right way. Get the basic structure of Backend.java working (say with just the count operation), then add in more operations, testing each one as soon as you implement it. Do a lot of testing. My copy of BackEnd.java is just a bit over 100 lines long. After you have BackEnd.java working, write Query.java to set up and run the pipeline.
Turn in a zip file called CS404Hw1Surname.zip
containing your versions of Query.java
and BackEnd.java
. Include in the zip file everything needed to compile and run your programs. Please remember to put your name inside of each of your source files.
This assignment is due Tuesday, September 24.