What’s this data?
Today, using data from the federal government, we’re going to examine world population by region between 1980 and 2010.
How to download
If you don’t already have University Edition, get it here and follow the instructions from the pdf carefully. If you need help with almost any aspect of using University Edition, check out these video tutorials. Additional resources are available in this article.
Download the csv file on this page. Save the file, and it's ready to bring into SAS University Edition.
How to get the data and prep it for analysis
There are several problems with these data. First, the data mixes countries and regions and there is no variable to make clear which row is which. You just have to know your geography. Second, the population data are read in as character strings, not numbers. "NA" is used for missing values and that causes it to be read in as character. So, we need to clean up the countries/regions and reformat those population fields.
The primary challenge with this data is removing the regions from the countries. The data comes in pre-sorted in alphabetical order by country and grouped by region. There will be a region total at the top and in alphabetical order the countries that make up the region are listed below. Then the next region total comes up and so on. The problem with this dataset is that the regions aren't marked any differently. This makes it pretty hard to isolate just the countries but, not impossible. Use the proc import to bring in your file. In your next datastep use the rename function on the variable "_" to make it country.
To convert the character data to numeric, we're going to use a MACRO. A MACRO executes the same code over and over, while replacing one key change to the code every time. To start a MACRO you always use the %MACRO command. Then you give your MACRO a name, in this case, "cleanup." In parentheses, you have one or more variables you are going to use when running this. These are variables that change every time. In the middle of the MACRO is everything you want to happen. You can have any code there: a data step, a model, or a graph. At the end of that code, close the macro with a %MEND.
For our MACRO, we have only one MACRO variable: the_var. We can replace this with a different value every time we run the MACRO. To run the MACRO, call the MACRO with the % sign in front. This should be written below your %MEND statement. In the parenthesis next to it, write the value of what should be replaced. You must do this for every value you want to be in the code. In this example, the first time we run the MACRO, we are replacing &the_var with _1980. It is really like an automatic copy-paste. You must run the code with a call after your %MEND statement or nothing will happen.
To remove the data that are not countries, we are going to use a "not in" statement. You specify your variable and then use the keywords not in and in parenthesis list every value you don't want, our region names. Since this is a character variable, don't forget to put quotes around each value, and commas in between. One thing to note, the log will report warnings, ignore them. The log is saying that there is non-numeric data in a variable formatted to be numeric. These are just the missing values in the dataset, and have automatically been converted to missing values. The log views it as an error, but SAS really just did some of our work for us.
proc import datafile="/folders/myfolders/my_data/population by country.csv"
where country not in ("North America", "Central & South America",
"Antarctica", "Europe", "Eurasia", "Middle East", "Africa", "Asia & Oceania",
proc univariate data=Countries_Only all;
proc univariate data=Countries_Only all;
What does this output mean?
Using proc univariate we can see some of the modeling of the data. In 2010 we can see that there are two countries that are far and away the two largest countries. It isn't labeled in the graphs above, but you can go back to the original data and find which countries have those values. These are China and India, China being the larger of the two. They are about four times as big as the third largest country (United States).
However, when you go back to the first year recorded you can see that the second highest country (India) was not quite so close to the highest (China) then. This can perhaps be explained by China’s one-child-per-family policy, implemented in the 1970s. You would expect population to increase exponentially, but China's slowed significantly.
Compare China's slower growth rate to India, which jumped 171% (1173.108 / 684.8877). China's grew 135% (1330.1412 /984.73646). China recently repealed the one-child law and will allow couples to have two children. This article has more information about China's policy change.
Now it’s your turn!
Did you find something else interesting in this data? Share in the comments. I’m glad to answer any questions.
Need data for learning?
The SAS Communities Library has a growing supply of free data sources that you can use in your training to become a data scientist. The easiest way to find articles about data sources is to type "Data for learning" in the communities site search field like so:
We publish all articles about free data sources under the Analytics U label in the SAS Communities Library. Want email notifications when we add new content? Subscribe to the Analytics U label by clicking "Find A Community" in the right nav and selecting SAS Communities Library at the bottom of the list. In the Labels box in the right nav, click Analytics U:
Click Analytics U, then select "Subscribe" from the Options menu.