What’s this data?
All right folks, this is the last article of the Free Data Friday series. For the tenth and final post, we’re going to make up our own data (aka simulated data).
This is a handy skill for a SAS learner since collecting real data isn’t always an option. Though simulated data is made up, it’s best to follow a pattern of real-world data so your output makes sense.
Today, we’ll create a dataset to examine height differences between men and women.
How to download
If you don’t already have University Edition, get it here and follow the instructions from the pdf carefully. If you need help with almost any aspect of using University Edition, check out these video tutorials. Additional resources are available in this article.
There is no data to download today because we’re making our own!
How to create your data
Something new we’re going to look at today is the "do loop". You start with your keyword do, then you use your count variable. I’ve called mine i but it can be any name, the same as any other variable. After that you set a range of numbers, 1 to 1000 in this case.
The "do loop" allows you to do many iterations of the same code. Here we’re going to make a variable called x and assign it a value of ranuni(1). You can put any number in here, that number will change the pattern of generated numbers. The ranuni(1) function will assign every instance of x a random number between 0 and 1. Using .5 as my cut point for gender (50% men 50% women), I assign genders using if then statements.
From there we assign values to height using 70 inches (5 ft. 10 in.) as the average height for men, and 65 inches (5 ft. 5 in.) for the average height for women. From there, we add rand(“Normal”) so that allows the data to spread in a normal distribution with center at 70 inches for men and 65 inches for women. The 4 and 3.5 are the standard deviations of height for men and women, we multiply them to the rand(“Normal”) so that the distribution spreads out over the right amount of values to match the standard deviation.
From there everything else is just converting values to make things easier to read. The if then statements convert from inches to feet and inches. Then finally we use the cat function to concatenate the variables inches and feet to see it in a format we’re more accustomed to seeing. The last part of creating the dataset is to add those heightM and heightF variables is so that we can set up direct comparisons between genders. Then the output statement goes right before you close your "do loop" with an end. You need this output statement in your "do loop" or the dataset will output the last observation only. Throw a proc freq in after the "do loop," so you can see the numbers more plainly.
Now for some real comparisons, we’re going to the proc sgplot again. This time we’re making overlapping histograms for men and women. The keyword is histogram for this graph type, then just type the variable name to analyze.
After that we put in the slash so we can put in some options. Fillattrs= is assigning a color to the histogram, the graphdata1 is a SAS preset color name. Then the transparency= statement allows the histogram to be adjust its transparency. The values for transparency range from 0 to 1.
The scale= count means that we’re using the counts rather than percentages on the y axis. The default is percentage for the histograms. The density keyword puts in a density curve above the histogram. Use the same variable and color option as the histogram you just made to make sure it matches. Then repeat this process, with the different color, for the heights of the other gender.
The keylegend adds a key to the graph and the options just decide whether the key is inside the graph area or outside of it. Position points to the corner you’d like to put it in. Noborder makes it blend into the background rather than be its own box.
The across= statement dictates how many rows are in the key. The yaxis statement allows you to set options for the yaxis and by using the grid option the graph has grid lines along the y axis. The label= statement determines the label on the axis for the xaxis and yaxis statements. Finally, close it with a run statement and the code is complete.
data Simulated; do i=1 to 1000; x=ranuni(1); if x < .5 then gender='M'; else gender='F'; if gender='M' then height=70 + rand("Normal")*4; if gender='F' then height=65 + rand("Normal")*3.5; height=round(height, .1); if 48<=height < 60 then feet=4; if 60<=height < 72 then feet=5; if 72<=height then feet=6; if feet=4 then inches=height - 48; if feet=5 then inches=height - 60; if feet=6 then inches=height - 72; inches=round(inches, 1); height2=cat(trim(feet), "' " , trim(inches)); if gender='F' then heightF=height; if gender='M' then heightM=height; output; end; run; proc sort data=Simulated; by gender; run; proc freq data=Simulated; by gender; tables height2 /missing; run; proc sgplot data= Simulated; histogram heightM / fillattrs=graphdata1 transparency=0.7 scale=count; density heightM / lineattrs=graphdata1; histogram heightF / fillattrs=graphdata2 transparency=0.5 scale=count; density heightF / lineattrs=graphdata2; keylegend / location=inside position=topright noborder across=2; yaxis grid label="Count"; xaxis label="Height"; run;
What does this output mean?
From this output we can see that men and women have different height splits. Women have a lower mean but, they’re less spread out than men. This is no surprise since we’ve made it this way – we had set the standard deviation to be lower for women than men. We also set the mean to be higher for men.
These numbers were pulled from a website showing the same type of graph, allowing us to make a pretty direct comparison. The website is here. Looking at our histogram and theirs, they look pretty similar. While the graph used on this site is likely also simulated data, ours should still look pretty much the same since we used the same numbers they did.
You can continue to experiment with simulated data by finding other variables to simulate like weight, age, race, etc.
Final wrap up
Through this series, we have come up against a variety of challenges with freely available data. The main challenge we focused on was converting the data into a SAS standard dataset. Other troubles we encountered were variable naming issues, missing values, formatting, in-line summaries, and others. To solve these challenges, we have come up with a variety of coding methods. Free data is only as useful as your ability to interpret it.
This series was all about interpreting open data with University Edition, which gives you a variety of methods to rename, reorder, reformat variables and values, and much more. The series covered how to bring in a dataset and manipulate the data so that it’s analytics ready. The coding skills you have learned will help you see what the data is trying to tell you in University Edition.
If you forget some of the more complicated things we covered, don’t worry. You can always revisit these articles to refresh your memory. I showed you a few basic procedures you can use, but there’s a lot more code to learn and even more free data to practice with.
Now it’s your turn!
Here’s a comprehensive global source of open data that I just ran across today. Run some of it through University Edition and let me know what you find!
Need data for learning?
The SAS Communities Library has a growing supply of free data sources that you can use in your training to become a data scientist. The easiest way to find articles about data sources is to type "Data for learning" in the communities site search field like so:
We publish all articles about free data sources under the Analytics U label in the SAS Communities Library. Want email notifications when we add new content? Subscribe to the Analytics U label by clicking "Find A Community" in the right nav and selecting SAS Communities Library at the bottom of the list. In the Labels box in the right nav, click Analytics U:
Click Analytics U, then select "Subscribe" from the Options menu.