We’re smarter together. Learn from this collection of community knowledge and add your expertise.

How SAS University Edition helps you "see" missing data

by Super Contributor on ‎04-29-2016 07:31 AM - edited on ‎04-29-2016 10:42 AM by Community Manager (417 Views)

I have been spending my time since SAS Global Forum getting back to reading up on missing data.  I intended this to be the first of the articles on predicting and forecasting (I have about six in draft form), but sometimes life throws you a curveball.

 

See_missing_values.jpgIn this case, it was week’s SAS Tech Report. Rick Wicklin's article, Visualize missing data in SAS, which is about using PROC IML to identify your missing data. 

 

I knew that SAS University Edition has IML, but I’ve never used it, so figured this was a good time to explore and see if I could pick up some of the basics.

 

Get the Data

From the first article using the Watermain data, I’m using the Open Data Toronto’s dataset for water main breaks, and from the second, I have created the work.IMPORT2 data set that I’ll be using here.

 

How to go about getting SAS University Edition

If you don’t already have University Edition, get it here and follow the instructions from the pdf carefully. If you need help with almost any aspect of using University Edition, check out these video tutorials. Additional resources are available in this article.

 

Getting the data ready

Beyond the steps outlined above, there’s nothing left to do with the data – we’re ready to go.

 

The Results

PROC IML uses matrices and allows some very powerful and very cool things to be done in SAS (I highly recommend reading Rick’s blog, The DO Loop, as it is packed full of great information!). One of the strengths of IML is the ability to not only manipulate the data, but also use ODS Graphics to create visualizations that are not standard in SAS. FreeDataFriday_graphic.jpg I am going to be taking Rick’s code and modifying it to meet the needs we have with the Water main breaks data. 

 

I’m keeping Rick’s comments in the code in as they are very concise, but I want to explain a couple of them in more depth.

 

The second line below, which starts varNames =, lists all the variables you want included in the analysis; in Rick’s article, he has a number of different ones from the sashelp.Heart dataset.  The water main data only has 2 variables of interest, so that’s what we’ve listed.

 

You’ll note I’m pointing to the work.tsprep dataset that is created in the second article linked above; this is because in order to count missing data, we need well, data that’s missing!  In the original data, February 13, 1990 was not included; once we run the Time Series Preparation task, it adds a row for the date with a “.” indicating no data.

 

 

 

proc iml;
varNames = {break_date Total};
use work.tsprep;                         /* open data set */
read all var varNames into X;              /* create numeric data matrix, X */
close work.tsprep;
 
title "Counts of Rows That Contain a Missing Value";
title2 "work.tsprep Data";
Count = countmiss(X,"row");            /* count missing values for each obs */
call Bar(Count);                       /* create bar chart */

 

 

So when we run this, we get this graph:

 

Image1_a.png

 

This is telling us that we have just shy of 8000 rows of complete data, and about 1800 rows of missing data.  This is helpful, but we need to go a little deeper.  We do this by adding the following code to the code above:

 

 

missRows = loc(Count > 0);                  /* which rows are missing? */
title "Location of Rows That Contain a Missing Value";
call histogram( missRows ) scale="count"    /* plot distribution */
     rebin={125, 250}                       /* bin width = 250 */
     label="Row Number";                    /* label for X axis */
 

 

 

Image2_a.png

 

So what the heck is this telling us?  Using Rick’s explanation, because we’ve set the width of the bins to 250, each bar includes 250 observations. For the first 2000 observations, between 20 and 40 of every 250 observations contain a missing value. Unlike Rick’s example, we have no continuous stretch of data that is complete.

 

One thing that’s kind of cool in the water main data is you can see cycles, which makes sense as the winter months occur in a specific pattern, and we’ve already seen the high number of breaks from November to March.

 

ods graphics / width=400px height=600px;
Y = X[missRows,];              /* extract missing rows   */
call HeatmapDisc( cmiss(Y) )   /* CMISS returns 0/1 matrix */
     displayoutlines=0 
     colorramp={white black}
     xvalues=VarNames          /* variable names along bottom */
     yvalues=missRows          /* use nonmissing rows numbers as labels */
     showlegend=0 title="Missing Value Pattern";

The final graph is called a heat map; I must admit, Rick’s is a lot more interesting than mine, but we can still get something of interest out of this one:

 

Image3_a.png

 

The black lines indicate the TOTAL is missing on that row; because we have no missing dates, that half of the graph is white.  On the TOTAL side, we once again see a pattern of thick black lines with thin white lines; it’s safe to assume the large black lines are the highest volume winter months.

 

Other than the Dataset and the Variables, I did not have to modify Rick’s original code to have it run on my dataset in University Edition. 

 

Now it’s your turn!

 

Did you find something else interesting in this data? Share in the comments. I’m glad to answer any questions.

 

Need data for learning?

 

The SAS Communities Library has a growing supply of free data sources that you can use in your training to become a data scientist. The easiest way to find articles about data sources is to type "Data for learning" in the communities site search field like so:

 

Image11.pngWe publish all articles about free data sources under the Analytics U label in the SAS Communities Library. Want email notifications when we add new content? Subscribe to the Analytics U label by clicking "Find A Community" in the right nav and selecting SAS Communities Library at the bottom of the list. In the Labels box in the right nav, click Analytics U:

 IMAGE12.png

 

Click Analytics U, then select "Subscribe" from the Options menu.

 

Happy Learning!

 

Your turn
Sign In!

Want to write an article? Sign in with your profile.


Looking for the Ask the Expert series? Find it in its new home: communities.sas.com/askexpert.