One of the key characteristics of many datasets is the fact that some of your data will be missing. Missing data is not necessarily a bad thing; it can mean any number of things (ineffective process, inadequate training, database problems, etc.) I’ll show you a couple of different ways to check up on your data, and see how much of it is missing.
Get the Data
Because I wanted to have a small dataset with missing data, I ended up creating my own. Don’t worry about what’s going on here, that’ll be covered in another blog. It’s enough to know that this creates a table DUMMY in the WORK library, with the variables IDNum, Gender, Jobcode, Salary, Birth, and Hired (data step taken and modified from SAS Support).
proc sql; create table work.dummy (IdNum char(4), Gender char(1), Jobcode char(3), Salary num, Birth num informat=date7. format=date7., Hired num informat=date7. format=date7.); insert into work.dummy values('1639','F','TA1',42260,'26JUN70'd,'28JAN91'd) values('1065','M','ME3',38090,'26JAN54'd,'07JAN92'd) values('1400','M','ME1',29769.'05NOV67'd,'16OCT90'd) values('1561','M',null,36514,'30NOV63'd,'07OCT87'd) values('1221','F','FA3',.,'22SEP63'd,'04OCT94'd) values('2639','F','TA1',42260,'26JUN70'd,'28JAN91'd) values('2065','M','ME3',38090,'26JAN54'd,'07JAN92'd) values('2400','M','ME1',29769.'05NOV67'd,'16OCT90'd) values('2561','M',null,36514,'30NOV63'd,'07OCT87'd) values('2221','F','FA3',.,'22SEP63'd,'04OCT94'd) values('3639','F','TA1',42260,'26JUN70'd,'28JAN91'd) values('3065','M','ME3',38090,'26JAN54'd,'07JAN92'd) values('3400','M','ME1',29769.'05NOV67'd,'16OCT90'd) values('3561','M',null,36514,'30NOV63'd,'07OCT87'd) values('3221','F','FA3',.,'22SEP63'd,'04OCT94'd) values('4639','F','TA1',42260,'26JUN70'd,'28JAN91'd) values('4065','M','ME3',38090,'26JAN54'd,'07JAN92'd) values('4400','M','ME1',29769.'05NOV67'd,'16OCT90'd) values('4561','M',null,36514,'30NOV63'd,'07OCT87'd) values('4221','F','FA3',.,'22SEP63'd,'04OCT94'd); select * from work.dummy; quit;
How to go about getting SAS University Edition
If you don’t already have University Edition, get it here and follow the instructions from the pdf carefully. If you need help with almost any aspect of using University Edition, check out these video tutorials. Additional resources are available in this article.
Getting the data ready
Because I’ve created the data to fit what I need, I don’t actually need to do anything else. Having said that, I recommend you play around with it using the SAS University Edition TASKS, PROC SQL, etc. Here’s the data as I’ve created it:
There are three methods I typically use to determine the “missingness” of my data. The first is using PROC MEANS, as shown here:
proc means data = work.dummy n nmiss; var _numeric_; run;
I’m using the nmiss option, which will be the way SAS provides the information about the missing data.
SAS is telling me that I have 4 missing Salaries, out of a total of 20 observations.
The next method is using PROC FREQ, which is easier code to run. You don't need to specify the variables you want to be analysed, but it is highly recommended you limit to a few at a time as otherwise it could take a long time to run and the results will be too cumbersome to analyse effectively. Here's the code I used:
proc freq data = work.dummy; tables jobcode gender ; run;
This code runs and produces two different tables (because I specified two variables) and Frequency, Percent, Cumulative Frequency, and Cumulative Percent. At the bottom of the first table SAS tells me I’m missing 4 observations.
The final method is a little more cumbersome, but provides the most flexibility as you can specify complex conditional statements. In this example, I’m picking three variables (Gender, Jobcode and Salary) and asking SAS to provide me the list of observations where one of them is missing.
proc sql; select * from work.dummy where jobcode is null or gender is null or salary is null; quit;
Here’s the output from SAS; because I said SELECT *, all the variables have been pulled. You can specify the exact variables you want to see by using SELECT JOB, GENDER … instead.
Each of these methods has its pros and cons. Depending on your data, and what you want to see, you may use one or a combination of these methods. Although SAS users have their preferences, there really isn’t a “right” way. Play around with them, become comfortable with them, and soon you’ll recognise situations where one is more appropriate than another.
Now it’s your turn!
Did you find something else interesting in this data? Share in the comments. I’m glad to answer any questions.
Need data for learning?
The SAS Communities Library has a growing supply of free data sources that you can use in your training to become a data scientist. The easiest way to find articles about data sources is to type "Data for learning" in the communities site search field like so:
We publish all articles about free data sources under the Analytics U label in the SAS Communities Library. Want email notifications when we add new content? Subscribe to the Analytics U label by clicking "Find A Community" in the right nav and selecting SAS Communities Library at the bottom of the list. In the Labels box in the right nav, click Analytics U:
Click Analytics U, then select "Subscribe" from the Options menu.