Editor's note: SAS programming concepts in this and other Free Data Friday articles remain useful, but SAS OnDemand for Academics has replaced SAS University Edition as a free e-learning option. Hit the orange button below to start your journey with SAS OnDemand for Academics:
Access Now
One of the key characteristics of many datasets is the fact that some of your data will be missing. Missing data is not necessarily a bad thing; it can mean any number of things (ineffective process, inadequate training, database problems, etc.) I’ll show you a couple of different ways to check up on your data, and see how much of it is missing.
Get the Data
Because I wanted to have a small dataset with missing data, I ended up creating my own. Don’t worry about what’s going on here, that’ll be covered in another blog. It’s enough to know that this creates a table DUMMY in the WORK library, with the variables IDNum, Gender, Jobcode, Salary, Birth, and Hired (data step taken and modified from SAS Support).
proc sql;
create table work.dummy
(IdNum char(4),
Gender char(1),
Jobcode char(3),
Salary num,
Birth num informat=date7.
format=date7.,
Hired num informat=date7.
format=date7.);
insert into work.dummy
values('1639','F','TA1',42260,'26JUN70'd,'28JAN91'd)
values('1065','M','ME3',38090,'26JAN54'd,'07JAN92'd)
values('1400','M','ME1',29769.'05NOV67'd,'16OCT90'd)
values('1561','M',null,36514,'30NOV63'd,'07OCT87'd)
values('1221','F','FA3',.,'22SEP63'd,'04OCT94'd)
values('2639','F','TA1',42260,'26JUN70'd,'28JAN91'd)
values('2065','M','ME3',38090,'26JAN54'd,'07JAN92'd)
values('2400','M','ME1',29769.'05NOV67'd,'16OCT90'd)
values('2561','M',null,36514,'30NOV63'd,'07OCT87'd)
values('2221','F','FA3',.,'22SEP63'd,'04OCT94'd)
values('3639','F','TA1',42260,'26JUN70'd,'28JAN91'd)
values('3065','M','ME3',38090,'26JAN54'd,'07JAN92'd)
values('3400','M','ME1',29769.'05NOV67'd,'16OCT90'd)
values('3561','M',null,36514,'30NOV63'd,'07OCT87'd)
values('3221','F','FA3',.,'22SEP63'd,'04OCT94'd)
values('4639','F','TA1',42260,'26JUN70'd,'28JAN91'd)
values('4065','M','ME3',38090,'26JAN54'd,'07JAN92'd)
values('4400','M','ME1',29769.'05NOV67'd,'16OCT90'd)
values('4561','M',null,36514,'30NOV63'd,'07OCT87'd)
values('4221','F','FA3',.,'22SEP63'd,'04OCT94'd);
select *
from work.dummy;
quit;
Get started with SAS OnDemand for Academics
In this 9-minute tutorial, SAS instructor @DomWeatherspoon shows you how to get your data into SAS OnDemand for Academics and other key steps:
Get Started
Getting the data ready
Because I’ve created the data to fit what I need, I don’t actually need to do anything else. Having said that, I recommend you play around with it using the SAS University Edition TASKS, PROC SQL, etc. Here’s the data as I’ve created it:
The Results
There are three methods I typically use to determine the “missingness” of my data. The first is using PROC MEANS, as shown here:
proc means data = work.dummy n nmiss;
var _numeric_;
run;
I’m using the nmiss option, which will be the way SAS provides the information about the missing data.
SAS is telling me that I have 4 missing Salaries, out of a total of 20 observations.
The next method is using PROC FREQ, which is easier code to run. You don't need to specify the variables you want to be analysed, but it is highly recommended you limit to a few at a time as otherwise it could take a long time to run and the results will be too cumbersome to analyse effectively. Here's the code I used:
proc freq data = work.dummy;
tables jobcode gender ;
run;
This code runs and produces two different tables (because I specified two variables) and Frequency, Percent, Cumulative Frequency, and Cumulative Percent. At the bottom of the first table SAS tells me I’m missing 4 observations.
The final method is a little more cumbersome, but provides the most flexibility as you can specify complex conditional statements. In this example, I’m picking three variables (Gender, Jobcode and Salary) and asking SAS to provide me the list of observations where one of them is missing.
proc sql;
select *
from work.dummy
where jobcode is null or
gender is null or
salary is null;
quit;
Here’s the output from SAS; because I said SELECT *, all the variables have been pulled. You can specify the exact variables you want to see by using SELECT JOB, GENDER … instead.
Each of these methods has its pros and cons. Depending on your data, and what you want to see, you may use one or a combination of these methods. Although SAS users have their preferences, there really isn’t a “right” way. Play around with them, become comfortable with them, and soon you’ll recognise situations where one is more appropriate than another.
Now it’s your turn!
Did you find something else interesting in this data? Share in the comments. I’m glad to answer any questions.
... View more