We’re smarter together. Learn from this collection of community knowledge and add your expertise.

Missing data - Not there, but not irrelevant!

by Regular Contributor ‎09-30-2016 01:25 PM - edited ‎09-30-2016 01:52 PM (667 Views)

One of the key characteristics of many datasets is the fact that some of your data will be missing.  Missing data is not necessarily a bad thing; it can mean any number of things (ineffective process, inadequate training, database problems, etc.)  I’ll show you a couple of different ways to check up on your data, and see how much of it is missing. MissingData.jpg

 

Get the Data

Because I wanted to have a small dataset with missing data, I ended up creating my own.  Don’t worry about what’s going on here, that’ll be covered in another blog.  It’s enough to know that this creates a table DUMMY in the WORK library, with the variables IDNum, Gender, Jobcode, Salary, Birth, and Hired (data step taken and modified from SAS Support).  

 

 

 

proc sql;
   create table work.dummy
       (IdNum char(4),
        Gender char(1),
        Jobcode char(3),
        Salary num,
        Birth num informat=date7.
                  format=date7.,
        Hired num informat=date7.
                  format=date7.);
insert into work.dummy
    values('1639','F','TA1',42260,'26JUN70'd,'28JAN91'd)
    values('1065','M','ME3',38090,'26JAN54'd,'07JAN92'd)
    values('1400','M','ME1',29769.'05NOV67'd,'16OCT90'd)
            values('1561','M',null,36514,'30NOV63'd,'07OCT87'd)
    values('1221','F','FA3',.,'22SEP63'd,'04OCT94'd)
   
    values('2639','F','TA1',42260,'26JUN70'd,'28JAN91'd)
    values('2065','M','ME3',38090,'26JAN54'd,'07JAN92'd)
    values('2400','M','ME1',29769.'05NOV67'd,'16OCT90'd)
            values('2561','M',null,36514,'30NOV63'd,'07OCT87'd)
    values('2221','F','FA3',.,'22SEP63'd,'04OCT94'd)
   
    values('3639','F','TA1',42260,'26JUN70'd,'28JAN91'd)
    values('3065','M','ME3',38090,'26JAN54'd,'07JAN92'd)
    values('3400','M','ME1',29769.'05NOV67'd,'16OCT90'd)
            values('3561','M',null,36514,'30NOV63'd,'07OCT87'd)
    values('3221','F','FA3',.,'22SEP63'd,'04OCT94'd)
   
    values('4639','F','TA1',42260,'26JUN70'd,'28JAN91'd)
    values('4065','M','ME3',38090,'26JAN54'd,'07JAN92'd)
    values('4400','M','ME1',29769.'05NOV67'd,'16OCT90'd)
            values('4561','M',null,36514,'30NOV63'd,'07OCT87'd)
    values('4221','F','FA3',.,'22SEP63'd,'04OCT94'd);
   
 select *
   from work.dummy;
   quit;

 

How to go about getting SAS University Edition

If you don’t already have University Edition, get it here and follow the instructions from the pdf carefully. If you need help with almost any aspect of using University Edition, check out these video tutorials. Additional resources are available in this article.

 

Getting the data ready FreeDataFriday_graphic.jpg

Because I’ve created the data to fit what I need, I don’t actually need to do anything else.  Having said that, I recommend you play around with it using the SAS University Edition TASKS, PROC SQL, etc. Here’s the data as I’ve created it:

 

 

IMAGE1.png

 

The Results

There are three methods I typically use to determine the “missingness” of my data.  The first is using PROC MEANS, as shown here:

 

 

proc means data = work.dummy n nmiss;
  var _numeric_;
run;

 

I’m using the nmiss option, which will be the way SAS provides the information about the missing data.

 

IMAGE2.png

 

SAS is telling me that I have 4 missing Salaries, out of a total of 20 observations.

 

The next method is using PROC FREQ, which is easier code to run.  You don't need to specify the variables you want to be analysed, but it is highly recommended you limit to a few at a time as otherwise it could take a long time to run and the results will be too cumbersome to analyse effectively.  Here's the code I used:

 

 

proc freq data = work.dummy;
  tables jobcode gender ;
run;

 

 This code runs and produces two different tables (because I specified two variables) and Frequency, Percent, Cumulative Frequency, and Cumulative Percent.  At the bottom of the first table SAS tells me I’m missing 4 observations.

 

IMAGE3.png

 

The final method is a little more cumbersome, but provides the most flexibility as you can specify complex conditional statements.  In this example, I’m picking three variables (Gender, Jobcode and Salary) and asking SAS to provide me the list of observations where one of them is missing.

  

proc sql;
select *
from work.dummy
where jobcode is null or
gender is null or
salary is null;
quit;

 

Here’s the output from SAS; because I said SELECT *, all the variables have been pulled.  You can specify the exact variables you want to see by using SELECT JOB, GENDER … instead. 

 

IMAGE4.png

 

Each of these methods has its pros and cons.  Depending on your data, and what you want to see, you may use one or a combination of these methods.  Although SAS users have their preferences, there really isn’t a “right” way.  Play around with them, become comfortable with them, and soon you’ll recognise situations where one is more appropriate than another.

 

Now it’s your turn!

 

Did you find something else interesting in this data? Share in the comments. I’m glad to answer any questions.

 

Need data for learning?

 

The SAS Communities Library has a growing supply of free data sources that you can use in your training to become a data scientist. The easiest way to find articles about data sources is to type "Data for learning" in the communities site search field like so:

 

Image11.png

 

We publish all articles about free data sources under the Analytics U label in the SAS Communities Library. Want email notifications when we add new content? Subscribe to the Analytics U label by clicking "Find A Community" in the right nav and selecting SAS Communities Library at the bottom of the list. In the Labels box in the right nav, click Analytics U:

 

IMAGE12.png

 

Click Analytics U, then select "Subscribe" from the Options menu.

 

Happy Learning!

 

Comments
by Super User
on ‎09-30-2016 01:47 PM

I would correct the statement about Proc Freq requiring the variables to be specified.

 

proc freq data = work.dummy;
run

 

Will generate a frequency table of every variable, which admittedly may not be desireable but will work.

If you are only using Proc Freq instead of Means since means won't handle character then

proc freq data = work.dummy;
  tables _character_;
run;

will do the frequencies of the character variables.

 

by Regular Contributor
on ‎09-30-2016 01:53 PM

Thanks @ballardw - fixed / cleaned up the language a bit.  Hopefully that's clearer!

Have a good one

Chris

by SAS Super FREQ
on ‎10-12-2016 10:15 AM

Hi, This is a great article and good for anyone learning SAS, using any interface, to explore their datasets for missing data. They don't have to be using the University Edition. Nothing that you show in the code is specific to the University Edition. "Code is code" and "data is data" -- they don't really have an interface restriction. And the PROC SQL, PROC MEANS, PROC FREQ steps could all be submitted using SAS OnDemand for Academics or SAS Enterprise Guide's editor or even the SAS Display Manager editor on Windows. I could even type that code into and ISPF editor and batch-submit the code on the mainframe. That is the beauty of understanding that the interface you use is not SAS. The interface is how you send your code to SAS. The main investment, what's going to have the biggest payoff down the road, is learning what the code does and how to use PROC MEANS and PROC FREQ to explore data, no matter what interface you use.

 

Thanks for writing the article!

cynthia

by Regular Contributor
on ‎10-12-2016 10:27 AM

Thanks so much @Cynthia_sas for a wonderful comment - it's absolutely true, and one of the reasons why I love SAS University Edition.  I often learn new things at home and send it to myself at work, where I then take my newfound skill and apply it to our data.  The improvements to my efficiency has been dramatic, as i'm not spending long hours at work trying to figure out how to do something.  SAS has truly taken the concept of "transferable skills" to a whole new level!

 

Hoping to see you at SGF - until then, see you around the community!

Chris

Your turn
Sign In!

Want to write an article? Sign in with your profile.