BookmarkSubscribeRSS Feed

Missing data - Not there, but not irrelevant!

Started ‎09-30-2016 by
Modified ‎08-03-2021 by
Views 1,591

Editor's note: SAS programming concepts in this and other Free Data Friday articles remain useful, but SAS OnDemand for Academics has replaced SAS University Edition as a free e-learning option. Hit the orange button below to start your journey with SAS OnDemand for Academics:

 

Access Now

 

One of the key characteristics of many datasets is the fact that some of your data will be missing.  Missing data is not necessarily a bad thing; it can mean any number of things (ineffective process, inadequate training, database problems, etc.)  I’ll show you a couple of different ways to check up on your data, and see how much of it is missing. MissingData.jpg

 

Get the Data

Because I wanted to have a small dataset with missing data, I ended up creating my own.  Don’t worry about what’s going on here, that’ll be covered in another blog.  It’s enough to know that this creates a table DUMMY in the WORK library, with the variables IDNum, Gender, Jobcode, Salary, Birth, and Hired (data step taken and modified from SAS Support).  

 

proc sql;
   create table work.dummy
       (IdNum char(4),
        Gender char(1),
        Jobcode char(3),
        Salary num,
        Birth num informat=date7.
                  format=date7.,
        Hired num informat=date7.
                  format=date7.);
insert into work.dummy
    values('1639','F','TA1',42260,'26JUN70'd,'28JAN91'd)
    values('1065','M','ME3',38090,'26JAN54'd,'07JAN92'd)
    values('1400','M','ME1',29769.'05NOV67'd,'16OCT90'd)
            values('1561','M',null,36514,'30NOV63'd,'07OCT87'd)
    values('1221','F','FA3',.,'22SEP63'd,'04OCT94'd)
   
    values('2639','F','TA1',42260,'26JUN70'd,'28JAN91'd)
    values('2065','M','ME3',38090,'26JAN54'd,'07JAN92'd)
    values('2400','M','ME1',29769.'05NOV67'd,'16OCT90'd)
            values('2561','M',null,36514,'30NOV63'd,'07OCT87'd)
    values('2221','F','FA3',.,'22SEP63'd,'04OCT94'd)
   
    values('3639','F','TA1',42260,'26JUN70'd,'28JAN91'd)
    values('3065','M','ME3',38090,'26JAN54'd,'07JAN92'd)
    values('3400','M','ME1',29769.'05NOV67'd,'16OCT90'd)
            values('3561','M',null,36514,'30NOV63'd,'07OCT87'd)
    values('3221','F','FA3',.,'22SEP63'd,'04OCT94'd)
   
    values('4639','F','TA1',42260,'26JUN70'd,'28JAN91'd)
    values('4065','M','ME3',38090,'26JAN54'd,'07JAN92'd)
    values('4400','M','ME1',29769.'05NOV67'd,'16OCT90'd)
            values('4561','M',null,36514,'30NOV63'd,'07OCT87'd)
    values('4221','F','FA3',.,'22SEP63'd,'04OCT94'd);
   
 select *
   from work.dummy;
   quit;

 

Get started with SAS OnDemand for Academics

 
In this 9-minute tutorial, SAS instructor @DomWeatherspoon shows you how to get your data into SAS OnDemand for Academics and other key steps:

Get Started

 

 

Getting the data ready FreeDataFriday_graphic.jpg

Because I’ve created the data to fit what I need, I don’t actually need to do anything else.  Having said that, I recommend you play around with it using the SAS University Edition TASKS, PROC SQL, etc. Here’s the data as I’ve created it:

 

 

IMAGE1.png

 

The Results

There are three methods I typically use to determine the “missingness” of my data.  The first is using PROC MEANS, as shown here:

 

 

proc means data = work.dummy n nmiss;
  var _numeric_;
run;

 

I’m using the nmiss option, which will be the way SAS provides the information about the missing data.

 

IMAGE2.png

 

SAS is telling me that I have 4 missing Salaries, out of a total of 20 observations.

 

The next method is using PROC FREQ, which is easier code to run.  You don't need to specify the variables you want to be analysed, but it is highly recommended you limit to a few at a time as otherwise it could take a long time to run and the results will be too cumbersome to analyse effectively.  Here's the code I used:

 

 

proc freq data = work.dummy;
  tables jobcode gender ;
run;

 

 This code runs and produces two different tables (because I specified two variables) and Frequency, Percent, Cumulative Frequency, and Cumulative Percent.  At the bottom of the first table SAS tells me I’m missing 4 observations.

 

IMAGE3.png

 

The final method is a little more cumbersome, but provides the most flexibility as you can specify complex conditional statements.  In this example, I’m picking three variables (Gender, Jobcode and Salary) and asking SAS to provide me the list of observations where one of them is missing.

  

proc sql;
select *
from work.dummy
where jobcode is null or
gender is null or
salary is null;
quit;

 

Here’s the output from SAS; because I said SELECT *, all the variables have been pulled.  You can specify the exact variables you want to see by using SELECT JOB, GENDER … instead. 

 

IMAGE4.png

 

Each of these methods has its pros and cons.  Depending on your data, and what you want to see, you may use one or a combination of these methods.  Although SAS users have their preferences, there really isn’t a “right” way.  Play around with them, become comfortable with them, and soon you’ll recognise situations where one is more appropriate than another.

 

Now it’s your turn!

 

Did you find something else interesting in this data? Share in the comments. I’m glad to answer any questions.

 

 

Comments

I would correct the statement about Proc Freq requiring the variables to be specified.

 

proc freq data = work.dummy;
run

 

Will generate a frequency table of every variable, which admittedly may not be desireable but will work.

If you are only using Proc Freq instead of Means since means won't handle character then

proc freq data = work.dummy;
  tables _character_;
run;

will do the frequencies of the character variables.

 

Thanks @ballardw - fixed / cleaned up the language a bit.  Hopefully that's clearer!

Have a good one

Chris

Hi, This is a great article and good for anyone learning SAS, using any interface, to explore their datasets for missing data. They don't have to be using the University Edition. Nothing that you show in the code is specific to the University Edition. "Code is code" and "data is data" -- they don't really have an interface restriction. And the PROC SQL, PROC MEANS, PROC FREQ steps could all be submitted using SAS OnDemand for Academics or SAS Enterprise Guide's editor or even the SAS Display Manager editor on Windows. I could even type that code into and ISPF editor and batch-submit the code on the mainframe. That is the beauty of understanding that the interface you use is not SAS. The interface is how you send your code to SAS. The main investment, what's going to have the biggest payoff down the road, is learning what the code does and how to use PROC MEANS and PROC FREQ to explore data, no matter what interface you use.

 

Thanks for writing the article!

cynthia

Thanks so much @Cynthia_sas for a wonderful comment - it's absolutely true, and one of the reasons why I love SAS University Edition.  I often learn new things at home and send it to myself at work, where I then take my newfound skill and apply it to our data.  The improvements to my efficiency has been dramatic, as i'm not spending long hours at work trying to figure out how to do something.  SAS has truly taken the concept of "transferable skills" to a whole new level!

 

Hoping to see you at SGF - until then, see you around the community!

Chris

Version history
Last update:
‎08-03-2021 01:46 PM
Updated by:

sas-innovate-2024.png

 

Secure your spot at the must-attend AI and analytics event of 2024: SAS Innovate 2024! Get ready for a jam-packed agenda featuring workshops, super demos, breakout sessions, roundtables, inspiring keynotes and incredible networking events.

 

Register by March 1 to snag the Early Bird rate of just $695! Don't miss out on this exclusive offer. 

 

Register now!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Tags