Why Do I Have Missing Data and How Do I Fix it? Q&A, Slides, and On-Demand Recording

1 Like

Did you miss the Ask the Expert session on missing data and how to resolve missing values in SAS? Not to worry, you can catch it on-demand at your leisure.

Watch the webinar

Watch this webinar to hear SAS expert Melodie Rush define missing values, why and when they occur and how to manage them. She will discuss functions, procedures and how products like SAS® Enterprise Guide®, SAS® Enterprise Miner™, SAS Studio and SAS® Viya® deal with missing values. During this webinar, you will learn:

The definition of a missing value.
Why missing values happen.
How to manage missing values in SAS.

Please leave a comment on this post and tell us how you handle missing values or how you implemented something new you learned during this webinar. It’s great to learn from fellow SAS users!

Here are the questions from the Q&A segment held at the end of the webinar. The slides from the webinar are attached.

What is the best way to impute categorical data?

Use a mode or a unique category. If using color, use mode and fill in everyone with that color. If doing gender for example, you could make a third category “unknown” for the missing values.

Why would I need imputation indicators?

If you get asked if it’s a real value, you need to know if it is or if you imputed it. That’s important to know so you can evaluate your imputations. It’s also important in predictive modeling. It can help you be more accurate in your predictions.

Can you run PROC STDIZE with a BY statement to apply different missing values to different BY-groups?

Yes: https://support.sas.com/documentation/onlinedoc/stat/132/stdize.pdf

For HPIMPUTE, if you use 'random' then how can you repeat results if the method is random? Problem if repeat the procedure and the imputed values will be different?

It looks like you can specify, or it generates a SEED value: https://go.documentation.sas.com/?cdcId=pgmsascdc&cdcVersion=9.4_3.5&docsetId=prochp&docsetTarget=pr...

What is seed in proc MI, which value shall we use?

The seed allows you to replicate the answer you get. PROC MI is randomly imputing those values so if you don’t put a seed in there you won’t be able to replicate the results. If you run the code later, it will give you the same answer. It will allow you to start at the same place. I usually use 1234.

Any positive # - to provide repeatable results across runs.

SEED=number

Specifies a positive integer to start the pseudo-random number generator. The default is a value generated from reading the time of day from the computer’s clock. However, in order to duplicate the results under identical situations, you must use the same value of the seed explicitly in subsequent runs of the MI procedure.

The seed information is displayed in the "Model Information" table so that the results can be reproduced by specifying this seed with the SEED= option. You need to specify the same seed number in the future to reproduce the results.

Is there an easy way in the DATA step to create an indicator column that denotes whether a column was imputed (link you can in EM)?

You can use the missing function. If you have a lot of columns you can use an array to run through all the columns quickly and create a bunch of indicators. Some people create an indicator for the whole row or will create one for each column.

You can post this question out on the SAS Support Community to get other opinions.

SAS Programming - https://communities.sas.com/t5/SAS-Procedures/Creating-an-indicator-variable-based-on-missing-variab...

Viya Model Studio - https://communities.sas.com/t5/SAS-Communities-Library/Asked-amp-Answered-How-to-create-missing-valu...

An Example on creating missing value indicator variables - https://stats.idre.ucla.edu/sas/seminars/multiple-imputation-in-sas/mi_new_1/

For imputation, do you have any guidelines on whether to choose mean, mode, median, etc...?

It depends on your situation. You should try several options and see what works best for your data. It depends on what your outcome is. If it’s creating a predictive model you can try several options and see which one gives you the best prediction.

PROC fastclus as missing replacement method

Good point. In from documentation IMPUTE requests imputation of missing values after the final assignment of observations to clusters. If an observation that is assigned (or would have been assigned) to a cluster has a missing value for variables used in the cluster analysis, the missing value is replaced by the corresponding value in the cluster seed to which the observation is assigned (or would have been assigned). If the observation cannot be assigned to a cluster, missing value replacement depends on whether the NOMISS option is specified. If NOMISS is not specified, missing values are replaced by the mean of all observations in the DATA= data set having a value for that variable. If NOMISS is specified, missing values are replaced by the mean of only observations used in the analysis. (A weighted mean is used if a variable is specified in the WEIGHT statement.) For information about cluster assignment see the section OUT= Data Set. If you specify the IMPUTE option, the imputed values are not used in computing cluster statistics. If you also request an OUT= data set, it contains the imputed values.

Which method is best?

Tough question. It depends. Some of the methods can be used interchangeably. Some methods rely on assumptions - so we need to validate. To use say mean or median, it depends upon your knowledge of the data and its distribution.

Any imputation method relies on some heavy assumptions. Could you touch on some of the pitfalls of using imputation for missing data?

You must be careful. There are pitfalls if delete the row or if you don’t get it correct. Most conservative is to use mean or mode, but if you have a lot missing, that could lead to bad results. If most of data is missing, you should ask if you should use that variable at all. Any data problem is unique. You may need to try several options and see what works best for your data.

What features does SAS have to address cleaning up data in terms of reading different formats of inputs?

Please post this question to the SAS Programming Support Community and you’ll be able to get a lot of good opinions and examples.

In what circumstances would you implement replacement of indicators for missing values?

From a business context does a missing value have a "value" for an explanatory or predictive model.

What method do you recommend for longitudinal missing data? Multiple imputation? What do you think about refreshment samples?

Longitudinal data is related to other data points so when you start imputing you need to use specialized imputation methods. I don’t think multiple imputations will serve you well because it won’t consider the time feature of longitudinal data.

What are your thoughts on time series forecasting and missing dates due to COVID Business shut down? What are the best practices for the missing dates?

SAS/ETS and other time series analysis has data imputation methods built in. COVID provides many challenges. We may need to explore the time series. If the series resumes after COVID ends (hopefully), we may need to define an event.

Do you have a threshold of overall missing data for a variable that you have where you would 'throw the variable out' due to low/lack of coverage instead of using imputation?

I use 49%. If I’m imputing over half the data you want to think about it unless you know why they are missing.

What would be preferable to input with the mean or the median? What is the effect of extreme data points when using some imputing method?

If you have extremes maybe median would be preferable as it isn't biased as much as means.

How do I start learning SAS Viya?

We have some free training resources available right now: https://www.sas.com/en_us/training/offers/free-training.html

We also have a free trial available: https://www.sas.com/en_us/software/viya.html

We also have a Viya training path: https://support.sas.com/training/us/paths/viya.html

Are there any rules, or methods that can be used that define how much missing data is acceptable? Are there any common tools used to measure how much missing data can be affecting the overall data?

50% or more is often an acceptable percentage. Other experts will say 20-30%.

Here are a couple of research articles on the topic

What procedures do you have for predictive analysis for MSRP or mortgage data?

Lots in SAS/STAT, Enterprise Miner (nodes), or Visual Data Mining and Machine Learning (regression methods, decision trees, GLM models, gradient boosting depending upon what products you have access to.

So best way to evaluate imputation method is to test the predictions against a training dataset?

I would use a training data set to create the model and then use a test or validation data set for testing. I score off the validation data set and use this so neural networks and decision trees don’t over fit. Use the test data set (hold out data that has not been used) to create the model and prove that the model is expendable to new data coming in.

In EM or VDMML try several options related to imputation methods in the model - what are the results? Do they make sense?

Sometimes I have '0' and Missing values where the '0' is also wrong entry or r I have a char like 'not applicable' which I would consider as a missing value. How do I handle that in SAS EG/SAS Code?

If I understand correctly, there is a separate step for data replacement. To replace one value for another like "0" with "." , you can use SAS code or Query Builder in EG.

Is there a way to let SAS automatically find subsets of your data where multiple values are not missing like the subset of customers with credit card? They of course have credit card number and type not missing.

Use the Missing Data Pattern Task available in SAS Studio. This will give you the ability to select the columns you are interested in and you can use Group By if you need to create subgroups.

Recommended Resources

Working with Missing Values Documentation

Managing Missing Data Using SAS® Enterprise Guide®

Video: SAS® Enterprise Miner® Tip: Imputing Missing Values

Multiple Imputation of Missing Data Using SAS®

Want more tips? Be sure to subscribe to the Ask the Expert board to receive follow up Q&A, slides and recordings from other SAS Ask the Expert webinars.

Why Do I Have Missing Data and How Do I Fix it? Q&A, Slides, and On-Demand Recording

SAS Innovate 2025: Call for Content

Click image to register for webinar

Classroom Training Available!