BookmarkSubscribeRSS Feed
ertr
Quartz | Level 8

Hello everyone,

 

I have a sample data set as below, I want to fill the missing values but it should fill them statistically meaningful. I do not know lots of methods but first I thought, it can be better to use Impute node. Can someone tell me how I can use the Impute node to fill missing  values statistically meaningful.

 

 

This is just a sample data set, there is no sense

Data Have;
Length CUST_ID 8 DEFAULT_DATE 8 Risk_1 8 Risk_2 8 Risk_3 8;
Infile Datalines Missover;
Input CUST_ID DEFAULT_DATE  Risk_1 Risk_2 Risk_3;
Format  DEFAULT_DATE date9.;
Datalines;
0001 21000 . 20000 30000
0002 21031 20000 . 30000
0003 21062 10000 20000 .
0004 21092 40000 50000 .
0005 21123 . 50000 60000
0006 21153 40000 . 60000
0007 21184 70000 80000 .
0008 21215 . 80000 90000
0009 21243 70000 . 90000
00010 21274 100000 110000 . 
00011 21304 . 110000 120000
00012 21335 100000 . 120000
;
Run;

Thank you,

1 REPLY 1
DougWielenga
SAS Employee

I have a sample data set as below, I want to fill the missing values but it should fill them statistically meaningful. I do not know lots of methods but first I thought, it can be better to use Impute node. Can someone tell me how I can use the Impute node to fill missing  values statistically meaningful.

 

Short answer - If you know what the missing value is (e.g. a missing gift likely means no gift which means $0.00), use that.   For critical variables, consider Tree-based imputation in the Impute node, and for variables with only slight missingness, consider using mean for interval variables and mode for categorical variables.  You might also consider creating missing value indicators for variables with non-trivial missingness so that you can treat the fact that the value was in fact missing as additional information.  Please note that replacing missing values in a small data set (like the sample you posted which has virtually no complete observations) is much riskier than in large data sets which are less impacted by any given imputed value.

 

Longer answer - The first time I had to use imputation, I had a data set with 25,000 observations and needed to run a regression model, and I did so -- obtaining an analysis of exactly 25 observations since there were only 25 complete observations.  Unfortunately, imputing missing values is essentially pretending you have data that you do not which is the opposite of doing something statistically meaningful.   Imputing missing values increases observations counts and often reduces variability (e.g. when the mean is used for imputation) making the error estimates and any associated traditional statistics like confidence intervals less meaningful.   As a result, many classical statistics are not reported or are altered in some way (e.g. you lose the notion of degrees of freedom). 

 

Having said that, in large data sets like those used in Data Mining, you can actually use modeling to predict the missing value from other values that are known.   Tree-based imputation methods that do not require complete data are available in SAS Enterprise Miner.  This is still 'guessing' but you are using other information in the observation to inform that guess.   Of course, you can't do this for every variable in typical data mining data sets because each imputation represents a different model, and fitting a model for every input variable could generate thousands of models and ultimately very slow score code.    If you are limited to SAS Enterprise Guide, you must resort to more manual methods of doing imputation.  The analyst then relies on having hold-out data to validate the fitted model rather than making assumptions about error distributions.   In general, you can still generate useful predictions without drawing statistical inferences like those done in classical regression.  

 

Hope this helps!

Doug

hackathon24-white-horiz.png

2025 SAS Hackathon: There is still time!

Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!

Register Now

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 1 reply
  • 2094 views
  • 0 likes
  • 2 in conversation