BookmarkSubscribeRSS Feed
Lucassss
Obsidian | Level 7

Hello,

 

I am dealing with a data set with plenty of missing values. For example, there is a variable A with the following definition: Months since most recent finance installment trade delinquency. It has an integer value ranging from 1 to 80, but 80% of this variable are missing values. And missing values mean these observations don't have any finance installment trade delinquencies.

 

I wanted to replace those missing values because Variable Clustering node in SAS EM requires non-missing inputs.

 

My question is should I replace these missing values? what value should I replace them with?

 

Thanks for any help.

 

Lucas

5 REPLIES 5
Reeza
Super User
Wouldn't it be 0 because these are not delinquent? But it also means your data is heavily leaning towards the 'non-outcome' of whatever you're trying to determine and you may want to oversample if needed.
Lucassss
Obsidian | Level 7
Thanks for the reply. However I don't think missing means 0. For example, a value of 5 means 5 months has past since this customer had his most recent delinquent. A value of 0 means this customer just had his delinquent this month.
I think a missing value is more like a very large number than a zero.
Reeza
Super User
You said they ranged from 1 to 80 so I assumed 0 would be none, but it was an assumption. Either way, this is definition problem, you definitely cannot impute in this case because it's a systematic missing, not missing at random.

What are you trying to model, is this a variable in the model or the outcome?
Lucassss
Obsidian | Level 7
This is a dependent variable. The target(independent) variable is binary, default or not default.

So in this case, do I leave this variable as it is?

My concern is that when I standardize interval variables, does it make sense to exclude these kinds of variables?
Reeza
Super User
You have to figure out a way to code them, otherwise that variable will be excluded from the analysis most likely. Generally, from my understanding, most algorithms will do that.

Given that the duration of default is only for those who default including it also problematic since it's only present for those who do default? Or at least will be very highly correlated, ie an account that's 36 months delinquent is more likely to default than one that's not delinquent.

This is a methodology question by the way, not specifically a coding question. I still think coding these as 0 is a good idea and then coding everything else as 1 to 80 makes sense. But hopefully someone else has a better answer for you.

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 5 replies
  • 1940 views
  • 0 likes
  • 2 in conversation