BookmarkSubscribeRSS Feed
Satori
Quartz | Level 8

I chose Var3 randomly. It could be any of the variables. What I want is to add the variable that keeps the highest number of observations. The order only matters in the sense that I want to keep the highest possible number of obs., and the objective is to see how the number of observations decreases as variables are added.

PaigeMiller
Diamond | Level 26

@Satori wrote:

I chose Var3 randomly. It could be any of the variables. What I want is to add the variable that keeps the highest number of observations. The order only matters in the sense that I want to keep the highest possible number of obs., and the objective is to see how the number of observations decreases as variables are added.


Are we to assume that all these variables are numeric? Or are we to assume that all these variables are character? Or are we to assume that some are numeric and some are character?

--
Paige Miller
Satori
Quartz | Level 8
Variables are numeric and character, but can be converted to be all numeric
PaigeMiller
Diamond | Level 26

I'm going to take a step back now and try to look at the big picture.

 

Over the years, many statistical methods have been developed to handle missing data. SAS has two PROCs, PROC MI and PROC MIANALYZE, which handles missing values in a statistically appropriate way (depending on certain assumptions) based on imputing values to use in place of the missing values. There is also the EM (expectation maximization) algorithm in PROC PLS (and possibly in other PROCs, I'm not sure) that will handle missing values.

 

The method you are proposing does not sound like any of this previously developed tools to handle missing values. You might want to consider using one of these methods already in SAS rather than try to invent your own. In fact, I would argue against inventing your own method unless none of the SAS methods already mention give you what you want, which is a way to handle missing values in a specific analysis.

 

In addition, I think your method might be useful if the missings appear at random; but if they are not random I'm highly suspicious that it will be a good method.

 

All of this hinges on you explaining what you are going to do once you have your missing analysis completed, and despite several people asking what you plan to do with the results of your missing analysis, you have not yet explained. So before I put more energy into helping you out with this method, I need to understand the next steps, once you have completed this accounting of missing values.

--
Paige Miller
Satori
Quartz | Level 8
actually I already explained the objective: get the highest number of observations with information on all (or most) variables. Doing this dynamic counting will allow me to see the decay in the number of observations by adding an extra variable. The goal is to decide where a cutoff is appropriate. For example if on variable 25 I have 1000 observations and by adding variable 26 (the next one with most non missing for all the previous 25 vars) it drops to 100, I will choose to not add variable 26 (as well as the following vars)
PaigeMiller
Diamond | Level 26

@Satori wrote:
actually I already explained the objective: 

No that's not what I was asking. I am (and others are) specifically asking about what happens NEXT, after you complete this missing value accounting. What are you going to do with this result? What is the next analysis?

--
Paige Miller
Satori
Quartz | Level 8

After I get this counting, I will choose which variables to include in my analysis, and I will do entropy balancing for matching treatment and control observations

PaigeMiller
Diamond | Level 26

@Satori wrote:

After I get this counting, I will choose which variables to include in my analysis, and I will do entropy balancing for matching treatment and control observations


Okay, so you will choose variables based upon the pattern of missing values, rather than how good the predictors are? And are these values missing at random, or not missing at random?

--
Paige Miller
Satori
Quartz | Level 8
yes. Missing at random.
Tom
Super User Tom
Super User

@Satori wrote:
actually I already explained the objective: get the highest number of observations with information on all (or most) variables. Doing this dynamic counting will allow me to see the decay in the number of observations by adding an extra variable. The goal is to decide where a cutoff is appropriate. For example if on variable 25 I have 1000 observations and by adding variable 26 (the next one with most non missing for all the previous 25 vars) it drops to 100, I will choose to not add variable 26 (as well as the following vars)

What if you added a numeric variable to your data that was a count of how many missing values it had?

data want;
  set have;
  nmiss=0;
  nmiss = cmiss(of _all_);
run;

Now did a regression to see which variables best predict NMISS.

 

Would that help?

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 16. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 24 replies
  • 2160 views
  • 8 likes
  • 6 in conversation