Re: dynamic counting of non missing observations - Page 2

Satori · Posted 02-14-2023 09:58 AM

I chose Var3 randomly. It could be any of the variables. What I want is to add the variable that keeps the highest number of observations. The order only matters in the sense that I want to keep the highest possible number of obs., and the objective is to see how the number of observations decreases as variables are added.

PaigeMiller · Posted 02-14-2023 10:45 AM

@Satori wrote:

I chose Var3 randomly. It could be any of the variables. What I want is to add the variable that keeps the highest number of observations. The order only matters in the sense that I want to keep the highest possible number of obs., and the objective is to see how the number of observations decreases as variables are added.

Are we to assume that all these variables are numeric? Or are we to assume that all these variables are character? Or are we to assume that some are numeric and some are character?

--
Paige Miller

Satori · Posted 02-14-2023 10:50 AM

Variables are numeric and character, but can be converted to be all numeric

PaigeMiller · Posted 02-14-2023 11:21 AM

I'm going to take a step back now and try to look at the big picture.

Over the years, many statistical methods have been developed to handle missing data. SAS has two PROCs, PROC MI and PROC MIANALYZE, which handles missing values in a statistically appropriate way (depending on certain assumptions) based on imputing values to use in place of the missing values. There is also the EM (expectation maximization) algorithm in PROC PLS (and possibly in other PROCs, I'm not sure) that will handle missing values.

The method you are proposing does not sound like any of this previously developed tools to handle missing values. You might want to consider using one of these methods already in SAS rather than try to invent your own. In fact, I would argue against inventing your own method unless none of the SAS methods already mention give you what you want, which is a way to handle missing values in a specific analysis.

In addition, I think your method might be useful if the missings appear at random; but if they are not random I'm highly suspicious that it will be a good method.

All of this hinges on you explaining what you are going to do once you have your missing analysis completed, and despite several people asking what you plan to do with the results of your missing analysis, you have not yet explained. So before I put more energy into helping you out with this method, I need to understand the next steps, once you have completed this accounting of missing values.

--
Paige Miller

Satori · Posted 02-14-2023 11:28 AM

actually I already explained the objective: get the highest number of observations with information on all (or most) variables. Doing this dynamic counting will allow me to see the decay in the number of observations by adding an extra variable. The goal is to decide where a cutoff is appropriate. For example if on variable 25 I have 1000 observations and by adding variable 26 (the next one with most non missing for all the previous 25 vars) it drops to 100, I will choose to not add variable 26 (as well as the following vars)

PaigeMiller · Posted 02-14-2023 11:40 AM

@Satori wrote:
actually I already explained the objective:

No that's not what I was asking. I am (and others are) specifically asking about what happens NEXT, after you complete this missing value accounting. What are you going to do with this result? What is the next analysis?

--
Paige Miller

Satori · Posted 02-14-2023 02:41 PM

After I get this counting, I will choose which variables to include in my analysis, and I will do entropy balancing for matching treatment and control observations

PaigeMiller · Posted 02-14-2023 03:35 PM

@Satori wrote:

After I get this counting, I will choose which variables to include in my analysis, and I will do entropy balancing for matching treatment and control observations

Okay, so you will choose variables based upon the pattern of missing values, rather than how good the predictors are? And are these values missing at random, or not missing at random?

--
Paige Miller

Satori · Posted 02-14-2023 03:54 PM

yes. Missing at random.

Tom · Posted 02-14-2023 12:37 PM

@Satori wrote:
actually I already explained the objective: get the highest number of observations with information on all (or most) variables. Doing this dynamic counting will allow me to see the decay in the number of observations by adding an extra variable. The goal is to decide where a cutoff is appropriate. For example if on variable 25 I have 1000 observations and by adding variable 26 (the next one with most non missing for all the previous 25 vars) it drops to 100, I will choose to not add variable 26 (as well as the following vars)

What if you added a numeric variable to your data that was a count of how many missing values it had?

data want;
  set have;
  nmiss=0;
  nmiss = cmiss(of _all_);
run;

Now did a regression to see which variables best predict NMISS.

Would that help?

SAS Innovate 2025: Call for Content

Classroom Training Available!