BookmarkSubscribeRSS Feed
Satori
Quartz | Level 8

I have a dataset with about 50 variables. I want to do a dynamic counting of non missing values for these variables, meaning that I start with the variable with the most non missing observations (var1), then I want to add a second variable (var2) and count the number of non missing observations for both var1 and var2, whereby the number of non missing is the highest possible among the existing variables.

 

For example, if I have just five variables:

var1 non missing 1000

var2 non missing 800

var3 non missing 400

var4 non missing 850

var5 non missing 330

 

I would start with var1, and then count the number of observations with non missing of two variables.

var1+var2 non missing 720

var1+var3 non missing 780

var1+var4 non missing 620

var1+var5 non missing 150

 

Then var1+var3 give me the most observations, so I would keep them and continue with the most number of observations with non missing for three variables.

var1+var3+var2

var1+var3+var4

var1+var3+var5

etc. until I have the result for all the variables (5 in this example). A possible result being var1+var3+var2+var5+var4

 

 

 

24 REPLIES 24
PaigeMiller
Diamond | Level 26

Are we to assume that all these variables are numeric? Or are we to assume that all these variables are character? Or are we to assume that some are numeric and some are character?

 

There are 50C1 + 50C2 + ... + 50C50 (where the C indicates combinations, 50C2 is the number of ways that 50 variables can be combined 2 at a time) possible combinations, that's 1,125,899,906,842,618, for more than is realistically possible to compute. So now what?

--
Paige Miller
Quentin
Super User

If I'm reading correctly the OP wants 50C1 + 49C1 + 48C1 ... 1C1

 

That is, 

Run PROC MEANS on all 50 variables, select the variable with the highest N.  Call that variable X.

Then run PROC MEANS on 49 variables where n(X)=1, and select the variable with the highest N.  Call that variable Y.

Then run PROC MEANS on 48 variables where n(X,Y)=2, and select the variable with the highest N  Call that variable Z.

Then run PROC MEANS on 47 variables where n(X,Y,Z)=3, and select the variable with the highest N....

 

So it could be done with 50 PROC MEANS steps in a macro or whatever.  

 

Is that what you want, @Satori ?

 

I'm curious to the big picture as to why you want to do this.  I don't think this sort of stepwise approach will necessarily give you an optimal combination of variables.

BASUG is hosting free webinars Next up: Mark Keintz presenting History Carried Forward, Future Carried Back: Mixing Time Series of Differing Frequencies on May 8. Register now at the Boston Area SAS Users Group event page: https://www.basug.org/events.
Satori
Quartz | Level 8

yes, that is exactly what I want.

 

The objective of doing this is to get the highest number of observations with information on all (or most) variables, so I can decide where to make a cutoff on the extra variables. For example if on variable 25 I have 1000 observations and by adding variable 26 (the next one with most non missing) it drops to 100, I will choose to not add variable 26 and the following ones (var26 to var50)

Quentin
Super User

I'm not sure this method works.

 

As an example, with this step-wise approach imagine you decide to keep three variables, and select Var1 Var2 Var3.

 

It's possible that Var1 Var3 Var4 would give you more non-missing records.  It all depends on how missing-ness is correlated across the variables.

 

I think to find an optimal solution, you would have to test all the combinations, as Paige mentioned.

BASUG is hosting free webinars Next up: Mark Keintz presenting History Carried Forward, Future Carried Back: Mixing Time Series of Differing Frequencies on May 8. Register now at the Boston Area SAS Users Group event page: https://www.basug.org/events.
Satori
Quartz | Level 8
Yes I'm aware of that, so what I would like is the simpler one path version. What I really want is help in writng the macro that does this as I'm not familiar with macros. Basically what I want is to count the number of non missing observations, get the variable with the highest number of non missing observations, then remove all missing observations for this highest variable, then again count the number of non missing, get the variable with highest nonmiss, remove all obs with missing in this var and do this for all 50 vars.
Quentin
Super User

Are you comfortable enough with macros to write a macro that will:

1) Run PROC MEANS on a list of variables

2) Identify the variable with highest number of non-missing variables.

 

So I'm imagining:

%macro FindVarWithGreatestN(
  data=/*name of dataset*/
 ,list= /*space-delimited list of variables*/
  );


I would start by writing that macro.  Then after you have that, you can write an outer macro that would iterate, removing one variable at a time.

 

If you can't write that macro, can you make some test data, and write the non-macro SAS code you would use to do one iteration?  (I'm thinking it's PROC means to get N for each var, then select one variable with the max N (handling ties)).  

 

This doesn't necessarily feel like the right thing to do, but if you want to learn the macro language it could be a useful exercise, so I'd be happy to help.

 

BASUG is hosting free webinars Next up: Mark Keintz presenting History Carried Forward, Future Carried Back: Mixing Time Series of Differing Frequencies on May 8. Register now at the Boston Area SAS Users Group event page: https://www.basug.org/events.
Satori
Quartz | Level 8

I have been using proc freq to get non missing counts, as below:

proc format; value missfmt .,0 ='Missing' other='Not Missing';
proc freq data=want; format var1--var50 missfmt.; tables var1--var50;

I also did a static non missing count for all 50 vars, as below:

data work.nm; set have end=eof; array vars var1-var50;
	array nonmiss[50]; do i=1 to dim(vars) while (vars[i]); nonmiss[i]+1; end; if eof; keep nonmiss:;
proc print data=work.nm;

 

Quentin
Super User

Here's an attempt at an approach. It will probably error for some edge cases (e.g. a variable with all missing values.  I'm not convinced that this is a good idea, but it's still an interesting algorithm to implement.

 

Make some test data:

data have ;
  call streaminit(123) ;

  array x {*} x1-x100 ;

  do id=1 to 100 ;
    do i=1 to dim(x) ;
      if rand("Uniform")<i/100 then x{i}=1 ;
      else call missing(x{i}) ;
    end ;

    output ;
  end ;

  drop i ;
run ;

This is the main macro. It uses PROC MEANS to count the N for a a list of variables.  The records read by PROC MEANS are subset to be records where a list of selected variables has no missing values:

 

%macro FindVarWithGreatestN(
  data=             /*dataset*/
 ,CandidateVarlist= /*space-delimited list of candidate variables*/
 ,SelectedVarlist=  /*space-delimited list of previously selected variables*/
  );

ods listing close ;
ods output summary=Ncount;
proc means data=have n stackods ;
  var &CandidateVarlist;
  %if %length(&SelectedVarList) %then %do ;
    where nmiss(%sysfunc(translate(&SelectedVarlist,%str(,),%str( ))))=0 ;
  %end ;
run ;
ods output close ;
ods listing ;

%*Scope of BestVar is not declared, so if exists in outer scope will write it there ;
proc sql outobs=1 nowarn noprint;
  select variable into :BestVar
  from nCount
  having N=max(N) 
   ;
quit ;

%put the variable with most non-missing values is: &BestVar ;

proc delete data=work.Ncount ;
run ;

%mend FindVarWithGreatestN ;

You can call that macro like:

options mprint ;
%FindVarWithGreatestN(data=have,CandidateVarlist=x1-x100)
%FindVarWithGreatestN(data=have,CandidateVarlist=x1-x97,SelectedVarList=x98 x99 x100)

 

I suggest you start by trying that macro on your real data, and see if it makes sense to you. Once you're happy with that macro, then you need another macro which will loop over a variable list, calling %FindVarWithGreatestN inside the loop, and building the list of selected variables.  Something like:

 


%macro MakeVarList(
  data=
 ,CandidateVarlist=
  ) ;

  %local SelectedVarlist BestVar ;
  %do %until(%sysfunc(countw(&CandidateVarlist,%str( )))=0) ;
    %FindVarWithGreatestN(data=&data
                         ,CandidateVarlist=&CandidateVarlist
                         ,SelectedVarlist=&SelectedVarlist
                          )

    %let SelectedVarlist=&SelectedVarlist &BestVar ;
    %let CandidateVarlist=%sysfunc(compbl(%str()%sysfunc(tranwrd(&CandidateVarlist,&BestVar,%str( ))))) ;
  %end ;

  %put the generated variable list is: &SelectedVarList ;
%mend MakeVarList;

You can call that macro like:

%MakeVarList(data=have,CandidateVarlist=x96 x97 x98 x99 x100)
BASUG is hosting free webinars Next up: Mark Keintz presenting History Carried Forward, Future Carried Back: Mixing Time Series of Differing Frequencies on May 8. Register now at the Boston Area SAS Users Group event page: https://www.basug.org/events.
PaigeMiller
Diamond | Level 26

I get the feeling this is the XY Problem all over again. The question (asked in several threads now) about how to sequentially count missings in combination across several variables seems strange and unusual to many of us, but the focus of all of the questions from the OP is tightly on the mechanics of counting missings rather than the underlying problem.

 

@Satori tells us:

 

After I get this counting, I will choose which variables to include in my analysis, and I will do entropy balancing for matching treatment and control observations

and so now we see that the real question involves fitting a model to data with lots of missings. The problem with such a tight focus on the mechanics of counting missings is that it provides a solution that will select variables that may or may not be good predictors. Such a method of counting missings may cause the best predictors to be thrown out.

 

So how to fit a predictive model when there are lots of missing values? That's what we should be answering, but since we haven't heard @Satori say this, we go down his rabbit hole of producing the sequential accounting of missing values by several variables. I'll stay out of the rabbit hole. There are solutions to handle missing values. I advise a possible method of fitting a model to data that has a lot of missings is to impute values for the missings; there are several methods to do this, which I have mentioned already, PROC MI, PROC MIANALYZE and the EM algorithm in PROC PLS, and probably others that I am not aware of. For example: please see the responses from @SteveDenham and @IanWakeling here: Solved: Factor Analysis Multiple Imputation - SAS Support Communities

 

--
Paige Miller
Reeza
Super User
Agreed, but also echo my comment of the Missing Pattern report from PROC MI which gives the OP pretty much what wants with all combos of variables to determine the best variables to drop to retain the best set of observations. It even shows how various means will be affected by these drops.
PaigeMiller
Diamond | Level 26

Yes, that sounds like a good idea. I was not aware of (or forgot about) this feature of PROC MI. It would be much easier to use this than to create a macro (or otherwise) that counts missings for combinations of variables.

--
Paige Miller
Reeza
Super User
Can you provide some sample data to work with? This is doable, probably inadvisable due to statistical methodology but doable. It also doesn't consider other combinations that may reduce your observations the minimum amount, as you need to consider all combos not just in order of highest to lowest.
ballardw
Super User

@Satori wrote:

I have a dataset with about 50 variables. I want to do a dynamic counting of non missing values for these variables, meaning that I start with the variable with the most non missing observations (var1), then I want to add a second variable (var2) and count the number of non missing observations for both var1 and var2, whereby the number of non missing is the highest possible among the existing variables.

 

For example, if I have just five variables:

var1 non missing 1000

var2 non missing 800

var3 non missing 400

var4 non missing 850

var5 non missing 330

 

I would start with var1, and then count the number of observations with non missing of two variables.

var1+var2 non missing 720

var1+var3 non missing 780

var1+var4 non missing 620

var1+var5 non missing 150

 

Then var1+var3 give me the most observations, so I would keep them and continue with the most number of observations with non missing for three variables.

var1+var3+var2

var1+var3+var4

var1+var3+var5

etc. until I have the result for all the variables (5 in this example). A possible result being var1+var3+var2+var5+var4

 

 

 


Can you show us an example of how you use the Var3 as the second largest instead, for discussion, the 5th variable? I am having a hard time seeing where order of variable actually comes into this if the objective is to determine numbers of nonmissing.

 

 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 24 replies
  • 1677 views
  • 8 likes
  • 6 in conversation