I have a dataset with about 50 variables. I want to do a dynamic counting of non missing values for these variables, meaning that I start with the variable with the most non missing observations (var1), then I want to add a second variable (var2) and count the number of non missing observations for both var1 and var2, whereby the number of non missing is the highest possible among the existing variables.
For example, if I have just five variables:
var1 non missing 1000
var2 non missing 800
var3 non missing 400
var4 non missing 850
var5 non missing 330
I would start with var1, and then count the number of observations with non missing of two variables.
var1+var2 non missing 720
var1+var3 non missing 780
var1+var4 non missing 620
var1+var5 non missing 150
Then var1+var3 give me the most observations, so I would keep them and continue with the most number of observations with non missing for three variables.
var1+var3+var2
var1+var3+var4
var1+var3+var5
etc. until I have the result for all the variables (5 in this example). A possible result being var1+var3+var2+var5+var4
Are we to assume that all these variables are numeric? Or are we to assume that all these variables are character? Or are we to assume that some are numeric and some are character?
There are 50C1 + 50C2 + ... + 50C50 (where the C indicates combinations, 50C2 is the number of ways that 50 variables can be combined 2 at a time) possible combinations, that's 1,125,899,906,842,618, for more than is realistically possible to compute. So now what?
If I'm reading correctly the OP wants 50C1 + 49C1 + 48C1 ... 1C1
That is,
Run PROC MEANS on all 50 variables, select the variable with the highest N. Call that variable X.
Then run PROC MEANS on 49 variables where n(X)=1, and select the variable with the highest N. Call that variable Y.
Then run PROC MEANS on 48 variables where n(X,Y)=2, and select the variable with the highest N Call that variable Z.
Then run PROC MEANS on 47 variables where n(X,Y,Z)=3, and select the variable with the highest N....
So it could be done with 50 PROC MEANS steps in a macro or whatever.
Is that what you want, @Satori ?
I'm curious to the big picture as to why you want to do this. I don't think this sort of stepwise approach will necessarily give you an optimal combination of variables.
yes, that is exactly what I want.
The objective of doing this is to get the highest number of observations with information on all (or most) variables, so I can decide where to make a cutoff on the extra variables. For example if on variable 25 I have 1000 observations and by adding variable 26 (the next one with most non missing) it drops to 100, I will choose to not add variable 26 and the following ones (var26 to var50)
I'm not sure this method works.
As an example, with this step-wise approach imagine you decide to keep three variables, and select Var1 Var2 Var3.
It's possible that Var1 Var3 Var4 would give you more non-missing records. It all depends on how missing-ness is correlated across the variables.
I think to find an optimal solution, you would have to test all the combinations, as Paige mentioned.
Are you comfortable enough with macros to write a macro that will:
1) Run PROC MEANS on a list of variables
2) Identify the variable with highest number of non-missing variables.
So I'm imagining:
%macro FindVarWithGreatestN(
data=/*name of dataset*/
,list= /*space-delimited list of variables*/
);
I would start by writing that macro. Then after you have that, you can write an outer macro that would iterate, removing one variable at a time.
If you can't write that macro, can you make some test data, and write the non-macro SAS code you would use to do one iteration? (I'm thinking it's PROC means to get N for each var, then select one variable with the max N (handling ties)).
This doesn't necessarily feel like the right thing to do, but if you want to learn the macro language it could be a useful exercise, so I'd be happy to help.
I have been using proc freq to get non missing counts, as below:
proc format; value missfmt .,0 ='Missing' other='Not Missing';
proc freq data=want; format var1--var50 missfmt.; tables var1--var50;
I also did a static non missing count for all 50 vars, as below:
data work.nm; set have end=eof; array vars var1-var50;
array nonmiss[50]; do i=1 to dim(vars) while (vars[i]); nonmiss[i]+1; end; if eof; keep nonmiss:;
proc print data=work.nm;
Here's an attempt at an approach. It will probably error for some edge cases (e.g. a variable with all missing values. I'm not convinced that this is a good idea, but it's still an interesting algorithm to implement.
Make some test data:
data have ;
call streaminit(123) ;
array x {*} x1-x100 ;
do id=1 to 100 ;
do i=1 to dim(x) ;
if rand("Uniform")<i/100 then x{i}=1 ;
else call missing(x{i}) ;
end ;
output ;
end ;
drop i ;
run ;
This is the main macro. It uses PROC MEANS to count the N for a a list of variables. The records read by PROC MEANS are subset to be records where a list of selected variables has no missing values:
%macro FindVarWithGreatestN(
data= /*dataset*/
,CandidateVarlist= /*space-delimited list of candidate variables*/
,SelectedVarlist= /*space-delimited list of previously selected variables*/
);
ods listing close ;
ods output summary=Ncount;
proc means data=have n stackods ;
var &CandidateVarlist;
%if %length(&SelectedVarList) %then %do ;
where nmiss(%sysfunc(translate(&SelectedVarlist,%str(,),%str( ))))=0 ;
%end ;
run ;
ods output close ;
ods listing ;
%*Scope of BestVar is not declared, so if exists in outer scope will write it there ;
proc sql outobs=1 nowarn noprint;
select variable into :BestVar
from nCount
having N=max(N)
;
quit ;
%put the variable with most non-missing values is: &BestVar ;
proc delete data=work.Ncount ;
run ;
%mend FindVarWithGreatestN ;
You can call that macro like:
options mprint ;
%FindVarWithGreatestN(data=have,CandidateVarlist=x1-x100)
%FindVarWithGreatestN(data=have,CandidateVarlist=x1-x97,SelectedVarList=x98 x99 x100)
I suggest you start by trying that macro on your real data, and see if it makes sense to you. Once you're happy with that macro, then you need another macro which will loop over a variable list, calling %FindVarWithGreatestN inside the loop, and building the list of selected variables. Something like:
%macro MakeVarList(
data=
,CandidateVarlist=
) ;
%local SelectedVarlist BestVar ;
%do %until(%sysfunc(countw(&CandidateVarlist,%str( )))=0) ;
%FindVarWithGreatestN(data=&data
,CandidateVarlist=&CandidateVarlist
,SelectedVarlist=&SelectedVarlist
)
%let SelectedVarlist=&SelectedVarlist &BestVar ;
%let CandidateVarlist=%sysfunc(compbl(%str()%sysfunc(tranwrd(&CandidateVarlist,&BestVar,%str( ))))) ;
%end ;
%put the generated variable list is: &SelectedVarList ;
%mend MakeVarList;
You can call that macro like:
%MakeVarList(data=have,CandidateVarlist=x96 x97 x98 x99 x100)
I get the feeling this is the XY Problem all over again. The question (asked in several threads now) about how to sequentially count missings in combination across several variables seems strange and unusual to many of us, but the focus of all of the questions from the OP is tightly on the mechanics of counting missings rather than the underlying problem.
@Satori tells us:
After I get this counting, I will choose which variables to include in my analysis, and I will do entropy balancing for matching treatment and control observations
and so now we see that the real question involves fitting a model to data with lots of missings. The problem with such a tight focus on the mechanics of counting missings is that it provides a solution that will select variables that may or may not be good predictors. Such a method of counting missings may cause the best predictors to be thrown out.
So how to fit a predictive model when there are lots of missing values? That's what we should be answering, but since we haven't heard @Satori say this, we go down his rabbit hole of producing the sequential accounting of missing values by several variables. I'll stay out of the rabbit hole. There are solutions to handle missing values. I advise a possible method of fitting a model to data that has a lot of missings is to impute values for the missings; there are several methods to do this, which I have mentioned already, PROC MI, PROC MIANALYZE and the EM algorithm in PROC PLS, and probably others that I am not aware of. For example: please see the responses from @SteveDenham and @IanWakeling here: Solved: Factor Analysis Multiple Imputation - SAS Support Communities
Yes, that sounds like a good idea. I was not aware of (or forgot about) this feature of PROC MI. It would be much easier to use this than to create a macro (or otherwise) that counts missings for combinations of variables.
Look at PROC MI missing pattern report.
https://blogs.sas.com/content/iml/2016/04/18/patterns-of-missing-data-in-sas.html
@Satori wrote:
I have a dataset with about 50 variables. I want to do a dynamic counting of non missing values for these variables, meaning that I start with the variable with the most non missing observations (var1), then I want to add a second variable (var2) and count the number of non missing observations for both var1 and var2, whereby the number of non missing is the highest possible among the existing variables.
For example, if I have just five variables:
var1 non missing 1000
var2 non missing 800
var3 non missing 400
var4 non missing 850
var5 non missing 330
I would start with var1, and then count the number of observations with non missing of two variables.
var1+var2 non missing 720
var1+var3 non missing 780
var1+var4 non missing 620
var1+var5 non missing 150
Then var1+var3 give me the most observations, so I would keep them and continue with the most number of observations with non missing for three variables.
var1+var3+var2
var1+var3+var4
var1+var3+var5
etc. until I have the result for all the variables (5 in this example). A possible result being var1+var3+var2+var5+var4
Can you show us an example of how you use the Var3 as the second largest instead, for discussion, the 5th variable? I am having a hard time seeing where order of variable actually comes into this if the objective is to determine numbers of nonmissing.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.