my code:
proc means noprint data=bv31 n; output out=counts(drop=_type_ _freq_) n=;
data high; set counts; array vars(*) _numeric_; max = max(of vars(*));
do i = 1 to dim(vars);
if vars(i) = max then do;
varname = vname(vars(i));
countval = vars(i);
output;
end;
end;
keep varname countval;
data test; set high; call symputx('varname', varname);
data dyna; set bvd.bv31; if missing (&varname) then delete; drop &varname;
How can I automate this code to do the following:
1) run the proc means precedure again (now with one variable less)
2) select the variable with the highest number of non-missing observations, and add it to the dataset high (add the var name and its value)
3) remove all missing observations for last variable added to the high dataset, and drop the variable
4) then continue doing this until there are no variables left.
sample data:
data have;
input obs var1 var2 var3 var4 var5 var6 $1.;
cards;
1 5 0 5 0 3 7
2 1 0 1 0 . 1
3 7 1 7 0 3 1
4 0 . . . 2 0
5 4 0 0 4 1 6
6 4 0 7 3 7 2
7 5 0 8 5 1 0
8 1 7 5 4 9 3
9 2 0 0 2 1 0
10 0 . . . 3 0
11 4 0 0 4 1 0
12 1 1 3 2 . 6
13 3 1 1 6 1 1
14 0 . . . 9 0
15 2 1 2 4 1 0
16 4 1 4 4 1 0
17 . 1 1 4 1 0
18 2 0 0 2 1 0
19 1 1 1 1 2 1
20 0 . . . 7 0
21 0 3 1 . 1 0
22 2 0 2 0 2 0
23 0 . . . 2 0
24 0 . . . 5 0
25 9 0 8 1 1 0
26 0 . . . 1 0
27 0 . 2 . 1 0
28 2 0 2 0 6 2
29 2 0 2 0 4 1
30 2 0 0 2 2 0
;
Based on your code posted, this the requested solution.
You need to loop n-1 otherwise you drop the last variable and have an empty data set at the end, unless that's what you want.
data have;
input obs var1 var2 var3 var4 var5 var6 $1.;
cards;
1 5 0 5 0 3 7
2 1 0 1 0 . 1
3 7 1 7 0 3 1
4 0 . . . 2 0
5 4 0 0 4 1 6
6 4 0 7 3 7 2
7 5 0 8 5 1 0
8 1 7 5 4 9 3
9 2 0 0 2 1 0
10 0 . . . 3 0
11 4 0 0 4 1 0
12 1 1 3 2 . 6
13 3 1 1 6 1 1
14 0 . . . 9 0
15 2 1 2 4 1 0
16 4 1 4 4 1 0
17 . 1 1 4 1 0
18 2 0 0 2 1 0
19 1 1 1 1 2 1
20 0 . . . 7 0
21 0 3 1 . 1 0
22 2 0 2 0 2 0
23 0 . . . 2 0
24 0 . . . 5 0
25 9 0 8 1 1 0
26 0 . . . 1 0
27 0 . 2 . 1 0
28 2 0 2 0 6 2
29 2 0 2 0 4 1
30 2 0 0 2 2 0
;
option mprint symbolgen;
%let nvars = 6;
%macro iterate(dsnin=);
ods select none;
proc means data=&dsnin. n stackods;
ods output summary = counts;
run;
ods select all;
proc sort data=counts;
by descending n;
run;
data highest;
set counts(obs=1);
call symputx('varname', variable);
run;
data &dsnin.;
set &dsnin.;
if missing (&varname) then delete;
drop &varname;
run;
%mend;
%macro loop(n=);
%do i=1 %to &n;
%iterate(dsnin=have);
%end;
%mend;
%loop(n=&nvars.);
@Satori wrote:
Using this code, how can I store the variable names and its values, before they are deleted, so that in the end of the process I have a table with two columns, the first column being the variable name and the second column being its value before deletion?
That does not make any sense. A variable has MANY values. Each observation in a dataset as a value for every variable in the dataset.
data have;
input obs var1 var2 var3 var4 var5 var6 $1.;
cards;
1 5 0 5 0 3 7
2 1 0 1 0 . 1
3 7 1 7 0 3 1
4 0 . . . 2 0
5 4 0 0 4 1 6
6 4 0 7 3 7 2
7 5 0 8 5 1 0
8 1 7 5 4 9 3
9 2 0 0 2 1 0
10 0 . . . 3 0
11 4 0 0 4 1 0
12 1 1 3 2 . 6
13 3 1 1 6 1 1
14 0 . . . 9 0
15 2 1 2 4 1 0
16 4 1 4 4 1 0
17 . 1 1 4 1 0
18 2 0 0 2 1 0
19 1 1 1 1 2 1
20 0 . . . 7 0
21 0 3 1 . 1 0
22 2 0 2 0 2 0
23 0 . . . 2 0
24 0 . . . 5 0
25 9 0 8 1 1 0
26 0 . . . 1 0
27 0 . 2 . 1 0
28 2 0 2 0 6 2
29 2 0 2 0 4 1
30 2 0 0 2 2 0
;
option mprint symbolgen;
%let nvars = 6;
*drop table of variable list before run;
proc sql;
drop table highest_list;
quit;
%macro iterate(dsnin=);
ods select none;
proc means data=&dsnin. n stackods;
ods output summary = counts;
run;
ods select all;
proc sort data=counts;
by descending n;
run;
data highest;
set counts(obs=1);
call symputx('varname', variable);
run;
proc append base = drop_list data=highest force;
run;
data &dsnin.;
set &dsnin.;
if missing (&varname) then delete;
drop &varname;
run;
%mend;
%macro loop(n=);
%do i=1 %to &n;
%iterate(dsnin=have);
%end;
%mend;
%loop(n=&nvars.);
Assuming you mean the # of non missing observation as your 'values' see above.
You really aren't tracking much in this process so I'm assuming it's a skeleton code that you're going to expand otherwise, it doesn't seem to be providing much information at all.
Weeks have gone by since you started discussing this problem, and you are still trying to eliminate variables that have lots of missings, except now you have rephrased the problem to keep variables that have lots of non-missings.
You have avoided (apparently) other suggestions that might work here, such as @Reeza's suggestion to use the missing pattern report from PROC MI, and my suggestion about fitting models when there is lots of missing data. https://communities.sas.com/t5/SAS-Programming/write-macro-or-automate-process/m-p/863200#M340998
People give you advice here because it appears you are working very hard to come up with a not very good solution. We want to show you a smarter path. We know that smart people have worked on the problem of missing values, and come up with some smart solutions. You should probably use those smart solutions instead of inventing your own.
Suggestion:
ACTUALLY walk through, showing the result of each manual step, NOT just the "highest count added" to a set. Show the actually removal of observations, the new set and then followup, for all the variables in your example data.
I am not sure I exactly which observations, variables or values are deleted or dropped.
I also wonder about rules for tie breaking. What if 3 variables have the same number of non-missing (or missing ) values (especially if on different observations) at the first pass. What rule is to be used for selecting the variable? Or is more than one variable to be selected?
This may be moderately critical in one way of determining variable and removing observations if these tied variables are often having the missing on the same observations it is quite possible that of 3 variables tied for most missing in the first pass that none of them have the most missing on the second pass.
I get from interpretting "3) remove all missing observations for last variable added to the high dataset, and drop the variable" as "remove the observation with missing values for the selected variable". Because you cannot have any "missing observations" in a SAS data set.
So for NUMERIC variables you can use PROC MEANS to calculate the N statistic to count the number of non-missing values.
proc summary data=have;
output out=statistics(drop=_type_ _freq_) N(_numeric_)= ;
run;
Now you can use the MAX() function to find the maximum value of those N statistics. And then the WHICHN() function to determine which of the variables has that value (this is take the first one with that count). Once you that index number you can use it with the VNAME() function to find the NAME of the variable. You can use the NLITERAL() function to convert the name to a string you could use in SAS code to refer to the variable. You can use CALL SYMPUTX() to write the name into a macro variable. While we are at it let's see how many variables we have left after we drop the selected variable.
data _null_;
set statistics;
array __n _numeric_ ;
call symputx("next",nliteral(vname(__n[whichn(MAX(of __n[*]),of __n[*])])));
call symputx('nvars',dim(__n)-1);
run;
Now that we know the name of the variable to drop we can use it to make a new copy of the dataset without the observations that have missing values of that variable, and also remove that variable. While we are at it let's also see how many observations are now left. There no sense continuing if there are no more observations.
data want(drop=&next);
if eof then call symputx('nobs',_n_-1);
set want end=eof;
where not missing(&next);
run;
We can then put it together in a macro definition so that we can loop until we run out of observations to delete or variables to drop.
But you have asked this question multiple times using different selection criteria, so let's make more parameters to the macro to control that also. We could use the NMISS() function instead of the N() function with PROC SUMMARY. (or really any statistic that PROC MEANS can calculate). And we could select the variable with MIN() value instead of the MAX() value. So let's add the STAT parameter and the CRITERIA parameter to the macro definition.
Here is the result:
%macro drop_vars
(inds /* input dataset name */
,outds /* output dataset name */
,mvar /* macro variable to store the dropped variable */
,stat=N /* What statistic to calculate */
,criteria=max /* What criteria to use for picking variable */
);
%local next nvars nobs;
%if not %symexist(&mvar) %then %global &mvar;
%let &mvar=;
data &outds;
set &inds;
run;
%do %until(0=&nvars or 0=&nobs);
proc summary data=&outds ;
output out=statistics(drop=_type_ _freq_) &stat(_numeric_)= ;
run;
data _null_;
set statistics;
array __n _numeric_ ;
call symputx("next",nliteral(vname(__n[whichn(&criteria(of __n[*]),of __n[*])])));
call symputx('nvars',dim(__n)-1);
run;
%let &mvar=&&&mvar &next ;
data &outds(drop=&next);
if eof then call symputx('nobs',_n_-1);
set &outds end=eof;
where not missing(&next);
run;
%end;
%mend drop_vars;
If we run it for your test dataset (click spoiler to expand);
options mprint nosymbolgen;
%drop_vars(inds=have,outds=want,mvar=varlist)
%put &=varlist;
we get this log:
4103 options mprint nosymbolgen; 4104 %drop_vars(inds=have,outds=want,mvar=varlist) MPRINT(DROP_VARS): data want; MPRINT(DROP_VARS): set have; MPRINT(DROP_VARS): run; NOTE: The data set WORK.WANT has 30 observations and 7 variables. NOTE: DATA statement used (Total process time): real time 0.01 seconds cpu time 0.00 seconds MPRINT(DROP_VARS): proc summary data=want ; MPRINT(DROP_VARS): output out=statistics(drop=_type_ _freq_) N(_numeric_)= ; MPRINT(DROP_VARS): run; NOTE: The data set WORK.STATISTICS has 1 observations and 6 variables. NOTE: PROCEDURE SUMMARY used (Total process time): real time 0.00 seconds cpu time 0.00 seconds MPRINT(DROP_VARS): data _null_; MPRINT(DROP_VARS): set statistics; MPRINT(DROP_VARS): array __n _numeric_ ; MPRINT(DROP_VARS): call symputx("next",nliteral(vname(__n[whichn(max(of __n[*]),of __n[*])]))); MPRINT(DROP_VARS): call symputx('nvars',dim(__n)-1); MPRINT(DROP_VARS): run; NOTE: DATA statement used (Total process time): real time 0.00 seconds cpu time 0.00 seconds MPRINT(DROP_VARS): data want(drop=obs); MPRINT(DROP_VARS): if eof then call symputx('nobs',_n_-1); MPRINT(DROP_VARS): set want end=eof; MPRINT(DROP_VARS): where not missing(obs); MPRINT(DROP_VARS): run; NOTE: The data set WORK.WANT has 30 observations and 6 variables. NOTE: DATA statement used (Total process time): real time 0.00 seconds cpu time 0.00 seconds MPRINT(DROP_VARS): proc summary data=want ; MPRINT(DROP_VARS): output out=statistics(drop=_type_ _freq_) N(_numeric_)= ; MPRINT(DROP_VARS): run; NOTE: The data set WORK.STATISTICS has 1 observations and 5 variables. NOTE: PROCEDURE SUMMARY used (Total process time): real time 0.01 seconds cpu time 0.00 seconds MPRINT(DROP_VARS): data _null_; MPRINT(DROP_VARS): set statistics; MPRINT(DROP_VARS): array __n _numeric_ ; MPRINT(DROP_VARS): call symputx("next",nliteral(vname(__n[whichn(max(of __n[*]),of __n[*])]))); MPRINT(DROP_VARS): call symputx('nvars',dim(__n)-1); MPRINT(DROP_VARS): run; NOTE: DATA statement used (Total process time): real time 0.00 seconds cpu time 0.00 seconds MPRINT(DROP_VARS): data want(drop=var1); MPRINT(DROP_VARS): if eof then call symputx('nobs',_n_-1); MPRINT(DROP_VARS): set want end=eof; MPRINT(DROP_VARS): where not missing(var1); MPRINT(DROP_VARS): run; NOTE: The data set WORK.WANT has 29 observations and 5 variables. NOTE: DATA statement used (Total process time): real time 0.00 seconds cpu time 0.00 seconds MPRINT(DROP_VARS): proc summary data=want ; MPRINT(DROP_VARS): output out=statistics(drop=_type_ _freq_) N(_numeric_)= ; MPRINT(DROP_VARS): run; NOTE: The data set WORK.STATISTICS has 1 observations and 4 variables. NOTE: PROCEDURE SUMMARY used (Total process time): real time 0.00 seconds cpu time 0.00 seconds MPRINT(DROP_VARS): data _null_; MPRINT(DROP_VARS): set statistics; MPRINT(DROP_VARS): array __n _numeric_ ; MPRINT(DROP_VARS): call symputx("next",nliteral(vname(__n[whichn(max(of __n[*]),of __n[*])]))); MPRINT(DROP_VARS): call symputx('nvars',dim(__n)-1); MPRINT(DROP_VARS): run; NOTE: DATA statement used (Total process time): real time 0.00 seconds cpu time 0.00 seconds MPRINT(DROP_VARS): data want(drop=var5); MPRINT(DROP_VARS): if eof then call symputx('nobs',_n_-1); MPRINT(DROP_VARS): set want end=eof; MPRINT(DROP_VARS): where not missing(var5); MPRINT(DROP_VARS): run; NOTE: The data set WORK.WANT has 27 observations and 4 variables. NOTE: DATA statement used (Total process time): real time 0.00 seconds cpu time 0.00 seconds MPRINT(DROP_VARS): proc summary data=want ; MPRINT(DROP_VARS): output out=statistics(drop=_type_ _freq_) N(_numeric_)= ; MPRINT(DROP_VARS): run; NOTE: The data set WORK.STATISTICS has 1 observations and 3 variables. NOTE: PROCEDURE SUMMARY used (Total process time): real time 0.00 seconds cpu time 0.00 seconds MPRINT(DROP_VARS): data _null_; MPRINT(DROP_VARS): set statistics; MPRINT(DROP_VARS): array __n _numeric_ ; MPRINT(DROP_VARS): call symputx("next",nliteral(vname(__n[whichn(max(of __n[*]),of __n[*])]))); MPRINT(DROP_VARS): call symputx('nvars',dim(__n)-1); MPRINT(DROP_VARS): run; NOTE: DATA statement used (Total process time): real time 0.00 seconds cpu time 0.01 seconds MPRINT(DROP_VARS): data want(drop=var3); MPRINT(DROP_VARS): if eof then call symputx('nobs',_n_-1); MPRINT(DROP_VARS): set want end=eof; MPRINT(DROP_VARS): where not missing(var3); MPRINT(DROP_VARS): run; NOTE: The data set WORK.WANT has 20 observations and 3 variables. NOTE: DATA statement used (Total process time): real time 0.00 seconds cpu time 0.00 seconds MPRINT(DROP_VARS): proc summary data=want ; MPRINT(DROP_VARS): output out=statistics(drop=_type_ _freq_) N(_numeric_)= ; MPRINT(DROP_VARS): run; NOTE: The data set WORK.STATISTICS has 1 observations and 2 variables. NOTE: PROCEDURE SUMMARY used (Total process time): real time 0.00 seconds cpu time 0.00 seconds MPRINT(DROP_VARS): data _null_; MPRINT(DROP_VARS): set statistics; MPRINT(DROP_VARS): array __n _numeric_ ; MPRINT(DROP_VARS): call symputx("next",nliteral(vname(__n[whichn(max(of __n[*]),of __n[*])]))); MPRINT(DROP_VARS): call symputx('nvars',dim(__n)-1); MPRINT(DROP_VARS): run; NOTE: DATA statement used (Total process time): real time 0.00 seconds cpu time 0.00 seconds MPRINT(DROP_VARS): data want(drop=var2); MPRINT(DROP_VARS): if eof then call symputx('nobs',_n_-1); MPRINT(DROP_VARS): set want end=eof; MPRINT(DROP_VARS): where not missing(var2); MPRINT(DROP_VARS): run; NOTE: Compression was disabled for data set WORK.WANT because compression overhead would increase the size of the data set. NOTE: The data set WORK.WANT has 19 observations and 2 variables. NOTE: DATA statement used (Total process time): real time 0.00 seconds cpu time 0.01 seconds MPRINT(DROP_VARS): proc summary data=want ; MPRINT(DROP_VARS): output out=statistics(drop=_type_ _freq_) N(_numeric_)= ; MPRINT(DROP_VARS): run; NOTE: Compression was disabled for data set WORK.STATISTICS because compression overhead would increase the size of the data set. NOTE: The data set WORK.STATISTICS has 1 observations and 1 variables. NOTE: PROCEDURE SUMMARY used (Total process time): real time 0.00 seconds cpu time 0.00 seconds MPRINT(DROP_VARS): data _null_; MPRINT(DROP_VARS): set statistics; MPRINT(DROP_VARS): array __n _numeric_ ; MPRINT(DROP_VARS): call symputx("next",nliteral(vname(__n[whichn(max(of __n[*]),of __n[*])]))); MPRINT(DROP_VARS): call symputx('nvars',dim(__n)-1); MPRINT(DROP_VARS): run; NOTE: DATA statement used (Total process time): real time 0.00 seconds cpu time 0.00 seconds MPRINT(DROP_VARS): data want(drop=var4); MPRINT(DROP_VARS): if eof then call symputx('nobs',_n_-1); MPRINT(DROP_VARS): set want end=eof; MPRINT(DROP_VARS): where not missing(var4); MPRINT(DROP_VARS): run; NOTE: Compression was disabled for data set WORK.WANT because compression overhead would increase the size of the data set. NOTE: The data set WORK.WANT has 18 observations and 1 variables. NOTE: DATA statement used (Total process time): real time 0.00 seconds cpu time 0.00 seconds 4105 %put &=varlist; VARLIST=obs var1 var5 var3 var2 var4
So you can see that the variables were dropped in this order:
VARLIST=obs var1 var5 var3 var2 var4
If instead we run it to drop the variable with most missing values:
%drop_vars(inds=have,outds=want,mvar=varlist,stat=nmiss)
We get the variables dropped in this order:
VARLIST=var4 var5 var1 obs var2 var3
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.