BookmarkSubscribeRSS Feed
Satori
Quartz | Level 8

my code:

 

proc means noprint data=bv31 n; output out=counts(drop=_type_ _freq_) n=;

data high; set counts; array vars(*) _numeric_; max = max(of vars(*));
do i = 1 to dim(vars);
  if vars(i) = max then do;
  varname = vname(vars(i));
  countval = vars(i);
  output;
  end;
end;
keep varname countval;

data test; set high; call symputx('varname', varname);

data dyna; set bvd.bv31; if missing (&varname) then delete; drop &varname;

How can I automate this code to do the following:

1) run the proc means precedure again (now with one variable less)

2) select the variable with the highest number of non-missing observations, and add it to the dataset high (add the var name and its value)

3) remove all missing observations for last variable added to the high dataset, and drop the variable

4) then continue doing this until there are no variables left.

 

sample data:

data have;
input obs var1 var2 var3 var4 var5 var6 $1.;
cards;
1  5 0 5 0 3 7
2  1 0 1 0 . 1
3  7 1 7 0 3 1
4  0 . . . 2 0
5  4 0 0 4 1 6
6  4 0 7 3 7 2
7  5 0 8 5 1 0
8  1 7 5 4 9 3
9  2 0 0 2 1 0
10 0 . . . 3 0
11 4 0 0 4 1 0
12 1 1 3 2 . 6
13 3 1 1 6 1 1
14 0 . . . 9 0
15 2 1 2 4 1 0
16 4 1 4 4 1 0
17 . 1 1 4 1 0
18 2 0 0 2 1 0
19 1 1 1 1 2 1
20 0 . . . 7 0
21 0 3 1 . 1 0
22 2 0 2 0 2 0
23 0 . . . 2 0
24 0 . . . 5 0
25 9 0 8 1 1 0
26 0 . . . 1 0
27 0 . 2 . 1 0
28 2 0 2 0 6 2
29 2 0 2 0 4 1
30 2 0 0 2 2 0
;
13 REPLIES 13
Reeza
Super User
Please provide some fake data for testing.
Satori
Quartz | Level 8
fake date added
Reeza
Super User
You're aware that your code description and requirements/task list don't align? Which one is correct, the code or your requirements?
Satori
Quartz | Level 8
Requirements
Reeza
Super User

Based on your code posted, this the requested solution.

You need to loop n-1 otherwise you drop the last variable and have an empty data set at the end, unless that's what you want. 

 

data have;
input obs var1 var2 var3 var4 var5 var6 $1.;
cards;
1  5 0 5 0 3 7
2  1 0 1 0 . 1
3  7 1 7 0 3 1
4  0 . . . 2 0
5  4 0 0 4 1 6
6  4 0 7 3 7 2
7  5 0 8 5 1 0
8  1 7 5 4 9 3
9  2 0 0 2 1 0
10 0 . . . 3 0
11 4 0 0 4 1 0
12 1 1 3 2 . 6
13 3 1 1 6 1 1
14 0 . . . 9 0
15 2 1 2 4 1 0
16 4 1 4 4 1 0
17 . 1 1 4 1 0
18 2 0 0 2 1 0
19 1 1 1 1 2 1
20 0 . . . 7 0
21 0 3 1 . 1 0
22 2 0 2 0 2 0
23 0 . . . 2 0
24 0 . . . 5 0
25 9 0 8 1 1 0
26 0 . . . 1 0
27 0 . 2 . 1 0
28 2 0 2 0 6 2
29 2 0 2 0 4 1
30 2 0 0 2 2 0
;

option mprint symbolgen;

%let nvars = 6;

%macro iterate(dsnin=);
ods select none;
proc means data=&dsnin. n stackods; 
ods output summary = counts;
run;
ods select all;

proc sort data=counts;
by descending n;
run;

data highest;
set counts(obs=1);
call symputx('varname', variable);
run;


data &dsnin.; 
set &dsnin.; 
if missing (&varname) then delete; 
drop &varname;
run;

%mend;

%macro loop(n=);
%do i=1 %to &n;
%iterate(dsnin=have);
%end;
%mend;

%loop(n=&nvars.);

Satori
Quartz | Level 8
Using this code, how can I store the variable names and its values, before they are deleted, so that in the end of the process I have a table with two columns, the first column being the variable name and the second column being its value before deletion?
Tom
Super User Tom
Super User

@Satori wrote:
Using this code, how can I store the variable names and its values, before they are deleted, so that in the end of the process I have a table with two columns, the first column being the variable name and the second column being its value before deletion?

That does not make any sense.  A variable has MANY values.  Each observation in a dataset as a value for every variable in the dataset.

Reeza
Super User
What "values"?
Reeza
Super User
data have;
input obs var1 var2 var3 var4 var5 var6 $1.;
cards;
1  5 0 5 0 3 7
2  1 0 1 0 . 1
3  7 1 7 0 3 1
4  0 . . . 2 0
5  4 0 0 4 1 6
6  4 0 7 3 7 2
7  5 0 8 5 1 0
8  1 7 5 4 9 3
9  2 0 0 2 1 0
10 0 . . . 3 0
11 4 0 0 4 1 0
12 1 1 3 2 . 6
13 3 1 1 6 1 1
14 0 . . . 9 0
15 2 1 2 4 1 0
16 4 1 4 4 1 0
17 . 1 1 4 1 0
18 2 0 0 2 1 0
19 1 1 1 1 2 1
20 0 . . . 7 0
21 0 3 1 . 1 0
22 2 0 2 0 2 0
23 0 . . . 2 0
24 0 . . . 5 0
25 9 0 8 1 1 0
26 0 . . . 1 0
27 0 . 2 . 1 0
28 2 0 2 0 6 2
29 2 0 2 0 4 1
30 2 0 0 2 2 0
;

option mprint symbolgen;

%let nvars = 6;

*drop table of variable list before run;
proc sql;
drop table highest_list;
quit;

%macro iterate(dsnin=);
ods select none;
proc means data=&dsnin. n stackods; 
ods output summary = counts;
run;
ods select all;

proc sort data=counts;
by descending n;
run;

data highest;
set counts(obs=1);
call symputx('varname', variable);
run;

proc append base = drop_list data=highest force;
run;

data &dsnin.; 
set &dsnin.; 
if missing (&varname) then delete; 
drop &varname;
run;

%mend;

%macro loop(n=);
%do i=1 %to &n;
%iterate(dsnin=have);
%end;
%mend;

%loop(n=&nvars.);

Assuming you mean the # of non missing observation as your 'values' see above. 

You really aren't tracking much in this process so I'm assuming it's a skeleton code that you're going to expand otherwise, it doesn't seem to be providing much information at all. 

PaigeMiller
Diamond | Level 26

Weeks have gone by since you started discussing this problem, and you are still trying to eliminate variables that have lots of missings, except now you have rephrased the problem to keep variables that have lots of non-missings.

 

You have avoided (apparently) other suggestions that might work here, such as @Reeza's suggestion to use the missing pattern report from PROC MI, and my suggestion about fitting models when there is lots of missing data. https://communities.sas.com/t5/SAS-Programming/write-macro-or-automate-process/m-p/863200#M340998

 

People give you advice here because it appears you are working very hard to come up with a not very good solution. We want to show you a smarter path. We know that smart people have worked on the problem of missing values, and come up with some smart solutions. You should probably use those smart solutions instead of inventing your own.

 

 

--
Paige Miller
Satori
Quartz | Level 8
Honestly, I don't understand why you keep replying if you don't have any intention of helping. This issue has nothing to do with smart solutions. I have explained before that it has to be done this way, if nothing else, because that is what my boss wants me to do. Also, you don't have the full picture (and neither do I by the way), but I know at least that the final goal has nothing to do with eliminating variables with lots of missings. The idea for now is just to see how the number of non-missing values decreases as variables are added.

So please if don't want to help me do this, I would appreciate it if you refrain from posting patronizing speeches because that does not help me at all.
ballardw
Super User

Suggestion:

ACTUALLY walk through, showing the result of each manual step, NOT just the "highest count added" to a set. Show the actually removal of observations, the new set and then followup, for all the variables in your example data.

 

I am not sure I exactly which observations, variables or values are deleted or dropped.

 

I also wonder about rules for tie breaking. What if 3 variables have the same number of non-missing (or missing ) values (especially if on different observations) at the first pass. What rule is to be used for selecting the variable? Or is more than one variable to be selected?

This may be moderately critical in one way of determining  variable and removing observations if these tied variables are often having the missing on the same observations it is quite possible that of 3 variables tied for most missing in the first pass that none of them have the most missing on the second pass.

I get from interpretting "3) remove all missing observations for last variable added to the high dataset, and drop the variable" as "remove the observation with missing values for the selected variable". Because you cannot have any "missing observations" in a SAS data set.

Tom
Super User Tom
Super User

So for NUMERIC variables you can use PROC MEANS to calculate the N statistic to count the number of non-missing values.

proc summary data=have;
  output out=statistics(drop=_type_ _freq_) N(_numeric_)= ;
run;

Now you can use the MAX() function to find the maximum value of those N statistics.  And then the WHICHN() function to determine which of the variables has that value (this is take the first one with that count).  Once you that index number you can use it with the VNAME() function to find the NAME of the variable.  You can use the NLITERAL() function to convert the name to a string you could use in SAS code to refer to the variable.  You can use CALL SYMPUTX() to write the name into a macro variable.  While we are at it let's see how many variables we have left after we drop the selected variable.

data _null_;
  set statistics;
  array __n _numeric_ ;
  call symputx("next",nliteral(vname(__n[whichn(MAX(of __n[*]),of __n[*])])));
  call symputx('nvars',dim(__n)-1);
run;

Now that we know the name of the variable to drop we can use it to make a new copy of the dataset without the observations that have missing values of that variable, and also remove that variable. While we are at it let's also see how many observations are now left. There no sense continuing if there are no more observations.

data want(drop=&next);
  if eof then call symputx('nobs',_n_-1);
  set want end=eof;
  where not missing(&next);
run;

We can then put it together in a macro definition so that we can loop until we run out of observations to delete or variables to drop.

 

But you have asked this question multiple times using different selection criteria, so let's make more parameters to the macro to control that also.  We could use the NMISS() function instead of the N() function with PROC SUMMARY.  (or really any statistic that PROC MEANS can calculate).  And we could select the variable with MIN() value instead of the MAX() value.  So let's add the STAT parameter and the CRITERIA parameter to the macro definition.

 

Here is the result:

%macro drop_vars
(inds /* input dataset name */
,outds /* output dataset name */
,mvar  /* macro variable to store the dropped variable */
,stat=N /* What statistic to calculate */
,criteria=max /* What criteria to use for picking variable */
);
%local next nvars nobs;
%if not %symexist(&mvar) %then %global &mvar;
%let &mvar=;
data &outds;
 set &inds;
run;

%do %until(0=&nvars or 0=&nobs);

proc summary data=&outds ;
  output out=statistics(drop=_type_ _freq_) &stat(_numeric_)= ;
run;
data _null_;
  set statistics;
  array __n _numeric_ ;
  call symputx("next",nliteral(vname(__n[whichn(&criteria(of __n[*]),of __n[*])])));
  call symputx('nvars',dim(__n)-1);
run;
%let &mvar=&&&mvar &next ;
data &outds(drop=&next);
  if eof then call symputx('nobs',_n_-1);
  set &outds end=eof;
  where not missing(&next);
run;
%end;
%mend drop_vars;

If we run it for your test dataset (click spoiler to expand);

 

Spoiler

 

options mprint nosymbolgen;
%drop_vars(inds=have,outds=want,mvar=varlist)
%put &=varlist;

we get this log:

4103  options mprint nosymbolgen;
4104  %drop_vars(inds=have,outds=want,mvar=varlist)
MPRINT(DROP_VARS):   data want;
MPRINT(DROP_VARS):   set have;
MPRINT(DROP_VARS):   run;

NOTE: The data set WORK.WANT has 30 observations and 7 variables.
NOTE: DATA statement used (Total process time):
      real time           0.01 seconds
      cpu time            0.00 seconds


MPRINT(DROP_VARS):   proc summary data=want ;
MPRINT(DROP_VARS):   output out=statistics(drop=_type_ _freq_) N(_numeric_)= ;
MPRINT(DROP_VARS):   run;

NOTE: The data set WORK.STATISTICS has 1 observations and 6 variables.
NOTE: PROCEDURE SUMMARY used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds


MPRINT(DROP_VARS):   data _null_;
MPRINT(DROP_VARS):   set statistics;
MPRINT(DROP_VARS):   array __n _numeric_ ;
MPRINT(DROP_VARS):   call symputx("next",nliteral(vname(__n[whichn(max(of __n[*]),of __n[*])])));
MPRINT(DROP_VARS):   call symputx('nvars',dim(__n)-1);
MPRINT(DROP_VARS):   run;

NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds


MPRINT(DROP_VARS):   data want(drop=obs);
MPRINT(DROP_VARS):   if eof then call symputx('nobs',_n_-1);
MPRINT(DROP_VARS):   set want end=eof;
MPRINT(DROP_VARS):   where not missing(obs);
MPRINT(DROP_VARS):   run;

NOTE: The data set WORK.WANT has 30 observations and 6 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds


MPRINT(DROP_VARS):   proc summary data=want ;
MPRINT(DROP_VARS):   output out=statistics(drop=_type_ _freq_) N(_numeric_)= ;
MPRINT(DROP_VARS):   run;

NOTE: The data set WORK.STATISTICS has 1 observations and 5 variables.
NOTE: PROCEDURE SUMMARY used (Total process time):
      real time           0.01 seconds
      cpu time            0.00 seconds


MPRINT(DROP_VARS):   data _null_;
MPRINT(DROP_VARS):   set statistics;
MPRINT(DROP_VARS):   array __n _numeric_ ;
MPRINT(DROP_VARS):   call symputx("next",nliteral(vname(__n[whichn(max(of __n[*]),of __n[*])])));
MPRINT(DROP_VARS):   call symputx('nvars',dim(__n)-1);
MPRINT(DROP_VARS):   run;

NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds


MPRINT(DROP_VARS):   data want(drop=var1);
MPRINT(DROP_VARS):   if eof then call symputx('nobs',_n_-1);
MPRINT(DROP_VARS):   set want end=eof;
MPRINT(DROP_VARS):   where not missing(var1);
MPRINT(DROP_VARS):   run;

NOTE: The data set WORK.WANT has 29 observations and 5 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds


MPRINT(DROP_VARS):   proc summary data=want ;
MPRINT(DROP_VARS):   output out=statistics(drop=_type_ _freq_) N(_numeric_)= ;
MPRINT(DROP_VARS):   run;

NOTE: The data set WORK.STATISTICS has 1 observations and 4 variables.
NOTE: PROCEDURE SUMMARY used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds


MPRINT(DROP_VARS):   data _null_;
MPRINT(DROP_VARS):   set statistics;
MPRINT(DROP_VARS):   array __n _numeric_ ;
MPRINT(DROP_VARS):   call symputx("next",nliteral(vname(__n[whichn(max(of __n[*]),of __n[*])])));
MPRINT(DROP_VARS):   call symputx('nvars',dim(__n)-1);
MPRINT(DROP_VARS):   run;

NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds


MPRINT(DROP_VARS):   data want(drop=var5);
MPRINT(DROP_VARS):   if eof then call symputx('nobs',_n_-1);
MPRINT(DROP_VARS):   set want end=eof;
MPRINT(DROP_VARS):   where not missing(var5);
MPRINT(DROP_VARS):   run;

NOTE: The data set WORK.WANT has 27 observations and 4 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds


MPRINT(DROP_VARS):   proc summary data=want ;
MPRINT(DROP_VARS):   output out=statistics(drop=_type_ _freq_) N(_numeric_)= ;
MPRINT(DROP_VARS):   run;

NOTE: The data set WORK.STATISTICS has 1 observations and 3 variables.
NOTE: PROCEDURE SUMMARY used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds


MPRINT(DROP_VARS):   data _null_;
MPRINT(DROP_VARS):   set statistics;
MPRINT(DROP_VARS):   array __n _numeric_ ;
MPRINT(DROP_VARS):   call symputx("next",nliteral(vname(__n[whichn(max(of __n[*]),of __n[*])])));
MPRINT(DROP_VARS):   call symputx('nvars',dim(__n)-1);
MPRINT(DROP_VARS):   run;

NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.01 seconds


MPRINT(DROP_VARS):   data want(drop=var3);
MPRINT(DROP_VARS):   if eof then call symputx('nobs',_n_-1);
MPRINT(DROP_VARS):   set want end=eof;
MPRINT(DROP_VARS):   where not missing(var3);
MPRINT(DROP_VARS):   run;

NOTE: The data set WORK.WANT has 20 observations and 3 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds


MPRINT(DROP_VARS):   proc summary data=want ;
MPRINT(DROP_VARS):   output out=statistics(drop=_type_ _freq_) N(_numeric_)= ;
MPRINT(DROP_VARS):   run;

NOTE: The data set WORK.STATISTICS has 1 observations and 2 variables.
NOTE: PROCEDURE SUMMARY used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds


MPRINT(DROP_VARS):   data _null_;
MPRINT(DROP_VARS):   set statistics;
MPRINT(DROP_VARS):   array __n _numeric_ ;
MPRINT(DROP_VARS):   call symputx("next",nliteral(vname(__n[whichn(max(of __n[*]),of __n[*])])));
MPRINT(DROP_VARS):   call symputx('nvars',dim(__n)-1);
MPRINT(DROP_VARS):   run;

NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds


MPRINT(DROP_VARS):   data want(drop=var2);
MPRINT(DROP_VARS):   if eof then call symputx('nobs',_n_-1);
MPRINT(DROP_VARS):   set want end=eof;
MPRINT(DROP_VARS):   where not missing(var2);
MPRINT(DROP_VARS):   run;

NOTE: Compression was disabled for data set WORK.WANT because compression overhead would increase the size of the data set.
NOTE: The data set WORK.WANT has 19 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.01 seconds


MPRINT(DROP_VARS):   proc summary data=want ;
MPRINT(DROP_VARS):   output out=statistics(drop=_type_ _freq_) N(_numeric_)= ;
MPRINT(DROP_VARS):   run;

NOTE: Compression was disabled for data set WORK.STATISTICS because compression overhead would increase the size of the data set.
NOTE: The data set WORK.STATISTICS has 1 observations and 1 variables.
NOTE: PROCEDURE SUMMARY used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds


MPRINT(DROP_VARS):   data _null_;
MPRINT(DROP_VARS):   set statistics;
MPRINT(DROP_VARS):   array __n _numeric_ ;
MPRINT(DROP_VARS):   call symputx("next",nliteral(vname(__n[whichn(max(of __n[*]),of __n[*])])));
MPRINT(DROP_VARS):   call symputx('nvars',dim(__n)-1);
MPRINT(DROP_VARS):   run;

NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds


MPRINT(DROP_VARS):   data want(drop=var4);
MPRINT(DROP_VARS):   if eof then call symputx('nobs',_n_-1);
MPRINT(DROP_VARS):   set want end=eof;
MPRINT(DROP_VARS):   where not missing(var4);
MPRINT(DROP_VARS):   run;

NOTE: Compression was disabled for data set WORK.WANT because compression overhead would increase the size of the data set.
NOTE: The data set WORK.WANT has 18 observations and 1 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds


4105  %put &=varlist;
VARLIST=obs var1 var5 var3 var2 var4

 

So you can see that the variables were dropped in this order: 

VARLIST=obs var1 var5 var3 var2 var4

 

If instead we run it to drop the variable with most missing values:

%drop_vars(inds=have,outds=want,mvar=varlist,stat=nmiss)

We get the variables dropped in this order:

VARLIST=var4 var5 var1 obs var2 var3

 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 13 replies
  • 738 views
  • 1 like
  • 5 in conversation