BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
fatemeh
Quartz | Level 8

Hello,

I have multivariable data set and I need to subset my data that values of all variables be larger than 75th percentile at each column. I appreciate for any help.

***********;
/*Calculate 75 percentile*/
proc means data=mydata  noprint;
  var varb varc varg varh vark    ;
  output out=p75dataset  P75=  / autoname;
run;

proc print data=p75dataset;run;

/*Store 75 percentile in a macro variable*/
data _null_;
  set p75dataset;
  call symputx('p75Mvalue', autoname);
run;



/*Find the subset that values of variables are larger than 75 percentiles*/
data subset;
  set mydata ;
  array var{5} varb varc varg varh vark;
	do i = 1 to 5;
	where var(i)>=&p75Mvalue	;
	end;
  
run;

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
mkeintz
PROC Star

Let's work backwards:

 

Your last step has this code:



/*Find the subset that values of variables are larger than 75 percentiles*/
data subset;
  set mydata ;
  array var{5} varb varc varg varh vark;
	do i = 1 to 5;
	where var(i)>=&p75Mvalue	;
	end;
  
run;

You have a where statement with a component specifying an array element.  This has two problems:

  1. You are comparing each variable to the same macro value.  But the variables each probably have their own 75th percentile.
  2. Far more problematic: the WHERE statement is always "outsourced" by the data step to the data engine (that's why you can use a WHERE statement in a PROC as well as a DATA step).  But because it is outsourced, it is not informed of the array definition, so you can't pass expressions like "var{i}" to it.

So, if you want to use macro variables, you probably want something like

where varb >= &varb_p75 and varc >= &varc_p75 and 
      varg >= &varg_p75 and varh >= &varh_p75 and 
      vark >= &vark_p75 ;

That, in turn, means you have to modify your middle step to create those 5 macrovars.  It currently only creates (and repeatedly overwrites) a single macrovar p75mvalue.

 

So take a look at the output of the proc means, and see how you can loop over the 5 values for the 75th percentiles, writing a single distinctly-named macrovar in each iteration.

 

BTW, you could avoid the use of macrovars entirely if you choose to use an IF statement (instead of where) in the last data step.  You would then need only the proc means and the DATA SUBSET step with an additional "IF _N_=1 then SET P75DATASET;" statement..

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

1 REPLY 1
mkeintz
PROC Star

Let's work backwards:

 

Your last step has this code:



/*Find the subset that values of variables are larger than 75 percentiles*/
data subset;
  set mydata ;
  array var{5} varb varc varg varh vark;
	do i = 1 to 5;
	where var(i)>=&p75Mvalue	;
	end;
  
run;

You have a where statement with a component specifying an array element.  This has two problems:

  1. You are comparing each variable to the same macro value.  But the variables each probably have their own 75th percentile.
  2. Far more problematic: the WHERE statement is always "outsourced" by the data step to the data engine (that's why you can use a WHERE statement in a PROC as well as a DATA step).  But because it is outsourced, it is not informed of the array definition, so you can't pass expressions like "var{i}" to it.

So, if you want to use macro variables, you probably want something like

where varb >= &varb_p75 and varc >= &varc_p75 and 
      varg >= &varg_p75 and varh >= &varh_p75 and 
      vark >= &vark_p75 ;

That, in turn, means you have to modify your middle step to create those 5 macrovars.  It currently only creates (and repeatedly overwrites) a single macrovar p75mvalue.

 

So take a look at the output of the proc means, and see how you can loop over the 5 values for the 75th percentiles, writing a single distinctly-named macrovar in each iteration.

 

BTW, you could avoid the use of macrovars entirely if you choose to use an IF statement (instead of where) in the last data step.  You would then need only the proc means and the DATA SUBSET step with an additional "IF _N_=1 then SET P75DATASET;" statement..

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

sas-innovate-white.png

Missed SAS Innovate in Orlando?

Catch the best of SAS Innovate 2025 — anytime, anywhere. Stream powerful keynotes, real-world demos, and game-changing insights from the world’s leading data and AI minds.

 

Register now

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 1 reply
  • 690 views
  • 1 like
  • 2 in conversation