Help using Base SAS procedures

how to subset efficiently based on cut off value from other dataset?

Reply
Regular Contributor
Posts: 190

how to subset efficiently based on cut off value from other dataset?

I have a dataset A that contains some 600+ variables.

I have another dataset B which contains variable name and corresponding percentile values. I want to check say 90percentile value from this dataset B for a variable and remove all observations from dataset A which are greater than that value for same variable. Basically outlier removal.

But doing this manually for so many variables using if condition or doing macro call for 600 variables will be very time consuming.

Is there any efficient way of doing this?

Super User
Super User
Posts: 7,942

Re: how to subset efficiently based on cut off value from other dataset?

Posted in reply to munitech4u

HI,

Well, guessing as no test data (in form of dataset) or output is provided:

merge dataaset B to A based on key identifiers.

then do something like:

data want;

     set have;

     array all_my_vars{600};  /* Assumes they all have prefix of all_my_vars */

     do i=1 to 600;

          if all_my_vars{i} > percentile_merged_on_from_b then all_my_vars{i}=.;

     end;

run;

Regular Contributor
Posts: 190

Re: how to subset efficiently based on cut off value from other dataset?

the problem is all my variables have customized names. and for merging based on variable names, i believe require transpose?

Super User
Super User
Posts: 7,942

Re: how to subset efficiently based on cut off value from other dataset?

Posted in reply to munitech4u

Yes, you will have to normalise your data then.  In all fairness, unless there is a specific reason to have 600+ variables (which is very rare), its best to keep them normalised anyways.  Something like:

proc transpose data=have out=want;

     by <id_vars>;

     var  var1 varx vat...;

run;

If there is too many to type by hand, then you can use the to modifier if your on later SAS versions:

     var var1-vat;

Super User
Posts: 10,018

Re: how to subset efficiently based on cut off value from other dataset?

Posted in reply to munitech4u

Check IML code:

https://communities.sas.com/message/282057#282057

Code: Program

data have;
call streaminit(1234);
do i=1 to 100;
  a=ceil(rand('uniform')*100);
  b=ceil(rand('uniform')*100);
  output;
end;
drop i;
run;


%let high=0.9;

proc iml;
use have;
read all var _num_ into x[c=vname];
close have;

call qntl(q,x,{&high});
do i=1 to ncol(x);
x[loc(x[,i]>q[i]),i]=.;
end;

create want from x[c=vname];
append from x;
close want;
quit;


Xia Keshan

Regular Contributor
Posts: 190

Re: how to subset efficiently based on cut off value from other dataset?

Thanks Xia, your inputs have always been great. But IML comes as separate package for SAS and we are using enterprise guide.

Respected Advisor
Posts: 4,173

Re: how to subset efficiently based on cut off value from other dataset?

Posted in reply to munitech4u

If you have IML licensed (on the server) then you can run such code out of SAS Enterprise Guide (the client). The client sends code to the server for execution.

Regular Contributor
Posts: 190

Re: how to subset efficiently based on cut off value from other dataset?

NO, they don't have IML.

Super User
Posts: 10,018

Re: how to subset efficiently based on cut off value from other dataset?

Posted in reply to munitech4u

How about download SAS University Edition at sas.com . It is totally free.

I still remember Reeza write a macro to winsorize a data set . Maybe you can call her to change her code to suit your requirement?

Search :  winsorize + Reeza  at this forum .

Xia Keshan

Regular Contributor
Posts: 190

Re: how to subset efficiently based on cut off value from other dataset?

I tried creating all the values in macro variable for 600 variable names and 600 values and tried running the do loop:

DATA LIB.SUB_FINAL1;

SET LIB.SUB_FINAL;

  %do i=1 %to 600 %by 1;

  %IF &&VARNAME&I=. %THEN &&VARNAME&I=&&PERC_MEDN&I;

%end;

RUN;

Thought values are resolving correctly, but its not doing, what it is intended to do.

e.g. log varname344 resolved to amount_ex

%IF condition &&VARNAME&I=. is FALSE

now amount_ex is present in the dataset, but why it is saying false? does it run only over one row? do i need to use some other counter like j=1 to 80000 to run over all rows?

Regular Contributor
Posts: 190

Re: how to subset efficiently based on cut off value from other dataset?

Posted in reply to munitech4u

It is interesting, i removed the % from if condition and now, it seems to be working fine within macro. I also created 600 macro values each for upper and lower percentile and used the same condition, so basically it writes those many if statements with loop.

Super User
Posts: 19,770

Re: how to subset efficiently based on cut off value from other dataset?

Posted in reply to munitech4u

Here's the macro that Xia was referring to, depending on how big your dataset is this may not be efficient, but its simple to understand and limits the number of macro variables so it's easy to debug.

SAS Box-Plot/Tukey Method of Capping Outliers · GitHub

Super User
Super User
Posts: 7,039

Re: how to subset efficiently based on cut off value from other dataset?

Posted in reply to munitech4u

Use dataset B to generate the code to subset dataset A.

filename code temp;


data _null_;

  set b;

  put 'if ' var '>' value 'then delete;' ;

run;


data want ;

  set A;

  %include code / source2 ;

run;

Ask a Question
Discussion stats
  • 12 replies
  • 497 views
  • 0 likes
  • 6 in conversation