Help using Base SAS procedures

how to subset efficiently based on cut off value from other dataset?

Reply
Regular Contributor
Posts: 188

how to subset efficiently based on cut off value from other dataset?

I have a dataset A that contains some 600+ variables.

I have another dataset B which contains variable name and corresponding percentile values. I want to check say 90percentile value from this dataset B for a variable and remove all observations from dataset A which are greater than that value for same variable. Basically outlier removal.

But doing this manually for so many variables using if condition or doing macro call for 600 variables will be very time consuming.

Is there any efficient way of doing this?

Super User
Super User
Posts: 7,417

Re: how to subset efficiently based on cut off value from other dataset?

HI,

Well, guessing as no test data (in form of dataset) or output is provided:

merge dataaset B to A based on key identifiers.

then do something like:

data want;

     set have;

     array all_my_vars{600};  /* Assumes they all have prefix of all_my_vars */

     do i=1 to 600;

          if all_my_vars{i} > percentile_merged_on_from_b then all_my_vars{i}=.;

     end;

run;

Regular Contributor
Posts: 188

Re: how to subset efficiently based on cut off value from other dataset?

the problem is all my variables have customized names. and for merging based on variable names, i believe require transpose?

Super User
Super User
Posts: 7,417

Re: how to subset efficiently based on cut off value from other dataset?

Yes, you will have to normalise your data then.  In all fairness, unless there is a specific reason to have 600+ variables (which is very rare), its best to keep them normalised anyways.  Something like:

proc transpose data=have out=want;

     by <id_vars>;

     var  var1 varx vat...;

run;

If there is too many to type by hand, then you can use the to modifier if your on later SAS versions:

     var var1-vat;

Super User
Posts: 9,687

Re: how to subset efficiently based on cut off value from other dataset?

Check IML code:

https://communities.sas.com/message/282057#282057

Code: Program

data have;
call streaminit(1234);
do i=1 to 100;
  a=ceil(rand('uniform')*100);
  b=ceil(rand('uniform')*100);
  output;
end;
drop i;
run;


%let high=0.9;

proc iml;
use have;
read all var _num_ into x[c=vname];
close have;

call qntl(q,x,{&high});
do i=1 to ncol(x);
x[loc(x[,i]>q[i]),i]=.;
end;

create want from x[c=vname];
append from x;
close want;
quit;


Xia Keshan

Regular Contributor
Posts: 188

Re: how to subset efficiently based on cut off value from other dataset?

Thanks Xia, your inputs have always been great. But IML comes as separate package for SAS and we are using enterprise guide.

Respected Advisor
Posts: 3,901

Re: how to subset efficiently based on cut off value from other dataset?

If you have IML licensed (on the server) then you can run such code out of SAS Enterprise Guide (the client). The client sends code to the server for execution.

Regular Contributor
Posts: 188

Re: how to subset efficiently based on cut off value from other dataset?

NO, they don't have IML.

Super User
Posts: 9,687

Re: how to subset efficiently based on cut off value from other dataset?

How about download SAS University Edition at sas.com . It is totally free.

I still remember Reeza write a macro to winsorize a data set . Maybe you can call her to change her code to suit your requirement?

Search :  winsorize + Reeza  at this forum .

Xia Keshan

Regular Contributor
Posts: 188

Re: how to subset efficiently based on cut off value from other dataset?

I tried creating all the values in macro variable for 600 variable names and 600 values and tried running the do loop:

DATA LIB.SUB_FINAL1;

SET LIB.SUB_FINAL;

  %do i=1 %to 600 %by 1;

  %IF &&VARNAME&I=. %THEN &&VARNAME&I=&&PERC_MEDN&I;

%end;

RUN;

Thought values are resolving correctly, but its not doing, what it is intended to do.

e.g. log varname344 resolved to amount_ex

%IF condition &&VARNAME&I=. is FALSE

now amount_ex is present in the dataset, but why it is saying false? does it run only over one row? do i need to use some other counter like j=1 to 80000 to run over all rows?

Regular Contributor
Posts: 188

Re: how to subset efficiently based on cut off value from other dataset?

It is interesting, i removed the % from if condition and now, it seems to be working fine within macro. I also created 600 macro values each for upper and lower percentile and used the same condition, so basically it writes those many if statements with loop.

Super User
Posts: 17,912

Re: how to subset efficiently based on cut off value from other dataset?

Here's the macro that Xia was referring to, depending on how big your dataset is this may not be efficient, but its simple to understand and limits the number of macro variables so it's easy to debug.

SAS Box-Plot/Tukey Method of Capping Outliers · GitHub

Super User
Super User
Posts: 6,502

Re: how to subset efficiently based on cut off value from other dataset?

Use dataset B to generate the code to subset dataset A.

filename code temp;


data _null_;

  set b;

  put 'if ' var '>' value 'then delete;' ;

run;


data want ;

  set A;

  %include code / source2 ;

run;

Ask a Question
Discussion stats
  • 12 replies
  • 493 views
  • 0 likes
  • 6 in conversation