Good day!
It seems that my first post was somehow marked as SPAM so I apologize if you are seeing this post a second time.
I have the following problem: I want to remove variables in a data set that contain more than 90% of the following values: -1,-2,-3,-4,-5,-6,-7. Stated otherwise, If a variable (of which there are 675) is made up of more than 90% of the above listed values, it should be removed from the data set. Below is an example of the data set I am working with.
ID | Target Variable | Var1 | Var2 | ... | Var675 |
001 | 1 | -2 | -2 | ... | -2 |
002 | 1 | -2 | -2 | ... | -3 |
003 | 0 | -2 | -2 | ... | 1500 |
004 | 1 | 5 | -3 | ... | -3 |
005 | 0 | -3 | 18 | ... | -2 |
Since there is a large number of columns in the data set I figured that I will probably need a do loop to do this. Therefore, I created the following macro to remove variables for any given data set that contain more than 90% of these special values.
%macro control(dsn, method = Quantile);
%let lib = %upcase(%scan(&dsn,1,.));
%let dsn = %upcase(%scan(&dsn,2,.));
%let method = %upcase(&method);
/* Creatign macro's for each column in the data set */
data _null_;
set sashelp.vcolumn(keep=libname memname name type
where=(libname = "&lib" AND memname = "&dsn" and type = "num")) end=eof;
call symputx('col'||strip(_n_),strip(name));
if eof then call symputx('max',strip(_n_));
run;
%do i = 1 %to &max;
I assume that this is were the code will be written to determine the percentage of special
values in a variable and then drop it from the original data set if it exceeds 90%.
%mend control;
Thank you in advance!
Thank you @sbxkoenk for the assistance.
I also managed to write the following code and it seems to work/provide the desired output.
Firstly, I created an empty table with the necessary specification before the macro as follows:
proc sql;
Create table Datasets.Special_Perc_&Split
(Variable char length = 23 format=$50.,
Sum_Special num length = 8 format = best12.,
Sum_positive num length = 8 format = best12.,
Percentage num length = 8 format = Percent6.2);
quit;
Secondly, I wrote a macro to calculate the percentage of special values for each column in any given data set. This is then added to the table that I created above for each iteration of the loop.
%macro Special(dsn);
%let lib = %upcase(%scan(&dsn,1,.));
%let dsn = %upcase(%scan(&dsn,2,.));
/* Creatign macro's for each column in the data set */
data _null_;
set sashelp.vcolumn(keep=libname memname name type
where=(libname = "&lib" AND memname = "&dsn" and type = "num")) end=eof;
call symputx('col'||strip(_n_),strip(name));
if eof then call symputx('max',strip(_n_));
run;
%do i = 1 %to &max;
proc sql noprint;
select sum(&&col&i in (-1,-2,-3,-4,-5,-6,-7)) as NegS,
sum(&&col&i not in (-1,-2,-3,-4,-5,-6,-7)) as Pos,
round(calculated NegS/(calculated NegS + calculated Pos), .01) as Perct
into :SN, :SP, :Perc
from &lib..&dsn;
quit;
data b;
Variable = "&&col&i";
Sum_Special = &sn;
Sum_positive = &sp;
Percentage = &Perc;
run;
proc append base = Datasets.Special_Perc_&split data = b force;
run;
proc sql;
drop table b;
quit;
%end;
%mend special;
%special(Datasets.Longevity_Numeric_&split);
Lastly, i use a proc sql to exclude those variables that have a special value percentage of more than 90%.
It may not be the best way to do it, but it got the job done. 😀
Hello @EduanR ,
Something like the below :
data work.have;
input ID $ Target_Variable Var1 Var2 Var3;
cards;
001 1 -2 -2 -2
002 1 -2 -2 -3
003 0 -2 -2 1500
004 1 5 -3 -3
005 0 -3 18 -2
;
run;
data work.have1(drop=i);
set work.have;
array myvars{3} Var1- Var3;
array countvar{3} c_Var1-c_Var3;
do i=1 to dim(myvars);
if myvars(i) in (-1,-2,-3,-4,-5,-6,-7) then countvar(i)=1; else countvar(i)=0;
end;
run;
ods trace off;
ods output OneWayFreqs=work.OneWayFreqs;
PROC FREQ data=work.have1 /*noprint*/;
tables c_Var1-c_Var3 / list /* missing out=work.myds */;
run;
data work.vars_to_throw_away;
set work.OneWayFreqs;
where c_Var1=1 OR c_Var2=1 OR c_Var3=1;
where also Percent >= 80;
run;
PROC SQL noprint;
select scan(Table,2,'_') into :Lijst separated by ' '
from work.vars_to_throw_away;
QUIT;
%PUT &=Lijst;
data work.want;
set work.have(drop=&Lijst.);
run;
/* end of program */
Koen
Thank you @sbxkoenk for the assistance.
I also managed to write the following code and it seems to work/provide the desired output.
Firstly, I created an empty table with the necessary specification before the macro as follows:
proc sql;
Create table Datasets.Special_Perc_&Split
(Variable char length = 23 format=$50.,
Sum_Special num length = 8 format = best12.,
Sum_positive num length = 8 format = best12.,
Percentage num length = 8 format = Percent6.2);
quit;
Secondly, I wrote a macro to calculate the percentage of special values for each column in any given data set. This is then added to the table that I created above for each iteration of the loop.
%macro Special(dsn);
%let lib = %upcase(%scan(&dsn,1,.));
%let dsn = %upcase(%scan(&dsn,2,.));
/* Creatign macro's for each column in the data set */
data _null_;
set sashelp.vcolumn(keep=libname memname name type
where=(libname = "&lib" AND memname = "&dsn" and type = "num")) end=eof;
call symputx('col'||strip(_n_),strip(name));
if eof then call symputx('max',strip(_n_));
run;
%do i = 1 %to &max;
proc sql noprint;
select sum(&&col&i in (-1,-2,-3,-4,-5,-6,-7)) as NegS,
sum(&&col&i not in (-1,-2,-3,-4,-5,-6,-7)) as Pos,
round(calculated NegS/(calculated NegS + calculated Pos), .01) as Perct
into :SN, :SP, :Perc
from &lib..&dsn;
quit;
data b;
Variable = "&&col&i";
Sum_Special = &sn;
Sum_positive = &sp;
Percentage = &Perc;
run;
proc append base = Datasets.Special_Perc_&split data = b force;
run;
proc sql;
drop table b;
quit;
%end;
%mend special;
%special(Datasets.Longevity_Numeric_&split);
Lastly, i use a proc sql to exclude those variables that have a special value percentage of more than 90%.
It may not be the best way to do it, but it got the job done. 😀
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.