I have a dataset with 644 variables and around 1.2 mln rows. All the variables are numeric except the id column. My task is to apply a WOE transformation to every variable in the dataset. In other words I need to combine continuous variables into groups by using predefined min max ranges.Consider the following example:
We have a continuous variable X that ranges from 0 - 1 and takes on values such as 0.26.Our transformed variable has two groups and a breaking point at 0.5. Based on these inputs all the values X that are below 0.5 should be converted to 1 and all the values above that to 2.
I wrote the following program to do this:
data have1;
Call streaminit(1);
do id = 1 to 20;
x1 = rand('uniform');
x2 = rand('uniform');
x3 = rand('uniform');
output;
end;
run;
data have2;
infile datalines delimiter=',';
input var $ group min max;
datalines;
x1,1,0,0.4
x1,2,0.4,1
x2,1,0,0.6
x2,2,0.6,1
x3,1,0,0.8
x3,2,0.8,1
;
%Macro example_transformation();
Proc contents data=have1(drop=id) out=var_list(keep=name);
Run;
Proc sql;
Select count(*) into:n_vars from var_list;
Quit;
Data results(index=(id));
Set have1(keep=id);
Run;
%Let var_x = 1;
%Do %while (&var_x <= &n_vars);
Data _null_;
Set var_list;
If _n_ = &var_x then call symputx('var',name);
Run;
%Put &var;
Data var_params;
Set have2;
Where var = "&var";
Key = 1;
Run;
Data var_transform(rename=group=&var._new drop=rc &var key min max index=(id) );
If _n_ = 0 then set var_params(keep=group min max);
If _n_ = 1 then do;
Dcl hash h(dataset:"var_params", multidata:"Y");
h.definekey('key');
h.definedata('min','max','group');
h.definedone();
End;
Set have1(keep=id &var);
key = 1;
rc = h.find();
Do while (rc = 0);
If min <= &var < max or (min = &var = max) then output;
rc = h.find_next();
End;
Run;
Data results(index=(id));
Merge results var_transform;
By id;
Run;
%Let var_x = %eval(&var_x + 1);
%End;
%Mend;
%example_transformation()
The problem is that when this macro is applied to my full dataset it takes around 18 hours to execute. The most time consuming step is the merge datastep, because it iteratively adds more data to the results table. My question is whether there is any way to speed up this process? For example if the merge was vertical instead of horizontal then the PROC APPEND method would speed things up quite significantly. However as far as I know SAS does not have a horizontal equivalent of PROC APPEND, or does it?
... View more