Solved: Re: How to improve speed of macro with iterative merging of dataset

kashlik123 · Posted 12-04-2021 12:07 PM

I have a dataset with 644 variables and around 1.2 mln rows. All the variables are numeric except the id column. My task is to apply a WOE transformation to every variable in the dataset. In other words I need to combine continuous variables into groups by using predefined min max ranges.Consider the following example:

We have a continuous variable X that ranges from 0 - 1 and takes on values such as 0.26.Our transformed variable has two groups and a breaking point at 0.5. Based on these inputs all the values X that are below 0.5 should be converted to 1 and all the values above that to 2.

I wrote the following program to do this:

data have1;
Call streaminit(1);
do id = 1 to 20;
	x1 = rand('uniform');
	x2 = rand('uniform');
	x3 = rand('uniform');
	output;
end;
run;

data have2;
infile datalines delimiter=',';
input var $ group min max;
datalines;
x1,1,0,0.4
x1,2,0.4,1
x2,1,0,0.6
x2,2,0.6,1
x3,1,0,0.8
x3,2,0.8,1
;

%Macro example_transformation();
	Proc contents data=have1(drop=id) out=var_list(keep=name);
	Run;

	Proc sql;
	Select count(*) into:n_vars from var_list;
	Quit;

	Data results(index=(id));
	Set have1(keep=id);
	Run;

	%Let var_x = 1;
	%Do %while (&var_x <= &n_vars);
		Data _null_;
		Set var_list;
		If _n_ = &var_x then call symputx('var',name);
		Run;
		%Put &var;

		Data var_params;
		Set have2;
		Where var = "&var";
		Key = 1;
		Run;

		Data var_transform(rename=group=&var._new drop=rc &var key min max index=(id) );
		If _n_ = 0 then set var_params(keep=group min max);
		If _n_ = 1 then do;
			Dcl hash h(dataset:"var_params", multidata:"Y");
			h.definekey('key');
			h.definedata('min','max','group');
			h.definedone();
		End;
		Set have1(keep=id &var);
		key = 1;

		rc = h.find();
		Do while (rc = 0);
			If min <= &var < max or (min = &var = max) then output;
			rc = h.find_next();
		End;
		Run;	

		Data results(index=(id));
		Merge results var_transform;
		By id;
		Run;

		%Let var_x = %eval(&var_x + 1);
	%End;
%Mend;

%example_transformation()

The problem is that when this macro is applied to my full dataset it takes around 18 hours to execute. The most time consuming step is the merge datastep, because it iteratively adds more data to the results table. My question is whether there is any way to speed up this process? For example if the merge was vertical instead of horizontal then the PROC APPEND method would speed things up quite significantly. However as far as I know SAS does not have a horizontal equivalent of PROC APPEND, or does it?

Tom · Posted 12-04-2021 02:41 PM

So HAVE1 is the actual data?

And HAVE2 is the metadata about what transformations you want to do?

I kind of looks to me like you have the data you need to generate the code to process HAVE1 in a single pass in the HAVE2 dataset without any need for macro code. So given your HAVE2 dataset

Obs    var    group    min    max

 1     x1       1      0.0    0.4
 2     x1       2      0.4    1.0
 3     x2       1      0.0    0.6
 4     x2       2      0.6    1.0
 5     x3       1      0.0    0.8
 6     x3       2      0.8    1.0

what code would you generate from that?

Perhaps something like:

data want1;
  set have1;
  if 0 <= x1 < 0.4 then new_x1 =1 ;
  else if 0.4 <= x1 <= 1 then new_x1 =2 ;
  if 0 <= x2 < 0.6 then new_x2 =1 ;
  else if 0.6 <= x2 <= 1 then new_x2 =2 ;
  if 0 <= x3 < 0.8 then new_x3 =1 ;
  else if 0.8 <= x3 <= 1 then new_x3 =2 ;
run;

Which seems pretty easy to do.

filename code temp;
data _null_;
  set have2 ;
  by var ;
  file code ;
  if not first.var then put 'else ' @;
  put 'if ' min '<= ' var @;
  if not last.var then put '< ' @;
  else put '<= ' @;
  put max 'then new_' var '=' group ';' ;
run;

data want2;
  set have1;
%include code / source2;
run;

And yields the same result

The COMPARE Procedure
Comparison of WORK.WANT1 with WORK.WANT2
(Method=EXACT)

Data Set Summary

Dataset              Created          Modified  NVar    NObs

WORK.WANT1  04DEC21:14:37:23  04DEC21:14:37:23     7      20
WORK.WANT2  04DEC21:14:37:23  04DEC21:14:37:23     7      20


Variables Summary

Number of Variables in Common: 7.


Observation Summary

Observation      Base  Compare

First Obs           1        1
Last  Obs          20       20

Number of Observations in Common: 20.
Total Number of Observations Read from WORK.WANT1: 20.
Total Number of Observations Read from WORK.WANT2: 20.

Number of Observations with Some Compared Variables Unequal: 0.
Number of Observations with All Compared Variables Equal: 20.

NOTE: No unequal values were found. All values compared are exactly equal.

Results:

Obs    id       x1         x2         x3      new_x1    new_x2    new_x3

  1     1    0.88387    0.97382    0.50758       2         2         1
  2     2    0.88694    0.68936    0.94325       2         2         2
  3     3    0.92879    0.48766    0.77989       2         1         1
  4     4    0.87714    0.56683    0.66561       2         1         1
  5     5    0.82344    0.92802    0.79010       2         2         1
  6     6    0.11429    0.89059    0.54490       1         2         1
  7     7    0.28105    0.93168    0.61406       1         2         1
  8     8    0.85643    0.61497    0.46557       2         2         1
  9     9    0.59010    0.77492    0.44266       2         2         1
 10    10    0.07624    0.90939    0.55137       1         2         1
 11    11    0.66631    0.23700    0.14195       2         1         1
 12    12    0.33216    0.33066    0.75438       1         1         1
 13    13    0.68821    0.61991    0.78624       2         2         1
 14    14    0.98633    0.15846    0.38494       2         1         1
 15    15    0.32056    0.07447    0.48063       1         1         1
 16    16    0.12128    0.39084    0.70005       1         1         1
 17    17    0.66958    0.48373    0.91975       2         1         2
 18    18    0.05295    0.40973    0.13176       1         1         1
 19    19    0.52615    0.55648    0.63083       2         1         1
 20    20    0.70816    0.35837    0.29789       2         1         1

View solution in original post

Tom · Posted 12-04-2021 12:30 PM

Macro code runs really quickly since all it does is generate SAS code.

Your problem is that the SAS code the macro is generating takes a long time to run.

Please explain what the SAS code the macro is generating is trying to do to get help in doing it more efficiently.

Why are you processing each variable with a different data step? That means if you have 20 variables you have to process the whole dataset 20 times. Why not perform the transformations you want to perform all in one data step?

kashlik123 · Posted 12-04-2021 01:26 PM

Not sure what exactly is unclear in my initial explanation. I have several continuous variables that I need to transform by binning them based on min max ranges with a specific example given in the first post.

How exactly could all the variables be processed in one datatep, if each of one them has different binning settings?

Tom · Posted 12-04-2021 02:41 PM

So HAVE1 is the actual data?

And HAVE2 is the metadata about what transformations you want to do?

I kind of looks to me like you have the data you need to generate the code to process HAVE1 in a single pass in the HAVE2 dataset without any need for macro code. So given your HAVE2 dataset

Obs    var    group    min    max

 1     x1       1      0.0    0.4
 2     x1       2      0.4    1.0
 3     x2       1      0.0    0.6
 4     x2       2      0.6    1.0
 5     x3       1      0.0    0.8
 6     x3       2      0.8    1.0

what code would you generate from that?

Perhaps something like:

data want1;
  set have1;
  if 0 <= x1 < 0.4 then new_x1 =1 ;
  else if 0.4 <= x1 <= 1 then new_x1 =2 ;
  if 0 <= x2 < 0.6 then new_x2 =1 ;
  else if 0.6 <= x2 <= 1 then new_x2 =2 ;
  if 0 <= x3 < 0.8 then new_x3 =1 ;
  else if 0.8 <= x3 <= 1 then new_x3 =2 ;
run;

Which seems pretty easy to do.

filename code temp;
data _null_;
  set have2 ;
  by var ;
  file code ;
  if not first.var then put 'else ' @;
  put 'if ' min '<= ' var @;
  if not last.var then put '< ' @;
  else put '<= ' @;
  put max 'then new_' var '=' group ';' ;
run;

data want2;
  set have1;
%include code / source2;
run;

And yields the same result

The COMPARE Procedure
Comparison of WORK.WANT1 with WORK.WANT2
(Method=EXACT)

Data Set Summary

Dataset              Created          Modified  NVar    NObs

WORK.WANT1  04DEC21:14:37:23  04DEC21:14:37:23     7      20
WORK.WANT2  04DEC21:14:37:23  04DEC21:14:37:23     7      20


Variables Summary

Number of Variables in Common: 7.


Observation Summary

Observation      Base  Compare

First Obs           1        1
Last  Obs          20       20

Number of Observations in Common: 20.
Total Number of Observations Read from WORK.WANT1: 20.
Total Number of Observations Read from WORK.WANT2: 20.

Number of Observations with Some Compared Variables Unequal: 0.
Number of Observations with All Compared Variables Equal: 20.

NOTE: No unequal values were found. All values compared are exactly equal.

Results:

Obs    id       x1         x2         x3      new_x1    new_x2    new_x3

  1     1    0.88387    0.97382    0.50758       2         2         1
  2     2    0.88694    0.68936    0.94325       2         2         2
  3     3    0.92879    0.48766    0.77989       2         1         1
  4     4    0.87714    0.56683    0.66561       2         1         1
  5     5    0.82344    0.92802    0.79010       2         2         1
  6     6    0.11429    0.89059    0.54490       1         2         1
  7     7    0.28105    0.93168    0.61406       1         2         1
  8     8    0.85643    0.61497    0.46557       2         2         1
  9     9    0.59010    0.77492    0.44266       2         2         1
 10    10    0.07624    0.90939    0.55137       1         2         1
 11    11    0.66631    0.23700    0.14195       2         1         1
 12    12    0.33216    0.33066    0.75438       1         1         1
 13    13    0.68821    0.61991    0.78624       2         2         1
 14    14    0.98633    0.15846    0.38494       2         1         1
 15    15    0.32056    0.07447    0.48063       1         1         1
 16    16    0.12128    0.39084    0.70005       1         1         1
 17    17    0.66958    0.48373    0.91975       2         1         2
 18    18    0.05295    0.40973    0.13176       1         1         1
 19    19    0.52615    0.55648    0.63083       2         1         1
 20    20    0.70816    0.35837    0.29789       2         1         1

kashlik123 · Posted 12-04-2021 02:58 PM

This is some magic right here. Could you give some resources on where I can read up on this type of syntax? I never saw anyone use @ in SAS.

Tom · Posted 12-04-2021 03:14 PM

It is a basic function of the PUT statement.

Or look for examples of using the data step to write a report, such as: https://v8doc.sas.com/sashtml/lrcon/z0736133.htm

ballardw · Posted 12-04-2021 12:58 PM

I ran your example code and for every "group" or whatever I see:

NOTE: Variable min is uninitialized.
NOTE: Variable max is uninitialized.

Which makes me suspect everything else because you are comparing values to MISSING.

You might want to look at https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.4/prochp/prochp_hpbin_examples05.htm

Which used Proc HBBIN to do at least one form of WOE.

I suggest than instead of naming variables in the X1_new X2_new (suffix to group like variables) that SAS works much nicer with prefixed values like New_x1 New_x2 etc. For one thing if you need to reference many related variables then you can use a list shorthand such as New_x: which would reference all variable names that start with New_x. Or a subset New_x1 - New_x25 would only use the ones ending in numeric order from 1 to 25 (assumes no gaps in the name numeric order).

Since you are using 644 variables that is a considerable number of variables that would require a LOT of typing to do a keep or drop list.

kashlik123 · Posted 12-04-2021 01:37 PM

1) Yes, I forgot to initialize the min max variables in the datastep with the hash (code has been edited). However, this did not have an effect on the results of the macro.

2) Unfortunately, my company's SAS license does not include PROC HPBIN, which is why I am forced to do this manually.

3) I am aware that naming variables with prefixes is much better, however for the purposes of my task adding suffixes was a deliberate decision. Besides, as far as I can tell, this does not affect the problem I am trying to solve.

sbxkoenk · Posted 12-04-2021 01:54 PM

Hello,

You say :

Unfortunately, my company's SAS license does not include PROC HPBIN, which is why I am forced to do this manually.

I think that's not possible :

PROC HPBIN is a high-performance statistical procedure from Base SAS 9.4.

And all licenses have Base SAS obviously.

Koen

kashlik123 · Posted 12-04-2021 01:58 PM

This is my version of SAS 9.03.01M1P110211. I realize how silly this sounds but my company really is using a 10 year old version of SAS and they are refusing to upgrade it because it might break some of the existing processes.

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!