topic What Makes Performance Better - Alternative Methods for PROC SORT & PROC TRANSPOSE in SAS Programming

What Makes Performance Better - Alternative Methods for PROC SORT & PROC TRANSPOSE

ertr — Wed, 03 Apr 2019 15:49:33 GMT

Hello everybody,

I have sample code as below,

Data Have;
Length ID 8 Variable $ 32;
Infile Datalines Missover; 
Input ID Variable;
Datalines;
1 AAA
1 BBB
1 CCC
2 AAA
2 BBB
2 CCC
;

PROC SORT DATA=HAVE(KEEP=Variable ID) OUT=WANT ;
	BY ID;
RUN;
PROC TRANSPOSE DATA=WANT OUT=WANT(drop=_NAME_)
	PREFIX=Test;
	BY ID;
	VAR Variable;
RUN; QUIT;

But in real life, when I want to do this process with million values, it takes a lot of time so I want your help to maker it faster. As you can see there is multiplexing problem in my code and I could solve it with PROC SORT and PROC TRANSPOSE. Can somebody help me do it better?

Thanks

Re: What Makes Performance Better - Alternative Methods for PROC SORT & PROC TRANSPOSE

novinosrin — Wed, 03 Apr 2019 16:11:48 GMT

Just a million?or over a million?

Or millions?

Or 100's of millions or billion?

And is it always sets of 3 for each BY group like what your sample suggests?

Re: What Makes Performance Better - Alternative Methods for PROC SORT & PROC TRANSPOSE

mkeintz — Wed, 03 Apr 2019 16:27:38 GMT

The only reason you are sorting is because dataset HAVE is not sorted. If it's a big data set then a single DATA step using a hash object may be faster than SORT then TRANSPOSE:


data template;
  set have (keep=id);
  length test1-test30 $32;
  t=0;
  stop;
run;
data _null_;
  retain max_t;
  if _n_=1 then do;
   if 0 then set template;
   declare hash h (dataset:'template',ordered:'A');
     h.definekey('id');
     h.definedata(all:'Y');
     h.definedone();
  end;
  array tst {*} $32 test:;
  set have  end=end_of_have;
  if h.find()=0 then t=t+1;
  else t=1;
  tst{t}=variable;
  h.replace();
  max_t=max(max_t,t);
  if end_of_have then do;
    want_descriptor=cats('WANT (keep=ID test1-test',max_t,')');
    rc=h.output(dataset:want_descriptor);
  end;
run;

The data set TEMPLATE is made just to avoid a lot of typing when declaring the hash object. Typing all:'Y' is a lot simpler than typing 'TEST1','TEST2',.....,'TEST30'.
Put a number of TEST variables in TEMPLATE that is sure to exceed the number of tests for any ID.
Note that the order of test values within an id will be preserved, even if they are initially separated by other ID's.
The want_descriptor is the means of dynaimically controlling the number of variables in WANT. It will be the minimum necessary, just like proc transpose.

The value here is to save on disk input/output, at the expense of memory for the hash object.

Re: What Makes Performance Better - Alternative Methods for PROC SORT & PROC TRANSPOSE

ertr — Thu, 04 Apr 2019 11:43:51 GMT

Thank you for your invaluable response. It seems work, however, I am not familiar with Hash code, can you please explain more deeply(step by step maybe)? @mkeintz @novinosrin On the other hand, our data sets 50 million rows. On the other hand, are there any alternative methods except from Hash?

Thanks

Re: What Makes Performance Better - Alternative Methods for PROC SORT & PROC TRANSPOSE

Ksharp — Thu, 04 Apr 2019 12:22:35 GMT

Data Have;
Length ID 8 Variable $ 32;
Infile Datalines Missover; 
Input ID Variable;
Datalines;
1 AAA
1 BBB
1 CCC
2 AAA
2 BBB
2 CCC
;

proc sql noprint;
create index id on have;
select max(n) into : n
 from (select count(*) as n from have group by id);
quit;
data want;
 do i=1 by 1 until(last.id);
  set have;
  by id;
  array x{*} $ 80 test1-test%left(&n);
  x{i}=variable;
 end;
 drop i variable;
run;

Re: What Makes Performance Better - Alternative Methods for PROC SORT & PROC TRANSPOSE

ballardw — Thu, 04 Apr 2019 15:10:58 GMT

@ertr wrote:

Hello everybody,

I have sample code as below,
Data Have;
Length ID 8 Variable $ 32;
Infile Datalines Missover; 
Input ID Variable;
Datalines;
1 AAA
1 BBB
1 CCC
2 AAA
2 BBB
2 CCC
;

PROC SORT DATA=HAVE(KEEP=Variable ID) OUT=WANT ;
	BY ID;
RUN;
PROC TRANSPOSE DATA=WANT OUT=WANT(drop=_NAME_)
	PREFIX=Test;
	BY ID;
	VAR Variable;
RUN; QUIT;
But in real life, when I want to do this process with million values, it takes a lot of time so I want your help to maker it faster. As you can see there is multiplexing problem in my code and I could solve it with PROC SORT and PROC TRANSPOSE. Can somebody help me do it better?

Thanks

Your example data implies that the values of variable may repeat for all your ID values (though possibly in a different order). Is that actually the case? If so selecting the distinct values of the variable, transposing and join to the id list might be another approach.