What Makes Performance Better - Alternative Methods for PROC SORT & P...

ertr · Posted 04-03-2019 11:49 AM

Hello everybody,

I have sample code as below,

Data Have;
Length ID 8 Variable $ 32;
Infile Datalines Missover; 
Input ID Variable;
Datalines;
1 AAA
1 BBB
1 CCC
2 AAA
2 BBB
2 CCC
;

PROC SORT DATA=HAVE(KEEP=Variable ID) OUT=WANT ;
	BY ID;
RUN;
PROC TRANSPOSE DATA=WANT OUT=WANT(drop=_NAME_)
	PREFIX=Test;
	BY ID;
	VAR Variable;
RUN; QUIT;

But in real life, when I want to do this process with million values, it takes a lot of time so I want your help to maker it faster. As you can see there is multiplexing problem in my code and I could solve it with PROC SORT and PROC TRANSPOSE. Can somebody help me do it better?

Thanks

novinosrin · Posted 04-03-2019 12:08 PM

Just a million?or over a million?

Or millions?

Or 100's of millions or billion?

And is it always sets of 3 for each BY group like what your sample suggests?

mkeintz · Posted 04-03-2019 12:27 PM

The only reason you are sorting is because dataset HAVE is not sorted. If it's a big data set then a single DATA step using a hash object may be faster than SORT then TRANSPOSE:


data template;
  set have (keep=id);
  length test1-test30 $32;
  t=0;
  stop;
run;
data _null_;
  retain max_t;
  if _n_=1 then do;
   if 0 then set template;
   declare hash h (dataset:'template',ordered:'A');
     h.definekey('id');
     h.definedata(all:'Y');
     h.definedone();
  end;
  array tst {*} $32 test:;
  set have  end=end_of_have;
  if h.find()=0 then t=t+1;
  else t=1;
  tst{t}=variable;
  h.replace();
  max_t=max(max_t,t);
  if end_of_have then do;
    want_descriptor=cats('WANT (keep=ID test1-test',max_t,')');
    rc=h.output(dataset:want_descriptor);
  end;
run;

The data set TEMPLATE is made just to avoid a lot of typing when declaring the hash object. Typing all:'Y' is a lot simpler than typing 'TEST1','TEST2',.....,'TEST30'.
Put a number of TEST variables in TEMPLATE that is sure to exceed the number of tests for any ID.
Note that the order of test values within an id will be preserved, even if they are initially separated by other ID's.
The want_descriptor is the means of dynaimically controlling the number of variables in WANT. It will be the minimum necessary, just like proc transpose.

The value here is to save on disk input/output, at the expense of memory for the hash object.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

ertr · Posted 04-04-2019 07:43 AM

Thank you for your invaluable response. It seems work, however, I am not familiar with Hash code, can you please explain more deeply(step by step maybe)? @mkeintz @novinosrin On the other hand, our data sets 50 million rows. On the other hand, are there any alternative methods except from Hash?

Thanks

Ksharp · Posted 04-04-2019 08:22 AM

Data Have;
Length ID 8 Variable $ 32;
Infile Datalines Missover; 
Input ID Variable;
Datalines;
1 AAA
1 BBB
1 CCC
2 AAA
2 BBB
2 CCC
;

proc sql noprint;
create index id on have;
select max(n) into : n
 from (select count(*) as n from have group by id);
quit;
data want;
 do i=1 by 1 until(last.id);
  set have;
  by id;
  array x{*} $ 80 test1-test%left(&n);
  x{i}=variable;
 end;
 drop i variable;
run;

ballardw · Posted 04-04-2019 11:10 AM

@ertr wrote:

Hello everybody,

I have sample code as below,
Data Have;
Length ID 8 Variable $ 32;
Infile Datalines Missover; 
Input ID Variable;
Datalines;
1 AAA
1 BBB
1 CCC
2 AAA
2 BBB
2 CCC
;

PROC SORT DATA=HAVE(KEEP=Variable ID) OUT=WANT ;
	BY ID;
RUN;
PROC TRANSPOSE DATA=WANT OUT=WANT(drop=_NAME_)
	PREFIX=Test;
	BY ID;
	VAR Variable;
RUN; QUIT;
But in real life, when I want to do this process with million values, it takes a lot of time so I want your help to maker it faster. As you can see there is multiplexing problem in my code and I could solve it with PROC SORT and PROC TRANSPOSE. Can somebody help me do it better?

Thanks

Your example data implies that the values of variable may repeat for all your ID values (though possibly in a different order). Is that actually the case? If so selecting the distinct values of the variable, transposing and join to the id list might be another approach.

What Makes Performance Better - Alternative Methods for PROC SORT & PROC TRANSPOSE

Re: What Makes Performance Better - Alternative Methods for PROC SORT & PROC TRANSPOSE

Re: What Makes Performance Better - Alternative Methods for PROC SORT & PROC TRANSPOSE

Re: What Makes Performance Better - Alternative Methods for PROC SORT & PROC TRANSPOSE

Re: What Makes Performance Better - Alternative Methods for PROC SORT & PROC TRANSPOSE

Re: What Makes Performance Better - Alternative Methods for PROC SORT & PROC TRANSPOSE

What Makes Performance Better - Alternative Methods for PROC SORT & PROC TRANSPOSE

Re: What Makes Performance Better - Alternative Methods for PROC SORT & PROC TRANSPOSE

Re: What Makes Performance Better - Alternative Methods for PROC SORT & PROC TRANSPOSE

Re: What Makes Performance Better - Alternative Methods for PROC SORT & PROC TRANSPOSE

Re: What Makes Performance Better - Alternative Methods for PROC SORT & PROC TRANSPOSE

Re: What Makes Performance Better - Alternative Methods for PROC SORT & PROC TRANSPOSE

Ready to join fellow brilliant minds for the SAS Hackathon?

Click image to register for webinar

Classroom Training Available!