BookmarkSubscribeRSS Feed
venkibhu
Calcite | Level 5

Dear All,

 

I have come across a paper "Sorting big datasets. Do we really need it?" which do a comparison of "Execution" time of sorting a datasets with the approaches like proc sort, indexing and Hash.

 

It is shown that, the HASH is more effective to sort a dataset when compared to the other techniques. I have tried (dataset contains 10L observations with 10 columns). I am surprised to see that the HASH is taking more time when compared to the PROC SORT.

 

Can some one suggest me a best way to sort the dataset in SAS?

 

Is it possible to re-write the below code using the Hash objects (Transposing and Merging ? in a single datastep with multiple objects or Multiple datasteps one for transposing, other is for merging?)

 

data long;
input id varnm $ value;
cards;
1 a3 10
2 a3 20
4 a3 40
3 a3 30
1 a4 10
4 a4 40
3 a4 30
2 a4 20
;
run;

data short;
input id a1 a2 a5;
cards;
1 10 10 10
2 20 20 20
4 40 40 40
3 30 30 30
;
run;

proc sort data = long;
by id;
run;

proc transpose data = long out = long_tra (drop = _name);
by id;
var value;
id varnm;
run;

proc sort data = short;
by id;
run;

data final;
merge short long_tra;
by id;
run;

 

Thanks in Advance,

Venkat.

1 REPLY 1
PeterClemmensen
Tourmaline | Level 20

You're asking two questions, so I am going to answer them separately.

 

"Can some one suggest me a best way to sort the dataset in SAS?"

 

Yes, but the correct, but boring answer is that it depends largely on the specific data and circumstances. Yes, Hash Objects are sometimes faster. However a hash object is an in-memory construct. So if the size of your data exceeds the available memory allocated, then what? Then the PROC SORT may be a better choice, because it can create utility files on disk to overcome the problem. Or maybe it is a better choice to create an appropriate index and exploit it with a by statement to sort the data? However, creating the index may be both time and CPU costly. And what will you do with the index file afterwards? Delete it? And if not, how often do you want the index centiles to update, which is also CPU costly. 

 

The short answer is: It depends. If you want a nice comparison of the different ways to sort data in SAS and what is going on under the hood, check out chapter 4 in the book High-Performance SAS Coding. It is well worth the few bucks. 

 

Is it possible to re-write the below code using the Hash objects?

 

 

You can do something like the below, though I am not fund of the approach.

 

proc sql noprint;
   select distinct varnm into :vars separated by ' '
   from long
   order by varnm;
quit;

%put &vars.;

data want(drop=i);
   if 0 then set long;
   attrib &vars. length=8;
   array vars{*} &vars.;

   if _N_ = 1 then do;
      declare hash h(dataset:'long', multidata:'Y', ordered:'Y');
      h.defineKey('id');
      h.defineData('varnm', 'value');
      h.defineDone();
   end;

   set short;

   do i=1 by 1 while (h.do_over()=0);
      vars[i]=value;
   end;
run;

 

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
Discussion stats
  • 1 reply
  • 931 views
  • 0 likes
  • 2 in conversation