topic Re: Proc sort or Proc sql in SAS Programming

Proc sort or Proc sql

tommy81 — Fri, 08 Oct 2010 05:32:28 GMT

Hi

I have a situation where I have to sort a 25 gb dataset and do the below .

proc sort data=sale;by sale;run;

data temp1 temp2;
set sale;
by sale;
if first.sale=last.sale then output temp2; else output temp1;
run;

Since this is taking almost 2hrs could somebody suggest an alternative using proc sql if you think that might take even less time.

Re: Proc sort or Proc sql

Robert_Bardos — Fri, 08 Oct 2010 06:42:55 GMT

As there are operating systems that allow for quite some SORT tweaking please post what operating system you are using and also what version of SAS you are using. And also how many CPUs you have available and whether option THREADS is active or not.

Please post the result of
[pre]
%put OS &sysscp &sysscpl ;
%put SAS &sysver ;
%put CPUs &sysncpu ;
%put THREADS: &sysfunc(getoption(threads)) ;
proc options group=sort; run;
[/pre]

Re: Proc sort or Proc sql

tommy81 — Fri, 08 Oct 2010 08:22:45 GMT

15 %put OS &sysscp &sysscpl
16 ;%put SAS &sysver ;
OS SUN 64 SunOS
SAS 9.2
17 %put CPUs &sysncpu ;
CPUs 4

proc options group=sort;
20 run;
SAS (r) Proprietary Software Release 9.2 TS2M3 SORTDUP=PHYSICAL SORT applies NODUP option to physical or logical records?
SORTEQUALS Maintain order for the input data set in the output data set, when processing identical BY-variable values with
Proc Sort
SORTSEQ= Collating sequence for sorting
SORTSIZE=83886080 Size parameter for sort
NOSORTVALIDATE Do not use automatic sort order validation to determine assertion
SORTANOM= Host sort option
SORTCUT=0 Specifies the number of observations above which the host sort program is used instead of the SAS sort program
SORTCUTP=0 Specifies the number of bytes above which the host sort program should be used instead of the SAS sort program
SORTDEV= Specifies the pathname used for temporary host sort files
SORTNAME= Specifies the name of the host sort utility
SORTPARM= Specifies the host sort parameters
SORTPGM=BEST Specifies the name of the sort utility

Re: Proc sort or Proc sql

Robert_Bardos — Fri, 08 Oct 2010 09:10:29 GMT

Thanks Tommy,

so you have a SUN 64 system, 4 CPUs, running SAS 9.2.

Coming from the mainframe side I can only say that the SORTSIZE value seems rather low to me. Do not these SUN systems have an overwhelming amount of memory available? If 83886080 actually means somewhat around 80MB then I think that value is extremely low causing SORT to do some heavy disk I/O. But then, I'm a mainframer, so we better wait for the operating system specific experts to wake up and get in front of their keyboards.

Re: Proc sort or Proc sql

sbb — Fri, 08 Oct 2010 09:21:22 GMT

I'm not so sure PROC SQL is the answer, however you could look at building an index or using a SAS VIEW -- or at least subsetting your input file to only choose SAS variables that your program will require.

Scott Barry
SBBWorks, Inc.

Suggested Google advanced search arguments, this topic / post:

solaris sort performance site:sas.com

sort performance index site:sas.com

Re: Proc sort or Proc sql

Patrick — Fri, 08 Oct 2010 10:19:56 GMT

Hi Tommy

Following Scott's suggestion to reduce the variables before sorting an approach like below could work.
The assumption is that after reducing the variables the sort can be done fully in memory - you will have to give it a try.

data massive(drop=i);
array ManyVars {10} (10*1);
do i=1 to 1000;
sale=ceil(ranuni(1)*700);
output;
end;
run;

data V_PrepareSort /view=V_PrepareSort;
set massive(keep=sale);
obsID=_n_;
run;

proc sql;
create view V_SortedMassiveSale as
select sale, obsID
from V_PrepareSort
order by sale
;
quit;

options sortsize=max;
data OneObs(compress=yes) MultiObs(compress=yes);
set V_SortedMassiveSale;
by sale;

set massive point=obsID;

if first.sale and last.sale then output OneObs;
else output MultiObs;
run;

HTH
Patrick

Re: Proc sort or Proc sql

Peter_C — Fri, 08 Oct 2010 11:38:56 GMT

tommy81

since it is awkward to sort, I would try not to need the sort.
What are you doing with the sorted data?
Some things might be accomplished without the sort .....

hth
peterC

Re: Proc sort or Proc sql

tommy81 — Fri, 08 Oct 2010 12:50:59 GMT

Peter

Not much .

Just check if the there are more than one repeating value for sale , and if so output all those records to another dataset.

Since the input dataset is not sorted , we could not go ahead with the first.sale=last.sale

Re: Proc sort or Proc sql

Patrick — Sat, 09 Oct 2010 00:57:14 GMT

Tommy
If it would be sufficient to write only the 2nd to n-th record for repeated sales to a new data set then an approach using a hash table would work (creating kind of a black list on-the-fly).

Re: Proc sort or Proc sql

sbb — Sat, 09 Oct 2010 05:25:42 GMT

PROC SORT option DUPOUT= can be used, if the additional duplicates are to be split out.

Scott Barry
SBBWorks, Inc.

Suggested Google advanced search argument, this topic / post:

proc sort dupout site:sas.com

Re: Proc sort or Proc sql

tommy81 — Sat, 09 Oct 2010 17:55:53 GMT

Hi SBB,

Exactly waht was needed..Simple but brilliant

Thank you so much.

Re: Proc sort or Proc sql

sbb — Sat, 09 Oct 2010 19:17:15 GMT

Also, one additional feature/benefit is where you can use two back-to-back PROC SORT executions, and only the second would have the DUPOUT=, possibly using the NODUPKEY option additionally to control SAS behavior -- but also investigate the EQUALS option which is sometimes needed in the last SORT execution. With EQUALS you can influence how your sort package (or SAS internal sort) sequences or leaves alone the incoming observation order, again, from a prior SORT.

For example, say you want to eliminate all duplicates and send them to another file (using DUPOUT= with NODUPKEY specified), however you also may want to have a previous SORT with a more granular BY variable list, one that ensure some additional sort-order ahead of the DUPOUT= / NODUPKEY sort execution.

And, also to remember (discussed in these forums before), it's up to the programmer to ensure that the PROC SORT BY variable list is granular enough to ensure that "duplicate observations" (achieved by either NODUPS or NODUPKEY) end up being adjacent, otherwise the sort-operation will not necessarily remove the duplicates -- they must be adjacent to achieve duplicate-deletion.

Happy Sorting People!

Scott Barry
SBBWorks, Inc.

Suggested Google advanced search arguments, this topic / post discussion:

+"proc sort" +equals +sortequals behavior site:sas.com

+"proc sort" by variable list remove duplicates site:sas.com