BookmarkSubscribeRSS Feed
tommy81
Obsidian | Level 7
Hi

I have a situation where I have to sort a 25 gb dataset and do the below .

proc sort data=sale;by sale;run;

data temp1 temp2;
set sale;
by sale;
if first.sale=last.sale then output temp2; else output temp1;
run;

Since this is taking almost 2hrs could somebody suggest an alternative using proc sql if you think that might take even less time.
11 REPLIES 11
Robert_Bardos
Fluorite | Level 6
As there are operating systems that allow for quite some SORT tweaking please post what operating system you are using and also what version of SAS you are using. And also how many CPUs you have available and whether option THREADS is active or not.

Please post the result of
[pre]
%put OS &sysscp &sysscpl ;
%put SAS &sysver ;
%put CPUs &sysncpu ;
%put THREADS: &sysfunc(getoption(threads)) ;
proc options group=sort; run;
[/pre]
tommy81
Obsidian | Level 7
15 %put OS &sysscp &sysscpl
16 ;%put SAS &sysver ;
OS SUN 64 SunOS
SAS 9.2
17 %put CPUs &sysncpu ;
CPUs 4


proc options group=sort;
20 run;
SAS (r) Proprietary Software Release 9.2 TS2M3 SORTDUP=PHYSICAL SORT applies NODUP option to physical or logical records?
SORTEQUALS Maintain order for the input data set in the output data set, when processing identical BY-variable values with
Proc Sort
SORTSEQ= Collating sequence for sorting
SORTSIZE=83886080 Size parameter for sort
NOSORTVALIDATE Do not use automatic sort order validation to determine assertion
SORTANOM= Host sort option
SORTCUT=0 Specifies the number of observations above which the host sort program is used instead of the SAS sort program
SORTCUTP=0 Specifies the number of bytes above which the host sort program should be used instead of the SAS sort program
SORTDEV= Specifies the pathname used for temporary host sort files
SORTNAME= Specifies the name of the host sort utility
SORTPARM= Specifies the host sort parameters
SORTPGM=BEST Specifies the name of the sort utility
Robert_Bardos
Fluorite | Level 6
Thanks Tommy,

so you have a SUN 64 system, 4 CPUs, running SAS 9.2.

Coming from the mainframe side I can only say that the SORTSIZE value seems rather low to me. Do not these SUN systems have an overwhelming amount of memory available? If 83886080 actually means somewhat around 80MB then I think that value is extremely low causing SORT to do some heavy disk I/O. But then, I'm a mainframer, so we better wait for the operating system specific experts to wake up and get in front of their keyboards.
sbb
Lapis Lazuli | Level 10 sbb
Lapis Lazuli | Level 10
I'm not so sure PROC SQL is the answer, however you could look at building an index or using a SAS VIEW -- or at least subsetting your input file to only choose SAS variables that your program will require.

Scott Barry
SBBWorks, Inc.

Suggested Google advanced search arguments, this topic / post:

solaris sort performance site:sas.com

sort performance index site:sas.com
Patrick
Opal | Level 21
Hi Tommy

Following Scott's suggestion to reduce the variables before sorting an approach like below could work.
The assumption is that after reducing the variables the sort can be done fully in memory - you will have to give it a try.

data massive(drop=i);
array ManyVars {10} (10*1);
do i=1 to 1000;
sale=ceil(ranuni(1)*700);
output;
end;
run;

data V_PrepareSort /view=V_PrepareSort;
set massive(keep=sale);
obsID=_n_;
run;

proc sql;
create view V_SortedMassiveSale as
select sale, obsID
from V_PrepareSort
order by sale
;
quit;

options sortsize=max;
data OneObs(compress=yes) MultiObs(compress=yes);
set V_SortedMassiveSale;
by sale;

set massive point=obsID;

if first.sale and last.sale then output OneObs;
else output MultiObs;
run;


HTH
Patrick
Peter_C
Rhodochrosite | Level 12
tommy81

since it is awkward to sort, I would try not to need the sort.
What are you doing with the sorted data?
Some things might be accomplished without the sort .....

hth
peterC
tommy81
Obsidian | Level 7
Peter

Not much .

Just check if the there are more than one repeating value for sale , and if so output all those records to another dataset.

Since the input dataset is not sorted , we could not go ahead with the first.sale=last.sale
Patrick
Opal | Level 21
Tommy
If it would be sufficient to write only the 2nd to n-th record for repeated sales to a new data set then an approach using a hash table would work (creating kind of a black list on-the-fly).
sbb
Lapis Lazuli | Level 10 sbb
Lapis Lazuli | Level 10
PROC SORT option DUPOUT= can be used, if the additional duplicates are to be split out.

Scott Barry
SBBWorks, Inc.

Suggested Google advanced search argument, this topic / post:

proc sort dupout site:sas.com
tommy81
Obsidian | Level 7
Hi SBB,

Exactly waht was needed..Simple but brilliant

Thank you so much.
sbb
Lapis Lazuli | Level 10 sbb
Lapis Lazuli | Level 10
Also, one additional feature/benefit is where you can use two back-to-back PROC SORT executions, and only the second would have the DUPOUT=, possibly using the NODUPKEY option additionally to control SAS behavior -- but also investigate the EQUALS option which is sometimes needed in the last SORT execution. With EQUALS you can influence how your sort package (or SAS internal sort) sequences or leaves alone the incoming observation order, again, from a prior SORT.

For example, say you want to eliminate all duplicates and send them to another file (using DUPOUT= with NODUPKEY specified), however you also may want to have a previous SORT with a more granular BY variable list, one that ensure some additional sort-order ahead of the DUPOUT= / NODUPKEY sort execution.

And, also to remember (discussed in these forums before), it's up to the programmer to ensure that the PROC SORT BY variable list is granular enough to ensure that "duplicate observations" (achieved by either NODUPS or NODUPKEY) end up being adjacent, otherwise the sort-operation will not necessarily remove the duplicates -- they must be adjacent to achieve duplicate-deletion.

Happy Sorting People!

Scott Barry
SBBWorks, Inc.


Suggested Google advanced search arguments, this topic / post discussion:

+"proc sort" +equals +sortequals behavior site:sas.com

+"proc sort" by variable list remove duplicates site:sas.com

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 11 replies
  • 1132 views
  • 0 likes
  • 5 in conversation