SAS Programming

DATA Step, Macro, Functions and more
BookmarkSubscribeRSS Feed
mariopellegrini
Pyrite | Level 9

I don't quite understand why a proc sort applied to a large table worsens performance if applied to the same dataset, but compressed.
The table in question weighs about 29Gb (305819687 observations and 16 variables, of which 13 variables used in the "by"), and by doing the COMPRESS the weight is reduced to only 26Gb. A first doubt I have is that usually compress reduces the occupied memory much more.
I wish I could optimize this proc sort but I didn't get good results, neither with the use of compress for the table, nor with the Bufsize= and Bufno = options (I tried them for all possible combinations), nor with the TAGSORT option .
With the SORTSIZE option I think the situation does not change because this is set to 32Gb, therefore a value already higher than the table size.
Here is the outcome of the log of the starting sort:

NOTE: There were 305819687 observations read from the data set ODS_VT.T_F_PTFVT_RISERVA_AWARDS.
NOTE: The data set WORK.ETLS_SORTEDXREF has 305819687 observations and 13 variables.
NOTES: PROCEDURE SORT used (total processing time):
real-time 8:29.88
user cpu time 2:35.53
system cpu time 53.34 seconds
memory 31708967.57k
OS Memory 31729708.00k
Timestamp 04/05/2023 08:18:18 m.
Step Count 15 Switch Count 659

and here the result of the log after the application of the proc sort on the compressed table with the use of the TAGSORT option:

NOTE: Tagsort reads each observation of the input data set twice.
NOTE: The data set WORK.ETLS_SORTEDXREF has 305819687 observations and 13 variables.
NOTES: Compressing data set WORK.ETLS_SORTEDXREF decreased size by 7.66 percent.
Compressed is 311022 pages; uncompressed would require 336806 pages.
NOTES: PROCEDURE SORT used (total processing time):
real time 12:35.03
user cpu time 6:48.25
system cpu time 1:22.71
memory 20403259.93k
OS Memory 20426884.00k
Timestamp 05/04/2023 08:35:33m.
Step Count 19 Switch Count 1205

18 REPLIES 18
Kurt_Bremser
Super User

With a compression rate of only 10%, COMPRESS is not really useful, and the overhead of TAGSORT (reading the input twice) is not compensated by having a smaller utility file.

Sorting 29 G in 8.5 minutes is not that bad, but a real time of 8.5 minutes vs. CPU time of 3.5 minutes points to your WORK storage being the bottleneck.

LinusH
Tourmaline | Level 20

Agree with @Kurt_Bremser , compression in this scenario will only slow down your process.

The only quick solution I can think of if there's more available memory, raise MEMESIZE and SORTSIZE.

I guess this is ajob that will run regularly. There might be a way yo could redesign your ETL not have to resort the whole table each time - but that needs in depth knowledge of your business rules, source system contents etc.

Data never sleeps
mariopellegrini
Pyrite | Level 9

L'opzione SORTSIZE è impostata su 32Gb, mentre la tabella da ordinare è 29Gb, quindi che senso ha aumentare il valore SORTSIZE?

LinusH
Tourmaline | Level 20

PROC SORT can use up to 3-4 times the size of the input table in spill files etc.

Data never sleeps
mariopellegrini
Pyrite | Level 9

what does this mean? that you can set a SORTSIZE for example to 60Gb?

LinusH
Tourmaline | Level 20

Yes, as long it's equal to or smaller than your MEMSIZE setting.

But don't set it to a higher value than you have actual memory availble to you. Otherwise the OS will do the swapping. You want to SAS do the swapping since it know its internal processing better than the OS:

Data never sleeps
mariopellegrini
Pyrite | Level 9

So I think I changed the SORTSIZE option. Could you tell me how to restore the previous option? Thank you...

 

Ksharp
Super User

I agree with Kurt's "Sorting 29 G in 8.5 minutes is not that bad".

If you want faster ,here is an example but need more code .

 

data Female Male;
 set sashelp.heart;
 if sex='Female' then output Female;
  else if sex='Male' then output Male;
run;

data want;
 set Female Male;
run;
Kurt_Bremser
Super User

@Ksharp wrote:

I agree with Kurt's "Sorting 29 G in 8.5 minutes is not that bad".

If you want faster ,here is an example but need more code .

 

data Female Male;
 set sashelp.heart;
 if sex='Female' then output Female;
  else if sex='Male' then output Male;
run;

data want;
 set Female Male;
run;

While this avoids the actual sort, it will still run the whole amount of data 4 times through I/O (same as PROC SORT without TAGSORT), and if the bottleneck is there, the gain will be minimal.

mariopellegrini
Pyrite | Level 9

I don't quite understand what the code you put means, it doesn't seem like a substitute for a proc sort

this is my starting proc sort:

 

proc sort data=_16_variables out=test1;
   by list of _13_ variables;
run;

 

Astounding
PROC Star

This is a proc sort. 

 

I didn't know the names of the 13 variables in your BY statement, so you would have to replace the BY statement with the actual variables you are using.

 

Rather than typing the name of your incoming data set, I used a made-up name _16_variables.  I expect you would use your actual data set name.

 

I used out= to avoid replacing the original data set.  If you replace the original data set, the second PROC SORT would be skipped since SAS detects that the data set is already in sorted order.

 

So the idea is to run PROC SORT twice on the same data set, using out= to name the sorted data set.  One PROC SORT uses TAGSORT and one doesn't.  

 

If that's not clear, please ask again.

Reeza
Super User

If you were sorting by sex (male/female in this data), that example code splits the data set into one for each sex and then stacks them. A manual sort essentially, which is the equivalent of

 

proc sort data=class;

by sex;

run;

 


@mariopellegrini wrote:

I don't quite understand what the code you put means, it doesn't seem like a substitute for a proc sort

this is my starting proc sort:

 

proc sort data=_16_variables out=test1;
   by list of _13_ variables;
run;

 


 

mariopellegrini
Pyrite | Level 9
but I sort by 13 variables so it's more complex
Reeza
Super User
Not really, your combinations are more but the principle is the same and if it's automated it doesn't matter. Do you have indexes on any of those 13 variables?

SAS Innovate 2025: Register Today!

 

Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.


Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 18 replies
  • 1804 views
  • 3 likes
  • 6 in conversation