SAS Programming

mariopellegrini · Posted 04-05-2023 04:23 AM

I don't quite understand why a proc sort applied to a large table worsens performance if applied to the same dataset, but compressed.
The table in question weighs about 29Gb (305819687 observations and 16 variables, of which 13 variables used in the "by"), and by doing the COMPRESS the weight is reduced to only 26Gb. A first doubt I have is that usually compress reduces the occupied memory much more.
I wish I could optimize this proc sort but I didn't get good results, neither with the use of compress for the table, nor with the Bufsize= and Bufno = options (I tried them for all possible combinations), nor with the TAGSORT option .
With the SORTSIZE option I think the situation does not change because this is set to 32Gb, therefore a value already higher than the table size.
Here is the outcome of the log of the starting sort:

NOTE: There were 305819687 observations read from the data set ODS_VT.T_F_PTFVT_RISERVA_AWARDS.
NOTE: The data set WORK.ETLS_SORTEDXREF has 305819687 observations and 13 variables.
NOTES: PROCEDURE SORT used (total processing time):
real-time 8:29.88
user cpu time 2:35.53
system cpu time 53.34 seconds
memory 31708967.57k
OS Memory 31729708.00k
Timestamp 04/05/2023 08:18:18 m.
Step Count 15 Switch Count 659

and here the result of the log after the application of the proc sort on the compressed table with the use of the TAGSORT option:

NOTE: Tagsort reads each observation of the input data set twice.
NOTE: The data set WORK.ETLS_SORTEDXREF has 305819687 observations and 13 variables.
NOTES: Compressing data set WORK.ETLS_SORTEDXREF decreased size by 7.66 percent.
Compressed is 311022 pages; uncompressed would require 336806 pages.
NOTES: PROCEDURE SORT used (total processing time):
real time 12:35.03
user cpu time 6:48.25
system cpu time 1:22.71
memory 20403259.93k
OS Memory 20426884.00k
Timestamp 05/04/2023 08:35:33m.
Step Count 19 Switch Count 1205

Kurt_Bremser · Posted 04-05-2023 04:35 AM

With a compression rate of only 10%, COMPRESS is not really useful, and the overhead of TAGSORT (reading the input twice) is not compensated by having a smaller utility file.

Sorting 29 G in 8.5 minutes is not that bad, but a real time of 8.5 minutes vs. CPU time of 3.5 minutes points to your WORK storage being the bottleneck.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

LinusH · Posted 04-05-2023 05:40 AM

Agree with @Kurt_Bremser , compression in this scenario will only slow down your process.

The only quick solution I can think of if there's more available memory, raise MEMESIZE and SORTSIZE.

I guess this is ajob that will run regularly. There might be a way yo could redesign your ETL not have to resort the whole table each time - but that needs in depth knowledge of your business rules, source system contents etc.

Data never sleeps

mariopellegrini · Posted 04-05-2023 05:45 AM

L'opzione SORTSIZE è impostata su 32Gb, mentre la tabella da ordinare è 29Gb, quindi che senso ha aumentare il valore SORTSIZE?

LinusH · Posted 04-05-2023 05:49 AM

PROC SORT can use up to 3-4 times the size of the input table in spill files etc.

Data never sleeps

mariopellegrini · Posted 04-05-2023 06:42 AM

what does this mean? that you can set a SORTSIZE for example to 60Gb?

LinusH · Posted 04-05-2023 07:16 AM

Yes, as long it's equal to or smaller than your MEMSIZE setting.

But don't set it to a higher value than you have actual memory availble to you. Otherwise the OS will do the swapping. You want to SAS do the swapping since it know its internal processing better than the OS:

Data never sleeps

mariopellegrini · Posted 04-05-2023 07:38 AM

So I think I changed the SORTSIZE option. Could you tell me how to restore the previous option? Thank you...

Ksharp · Posted 04-05-2023 05:44 AM

I agree with Kurt's "Sorting 29 G in 8.5 minutes is not that bad".

If you want faster ,here is an example but need more code .

data Female Male;
 set sashelp.heart;
 if sex='Female' then output Female;
  else if sex='Male' then output Male;
run;

data want;
 set Female Male;
run;

Kurt_Bremser · Posted 04-05-2023 08:04 AM

@Ksharp wrote:

I agree with Kurt's "Sorting 29 G in 8.5 minutes is not that bad".

If you want faster ,here is an example but need more code .
data Female Male;
 set sashelp.heart;
 if sex='Female' then output Female;
  else if sex='Male' then output Male;
run;

data want;
 set Female Male;
run;

While this avoids the actual sort, it will still run the whole amount of data 4 times through I/O (same as PROC SORT without TAGSORT), and if the bottleneck is there, the gain will be minimal.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

mariopellegrini · Posted 04-05-2023 10:22 AM

I don't quite understand what the code you put means, it doesn't seem like a substitute for a proc sort

this is my starting proc sort:

proc sort data=_16_variables out=test1;
   by list of _13_ variables;
run;

Astounding · Posted 04-05-2023 10:29 AM

This is a proc sort.

I didn't know the names of the 13 variables in your BY statement, so you would have to replace the BY statement with the actual variables you are using.

Rather than typing the name of your incoming data set, I used a made-up name _16_variables. I expect you would use your actual data set name.

I used out= to avoid replacing the original data set. If you replace the original data set, the second PROC SORT would be skipped since SAS detects that the data set is already in sorted order.

So the idea is to run PROC SORT twice on the same data set, using out= to name the sorted data set. One PROC SORT uses TAGSORT and one doesn't.

If that's not clear, please ask again.

Reeza · Posted 04-05-2023 10:31 AM

If you were sorting by sex (male/female in this data), that example code splits the data set into one for each sex and then stacks them. A manual sort essentially, which is the equivalent of

proc sort data=class;

by sex;

run;

@mariopellegrini wrote:

I don't quite understand what the code you put means, it doesn't seem like a substitute for a proc sort

this is my starting proc sort:
proc sort data=_16_variables out=test1;
   by list of _13_ variables;
run;

mariopellegrini · Posted 04-05-2023 10:38 AM

but I sort by 13 variables so it's more complex

Reeza · Posted 04-05-2023 11:49 AM

Not really, your combinations are more but the principle is the same and if it's automated it doesn't matter. Do you have indexes on any of those 13 variables?

SAS Programming

proc sort large table

Re: proc sort large table

Re: proc sort large table

Re: proc sort large table

Re: proc sort large table

Re: proc sort large table

Re: proc sort large table

Re: proc sort large table

Re: proc sort large table

Re: proc sort large table

Re: proc sort large table

Re: proc sort large table

Re: proc sort large table

Re: proc sort large table

Re: proc sort large table

Follow Us

What is...

SAS Programming

Special offer for SAS Communities members

SAS Training: Just a Click Away

Follow Us

What is...