- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I don't quite understand why a proc sort applied to a large table worsens performance if applied to the same dataset, but compressed.
The table in question weighs about 29Gb (305819687 observations and 16 variables, of which 13 variables used in the "by"), and by doing the COMPRESS the weight is reduced to only 26Gb. A first doubt I have is that usually compress reduces the occupied memory much more.
I wish I could optimize this proc sort but I didn't get good results, neither with the use of compress for the table, nor with the Bufsize= and Bufno = options (I tried them for all possible combinations), nor with the TAGSORT option .
With the SORTSIZE option I think the situation does not change because this is set to 32Gb, therefore a value already higher than the table size.
Here is the outcome of the log of the starting sort:
NOTE: There were 305819687 observations read from the data set ODS_VT.T_F_PTFVT_RISERVA_AWARDS.
NOTE: The data set WORK.ETLS_SORTEDXREF has 305819687 observations and 13 variables.
NOTES: PROCEDURE SORT used (total processing time):
real-time 8:29.88
user cpu time 2:35.53
system cpu time 53.34 seconds
memory 31708967.57k
OS Memory 31729708.00k
Timestamp 04/05/2023 08:18:18 m.
Step Count 15 Switch Count 659
and here the result of the log after the application of the proc sort on the compressed table with the use of the TAGSORT option:
NOTE: Tagsort reads each observation of the input data set twice.
NOTE: The data set WORK.ETLS_SORTEDXREF has 305819687 observations and 13 variables.
NOTES: Compressing data set WORK.ETLS_SORTEDXREF decreased size by 7.66 percent.
Compressed is 311022 pages; uncompressed would require 336806 pages.
NOTES: PROCEDURE SORT used (total processing time):
real time 12:35.03
user cpu time 6:48.25
system cpu time 1:22.71
memory 20403259.93k
OS Memory 20426884.00k
Timestamp 05/04/2023 08:35:33m.
Step Count 19 Switch Count 1205
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
With a compression rate of only 10%, COMPRESS is not really useful, and the overhead of TAGSORT (reading the input twice) is not compensated by having a smaller utility file.
Sorting 29 G in 8.5 minutes is not that bad, but a real time of 8.5 minutes vs. CPU time of 3.5 minutes points to your WORK storage being the bottleneck.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Agree with @Kurt_Bremser , compression in this scenario will only slow down your process.
The only quick solution I can think of if there's more available memory, raise MEMESIZE and SORTSIZE.
I guess this is ajob that will run regularly. There might be a way yo could redesign your ETL not have to resort the whole table each time - but that needs in depth knowledge of your business rules, source system contents etc.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
L'opzione SORTSIZE è impostata su 32Gb, mentre la tabella da ordinare è 29Gb, quindi che senso ha aumentare il valore SORTSIZE?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
PROC SORT can use up to 3-4 times the size of the input table in spill files etc.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
what does this mean? that you can set a SORTSIZE for example to 60Gb?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Yes, as long it's equal to or smaller than your MEMSIZE setting.
But don't set it to a higher value than you have actual memory availble to you. Otherwise the OS will do the swapping. You want to SAS do the swapping since it know its internal processing better than the OS:
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
So I think I changed the SORTSIZE option. Could you tell me how to restore the previous option? Thank you...
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I agree with Kurt's "Sorting 29 G in 8.5 minutes is not that bad".
If you want faster ,here is an example but need more code .
data Female Male;
set sashelp.heart;
if sex='Female' then output Female;
else if sex='Male' then output Male;
run;
data want;
set Female Male;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@Ksharp wrote:
I agree with Kurt's "Sorting 29 G in 8.5 minutes is not that bad".
If you want faster ,here is an example but need more code .
data Female Male; set sashelp.heart; if sex='Female' then output Female; else if sex='Male' then output Male; run; data want; set Female Male; run;
While this avoids the actual sort, it will still run the whole amount of data 4 times through I/O (same as PROC SORT without TAGSORT), and if the bottleneck is there, the gain will be minimal.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I don't quite understand what the code you put means, it doesn't seem like a substitute for a proc sort
this is my starting proc sort:
proc sort data=_16_variables out=test1;
by list of _13_ variables;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
This is a proc sort.
I didn't know the names of the 13 variables in your BY statement, so you would have to replace the BY statement with the actual variables you are using.
Rather than typing the name of your incoming data set, I used a made-up name _16_variables. I expect you would use your actual data set name.
I used out= to avoid replacing the original data set. If you replace the original data set, the second PROC SORT would be skipped since SAS detects that the data set is already in sorted order.
So the idea is to run PROC SORT twice on the same data set, using out= to name the sorted data set. One PROC SORT uses TAGSORT and one doesn't.
If that's not clear, please ask again.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
If you were sorting by sex (male/female in this data), that example code splits the data set into one for each sex and then stacks them. A manual sort essentially, which is the equivalent of
proc sort data=class;
by sex;
run;
@mariopellegrini wrote:
I don't quite understand what the code you put means, it doesn't seem like a substitute for a proc sort
this is my starting proc sort:
proc sort data=_16_variables out=test1; by list of _13_ variables; run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content