- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I have a 42gb table, 44 variables, 34 million observations, a proc sort that takes about 44 minutes to sort by 7 variables.
I wonder if anyone has any specific advice that is used in these cases to reduce processing times. Thanks
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
32212254720B = 31457280KB = 30720MB = 30GB nice setup 🙂
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug
"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings
SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@mariopellegrini wrote:
I have a 42gb table, 44 variables, 34 million observations, a proc sort that takes about 44 minutes to sort by 7 variables.
I wonder if anyone has any specific advice that is used in these cases to reduce processing times. Thanks
First thing is make sure that you are using the fastest disk on your system. Typically the WORK disk is assigned to a fast disk so try sorting from/to WORK.
Also since you seem to have long records perhaps COMPRESS= dataset option will help. You might end up using more CPU time, but it should save on disk I/O operations, which are orders of magnitude slower.
If you have access to it you might try using the SPDE engine.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi.
Consider trying Bufsize= and Bufno = options for the output table. Plese see online documentation:
https://go.documentation.sas.com/doc/en/vdmmlcdc/8.1/lesysoptsref/p1d8hx95jb53wqn0zzvawxw94nvi.htm
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Sometimes the TAGSORT option can help. READ the documentation on the option as one of the key bits is "
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thanks. I tried to look for documentation on the SORTSIZE option, but I don't know what values to set it to see if it improves performance. Do you have any examples?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
BEFORE you start tinkering, analyze.
What does the log of the PROC SORT say?
Run it with
options fullstimer;
set and then post the log here. This will give us a picture of the bottleneck.
The procedure will read and write the whole data 4 times (read the dataset, write the temporary file, read the temporary file, write the result), so you move ~170 G through your storage, which translates to 4 G/min, or ~ 67 M/sec, which is not what proper current disks can do (200 Μ/sec for a single, non-striped SSD).
While you run the procedure, take a look at the log to see when the first message (observations read) appears. If your source library is in fact something which goes over a slow network, this may explain the bad performance.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I report here the log with the fulltimer option (the time + is lower perhaps because I'm using another server, the development one, but in the production logs it took 43'. This is another thing I need to understand)
39 proc sort data = TAB_INP out = work.etls_sortedxref(keep=COD_TIPO_MOV DT_CONTABILE GARANZIA
40 ID COD_MOVIMENTO NUM_COMPONENTE num_rev_contr POLIZZA TS_INIZIO_VALIDITA TS_FINE_VALIDITA ANTIRACKET BANCA CODICEFILIALE
40 ! CONT_FRAZ
41 CONT_RINNOVO DT_EFFETTO DT_RIATTIVAZIONE DT_SCADENZA DT_SOSPENSIONE DES_GARANZIA EMS_MOVIMENTO GESTOCOMMERCIALE
41 ! INTERMEDIARIO2
42 INVOICETYPE EU_IMP_LORDO EU_PREMIO_NETTO PRODOTTO FAM_PRODOTTO EU_PROV_ACQ_INT2 PROVINCIA EU_PROVV_ACQ
42 ! EU_PROVV_GG_PRECONT
43 RAMOBILANCIO EU_SSN EU_TASSE_PREMI TIPOLOGIA_PAGAMENTO TIPOLOGIA_PAGMOVIMENTO EU_TOT_TASSE EU_TXCOMM EU_TXTASSA)
44 ;
45 by COD_TIPO_MOV DT_CONTABILE GARANZIA ID COD_MOVIMENTO NUM_COMPONENTE num_rev_contr POLIZZA
46 TS_FINE_VALIDITA;
47 run;
NOTE: There were 35800340 observations read from the data set TAB_INP.
NOTE: The data set WORK.ETLS_SORTEDXREF has 35800340 observations and 40 variables.
NOTE: PROCEDURE SORT ha utilizzato (tempo totale di elaborazione):
real time 7:34.39
user cpu time 1:11.46
system cpu time 1:14.73
memory 31614078.68k
OS Memory 31632696.00k
Timestamp 22/02/2023 08:07:18 m.
Step Count 6 Switch Count 565
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Your CPU needs roughly 2.5 minutes, while in real time you spend 3 times of that. This means that (as expected) I/O is your bottleneck (if that CPU time is actually that of several threads running in parallel, the disproportion is even stronger). But 7 minutes for a 42 G sort isn't bad at all.
So it looks that your production server (if it has the same CPU hardware) is congested by multiple processes running in parallel; if these are SAS processes using the same WORK disks, you need to invest there.
Think of adding disks (same size as WORK) and use them for UTILLOC to split the load.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for the very useful information. What exactly do you mean by "adding disks"? Do you always add disks inside the sas work? I didn't quite understand what to do
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
"Disk" depends on how your storage is set up. In today's world, this often means fully virtualized devices. Talk with your sysadmins how you can improve your I/O throughput, or how you can schedule jobs in a way that reduces contention for available resources.
A separate UTILLOC (this is where the temporary files during sorting reside) only makes sense if you can add devices that do not share hardware with the existing WORK.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thanks Kurt. I am also attaching the log for the production server. It can be seen that the gap between real time 44:07.42 and cpu time 3:27.46 is even greater. Does this support even more the thesis that you have exposed?
proc sort data = TAB_INP out = work.etls_sortedxref(keep=COD_TIPO_MOV DT_CONTABILE GARANZIA
ID COD_MOVIMENTO NUM_COMPONENTE num_rev_contr POLIZZA TS_INIZIO_VALIDITA TS_FINE_VALIDITA ANTIRACKET BANCA CODICEFILIALE CONT_FRAZ
CONT_RINNOVO DT_EFFETTO DT_RIATTIVAZIONE DT_SCADENZA DT_SOSPENSIONE DES_GARANZIA EMS_MOVIMENTO GESTOCOMMERCIALE INTERMEDIARIO2
INVOICETYPE EU_IMP_LORDO EU_PREMIO_NETTO PRODOTTO FAM_PRODOTTO EU_PROV_ACQ_INT2 PROVINCIA EU_PROVV_ACQ EU_PROVV_GG_PRECONT
RAMOBILANCIO EU_SSN EU_TASSE_PREMI TIPOLOGIA_PAGAMENTO TIPOLOGIA_PAGMOVIMENTO EU_TOT_TASSE EU_TXCOMM EU_TXTASSA);
by COD_TIPO_MOV DT_CONTABILE GARANZIA ID COD_MOVIMENTO NUM_COMPONENTE num_rev_contr POLIZZA
TS_FINE_VALIDITA;
run;
NOTE: There were 34159292 observations read from the data set TAB_INP.
NOTE: The data set WORK.ETLS_SORTEDXREF has 34159292 observations and 40 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 44:07.42
cpu time 3:27.46
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
To add to what @Kurt_Bremser wrote:
Table name work.etls_sortedxref makes me think that this is a step of DIS generated SCD2 loader code. If so and you're actually trying to improve performance of the whole load process then I've ended up more than once to create a custom transformation. This is especially true for target tables in a database.
The SAS log you shared doesn't show it but do you also get in the log the compression messages? If not then setting option compress=yes would likely improve elapsed times.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Yes of course, it is a Data Integration process. Does this mean anything in particular?
Yes, I confirm that the COMPRESS option will be introduced to improve performance
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
SORTSIZE=n | nK | nM | nG | nT | hexX | MIN | MAX
Note: You can also use the KB, MB, GB, and TB syntax notations.
Specifies the amount of memory that is available to the SORT procedure.
There is also sortcutp
SORTCUTP=n | nK | nM | nG | MIN | MAX | hexX
Note: You can also use the KB, MB, and GB syntax notation.
Specifies the size of the data in bytes above which the host sort is likely to perform more efficiently than the internal sort. SORTCUTP is used only when SORTCUT=0.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
How are you creating or updating this table in the first place? If you are updating it, it may be possible to keep the sort order intact to avoid sorting it entirely. If the table is being sourced from an external database, sort the data there while reading it as it will most likely be faster than sorting in later in SAS.