BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.

I have a 42gb table, 44 variables, 34 million observations, a proc sort that takes about 44 minutes to sort by 7 variables.
I wonder if anyone has any specific advice that is used in these cases to reduce processing times. Thanks

1 ACCEPTED SOLUTION

Accepted Solutions
yabwon
Onyx | Level 15

32212254720B = 31457280KB = 30720MB = 30GB nice setup 🙂



_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation



View solution in original post

29 REPLIES 29
Tom
Super User Tom
Super User

@mariopellegrini wrote:

I have a 42gb table, 44 variables, 34 million observations, a proc sort that takes about 44 minutes to sort by 7 variables.
I wonder if anyone has any specific advice that is used in these cases to reduce processing times. Thanks


First thing is make sure that you are using the fastest disk on your system. Typically the WORK disk is assigned to a fast disk so try sorting from/to WORK.

 

Also since you seem to have long records perhaps COMPRESS= dataset option will help.  You might end up using more CPU time, but it should save on disk I/O operations, which are orders of magnitude slower.

 

If you have access to it you might try using the SPDE engine.

 

john_mccall
SAS Employee

Hi.

 

Consider trying Bufsize= and Bufno = options for the output table.  Plese see online documentation:

https://go.documentation.sas.com/doc/en/vdmmlcdc/8.1/lesysoptsref/p1d8hx95jb53wqn0zzvawxw94nvi.htm

ballardw
Super User

Sometimes the TAGSORT option can help. READ the documentation on the option as one of the key bits is "

in cases where the total number of bytes of the sort keys is small compared with the length of the record, temporary disk use is reduced considerably".
 
Proc sort in effect makes three copies of the data set so limitations of disc space come up and can result in a lot of extra disc accesses in temporary swap files. So where your temp space resides can make a difference. The Tagsort options reduces how much is swapped but working on a faster disc makes a difference.
If you are working in a networked environment on a server your allocated work space may be a choke point.
 
You might also check on your setting for the SORTSIZE system option. If you have more memory available than that you could set the SORTSIZE option in the Proc Sort code to use more memory for that run to reduce the amount of disc swapping likely to occur.
 
If there is any chance that the data may already be in sort order you might also use the PRESORTED option which will investigate the set an only sort if not in sort order of the BY the by variables.
 
If your data is coming from an external data source, such as a data base, you might have that source order the records before use by SAS.
 
Also consider if you really need to sort by all 7 variables.
 
 
mariopellegrini
Pyrite | Level 9

Thanks. I tried to look for documentation on the SORTSIZE option, but I don't know what values to set it to see if it improves performance. Do you have any examples?

Kurt_Bremser
Super User

BEFORE you start tinkering, analyze.

What does the log of the PROC SORT say?

Run it with

options fullstimer;

set and then post the log here. This will give us a picture of the bottleneck.

 

The procedure will read and write the whole data 4 times (read the dataset, write the temporary file, read the temporary file, write the result), so you move ~170 G through your storage, which translates to 4 G/min, or ~ 67 M/sec, which is not what proper current disks can do (200 Μ/sec for a single, non-striped SSD).

While you run the procedure, take a look at the log to see when the first message (observations read) appears. If your source library is in fact something which goes over a slow network, this may explain the bad performance.

mariopellegrini
Pyrite | Level 9

 

I report here the log with the fulltimer option (the time + is lower perhaps because I'm using another server, the development one, but in the production logs it took 43'. This is another thing I need to understand)

 

39 proc sort data = TAB_INP out = work.etls_sortedxref(keep=COD_TIPO_MOV DT_CONTABILE GARANZIA
40 ID COD_MOVIMENTO NUM_COMPONENTE num_rev_contr POLIZZA TS_INIZIO_VALIDITA TS_FINE_VALIDITA ANTIRACKET BANCA CODICEFILIALE
40 ! CONT_FRAZ
41 CONT_RINNOVO DT_EFFETTO DT_RIATTIVAZIONE DT_SCADENZA DT_SOSPENSIONE DES_GARANZIA EMS_MOVIMENTO GESTOCOMMERCIALE
41 ! INTERMEDIARIO2
42 INVOICETYPE EU_IMP_LORDO EU_PREMIO_NETTO PRODOTTO FAM_PRODOTTO EU_PROV_ACQ_INT2 PROVINCIA EU_PROVV_ACQ
42 ! EU_PROVV_GG_PRECONT
43 RAMOBILANCIO EU_SSN EU_TASSE_PREMI TIPOLOGIA_PAGAMENTO TIPOLOGIA_PAGMOVIMENTO EU_TOT_TASSE EU_TXCOMM EU_TXTASSA)
44 ;
45 by COD_TIPO_MOV DT_CONTABILE GARANZIA ID COD_MOVIMENTO NUM_COMPONENTE num_rev_contr POLIZZA
46 TS_FINE_VALIDITA;
47 run;

NOTE: There were 35800340 observations read from the data set TAB_INP.
NOTE: The data set WORK.ETLS_SORTEDXREF has 35800340 observations and 40 variables.
NOTE: PROCEDURE SORT ha utilizzato (tempo totale di elaborazione):
real time 7:34.39
user cpu time 1:11.46
system cpu time 1:14.73
memory 31614078.68k
OS Memory 31632696.00k
Timestamp 22/02/2023 08:07:18 m.
Step Count 6 Switch Count 565

Kurt_Bremser
Super User

Your CPU needs roughly 2.5 minutes, while in real time you spend 3 times of that. This means that (as expected) I/O is your bottleneck (if that CPU time is actually that of several threads running in parallel, the disproportion is even stronger). But 7 minutes for a 42 G sort isn't bad at all.

So it looks that your production server (if it has the same CPU hardware) is congested by multiple processes running in parallel; if these are SAS processes using the same WORK disks, you need to invest there.

Think of adding disks (same size as WORK) and use them for UTILLOC to split the load.

mariopellegrini
Pyrite | Level 9

Thanks for the very useful information. What exactly do you mean by "adding disks"? Do you always add disks inside the sas work? I didn't quite understand what to do

Kurt_Bremser
Super User

"Disk" depends on how your storage is set up. In today's world, this often means fully virtualized devices. Talk with your sysadmins how you can improve your I/O throughput, or how you can schedule jobs in a way that reduces contention for available resources.

A separate UTILLOC (this is where the temporary files during sorting reside) only makes sense if you can add devices that do not share hardware with the existing WORK.

mariopellegrini
Pyrite | Level 9

Thanks Kurt. I am also attaching the log for the production server. It can be seen that the gap between real time 44:07.42 and cpu time 3:27.46 is even greater. Does this support even more the thesis that you have exposed?

 

proc sort data = TAB_INP out = work.etls_sortedxref(keep=COD_TIPO_MOV DT_CONTABILE GARANZIA
ID COD_MOVIMENTO NUM_COMPONENTE num_rev_contr POLIZZA TS_INIZIO_VALIDITA TS_FINE_VALIDITA ANTIRACKET BANCA CODICEFILIALE CONT_FRAZ
CONT_RINNOVO DT_EFFETTO DT_RIATTIVAZIONE DT_SCADENZA DT_SOSPENSIONE DES_GARANZIA EMS_MOVIMENTO GESTOCOMMERCIALE INTERMEDIARIO2
INVOICETYPE EU_IMP_LORDO EU_PREMIO_NETTO PRODOTTO FAM_PRODOTTO EU_PROV_ACQ_INT2 PROVINCIA EU_PROVV_ACQ EU_PROVV_GG_PRECONT
RAMOBILANCIO EU_SSN EU_TASSE_PREMI TIPOLOGIA_PAGAMENTO TIPOLOGIA_PAGMOVIMENTO EU_TOT_TASSE EU_TXCOMM EU_TXTASSA);
by COD_TIPO_MOV DT_CONTABILE GARANZIA ID COD_MOVIMENTO NUM_COMPONENTE num_rev_contr POLIZZA
TS_FINE_VALIDITA;
run;

NOTE: There were 34159292 observations read from the data set TAB_INP.
NOTE: The data set WORK.ETLS_SORTEDXREF has 34159292 observations and 40 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 44:07.42
cpu time 3:27.46

 

 

Patrick
Opal | Level 21

To add to what @Kurt_Bremser wrote:

Table name work.etls_sortedxref makes me think that this is a step of DIS generated SCD2 loader code. If so and you're actually trying to improve performance of the whole load process then I've ended up more than once to create a custom transformation. This is especially true for target tables in a database. 

 

The SAS log you shared doesn't show it but do you also get in the log the compression messages? If not then setting option compress=yes would likely improve elapsed times.

mariopellegrini
Pyrite | Level 9

Yes of course, it is a Data Integration process. Does this mean anything in particular?
Yes, I confirm that the COMPRESS option will be introduced to improve performance

john_mccall
SAS Employee
Here are the options:
SORTSIZE=n | nK | nM | nG | nT | hexX | MIN | MAX
Note: You can also use the KB, MB, GB, and TB syntax notations.
Specifies the amount of memory that is available to the SORT procedure.

There is also sortcutp

SORTCUTP=n | nK | nM | nG | MIN | MAX | hexX
Note: You can also use the KB, MB, and GB syntax notation.
Specifies the size of the data in bytes above which the host sort is likely to perform more efficiently than the internal sort. SORTCUTP is used only when SORTCUT=0.

SASKiwi
PROC Star

How are you creating or updating this table in the first place? If you are updating it, it may be possible to keep the sort order intact to avoid sorting it entirely. If the table is being sourced from an external database, sort the data there while reading it as it will most likely be faster than sorting in later in SAS.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 29 replies
  • 2205 views
  • 7 likes
  • 11 in conversation