BookmarkSubscribeRSS Feed
Daniele_R
Calcite | Level 5

Hi !

 

It’s my first thread 😀

I’m in the following situation:

 

Left join on five tables (first table: 100 million rows / 15 columns / about 15GB

                                     second table: 300 million rows / 15 columns / about 80GB

                                     third table: 10 000 rows / 10 columns / about 2MB

                                     fourth table: 100 million rows / 40 columns / about 40GB

                                     fifth table: 50 rows / 5columns / about 150KB)

Filtering with IN clauses and Date intervals

Selecting multiple variables and applying some Text functions and CASE WHEN statements

 

I’d like to improve the performances of my code, especially the speed.

 

I tried filtering the largest tables before joining them and also to compress them with a binary option, but my performances didn’t improve so much.

 

I’d like to know if you have any advice to suggest me to make it better ( I’ve already read this interesting topic and tried some methods from Making SAS® Tables Smaller and Faster Without Data Loss - SAS Support Communities, but without meaningful results).

 

Hope you can help me and in any case thanks in advance!

13 REPLIES 13
bpowell
Obsidian | Level 7
Are the tables native sas, are they indexed, do you have any I/O constraints e.g. network, disk?
Sometimes with SAS I've found multi-table joins don't seem to be well optimised and I break the query into smaller chunks. At least I do this to get a baseline.
yabwon
Onyx | Level 15

0) it's hard to optimise if there is no code... 

1) does it have to be always "left join"?

2) is it "many-to-many" join?

3) are you adding those two little table, or are they just for filtering? 

4) what was the result of the filtering, were those tables much "smaller"? 

5) how long does it run ( options fullstimer; )? 

6) what is the size of your session (RAM available)?

 

Bart

 

 

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation



bpowell
Obsidian | Level 7
I was also wondering about RAM but I expect you'd need, what, 128GB to make a difference on this query? Most will be done in I/O. I don't use more than 16-24GB and the 80GB dataset I normalize can't be done much in memory obviously.
AhmedAl_Attar
Rhodochrosite | Level 12

Hi @Daniele_R 

Here is my first stab at this

 

Left join on five tables (first table: 100 million rows / 15 columns / about 15GB

                                     second table: 300 million rows / 15 columns / about 80GB

                                     third table: 10 000 rows / 10 columns / about 2MB /* This can be loaded into a Hash Object */

                                     fourth table: 100 million rows / 40 columns / about 40GB

                                     fifth table: 50 rows / 5columns / about 150KB) /* This can be loaded into a Hash Object */

Filtering with IN clauses and Date intervals

Selecting multiple variables and applying some Text functions and CASE WHEN statements

 

Ensure you have the Proper Simple(Single column based) / Complex (Muliple columns based) Indexes defined for the tables below, evaluate based on filtering criteria 

  • 100 million rows / 15 columns / about 15GB
  • 300 million rows / 15 columns / about 80GB
  • 100 million rows / 40 columns / about 40GB

Here couple of papers to get you started on SAS Indexes

Depending on the nature of your tables:

  • External Relational Database Management System (RDBMS) ---> Most have partitioning features, see if you can utilize them if applicable
  • SAS Data sets. I would investigate using SPDE to store such large tables.

One other suggestion I would recommend,

  • Consider creating temporary tables (if possible) by filtering these large tables before your query, then use them in your query. (This is a workaround for using CTEs in SQL)

Hope this helps

Daniele_R
Calcite | Level 5

Thank you very much for all these tips to improve our data processing.

 

I will start reading the articles and experimenting with the suggested optimizations.

AhmedAl_Attar
Rhodochrosite | Level 12
In addition, I would highly recommend looking into your SAS environment and invocation settings.
Options that can enhance performance include:
-memsize (Default: 2G)
-sortsize (Default: 1G)
-CPUCOUNT (Default: 4)
-UTILLOC (Default: work)
Changing these defaults will help with the performance issues.
Here is another paper to read 😉
https://support.sas.com/resources/papers/proceedings09/333-2009.pdf
bpowell
Obsidian | Level 7

On the CPU count option, I find CPU use is relatively low in SAS, especially with large datasets. There is loads of room for SAS to use higher compression on datasets and hence more CPU and I don't know why this hasn't been done. I found during lockdown compressing a compressed SAS dataset with 7z on fast could reduce filesize by ~90%. Depending how the compressed dataset is unpacked in the computation this may yield some benefit. It certainly would in writing output datasets.

AhmedAl_Attar
Rhodochrosite | Level 12

Hi @bpowell 

 

There are ways to reduce the SAS data set size and optimize it for processing without compression.

This paper Twenty Ways to Run Your SAS® Program Faster and Use Less Space 

I personally follow and advise my users to follow these points.

 

6. Numeric variables in SAS data sets have a default length of 8. If the values of the
numeric variable are all integers, you can reduce the space by using the following table.

 

7. Sometimes character variables imported into SAS from other systems, like Oracle or Excel, have very large lengths. You can use the following procedure to get the shortest possible length for your character variable, although you might want to allow room for growth:
a. Use the LENGTH function to calculate the actual length of the variable in each observation in the data set.
b. Use the MAX option in PROC SUMMARY to get the largest value of the length.
c. Use the LENGTH statement to shorten the length of the character variable to the maximum length.

 

8. Switch variables from numeric to character if they are integers and range in value from -9 to 99. The minimum length for numeric variables is 3, so you can save space if the variable can fit into one or two characters.


9. Switch variables from character to numeric if they are all integers and occupy more than 3 bytes. For example, the number 1234 would occupy 4 bytes as a character variable but item 6 above shows it would only occupy 3 bytes as a numeric variable.

 

Note: When the data set record length is reduced, that means more records can fit into the buffer, which in turn means less time to read the data

SASKiwi
PROC Star

"There is loads of room for SAS to use higher compression on datasets and hence more CPU and I don't know why this hasn't been done." It has been done. Check out the COMPRESS option.

 

I can get compression ratios with some of my tables approaching 90% if character columns have a lot of blank space.

bpowell
Obsidian | Level 7
Well, excuse me but "duh".

40 data test;
41 length test $1000.;
42 do i=1 to 100000000;
43 test="
43 ! ....
43 ! ...";
NOTE: The quoted string currently being processed has become more than 262 characters long.
You might have unbalanced quotation marks.
44 output;
45 end;
46 drop i;
47 run;

NOTE: The data set WORK.TEST has 100000000 observations and 1 variables.
NOTE: Compressing data set WORK.TEST decreased size by 97.82 percent.
Compressed is 33607 pages; un-compressed would require 1538462 pages.
NOTE: DATA statement used (Total process time):
real time 56.63 seconds
cpu time 48.06 seconds

Output dataset size is 2gb *compressed*. SAS would have written 100s of GB without the compress option being set. That's a 1000 character string about 99% whitespace.

Now, if, as I said, you compress the output dataset with 7z, on fast, the resulting output dataset is <200mb. Like I said, SAS has * a lot * of scope for additional compression.
Kurt_Bremser
Super User

The 7z compression compresses everything, which also means that one would need to uncompress everything before use.

SAS compresses observations, which means that individual observations can still be found within a dataset page without having to uncompress the whole dataset first.

And you need to run the same test with COMPRESS=BINARY. COMPRESS=YES only does RLE (run length encoding) compression (which is almost undetectable in terms of CPU usage).

bpowell
Obsidian | Level 7
Yes - I know SAS can't read a zipped dataset. My point is the overall compression is poor. Does anything you say mean SAS could not use greater compression using more CPU? No. I would be surprised if there wasn't at least some scope to further leverage the current multi-cpu universe we inhabit. I don't believe compress=YES has been updated since 7bdat which is what, 30 years old? Then multi-cpu was mainframe and cpu time was costly. No longer the case. Time for an upgrade.
Kurt_Bremser
Super User

It's been my experience that multi-table joins in SQL tend to be very slow and consume excessive resources (disk space in WORK).

  • Split the process into several steps
  • Use SORT and DATA steps rather than SQL (unless you explicitly need a cartesian product)
  • Do lookups with hash objects, which can be integrated in a DATA step which does another MERGE

We need to see your current code (accompanied by some example data in DATA steps with DATALINES, so we know the variable attributes) to throw the whole weight of our experience at your problem.

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 13 replies
  • 2147 views
  • 6 likes
  • 6 in conversation