Hi !
It’s my first thread 😀
I’m in the following situation:
Left join on five tables (first table: 100 million rows / 15 columns / about 15GB
second table: 300 million rows / 15 columns / about 80GB
third table: 10 000 rows / 10 columns / about 2MB
fourth table: 100 million rows / 40 columns / about 40GB
fifth table: 50 rows / 5columns / about 150KB)
Filtering with IN clauses and Date intervals
Selecting multiple variables and applying some Text functions and CASE WHEN statements
I’d like to improve the performances of my code, especially the speed.
I tried filtering the largest tables before joining them and also to compress them with a binary option, but my performances didn’t improve so much.
I’d like to know if you have any advice to suggest me to make it better ( I’ve already read this interesting topic and tried some methods from Making SAS® Tables Smaller and Faster Without Data Loss - SAS Support Communities, but without meaningful results).
Hope you can help me and in any case thanks in advance!
0) it's hard to optimise if there is no code...
1) does it have to be always "left join"?
2) is it "many-to-many" join?
3) are you adding those two little table, or are they just for filtering?
4) what was the result of the filtering, were those tables much "smaller"?
5) how long does it run ( options fullstimer; )?
6) what is the size of your session (RAM available)?
Bart
Hi @Daniele_R
Here is my first stab at this
Left join on five tables (first table: 100 million rows / 15 columns / about 15GB
second table: 300 million rows / 15 columns / about 80GB
third table: 10 000 rows / 10 columns / about 2MB /* This can be loaded into a Hash Object */
fourth table: 100 million rows / 40 columns / about 40GB
fifth table: 50 rows / 5columns / about 150KB) /* This can be loaded into a Hash Object */
Filtering with IN clauses and Date intervals
Selecting multiple variables and applying some Text functions and CASE WHEN statements
Ensure you have the Proper Simple(Single column based) / Complex (Muliple columns based) Indexes defined for the tables below, evaluate based on filtering criteria
Here couple of papers to get you started on SAS Indexes
Depending on the nature of your tables:
One other suggestion I would recommend,
Hope this helps
Thank you very much for all these tips to improve our data processing.
I will start reading the articles and experimenting with the suggested optimizations.
On the CPU count option, I find CPU use is relatively low in SAS, especially with large datasets. There is loads of room for SAS to use higher compression on datasets and hence more CPU and I don't know why this hasn't been done. I found during lockdown compressing a compressed SAS dataset with 7z on fast could reduce filesize by ~90%. Depending how the compressed dataset is unpacked in the computation this may yield some benefit. It certainly would in writing output datasets.
Hi @bpowell
There are ways to reduce the SAS data set size and optimize it for processing without compression.
This paper Twenty Ways to Run Your SAS® Program Faster and Use Less Space
I personally follow and advise my users to follow these points.
6. Numeric variables in SAS data sets have a default length of 8. If the values of the
numeric variable are all integers, you can reduce the space by using the following table.
7. Sometimes character variables imported into SAS from other systems, like Oracle or Excel, have very large lengths. You can use the following procedure to get the shortest possible length for your character variable, although you might want to allow room for growth:
a. Use the LENGTH function to calculate the actual length of the variable in each observation in the data set.
b. Use the MAX option in PROC SUMMARY to get the largest value of the length.
c. Use the LENGTH statement to shorten the length of the character variable to the maximum length.
8. Switch variables from numeric to character if they are integers and range in value from -9 to 99. The minimum length for numeric variables is 3, so you can save space if the variable can fit into one or two characters.
9. Switch variables from character to numeric if they are all integers and occupy more than 3 bytes. For example, the number 1234 would occupy 4 bytes as a character variable but item 6 above shows it would only occupy 3 bytes as a numeric variable.
Note: When the data set record length is reduced, that means more records can fit into the buffer, which in turn means less time to read the data
"There is loads of room for SAS to use higher compression on datasets and hence more CPU and I don't know why this hasn't been done." It has been done. Check out the COMPRESS option.
I can get compression ratios with some of my tables approaching 90% if character columns have a lot of blank space.
The 7z compression compresses everything, which also means that one would need to uncompress everything before use.
SAS compresses observations, which means that individual observations can still be found within a dataset page without having to uncompress the whole dataset first.
And you need to run the same test with COMPRESS=BINARY. COMPRESS=YES only does RLE (run length encoding) compression (which is almost undetectable in terms of CPU usage).
It's been my experience that multi-table joins in SQL tend to be very slow and consume excessive resources (disk space in WORK).
We need to see your current code (accompanied by some example data in DATA steps with DATALINES, so we know the variable attributes) to throw the whole weight of our experience at your problem.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.