- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi !
It’s my first thread 😀
I’m in the following situation:
Left join on five tables (first table: 100 million rows / 15 columns / about 15GB
second table: 300 million rows / 15 columns / about 80GB
third table: 10 000 rows / 10 columns / about 2MB
fourth table: 100 million rows / 40 columns / about 40GB
fifth table: 50 rows / 5columns / about 150KB)
Filtering with IN clauses and Date intervals
Selecting multiple variables and applying some Text functions and CASE WHEN statements
I’d like to improve the performances of my code, especially the speed.
I tried filtering the largest tables before joining them and also to compress them with a binary option, but my performances didn’t improve so much.
I’d like to know if you have any advice to suggest me to make it better ( I’ve already read this interesting topic and tried some methods from Making SAS® Tables Smaller and Faster Without Data Loss - SAS Support Communities, but without meaningful results).
Hope you can help me and in any case thanks in advance!
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Sometimes with SAS I've found multi-table joins don't seem to be well optimised and I break the query into smaller chunks. At least I do this to get a baseline.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
0) it's hard to optimise if there is no code...
1) does it have to be always "left join"?
2) is it "many-to-many" join?
3) are you adding those two little table, or are they just for filtering?
4) what was the result of the filtering, were those tables much "smaller"?
5) how long does it run ( options fullstimer; )?
6) what is the size of your session (RAM available)?
Bart
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug
"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings
SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi @Daniele_R
Here is my first stab at this
Left join on five tables (first table: 100 million rows / 15 columns / about 15GB
second table: 300 million rows / 15 columns / about 80GB
third table: 10 000 rows / 10 columns / about 2MB /* This can be loaded into a Hash Object */
fourth table: 100 million rows / 40 columns / about 40GB
fifth table: 50 rows / 5columns / about 150KB) /* This can be loaded into a Hash Object */
Filtering with IN clauses and Date intervals
Selecting multiple variables and applying some Text functions and CASE WHEN statements
Ensure you have the Proper Simple(Single column based) / Complex (Muliple columns based) Indexes defined for the tables below, evaluate based on filtering criteria
- 100 million rows / 15 columns / about 15GB
- 300 million rows / 15 columns / about 80GB
- 100 million rows / 40 columns / about 40GB
Here couple of papers to get you started on SAS Indexes
Depending on the nature of your tables:
- External Relational Database Management System (RDBMS) ---> Most have partitioning features, see if you can utilize them if applicable
- SAS Data sets. I would investigate using SPDE to store such large tables.
One other suggestion I would recommend,
- Consider creating temporary tables (if possible) by filtering these large tables before your query, then use them in your query. (This is a workaround for using CTEs in SQL)
Hope this helps
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thank you very much for all these tips to improve our data processing.
I will start reading the articles and experimenting with the suggested optimizations.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Options that can enhance performance include:
-memsize (Default: 2G)
-sortsize (Default: 1G)
-CPUCOUNT (Default: 4)
-UTILLOC (Default: work)
Changing these defaults will help with the performance issues.
Here is another paper to read 😉
https://support.sas.com/resources/papers/proceedings09/333-2009.pdf
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
On the CPU count option, I find CPU use is relatively low in SAS, especially with large datasets. There is loads of room for SAS to use higher compression on datasets and hence more CPU and I don't know why this hasn't been done. I found during lockdown compressing a compressed SAS dataset with 7z on fast could reduce filesize by ~90%. Depending how the compressed dataset is unpacked in the computation this may yield some benefit. It certainly would in writing output datasets.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi @bpowell
There are ways to reduce the SAS data set size and optimize it for processing without compression.
This paper Twenty Ways to Run Your SAS® Program Faster and Use Less Space
I personally follow and advise my users to follow these points.
6. Numeric variables in SAS data sets have a default length of 8. If the values of the
numeric variable are all integers, you can reduce the space by using the following table.
7. Sometimes character variables imported into SAS from other systems, like Oracle or Excel, have very large lengths. You can use the following procedure to get the shortest possible length for your character variable, although you might want to allow room for growth:
a. Use the LENGTH function to calculate the actual length of the variable in each observation in the data set.
b. Use the MAX option in PROC SUMMARY to get the largest value of the length.
c. Use the LENGTH statement to shorten the length of the character variable to the maximum length.
8. Switch variables from numeric to character if they are integers and range in value from -9 to 99. The minimum length for numeric variables is 3, so you can save space if the variable can fit into one or two characters.
9. Switch variables from character to numeric if they are all integers and occupy more than 3 bytes. For example, the number 1234 would occupy 4 bytes as a character variable but item 6 above shows it would only occupy 3 bytes as a numeric variable.
Note: When the data set record length is reduced, that means more records can fit into the buffer, which in turn means less time to read the data
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
"There is loads of room for SAS to use higher compression on datasets and hence more CPU and I don't know why this hasn't been done." It has been done. Check out the COMPRESS option.
I can get compression ratios with some of my tables approaching 90% if character columns have a lot of blank space.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
40 data test;
41 length test $1000.;
42 do i=1 to 100000000;
43 test="
43 ! ....
43 ! ...";
NOTE: The quoted string currently being processed has become more than 262 characters long.
You might have unbalanced quotation marks.
44 output;
45 end;
46 drop i;
47 run;
NOTE: The data set WORK.TEST has 100000000 observations and 1 variables.
NOTE: Compressing data set WORK.TEST decreased size by 97.82 percent.
Compressed is 33607 pages; un-compressed would require 1538462 pages.
NOTE: DATA statement used (Total process time):
real time 56.63 seconds
cpu time 48.06 seconds
Output dataset size is 2gb *compressed*. SAS would have written 100s of GB without the compress option being set. That's a 1000 character string about 99% whitespace.
Now, if, as I said, you compress the output dataset with 7z, on fast, the resulting output dataset is <200mb. Like I said, SAS has * a lot * of scope for additional compression.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
The 7z compression compresses everything, which also means that one would need to uncompress everything before use.
SAS compresses observations, which means that individual observations can still be found within a dataset page without having to uncompress the whole dataset first.
And you need to run the same test with COMPRESS=BINARY. COMPRESS=YES only does RLE (run length encoding) compression (which is almost undetectable in terms of CPU usage).
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
It's been my experience that multi-table joins in SQL tend to be very slow and consume excessive resources (disk space in WORK).
- Split the process into several steps
- Use SORT and DATA steps rather than SQL (unless you explicitly need a cartesian product)
- Do lookups with hash objects, which can be integrated in a DATA step which does another MERGE
We need to see your current code (accompanied by some example data in DATA steps with DATALINES, so we know the variable attributes) to throw the whole weight of our experience at your problem.