topic Re: Optimizing a join with large datasets in SAS Programming

Optimizing a join with large datasets

Daniele_R — Fri, 14 Apr 2023 10:13:02 GMT

Hi !

It’s my first thread 😀

I’m in the following situation:

Left join on five tables (first table: 100 million rows / 15 columns / about 15GB

second table: 300 million rows / 15 columns / about 80GB

third table: 10 000 rows / 10 columns / about 2MB

fourth table: 100 million rows / 40 columns / about 40GB

fifth table: 50 rows / 5columns / about 150KB)

Filtering with IN clauses and Date intervals

Selecting multiple variables and applying some Text functions and CASE WHEN statements

I’d like to improve the performances of my code, especially the speed.

I tried filtering the largest tables before joining them and also to compress them with a binary option, but my performances didn’t improve so much.

I’d like to know if you have any advice to suggest me to make it better ( I’ve already read this interesting topic and tried some methods from Making SAS® Tables Smaller and Faster Without Data Loss - SAS Support Communities, but without meaningful results).

Hope you can help me and in any case thanks in advance!

Re: Optimizing a join with large datasets

bpowell — Fri, 14 Apr 2023 10:24:28 GMT

Are the tables native sas, are they indexed, do you have any I/O constraints e.g. network, disk?
Sometimes with SAS I've found multi-table joins don't seem to be well optimised and I break the query into smaller chunks. At least I do this to get a baseline.

Re: Optimizing a join with large datasets

yabwon — Fri, 14 Apr 2023 10:32:39 GMT

0) it's hard to optimise if there is no code...

1) does it have to be always "left join"?

2) is it "many-to-many" join?

3) are you adding those two little table, or are they just for filtering?

4) what was the result of the filtering, were those tables much "smaller"?

5) how long does it run ( options fullstimer; )?

6) what is the size of your session (RAM available)?

Bart

Re: Optimizing a join with large datasets

bpowell — Fri, 14 Apr 2023 10:40:06 GMT

I was also wondering about RAM but I expect you'd need, what, 128GB to make a difference on this query? Most will be done in I/O. I don't use more than 16-24GB and the 80GB dataset I normalize can't be done much in memory obviously.

Re: Optimizing a join with large datasets

AhmedAl_Attar — Fri, 14 Apr 2023 11:24:17 GMT

Hi @Daniele_R

Here is my first stab at this

Left join on five tables (first table: 100 million rows / 15 columns / about 15GB

second table: 300 million rows / 15 columns / about 80GB

third table: 10 000 rows / 10 columns / about 2MB /* This can be loaded into a Hash Object */

fourth table: 100 million rows / 40 columns / about 40GB

fifth table: 50 rows / 5columns / about 150KB) /* This can be loaded into a Hash Object */

Filtering with IN clauses and Date intervals

Selecting multiple variables and applying some Text functions and CASE WHEN statements

Ensure you have the Proper Simple(Single column based) / Complex (Muliple columns based) Indexes defined for the tables below, evaluate based on filtering criteria

100 million rows / 15 columns / about 15GB
300 million rows / 15 columns / about 80GB
100 million rows / 40 columns / about 40GB

Here couple of papers to get you started on SAS Indexes

Depending on the nature of your tables:

External Relational Database Management System (RDBMS) ---> Most have partitioning features, see if you can utilize them if applicable
SAS Data sets. I would investigate using SPDE to store such large tables.

One other suggestion I would recommend,

Consider creating temporary tables (if possible) by filtering these large tables before your query, then use them in your query. (This is a workaround for using CTEs in SQL)

Hope this helps

Re: Optimizing a join with large datasets

Daniele_R — Fri, 14 Apr 2023 14:43:51 GMT

Thank you very much for all these tips to improve our data processing.

I will start reading the articles and experimenting with the suggested optimizations.

Re: Optimizing a join with large datasets

AhmedAl_Attar — Fri, 14 Apr 2023 15:09:26 GMT

In addition, I would highly recommend looking into your SAS environment and invocation settings.
Options that can enhance performance include:
-memsize (Default: 2G)
-sortsize (Default: 1G)
-CPUCOUNT (Default: 4)
-UTILLOC (Default: work)
Changing these defaults will help with the performance issues.
Here is another paper to read 😉
https://support.sas.com/resources/papers/proceedings09/333-2009.pdf

Re: Optimizing a join with large datasets

bpowell — Fri, 14 Apr 2023 16:10:28 GMT

On the CPU count option, I find CPU use is relatively low in SAS, especially with large datasets. There is loads of room for SAS to use higher compression on datasets and hence more CPU and I don't know why this hasn't been done. I found during lockdown compressing a compressed SAS dataset with 7z on fast could reduce filesize by ~90%. Depending how the compressed dataset is unpacked in the computation this may yield some benefit. It certainly would in writing output datasets.

Re: Optimizing a join with large datasets

AhmedAl_Attar — Fri, 14 Apr 2023 16:45:52 GMT

Hi @bpowell

There are ways to reduce the SAS data set size and optimize it for processing without compression.

This paper Twenty Ways to Run Your SAS® Program Faster and Use Less Space

I personally follow and advise my users to follow these points.

6. Numeric variables in SAS data sets have a default length of 8. If the values of the
numeric variable are all integers, you can reduce the space by using the following table.

7. Sometimes character variables imported into SAS from other systems, like Oracle or Excel, have very large lengths. You can use the following procedure to get the shortest possible length for your character variable, although you might want to allow room for growth:
a. Use the LENGTH function to calculate the actual length of the variable in each observation in the data set.
b. Use the MAX option in PROC SUMMARY to get the largest value of the length.
c. Use the LENGTH statement to shorten the length of the character variable to the maximum length.

8. Switch variables from numeric to character if they are integers and range in value from -9 to 99. The minimum length for numeric variables is 3, so you can save space if the variable can fit into one or two characters.

9. Switch variables from character to numeric if they are all integers and occupy more than 3 bytes. For example, the number 1234 would occupy 4 bytes as a character variable but item 6 above shows it would only occupy 3 bytes as a numeric variable.

Note: When the data set record length is reduced, that means more records can fit into the buffer, which in turn means less time to read the data

Re: Optimizing a join with large datasets

SASKiwi — Fri, 14 Apr 2023 22:33:54 GMT

"There is loads of room for SAS to use higher compression on datasets and hence more CPU and I don't know why this hasn't been done." It has been done. Check out the COMPRESS option.

I can get compression ratios with some of my tables approaching 90% if character columns have a lot of blank space.

Re: Optimizing a join with large datasets

Kurt_Bremser — Sat, 15 Apr 2023 07:31:09 GMT

It's been my experience that multi-table joins in SQL tend to be very slow and consume excessive resources (disk space in WORK).

Split the process into several steps
Use SORT and DATA steps rather than SQL (unless you explicitly need a cartesian product)
Do lookups with hash objects, which can be integrated in a DATA step which does another MERGE

We need to see your current code (accompanied by some example data in DATA steps with DATALINES, so we know the variable attributes) to throw the whole weight of our experience at your problem.

Re: Optimizing a join with large datasets

bpowell — Mon, 17 Apr 2023 08:27:10 GMT

Well, excuse me but "duh".

40 data test;
41 length test $1000.;
42 do i=1 to 100000000;
43 test="
43 ! ....
43 ! ...";
NOTE: The quoted string currently being processed has become more than 262 characters long.
You might have unbalanced quotation marks.
44 output;
45 end;
46 drop i;
47 run;

NOTE: The data set WORK.TEST has 100000000 observations and 1 variables.
NOTE: Compressing data set WORK.TEST decreased size by 97.82 percent.
Compressed is 33607 pages; un-compressed would require 1538462 pages.
NOTE: DATA statement used (Total process time):
real time 56.63 seconds
cpu time 48.06 seconds

Output dataset size is 2gb *compressed*. SAS would have written 100s of GB without the compress option being set. That's a 1000 character string about 99% whitespace.

Now, if, as I said, you compress the output dataset with 7z, on fast, the resulting output dataset is <200mb. Like I said, SAS has * a lot * of scope for additional compression.

Re: Optimizing a join with large datasets

Kurt_Bremser — Mon, 17 Apr 2023 09:52:04 GMT

The 7z compression compresses everything, which also means that one would need to uncompress everything before use.

SAS compresses observations, which means that individual observations can still be found within a dataset page without having to uncompress the whole dataset first.

And you need to run the same test with COMPRESS=BINARY. COMPRESS=YES only does RLE (run length encoding) compression (which is almost undetectable in terms of CPU usage).

Re: Optimizing a join with large datasets

bpowell — Mon, 17 Apr 2023 11:50:44 GMT

Yes - I know SAS can't read a zipped dataset. My point is the overall compression is poor. Does anything you say mean SAS could not use greater compression using more CPU? No. I would be surprised if there wasn't at least some scope to further leverage the current multi-cpu universe we inhabit. I don't believe compress=YES has been updated since 7bdat which is what, 30 years old? Then multi-cpu was mainframe and cpu time was costly. No longer the case. Time for an upgrade.