I was joining 2 tables, 5 GB and 1GB in size respectively (5g left join 1g). (Both had around 1 million rows)
Took around 13 minutes, which moderate compared to queries I generally run.
During the join a file called "sastmp-000000111.sas7butl" was created which was 127 GB in size.
My access was blocked and now I have to raise a ticket to get it unlocked, with an optimised query to perform the same task.
But I have no clear idea on what would be an optimal query.
I've heard about indexes, but not sure if they would be useful, or how to use them here.
You would have to show the code for suggestions that apply to your job.
Generic advice: reduce the size as early as possible. One example is to only select variables needed early on.
If data xxx has 400 variables and you only need 4 instead of using
from xxx
use:
from (select var1, var2, var3, var4 from xxx) as x
If you know that only need some of the records apply where clauses before joining for example
from (select var1, var2, var3, var4 from xxx where varxxx in (<list of values>) and varyyy ge '01JAN2022'd ) as x
The best way to optimize a SQL query is to not use SQL. Unless you need to do a many-to-many join, a data step MERGE will outperform SQL and consume considerably less space during the necessary sorts. The MERGE itself needs only space for the resulting dataset.
Depending on the number (and type and size) of variables taken from the 1G dataset, you might even be able to do it with no preceding sort and a hash object.
If you show us your SQL code, we can suggest how to translate it to data step code.
SELECT * with at least one variable in common between the two datasets will result in a WARNING, so it would be sloppy programming at best.
Please post your REAL code, so we can see which variables from which dataset will contribute to the final result. If sensitive variable names are in there, replace them with consistent XXX, YYY and so on. But it is imperative to know from where the variables come from.
You DO NOT WANT WARNINGs in your LOG, ever. PERIOD. Good code runs without ERRORs, WARNINGs or any NOTEs beyond those telling you how long a step took and which datasets it read and/or created. So you do not use a global asterisk in a SQL join. You may use a.* to select all variables from one dataset, but you must be selective for the variables coming from other datasets to avoid name collisions. In case of a FULL/OUTER join, you must take care to COALESCE the ON variable(s), or you'll get unwanted missing values.
Any code that throws a WARNING (or an ERROR) will not run successfully from a scheduler, so it's unusable in a professional context.
If you have a one-to-one, a one-to-many, or a many-to-one relationship, these SORT/DATA steps can replace your SQL:
proc sort data=table1;
by key;
run;
proc sort data=table2;
by key;
run;
data ab;
merge
table1 (in=a)
table2 (drop=....) /* drop the variables that are common, except key */
;
by key;
if a; /* keep only observations coming from table1 */
run;
Depending on the state of your datasets, the SORT steps may not be necessary.
@Rohit_Rai_1996 wrote:
I actually used SELECT* in the real code.
2 or 3 are common. AFAIK, if the variable is already present in left, then it isn't picked up from right.
...or may-be not and you could end-up with an undesired result.
data table1;
set sashelp.class(drop=age rename=(name=key));
run;
data table2;
set sashelp.class(rename=(name=key));
if mod(_n_,2)=0;
run;
proc sql feedback;
create table AB as
select*
from Table1 a
left join Table2 b
on a.key=b.key;
quit;
Even though Sex, Age, Height and Weight also exist in Table1 FEEDBACK tells us they get picked up from TABLE2.
But then looking at the result table it seems only AGE gets really picked-up from TABLE2 - at least with my version of SAS.
That FEEDBACK doesn't tell us the "truth" is ugly and looks like an undocumented feature. If I would have a say then SAS wouldn't just throw a WARNING but an ERROR for such cases. That's what all the databases I know would be doing.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.