How to reduce space utilisation during joins in proc sql?

Rohit_Rai_1996 · Posted 04-11-2022 11:11 AM

I was joining 2 tables, 5 GB and 1GB in size respectively (5g left join 1g). (Both had around 1 million rows)

Took around 13 minutes, which moderate compared to queries I generally run.

During the join a file called "sastmp-000000111.sas7butl" was created which was 127 GB in size.

My access was blocked and now I have to raise a ticket to get it unlocked, with an optimised query to perform the same task.

But I have no clear idea on what would be an optimal query.

I've heard about indexes, but not sure if they would be useful, or how to use them here.

ballardw · Posted 04-11-2022 11:26 AM

You would have to show the code for suggestions that apply to your job.

Generic advice: reduce the size as early as possible. One example is to only select variables needed early on.

If data xxx has 400 variables and you only need 4 instead of using

from xxx

use:

from (select var1, var2, var3, var4 from xxx) as x

If you know that only need some of the records apply where clauses before joining for example

from (select var1, var2, var3, var4 from xxx
         where varxxx in (<list of values>) and varyyy ge '01JAN2022'd ) as x

Rohit_Rai_1996 · Posted 04-11-2022 12:01 PM

I understand.

In this case all variables are needed (45 on one side, 20 on the other, total 67).

Kurt_Bremser · Posted 04-11-2022 12:02 PM

The best way to optimize a SQL query is to not use SQL. Unless you need to do a many-to-many join, a data step MERGE will outperform SQL and consume considerably less space during the necessary sorts. The MERGE itself needs only space for the resulting dataset.

Depending on the number (and type and size) of variables taken from the 1G dataset, you might even be able to do it with no preceding sort and a hash object.

If you show us your SQL code, we can suggest how to translate it to data step code.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Rohit_Rai_1996 · Posted 04-12-2022 01:03 AM

Thanks, I didn't realise merge was better.

Code was :
proc sql;
create table AB as
select*
from Table1 a
left join Table2 b
on a.key=b.key;
quit;

Kurt_Bremser · Posted 04-12-2022 02:29 AM

SELECT * with at least one variable in common between the two datasets will result in a WARNING, so it would be sloppy programming at best.

Please post your REAL code, so we can see which variables from which dataset will contribute to the final result. If sensitive variable names are in there, replace them with consistent XXX, YYY and so on. But it is imperative to know from where the variables come from.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Rohit_Rai_1996 · Posted 04-12-2022 02:50 AM

I actually used SELECT* in the real code.

There are 45 variables in the left table, 20 variables in the right table. We need all the variables in the final table

2 or 3 are common. AFAIK, if the variable is already present in left, then it isn't picked up from right.

Kurt_Bremser · Posted 04-12-2022 03:14 AM

You DO NOT WANT WARNINGs in your LOG, ever. PERIOD. Good code runs without ERRORs, WARNINGs or any NOTEs beyond those telling you how long a step took and which datasets it read and/or created. So you do not use a global asterisk in a SQL join. You may use a.* to select all variables from one dataset, but you must be selective for the variables coming from other datasets to avoid name collisions. In case of a FULL/OUTER join, you must take care to COALESCE the ON variable(s), or you'll get unwanted missing values.

Any code that throws a WARNING (or an ERROR) will not run successfully from a scheduler, so it's unusable in a professional context.

If you have a one-to-one, a one-to-many, or a many-to-one relationship, these SORT/DATA steps can replace your SQL:

proc sort data=table1;
by key;
run;

proc sort data=table2;
by key;
run;

data ab;
merge
  table1 (in=a)
  table2 (drop=....) /* drop the variables that are common, except key */
;
by key;
if a; /* keep only observations coming from table1 */
run;

Depending on the state of your datasets, the SORT steps may not be necessary.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Rohit_Rai_1996 · Posted 04-12-2022 07:02 AM

Thanks. I'll try this

Patrick · Posted 04-12-2022 05:14 AM

@Rohit_Rai_1996 wrote:
I actually used SELECT* in the real code.
2 or 3 are common. AFAIK, if the variable is already present in left, then it isn't picked up from right.

...or may-be not and you could end-up with an undesired result.

data table1;
  set sashelp.class(drop=age rename=(name=key));
run;

data table2;
  set sashelp.class(rename=(name=key));
  if mod(_n_,2)=0;
run;

proc sql feedback;
  create table AB as
    select*
  from Table1 a
    left join Table2 b
      on a.key=b.key;
quit;

Even though Sex, Age, Height and Weight also exist in Table1 FEEDBACK tells us they get picked up from TABLE2.

But then looking at the result table it seems only AGE gets really picked-up from TABLE2 - at least with my version of SAS.

That FEEDBACK doesn't tell us the "truth" is ugly and looks like an undocumented feature. If I would have a say then SAS wouldn't just throw a WARNING but an ERROR for such cases. That's what all the databases I know would be doing.

Rohit_Rai_1996 · Posted 04-12-2022 07:15 AM

Oh

I'll try to pay attention to this

How to reduce space utilisation during joins in proc sql?

Re: How to reduce space utilisation during joins in proc sql?

Re: How to reduce space utilisation during joins in proc sql?

Re: How to reduce space utilisation during joins in proc sql?

Re: How to reduce space utilisation during joins in proc sql?

Re: How to reduce space utilisation during joins in proc sql?

Re: How to reduce space utilisation during joins in proc sql?

Re: How to reduce space utilisation during joins in proc sql?

Re: How to reduce space utilisation during joins in proc sql?

Re: How to reduce space utilisation during joins in proc sql?

Re: How to reduce space utilisation during joins in proc sql?

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!