BookmarkSubscribeRSS Feed
Rohit_Rai_1996
Fluorite | Level 6

I was joining 2 tables, 5 GB and 1GB in size respectively (5g left join 1g). (Both had around 1 million rows)

Took around 13 minutes, which moderate compared to queries I generally run.

During the join a file called "sastmp-000000111.sas7butl" was created which was 127 GB in size.

My access was blocked and now I have to raise a ticket to get it unlocked, with an optimised query to perform the same task.

But I have no clear idea on what would be an optimal query.

I've heard about indexes, but not sure if they would be useful, or how to use them here.

10 REPLIES 10
ballardw
Super User

You would have to show the code for suggestions that apply to your job.

 

Generic advice: reduce the size as early as possible. One example is to only select variables needed early on.

If data xxx has 400 variables and you only need 4 instead of using

 

from xxx

use:

from (select var1, var2, var3, var4 from xxx) as x

If you know that only need some of the records apply where clauses before joining for example

from (select var1, var2, var3, var4 from xxx
         where varxxx in (<list of values>) and varyyy ge '01JAN2022'd ) as x
Rohit_Rai_1996
Fluorite | Level 6
I understand.

In this case all variables are needed (45 on one side, 20 on the other, total 67).
Kurt_Bremser
Super User

The best way to optimize a SQL query is to not use SQL. Unless you need to do a many-to-many join, a data step MERGE will outperform SQL and consume considerably less space during the necessary sorts. The MERGE itself needs only space for the resulting dataset.

Depending on the number (and type and size) of variables taken from the 1G dataset, you might even be able to do it with no preceding sort and a hash object.

If you show us your SQL code, we can suggest how to translate it to data step code.

Rohit_Rai_1996
Fluorite | Level 6
Thanks, I didn't realise merge was better.

Code was :
proc sql;
create table AB as
select*
from Table1 a
left join Table2 b
on a.key=b.key;
quit;
Kurt_Bremser
Super User

SELECT * with at least one variable in common between the two datasets will result in a WARNING, so it would be sloppy programming at best.

Please post your REAL code, so we can see which variables from which dataset will contribute to the final result. If sensitive variable names are in there, replace them with consistent XXX, YYY and so on. But it is imperative to know from where the variables come from.

Rohit_Rai_1996
Fluorite | Level 6
I actually used SELECT* in the real code.

There are 45 variables in the left table, 20 variables in the right table. We need all the variables in the final table

2 or 3 are common. AFAIK, if the variable is already present in left, then it isn't picked up from right.
Kurt_Bremser
Super User

You DO NOT WANT WARNINGs in your LOG, ever. PERIOD. Good code runs without ERRORs, WARNINGs or any NOTEs beyond those telling you how long a step took and which datasets it read and/or created. So you do not use a global asterisk in a SQL join. You may use a.* to select all variables from one dataset, but you must be selective for the variables coming from other datasets to avoid name collisions. In case of a FULL/OUTER join, you must take care to COALESCE the ON variable(s), or you'll get unwanted missing values.

Any code that throws a WARNING (or an ERROR) will not run successfully from a scheduler, so it's unusable in a professional context.

 

If you have a one-to-one, a one-to-many, or a many-to-one relationship, these SORT/DATA steps can replace your SQL:

proc sort data=table1;
by key;
run;

proc sort data=table2;
by key;
run;

data ab;
merge
  table1 (in=a)
  table2 (drop=....) /* drop the variables that are common, except key */
;
by key;
if a; /* keep only observations coming from table1 */
run;

Depending on the state of your datasets, the SORT steps may not be necessary.

Patrick
Opal | Level 21

@Rohit_Rai_1996 wrote:
I actually used SELECT* in the real code.
2 or 3 are common. AFAIK, if the variable is already present in left, then it isn't picked up from right.

...or may-be not and you could end-up with an undesired result.

data table1;
  set sashelp.class(drop=age rename=(name=key));
run;

data table2;
  set sashelp.class(rename=(name=key));
  if mod(_n_,2)=0;
run;

proc sql feedback;
  create table AB as
    select*
  from Table1 a
    left join Table2 b
      on a.key=b.key;
quit;

Even though Sex, Age, Height and Weight also exist in Table1 FEEDBACK tells us they get picked up from TABLE2.

Patrick_0-1649754558201.png

 

But then looking at the result table it seems only AGE gets really picked-up from TABLE2 - at least with my version of SAS.

Patrick_1-1649754747139.png

 

That FEEDBACK doesn't tell us the "truth" is ugly and looks like an undocumented feature. If I would have a say then SAS wouldn't just throw a WARNING but an ERROR for such cases. That's what all the databases I know would be doing. 

Rohit_Rai_1996
Fluorite | Level 6
Oh

I'll try to pay attention to this

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 10 replies
  • 1350 views
  • 5 likes
  • 4 in conversation