SAS seems to run out of work disc space (sorting space) when I run the proc sql, even if I specify the SQL Optimizer to use a hash join (magic=103), so I'm looking to transform the following proc sql into a hash object approach. How I go about doing this:
proc sql magic=103;
create table final as
select *
from have a
inner join ids b on (a.id = b.id)
left join nad g on (a.fpc = g.fpc)
left join pricing e on (a.gpi = e.gpi) and (d.list_id = e.list_id)
left join network f on (a.id = f.id)
left join client_segment d on (a.client = d.client) and (a.bob = d.bob) and
(a.seg = d.seg) and (a.ex_id = d.ex_id) and (a.aids = d.aids)
left join instate c on (a.ncp_id = c.ncp_id);
quit;
Don't use the asterisk, use an exhaustive list of wanted variables with correct table aliases. Otherwise you will at least get WARNINGs.
What are the sizes (in terms of observation numbers) and the sizes of the variables you want to join to your main table? This will determine if you can fit the tables into memory.
Hey Kurt
The size of the base table is approx. 58,000,000 records
The sizes of the join fields range from a length of 2 to 14. Fore example, client is 2, gpi is 14, fpc is 11, ncpdp is 5...
We need to know the numbers of obs in all the tables you want to join to the base, and the lengths of the variables you want from there.
@PegaZeus - 58 million rows isn't large. How about trying - options compress = yes; - to see if that helps. If your base table contains lots of long character variables then compressing will save a lot of space.
If you are using a SAS server, then I'd suggest it doesn't have enough SAS WORK space configured. Talk to your SAS administrator about this. You may be able to fix this program to use less WORK space, but then you'll run out again in the next big job. Fix the problem, not the symptom.
@SASKiwi Utility files are not compressed.
@PegaZeus You could try 2 steps: Join a and b (and maybe f) on id. This will reduce the number of rows, and the sql parser will know what it's dealing with and will adopt the most efficient method. Then do the other joins against that first table. Don't forget the _method option to monitor the parser's choices.
@ChrisNZ - True, but the code as posted contains temporary SAS datasets where compression may help.
@SASKiwi I don't see temporary data sets. Can you tell em more?
Also, one thing that sometimes help is using a compressed folder for utility files.
This join will require a lot of sorting of your Have table/intermediary result sets and unless the inner join doesn't reduce the volumes considerably that's going to be sorting 58M rows multiple times.
Sorting will create intermediary files on disk so as @SASKiwi suggests using option compress=yes should help to save disk space and I/O and though also improve performance.
You could avoid at least one sort by moving join...
left join network f on (a.id = f.id)
...right after the inner join which uses the same key column.
I believe option magic=103 only works for inner joins.
A data step hash approach could eventually improve performance but you would need to have sufficient memory for loading all the lookup tables.
If you run a Proc Contents over your tables then you'll get the information how many rows a table has and the names and lengths of all variables in the table. Sum the lengths of all variables you want, multiply by the number of rows in the table, add 17MB to the result and you'll get an estimate how much memory a hash table will consume.
Using a Select * when joining tables will create warnings if the same variable name exists in more than one table. You should list your variables explicitly.
You can save yourself some typing by executing below - and then use the expanded variable list in the SAS log as starting point for your select statement.
proc sql feedback inobs=1;
/* create table final as */
select *
from have a
inner join ids b on (a.id = b.id)
left join nad g on (a.fpc = g.fpc)
left join pricing e on (a.gpi = e.gpi) and (d.list_id = e.list_id)
left join network f on (a.id = f.id)
left join client_segment d on (a.client = d.client) and (a.bob = d.bob) and
(a.seg = d.seg) and (a.ex_id = d.ex_id) and (a.aids = d.aids)
left join instate c on (a.ncp_id = c.ncp_id);
quit;
You can also pre-sort your lookup tables. This won't avoid the sorting of the Have table and the intermediary results, but it will avoid sorting of the lookup tables as part of the SQL.
So you could use a Proc Sort first for you lookup tables. The SQL won't sort the tables again if it has it is already sorted AND has the sortedby=<variable name(s)) property set (which Proc Sort will do).
If you know that a table is already sorted the way you need it but the table metadata doesn't have the sortedby attribute set (proc contents will show you) then you can also formulate it explicitly when using the table in the SQL; i.e. left join network(sortedby=id) f on (a.id = f.id)
And last but not least:
Make sure the relationships between table Have and any of the other tables is M:(0 or 1). Any other relationship could increase the row count dramatically and explain why you're running out of workspace.
>You could avoid at least one sort by moving join... ...right after the inner join which uses the same key column.
Oh wow. I'm so disappointed, I was certain the SQL parser did that. You are right it does not.
Indexes haven't been mentioned yet.
It may well be, depending on their size, that tables such as pricing and instate just have to be indexed, to avoid sorting the main table.
@ChrisNZ wrote:
Indexes haven't been mentioned yet.
It may well be, depending on their size, that tables such as pricing and instate just have to be indexed, to avoid sorting the main table.
Thought about it and actually played around with some mock-up data but couldn't make the left joins use the index. Then I found below paper:
"In a LEFT join PROC SQL must pick every row in the LEFT table. If there are any indexes built on the LEFT table,
SQL cannot use them for a LEFT join. It can make use of indexes on the right table, because only certain rows are
selected from the RIGHT table – those that match what are in the LEFT table. But, there is no way for PROC SQL to
use an index built on the LEFT table. So, do not waste your time building and maintaining an index on the LEFT
table for the purpose of doing a LEFT join."
@Patrick This makes sense. Hence my suggestion that the right tables be indexed.
You can force the using index by using option (idxname=).
I expect that the tables have to be large (especially wide), and the proportion of retrieved rows low, before the index is faster than a sort.
@ChrisNZ wrote:
@Patrick This makes sense. Hence my suggestion that the right tables be indexed.
You can force the using index by using option (idxname=).
I expect that the tables have to be large (especially wide), and the proportion of retrieved rows low, before the index is faster than a sort.
Hmm... I can see how an index on the right table could get used for the first inner join. But isn't then already the first left join executing against the result set of the inner join and not against the right table? So how could an index on the right table get used here? Can you show me using below sample code?
data
Main(keep=m id id2)
Inner(keep=i id)
L1(keep=l1 id)
L2(keep=l2 id2)
;
length id id2 8;
m='master';
i='inner';
l1='left 1';
l2='left 2';
do id=1 to 1000;
id2=100*id;
output main;
if mod(id,2)=0 then output inner;
if mod(id,4)=0 then output l1;
if mod(id,8)=0 then output l2;
end;
run;
proc sql;
create index id on main(id);
quit;
proc sql _method;
create table want as
select m.id, m.id2,m.m, i.i, l1.l1, l2.l2
from main(idxname=id) m
inner join inner i on m.id=i.id
left join l1 on m.id=l1.id
left join l2 on m.id2=l2.id2
;
quit;
sqxcrta sqxjm sqxsort sqxsrc( WORK.L2 ) sqxsort sqxjm sqxsort sqxsrc( WORK.L1 ) sqxsort sqxjndx sqxsrc( WORK.INNER(alias = I) ) sqxsrc( WORK.MAIN(alias = M) )
@Patrick Not sure I understand the question.
The index is used to retrieve date from the right table B.
Once that data is there, there's nothing else to retrieve from B.
However the B data can be used to fetch the data from C, using an index again.
data CLASS(index=(NAME));
set SASHELP.CLASS;
run;
options msglevel=i;
proc sql _method;
select a.NAME, b.SEX, c.AGE
from CLASS a
left join CLASS(idxname=NAME) b on a.NAME=b.NAME
left join CLASS(idxname=NAME) c on b.NAME=c.NAME;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.