BookmarkSubscribeRSS Feed
PegaZeus
Obsidian | Level 7

SAS seems to run out of work disc space (sorting space) when I run the proc sql, even if I specify the SQL Optimizer to use a hash join (magic=103), so I'm looking to transform the following proc sql into a hash object approach. How I go about doing this:

 

proc sql magic=103;
  create table final as 
  select *
  from have a
  inner join ids b on (a.id = b.id)
  left join nad g on (a.fpc = g.fpc)
  left join pricing e on (a.gpi = e.gpi) and (d.list_id = e.list_id)
  left join network f on (a.id = f.id)
  left join client_segment d on (a.client = d.client) and (a.bob = d.bob) and 
           (a.seg = d.seg) and (a.ex_id = d.ex_id) and (a.aids = d.aids)
  left join instate c on (a.ncp_id = c.ncp_id);
quit;
18 REPLIES 18
Kurt_Bremser
Super User

Don't use the asterisk, use an exhaustive list of wanted variables with correct table aliases. Otherwise you will at least get WARNINGs.

 

What are the sizes (in terms of observation numbers) and the sizes of the variables you want to join to your main table? This will determine if you can fit the tables into memory.

PegaZeus
Obsidian | Level 7

Hey Kurt

 

The size of the base table is approx. 58,000,000 records

 

The sizes of the join fields range from a length of 2 to 14. Fore example, client is 2, gpi is 14, fpc is 11, ncpdp is 5...

 

 

SASKiwi
PROC Star

@PegaZeus  - 58 million rows isn't large. How about trying - options compress = yes; - to see if that helps. If your base table contains lots of long character variables then compressing will save a lot of space.

 

If you are using a SAS server, then I'd suggest it doesn't have enough SAS WORK space configured. Talk to your SAS administrator about this. You may be able to fix this program to use less WORK space, but then you'll  run out again in the next big job. Fix the problem, not the symptom. 

ChrisNZ
Tourmaline | Level 20

@SASKiwi Utility files are not compressed.

@PegaZeus You could try 2 steps: Join a and b (and maybe f) on id. This will reduce the number of rows, and the sql parser will know what it's dealing with and will adopt the most efficient method. Then do the other joins against that first table. Don't forget the _method option to monitor the parser's choices.

SASKiwi
PROC Star

@ChrisNZ - True, but the code as posted contains temporary SAS datasets where compression may help.

ChrisNZ
Tourmaline | Level 20

@SASKiwi I don't see temporary data sets. Can you tell em more?

Also, one thing that sometimes help is using a compressed folder for utility files.

Patrick
Opal | Level 21

This join will require a lot of sorting of your Have table/intermediary result sets and unless the inner join doesn't reduce the volumes considerably that's going to be sorting 58M rows multiple times. 

Sorting will create intermediary files on disk so as @SASKiwi suggests using option compress=yes should help to save disk space and I/O and though also improve performance.

 

You could avoid at least one sort by moving join...

 

left join network f on (a.id = f.id)

 

...right after the inner join which uses the same key column.

 

I believe option magic=103 only works for inner joins. 

A data step hash approach could eventually improve performance but you would need to have sufficient memory for loading all the lookup tables. 

If you run a Proc Contents over your tables then you'll get the information how many rows a table has and the names and lengths of all variables in the table. Sum the lengths of all variables you want, multiply by the number of rows in the table, add 17MB to the result and you'll get an estimate how much memory a hash table will consume.

 

Using a Select * when joining tables will create warnings if the same variable name exists in more than one table. You should list your variables explicitly. 

You can save yourself some typing by executing below - and then use the expanded variable list in the SAS log as starting point for your select statement.

proc sql feedback inobs=1;
/*  create table final as */
  select *
  from have a
  inner join ids b on (a.id = b.id)
  left join nad g on (a.fpc = g.fpc)
  left join pricing e on (a.gpi = e.gpi) and (d.list_id = e.list_id)
  left join network f on (a.id = f.id)
  left join client_segment d on (a.client = d.client) and (a.bob = d.bob) and 
           (a.seg = d.seg) and (a.ex_id = d.ex_id) and (a.aids = d.aids)
  left join instate c on (a.ncp_id = c.ncp_id);
quit;

You can also pre-sort your lookup tables. This won't avoid the sorting of the Have table and the intermediary results, but it will avoid sorting of the lookup tables as part of the SQL.

So you could use a Proc Sort first for you lookup tables. The SQL won't sort the tables again if it has it is already sorted AND has the sortedby=<variable name(s)) property set (which Proc Sort will do).

If you know that a table is already sorted the way you need it but the table metadata doesn't have the sortedby attribute set (proc contents will show you) then you can also formulate it explicitly when using the table in the SQL; i.e. left join network(sortedby=id) f on (a.id = f.id)

 

And last but not least:

Make sure the relationships between table Have and any of the other tables is M:(0 or 1). Any other relationship could increase the row count dramatically and explain why you're running out of workspace.

ChrisNZ
Tourmaline | Level 20

@Patrick 

>You could avoid at least one sort by moving join...  ...right after the inner join which uses the same key column.

Oh wow. I'm so disappointed, I was certain the SQL parser did that. You are right it does not.

ChrisNZ
Tourmaline | Level 20

Indexes haven't been mentioned yet.

It may well be, depending on their size, that tables such as pricing and instate just have to be indexed, to avoid sorting the main table.

 

Patrick
Opal | Level 21

@ChrisNZ wrote:

Indexes haven't been mentioned yet.

It may well be, depending on their size, that tables such as pricing and instate just have to be indexed, to avoid sorting the main table.

 


Thought about it and actually played around with some mock-up data but couldn't make the left joins use the index. Then I found below paper:

"In a LEFT join PROC SQL must pick every row in the LEFT table. If there are any indexes built on the LEFT table,
SQL cannot use them for a LEFT join. It can make use of indexes on the right table, because only certain rows are
selected from the RIGHT table – those that match what are in the LEFT table. But, there is no way for PROC SQL to
use an index built on the LEFT table. So, do not waste your time building and maintaining an index on the LEFT
table for the purpose of doing a LEFT join."

ChrisNZ
Tourmaline | Level 20

@Patrick This makes sense. Hence my suggestion that the right tables be indexed.

You can force the using index by using option (idxname=).

I expect that the tables have to be large (especially wide), and the proportion of retrieved rows low, before the index is faster than a sort.

Patrick
Opal | Level 21

@ChrisNZ wrote:

@Patrick This makes sense. Hence my suggestion that the right tables be indexed.

You can force the using index by using option (idxname=).

I expect that the tables have to be large (especially wide), and the proportion of retrieved rows low, before the index is faster than a sort.


Hmm... I can see how an index on the right table could get used for the first inner join. But isn't then already the first left join executing against the result set of the inner join and not against the right table? So how could an index on the right table get used here? Can you show me using below sample code?

data 
  Main(keep=m id id2) 
  Inner(keep=i id) 
  L1(keep=l1 id) 
  L2(keep=l2 id2) 
  ;
  length id id2 8;
  m='master';
  i='inner';
  l1='left 1';
  l2='left 2';
  do id=1 to 1000;
    id2=100*id;
    output main;
    if mod(id,2)=0 then output inner;
    if mod(id,4)=0 then output l1;
    if mod(id,8)=0 then output l2;
  end;
run;

proc sql;
  create index id on main(id);
quit;

proc sql _method;
  create table want as
    select m.id, m.id2,m.m, i.i, l1.l1, l2.l2 
    from main(idxname=id) m
    inner join inner i on m.id=i.id
    left join l1 on m.id=l1.id
    left join l2 on m.id2=l2.id2
  ;
quit; 
      sqxcrta
          sqxjm
              sqxsort
                  sqxsrc( WORK.L2 )
              sqxsort
                  sqxjm
                      sqxsort
                          sqxsrc( WORK.L1 )
                      sqxsort
                          sqxjndx
                              sqxsrc( WORK.INNER(alias = I) )
                              sqxsrc( WORK.MAIN(alias = M) )
ChrisNZ
Tourmaline | Level 20

@Patrick Not sure I understand the question.

The index is used to retrieve date from the right table B.

Once that data is there, there's nothing else to retrieve from B.

However the B data can be used to fetch the data from C, using an index again.

data CLASS(index=(NAME));
  set SASHELP.CLASS;
run;
options msglevel=i;
proc sql _method;
  select a.NAME, b.SEX, c.AGE
  from        CLASS               a
    left join CLASS(idxname=NAME) b on a.NAME=b.NAME
    left join CLASS(idxname=NAME) c on b.NAME=c.NAME;      
76 options msglevel=i;
77 proc sql _method;
78 select a.NAME, b.SEX, c.AGE
79 from CLASS a
80 left join CLASS(idxname=NAME) b on a.NAME=b.NAME
81 left join CLASS(idxname=NAME) c on b.NAME=c.NAME;
INFO: Index Name of SQL table WORK.CLASS (alias = B) selected for SQL WHERE clause (outer join) optimization.
INFO: Index Name of SQL table WORK.CLASS (alias = C) selected for SQL WHERE clause (outer join) optimization.
 
NOTE: SQL execution methods chosen are:
 
sqxslct
sqxjndx
sqxjndx
sqxsrc( WORK.CLASS(alias = A) )
sqxsrc( WORK.CLASS(alias = B) )
sqxsrc( WORK.CLASS(alias = C) )

 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 18 replies
  • 1283 views
  • 11 likes
  • 6 in conversation