Transform Proc SQL to a Hash Object

PegaZeus · Posted 04-10-2020 12:01 PM

SAS seems to run out of work disc space (sorting space) when I run the proc sql, even if I specify the SQL Optimizer to use a hash join (magic=103), so I'm looking to transform the following proc sql into a hash object approach. How I go about doing this:

proc sql magic=103;
  create table final as 
  select *
  from have a
  inner join ids b on (a.id = b.id)
  left join nad g on (a.fpc = g.fpc)
  left join pricing e on (a.gpi = e.gpi) and (d.list_id = e.list_id)
  left join network f on (a.id = f.id)
  left join client_segment d on (a.client = d.client) and (a.bob = d.bob) and 
           (a.seg = d.seg) and (a.ex_id = d.ex_id) and (a.aids = d.aids)
  left join instate c on (a.ncp_id = c.ncp_id);
quit;

Kurt_Bremser · Posted 04-10-2020 12:08 PM

Don't use the asterisk, use an exhaustive list of wanted variables with correct table aliases. Otherwise you will at least get WARNINGs.

What are the sizes (in terms of observation numbers) and the sizes of the variables you want to join to your main table? This will determine if you can fit the tables into memory.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

PegaZeus · Posted 04-10-2020 01:28 PM

Hey Kurt

The size of the base table is approx. 58,000,000 records

The sizes of the join fields range from a length of 2 to 14. Fore example, client is 2, gpi is 14, fpc is 11, ncpdp is 5...

Kurt_Bremser · Posted 04-10-2020 01:49 PM

We need to know the numbers of obs in all the tables you want to join to the base, and the lengths of the variables you want from there.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

SASKiwi · Posted 04-10-2020 06:27 PM

@PegaZeus - 58 million rows isn't large. How about trying - options compress = yes; - to see if that helps. If your base table contains lots of long character variables then compressing will save a lot of space.

If you are using a SAS server, then I'd suggest it doesn't have enough SAS WORK space configured. Talk to your SAS administrator about this. You may be able to fix this program to use less WORK space, but then you'll run out again in the next big job. Fix the problem, not the symptom.

ChrisNZ · Posted 04-11-2020 04:23 AM

@SASKiwi Utility files are not compressed.

@PegaZeus You could try 2 steps: Join a and b (and maybe f) on id. This will reduce the number of rows, and the sql parser will know what it's dealing with and will adopt the most efficient method. Then do the other joins against that first table. Don't forget the _method option to monitor the parser's choices.

High-Performance SAS Coding - Third Edition

SASKiwi · Posted 04-11-2020 07:27 PM

@ChrisNZ - True, but the code as posted contains temporary SAS datasets where compression may help.

ChrisNZ · Posted 04-11-2020 07:38 PM

@SASKiwi I don't see temporary data sets. Can you tell em more?

Also, one thing that sometimes help is using a compressed folder for utility files.

High-Performance SAS Coding - Third Edition

Patrick · Posted 04-10-2020 10:15 PM

This join will require a lot of sorting of your Have table/intermediary result sets and unless the inner join doesn't reduce the volumes considerably that's going to be sorting 58M rows multiple times.

Sorting will create intermediary files on disk so as @SASKiwi suggests using option compress=yes should help to save disk space and I/O and though also improve performance.

You could avoid at least one sort by moving join...

left join network f on (a.id = f.id)

...right after the inner join which uses the same key column.

I believe option magic=103 only works for inner joins.

A data step hash approach could eventually improve performance but you would need to have sufficient memory for loading all the lookup tables.

If you run a Proc Contents over your tables then you'll get the information how many rows a table has and the names and lengths of all variables in the table. Sum the lengths of all variables you want, multiply by the number of rows in the table, add 17MB to the result and you'll get an estimate how much memory a hash table will consume.

Using a Select * when joining tables will create warnings if the same variable name exists in more than one table. You should list your variables explicitly.

You can save yourself some typing by executing below - and then use the expanded variable list in the SAS log as starting point for your select statement.

proc sql feedback inobs=1;
/*  create table final as */
  select *
  from have a
  inner join ids b on (a.id = b.id)
  left join nad g on (a.fpc = g.fpc)
  left join pricing e on (a.gpi = e.gpi) and (d.list_id = e.list_id)
  left join network f on (a.id = f.id)
  left join client_segment d on (a.client = d.client) and (a.bob = d.bob) and 
           (a.seg = d.seg) and (a.ex_id = d.ex_id) and (a.aids = d.aids)
  left join instate c on (a.ncp_id = c.ncp_id);
quit;

You can also pre-sort your lookup tables. This won't avoid the sorting of the Have table and the intermediary results, but it will avoid sorting of the lookup tables as part of the SQL.

So you could use a Proc Sort first for you lookup tables. The SQL won't sort the tables again if it has it is already sorted AND has the sortedby=<variable name(s)) property set (which Proc Sort will do).

If you know that a table is already sorted the way you need it but the table metadata doesn't have the sortedby attribute set (proc contents will show you) then you can also formulate it explicitly when using the table in the SQL; i.e. left join network(sortedby=id) f on (a.id = f.id)

And last but not least:

Make sure the relationships between table Have and any of the other tables is M:(0 or 1). Any other relationship could increase the row count dramatically and explain why you're running out of workspace.

ChrisNZ · Posted 04-11-2020 08:20 PM

@Patrick

>You could avoid at least one sort by moving join... ...right after the inner join which uses the same key column.

Oh wow. I'm so disappointed, I was certain the SQL parser did that. You are right it does not.

High-Performance SAS Coding - Third Edition

ChrisNZ · Posted 04-11-2020 08:37 PM

Indexes haven't been mentioned yet.

It may well be, depending on their size, that tables such as pricing and instate just have to be indexed, to avoid sorting the main table.

High-Performance SAS Coding - Third Edition

Patrick · Posted 04-11-2020 09:38 PM

@ChrisNZ wrote:

Indexes haven't been mentioned yet.

It may well be, depending on their size, that tables such as pricing and instate just have to be indexed, to avoid sorting the main table.

Thought about it and actually played around with some mock-up data but couldn't make the left joins use the index. Then I found below paper:

"In a LEFT join PROC SQL must pick every row in the LEFT table. If there are any indexes built on the LEFT table,
SQL cannot use them for a LEFT join. It can make use of indexes on the right table, because only certain rows are
selected from the RIGHT table – those that match what are in the LEFT table. But, there is no way for PROC SQL to
use an index built on the LEFT table. So, do not waste your time building and maintaining an index on the LEFT
table for the purpose of doing a LEFT join."

ChrisNZ · Posted 04-11-2020 11:04 PM

@Patrick This makes sense. Hence my suggestion that the right tables be indexed.

You can force the using index by using option (idxname=).

I expect that the tables have to be large (especially wide), and the proportion of retrieved rows low, before the index is faster than a sort.

High-Performance SAS Coding - Third Edition

Patrick · Posted 04-12-2020 12:33 AM

@ChrisNZ wrote:

@Patrick This makes sense. Hence my suggestion that the right tables be indexed.

You can force the using index by using option (idxname=).

I expect that the tables have to be large (especially wide), and the proportion of retrieved rows low, before the index is faster than a sort.

Hmm... I can see how an index on the right table could get used for the first inner join. But isn't then already the first left join executing against the result set of the inner join and not against the right table? So how could an index on the right table get used here? Can you show me using below sample code?

data 
  Main(keep=m id id2) 
  Inner(keep=i id) 
  L1(keep=l1 id) 
  L2(keep=l2 id2) 
  ;
  length id id2 8;
  m='master';
  i='inner';
  l1='left 1';
  l2='left 2';
  do id=1 to 1000;
    id2=100*id;
    output main;
    if mod(id,2)=0 then output inner;
    if mod(id,4)=0 then output l1;
    if mod(id,8)=0 then output l2;
  end;
run;

proc sql;
  create index id on main(id);
quit;

proc sql _method;
  create table want as
    select m.id, m.id2,m.m, i.i, l1.l1, l2.l2 
    from main(idxname=id) m
    inner join inner i on m.id=i.id
    left join l1 on m.id=l1.id
    left join l2 on m.id2=l2.id2
  ;
quit;

      sqxcrta
          sqxjm
              sqxsort
                  sqxsrc( WORK.L2 )
              sqxsort
                  sqxjm
                      sqxsort
                          sqxsrc( WORK.L1 )
                      sqxsort
                          sqxjndx
                              sqxsrc( WORK.INNER(alias = I) )
                              sqxsrc( WORK.MAIN(alias = M) )

ChrisNZ · Posted 04-12-2020 03:04 AM

@Patrick Not sure I understand the question.

The index is used to retrieve date from the right table B.

Once that data is there, there's nothing else to retrieve from B.

However the B data can be used to fetch the data from C, using an index again.

data CLASS(index=(NAME));
  set SASHELP.CLASS;
run;
options msglevel=i;
proc sql _method;
  select a.NAME, b.SEX, c.AGE
  from        CLASS               a
    left join CLASS(idxname=NAME) b on a.NAME=b.NAME
    left join CLASS(idxname=NAME) c on b.NAME=c.NAME;

76 options msglevel=i;

77 proc sql _method;

78 select a.NAME, b.SEX, c.AGE

79 from CLASS a

80 left join CLASS(idxname=NAME) b on a.NAME=b.NAME

81 left join CLASS(idxname=NAME) c on b.NAME=c.NAME;

INFO: Index Name of SQL table WORK.CLASS (alias = B) selected for SQL WHERE clause (outer join) optimization.

INFO: Index Name of SQL table WORK.CLASS (alias = C) selected for SQL WHERE clause (outer join) optimization.

NOTE: SQL execution methods chosen are:

sqxslct

sqxjndx

sqxjndx

sqxsrc( WORK.CLASS(alias = A) )

sqxsrc( WORK.CLASS(alias = B) )

sqxsrc( WORK.CLASS(alias = C) )

High-Performance SAS Coding - Third Edition

Transform Proc SQL to a Hash Object

Re: Transform Proc SQL to a Hash Object

Re: Transform Proc SQL to a Hash Object

Re: Transform Proc SQL to a Hash Object

Re: Transform Proc SQL to a Hash Object

Re: Transform Proc SQL to a Hash Object

Re: Transform Proc SQL to a Hash Object

Re: Transform Proc SQL to a Hash Object

Re: Transform Proc SQL to a Hash Object

Re: Transform Proc SQL to a Hash Object

Re: Transform Proc SQL to a Hash Object

Re: Transform Proc SQL to a Hash Object

Re: Transform Proc SQL to a Hash Object

Re: Transform Proc SQL to a Hash Object

Re: Transform Proc SQL to a Hash Object

SAS Innovate 2025: Call for Content

Classroom Training Available!