topic Re: Transform Proc SQL to a Hash Object in SAS Programming

Transform Proc SQL to a Hash Object

PegaZeus — Fri, 10 Apr 2020 16:01:54 GMT

SAS seems to run out of work disc space (sorting space) when I run the proc sql, even if I specify the SQL Optimizer to use a hash join (magic=103), so I'm looking to transform the following proc sql into a hash object approach. How I go about doing this:

proc sql magic=103;
  create table final as 
  select *
  from have a
  inner join ids b on (a.id = b.id)
  left join nad g on (a.fpc = g.fpc)
  left join pricing e on (a.gpi = e.gpi) and (d.list_id = e.list_id)
  left join network f on (a.id = f.id)
  left join client_segment d on (a.client = d.client) and (a.bob = d.bob) and 
           (a.seg = d.seg) and (a.ex_id = d.ex_id) and (a.aids = d.aids)
  left join instate c on (a.ncp_id = c.ncp_id);
quit;

Re: Transform Proc SQL to a Hash Object

Kurt_Bremser — Fri, 10 Apr 2020 16:08:31 GMT

Don't use the asterisk, use an exhaustive list of wanted variables with correct table aliases. Otherwise you will at least get WARNINGs.

What are the sizes (in terms of observation numbers) and the sizes of the variables you want to join to your main table? This will determine if you can fit the tables into memory.

Re: Transform Proc SQL to a Hash Object

PegaZeus — Fri, 10 Apr 2020 17:28:41 GMT

Hey Kurt

The size of the base table is approx. 58,000,000 records

The sizes of the join fields range from a length of 2 to 14. Fore example, client is 2, gpi is 14, fpc is 11, ncpdp is 5...

Re: Transform Proc SQL to a Hash Object

Kurt_Bremser — Fri, 10 Apr 2020 17:49:42 GMT

We need to know the numbers of obs in all the tables you want to join to the base, and the lengths of the variables you want from there.

Re: Transform Proc SQL to a Hash Object

SASKiwi — Fri, 10 Apr 2020 22:27:04 GMT

@PegaZeus - 58 million rows isn't large. How about trying - options compress = yes; - to see if that helps. If your base table contains lots of long character variables then compressing will save a lot of space.

If you are using a SAS server, then I'd suggest it doesn't have enough SAS WORK space configured. Talk to your SAS administrator about this. You may be able to fix this program to use less WORK space, but then you'll run out again in the next big job. Fix the problem, not the symptom.

Re: Transform Proc SQL to a Hash Object

Patrick — Sun, 12 Apr 2020 04:00:10 GMT

This join will require a lot of sorting of your Have table/intermediary result sets and unless the inner join doesn't reduce the volumes considerably that's going to be sorting 58M rows multiple times.

Sorting will create intermediary files on disk so as @SASKiwi suggests using option compress=yes should help to save disk space and I/O and though also improve performance.

You could avoid at least one sort by moving join...

left join network f on (a.id = f.id)

...right after the inner join which uses the same key column.

I believe option magic=103 only works for inner joins.

A data step hash approach could eventually improve performance but you would need to have sufficient memory for loading all the lookup tables.

If you run a Proc Contents over your tables then you'll get the information how many rows a table has and the names and lengths of all variables in the table. Sum the lengths of all variables you want, multiply by the number of rows in the table, add 17MB to the result and you'll get an estimate how much memory a hash table will consume.

Using a Select * when joining tables will create warnings if the same variable name exists in more than one table. You should list your variables explicitly.

You can save yourself some typing by executing below - and then use the expanded variable list in the SAS log as starting point for your select statement.

proc sql feedback inobs=1;
/*  create table final as */
  select *
  from have a
  inner join ids b on (a.id = b.id)
  left join nad g on (a.fpc = g.fpc)
  left join pricing e on (a.gpi = e.gpi) and (d.list_id = e.list_id)
  left join network f on (a.id = f.id)
  left join client_segment d on (a.client = d.client) and (a.bob = d.bob) and 
           (a.seg = d.seg) and (a.ex_id = d.ex_id) and (a.aids = d.aids)
  left join instate c on (a.ncp_id = c.ncp_id);
quit;

You can also pre-sort your lookup tables. This won't avoid the sorting of the Have table and the intermediary results, but it will avoid sorting of the lookup tables as part of the SQL.

So you could use a Proc Sort first for you lookup tables. The SQL won't sort the tables again if it has it is already sorted AND has the sortedby=<variable name(s)) property set (which Proc Sort will do).

If you know that a table is already sorted the way you need it but the table metadata doesn't have the sortedby attribute set (proc contents will show you) then you can also formulate it explicitly when using the table in the SQL; i.e. left join network(sortedby=id) f on (a.id = f.id)

And last but not least:

Make sure the relationships between table Have and any of the other tables is M:(0 or 1). Any other relationship could increase the row count dramatically and explain why you're running out of workspace.

Re: Transform Proc SQL to a Hash Object

ChrisNZ — Sat, 11 Apr 2020 08:23:40 GMT

@SASKiwi Utility files are not compressed.

@PegaZeus You could try 2 steps: Join a and b (and maybe f) on id. This will reduce the number of rows, and the sql parser will know what it's dealing with and will adopt the most efficient method. Then do the other joins against that first table. Don't forget the _method option to monitor the parser's choices.

Re: Transform Proc SQL to a Hash Object

SASKiwi — Sat, 11 Apr 2020 23:27:13 GMT

@ChrisNZ - True, but the code as posted contains temporary SAS datasets where compression may help.

Re: Transform Proc SQL to a Hash Object

ChrisNZ — Sat, 11 Apr 2020 23:38:44 GMT

@SASKiwi I don't see temporary data sets. Can you tell em more?

Also, one thing that sometimes help is using a compressed folder for utility files.

Re: Transform Proc SQL to a Hash Object

ChrisNZ — Sun, 12 Apr 2020 00:20:00 GMT

@Patrick

>You could avoid at least one sort by moving join... ...right after the inner join which uses the same key column.

Oh wow. I'm so disappointed, I was certain the SQL parser did that. You are right it does not.

Re: Transform Proc SQL to a Hash Object

ChrisNZ — Sun, 12 Apr 2020 00:37:39 GMT

Indexes haven't been mentioned yet.

It may well be, depending on their size, that tables such as pricing and instate just have to be indexed, to avoid sorting the main table.

Re: Transform Proc SQL to a Hash Object

Patrick — Sun, 12 Apr 2020 01:38:24 GMT

@ChrisNZ wrote:

Indexes haven't been mentioned yet.

It may well be, depending on their size, that tables such as pricing and instate just have to be indexed, to avoid sorting the main table.

Thought about it and actually played around with some mock-up data but couldn't make the left joins use the index. Then I found below paper:

"In a LEFT join PROC SQL must pick every row in the LEFT table. If there are any indexes built on the LEFT table,
SQL cannot use them for a LEFT join. It can make use of indexes on the right table, because only certain rows are
selected from the RIGHT table – those that match what are in the LEFT table. But, there is no way for PROC SQL to
use an index built on the LEFT table. So, do not waste your time building and maintaining an index on the LEFT
table for the purpose of doing a LEFT join."

Re: Transform Proc SQL to a Hash Object

ChrisNZ — Sun, 12 Apr 2020 03:04:59 GMT

@Patrick This makes sense. Hence my suggestion that the right tables be indexed.

You can force the using index by using option (idxname=).

I expect that the tables have to be large (especially wide), and the proportion of retrieved rows low, before the index is faster than a sort.

Re: Transform Proc SQL to a Hash Object

ScottBass — Sun, 12 Apr 2020 03:42:39 GMT

@PegaZeus wrote:

proc sql magic=103;
  create table final as 
  select *
  from have a
  inner join ids b on (a.id = b.id)
  left join nad g on (a.fpc = g.fpc)
  left join pricing e on (a.gpi = e.gpi) and (d.list_id = e.list_id)
  left join network f on (a.id = f.id)
  left join client_segment d on (a.client = d.client) and (a.bob = d.bob) and 
           (a.seg = d.seg) and (a.ex_id = d.ex_id) and (a.aids = d.aids)
  left join instate c on (a.ncp_id = c.ncp_id);
quit;

Have a look at:

https://github.com/scottbass/SAS/blob/master/Macro/hash_define.sas

https://github.com/scottbass/SAS/blob/master/Macro/hash_lookup.sas

You'll also need the utility macros:

https://github.com/scottbass/SAS/blob/master/Macro/parmv.sas

Read the hash_define macro header thoroughly. Run the self-contained examples.

You can simulate an inner join by filtering (subsetting if statement) on the return code from the hash object lookup.

Given your join criteria the order of your lookups is important.

Hopefully your dimension tables will fit into the memory available to your SAS session, otherwise you'll need to split the code into multiple datasteps. Alternatively, you can create indexes on your dimension tables and use an index key lookup ("double set statement" - see the documentation on the SET statement, esp. the examples showing transaction datasets).

Give it a go and post followup questions.

Re: Transform Proc SQL to a Hash Object

Patrick — Sun, 12 Apr 2020 04:33:54 GMT

@ChrisNZ wrote:

@Patrick This makes sense. Hence my suggestion that the right tables be indexed.

You can force the using index by using option (idxname=).

I expect that the tables have to be large (especially wide), and the proportion of retrieved rows low, before the index is faster than a sort.

Hmm... I can see how an index on the right table could get used for the first inner join. But isn't then already the first left join executing against the result set of the inner join and not against the right table? So how could an index on the right table get used here? Can you show me using below sample code?

data 
  Main(keep=m id id2) 
  Inner(keep=i id) 
  L1(keep=l1 id) 
  L2(keep=l2 id2) 
  ;
  length id id2 8;
  m='master';
  i='inner';
  l1='left 1';
  l2='left 2';
  do id=1 to 1000;
    id2=100*id;
    output main;
    if mod(id,2)=0 then output inner;
    if mod(id,4)=0 then output l1;
    if mod(id,8)=0 then output l2;
  end;
run;

proc sql;
  create index id on main(id);
quit;

proc sql _method;
  create table want as
    select m.id, m.id2,m.m, i.i, l1.l1, l2.l2 
    from main(idxname=id) m
    inner join inner i on m.id=i.id
    left join l1 on m.id=l1.id
    left join l2 on m.id2=l2.id2
  ;
quit;

      sqxcrta
          sqxjm
              sqxsort
                  sqxsrc( WORK.L2 )
              sqxsort
                  sqxjm
                      sqxsort
                          sqxsrc( WORK.L1 )
                      sqxsort
                          sqxjndx
                              sqxsrc( WORK.INNER(alias = I) )
                              sqxsrc( WORK.MAIN(alias = M) )

Re: Transform Proc SQL to a Hash Object

ChrisNZ — Sun, 12 Apr 2020 08:03:03 GMT

@Patrick Not sure I understand the question.

The index is used to retrieve date from the right table B.

Once that data is there, there's nothing else to retrieve from B.

However the B data can be used to fetch the data from C, using an index again.

data CLASS(index=(NAME));
  set SASHELP.CLASS;
run;
options msglevel=i;
proc sql _method;
  select a.NAME, b.SEX, c.AGE
  from        CLASS               a
    left join CLASS(idxname=NAME) b on a.NAME=b.NAME
    left join CLASS(idxname=NAME) c on b.NAME=c.NAME;

76 options msglevel=i;

77 proc sql _method;

78 select a.NAME, b.SEX, c.AGE

79 from CLASS a

80 left join CLASS(idxname=NAME) b on a.NAME=b.NAME

81 left join CLASS(idxname=NAME) c on b.NAME=c.NAME;

INFO: Index Name of SQL table WORK.CLASS (alias = B) selected for SQL WHERE clause (outer join) optimization.

INFO: Index Name of SQL table WORK.CLASS (alias = C) selected for SQL WHERE clause (outer join) optimization.

NOTE: SQL execution methods chosen are:

sqxslct

sqxjndx

sqxjndx

sqxsrc( WORK.CLASS(alias = A) )

sqxsrc( WORK.CLASS(alias = B) )

sqxsrc( WORK.CLASS(alias = C) )

Re: Transform Proc SQL to a Hash Object

Kurt_Bremser — Mon, 13 Apr 2020 10:11:40 GMT

Assuming there is a column "name" in dataset "ids", this does the first part of your SQL:

data final;
set have;
if _n_ = 1
then do;
  length name $20; /* set attributes of data variables in the hash */
  declare hash ids(dataset:"ids");
  ids.definekey('id');
  ids.definedata('name'); /* insert more columns as needed */
  ids.definedone();
  call missing(name); /* prevents "uninitialized" NOTE */
end;
if ids.find() = 0; /* does two things:
  - fetches data coming from dataset ids
  - proceeds with the data step iteration only if match is found
    (does the "inner join")
*/
run;

Expand this as needed.

For more detailed assistance, show the structures of all the datasets in your SQL.

Edited according to answer by @ScottBass

Re: Transform Proc SQL to a Hash Object

ScottBass — Mon, 13 Apr 2020 09:01:46 GMT

There is an implied retain statement for data set variables. Therefore the "if 0 then set ids" construct will cause a retain on all variables in ids. Of particular interest are your definedata() variables, eg. "name".

Since ids is never read by a set statement, any failed find() method will cause the value from the previous successful find to be carried forward in the PDV. Probably not what you want.

I recommend a call missing(of _all_) or at least call missing(of list,of,hash,object,satellite,variables) at the appropriate point in your data step.

Re: Transform Proc SQL to a Hash Object

Kurt_Bremser — Mon, 13 Apr 2020 10:08:59 GMT

You are absolutely right. In this particular case, though, it would not be of consequence, as the subsetting if lets only observations through where a value is found.

But I will edit my post, as the other ("left join") hashes would suffer from this effect.