topic hash merge in SAS Programming

hash merge

Batman — Fri, 03 Mar 2023 00:19:28 GMT

I'm trying to find which records in a small dataset ('COHORT.HICS') are not in the larger one (ABCD.DATA). Trying use merge via hash rather than sorting and then a data step, but it seems to run out resources. Any suggestions how to make it work.

2 OPTIONS NOCENTER OBS= MAX ;
3
4 DATA HICDIFF0(KEEP=HIC);
5 IF 0 THEN SET ABCD.DATA;
6 IF _N_ = 1 THEN DO;
7 DECLARE HASH MB(DATASET:'ABCD.DATA');
8 MB.DEFINEKEY('HIC');
9 MB.DEFINEDONE();
10 END;
11 SET COHORT.HICS;
12 IF MB.FIND() NE 0 THEN OUTPUT;
13 RUN;

NOTE: THERE WERE 12354503 OBSERVATIONS READ FROM THE DATA SET ABCD.DATA.
ERROR: HASH DATA SET LOAD FAILED AT LINE 9 COLUMN 5.
ERROR: HASH OBJECT ADDED 12354498 ITEMS WHEN MEMORY FAILURE OCCURRED.
FATAL: INSUFFICIENT MEMORY TO EXECUTE DATA STEP PROGRAM. ABORTED DURING THE EXEC
ERROR: THE SAS SYSTEM STOPPED PROCESSING THIS STEP BECAUSE OF INSUFFICIENT MEMOR
WARNING: THE DATA SET WORK.HICDIFF0 MAY BE INCOMPLETE. WHEN THIS STEP WAS STOPP
NOTE: THE DATA STATEMENT USED 12.64 CPU SECONDS AND 18741K.

NOTE: THE ADDRESS SPACE HAS USED A MAXIMUM OF 920K BELOW THE LINE AND 1555896K A

Re: hash merge

SASKiwi — Fri, 03 Mar 2023 00:26:56 GMT

Hash tables have to fit entirely into memory. Why not swap the tables so the small one COHORT.HICS is the hash table?

Re: hash merge

Batman — Fri, 03 Mar 2023 00:31:39 GMT

would I need to change to code in row 12?

Re: hash merge

SASKiwi — Fri, 03 Mar 2023 01:53:05 GMT

How many rows are in your small table and in your large table? A hash merge might not be the best approach for comparing a very small table with a very large one.

Re: hash merge

Tom — Fri, 03 Mar 2023 04:30:28 GMT

Does the LARGE one have an INDEX on HIC? If so then use that.

data want;
  set SMALL;
  set LARGE key=HIC;
  if _error_ then do;
     put 'NOTE: ' HIC= 'not found.';
     _error_=0;
  end;
  output;
  call missing(of _all_);
run;

Example:

data small (keep=name) large(index=(name)) ;
  set sashelp.class end=eof;
  if _n_ in (1,5,7) then output small;
  output large;
  if eof then do;
    name = 'ZZZ'; output small;
  end;
run;

data want;
  set SMALL;
  set LARGE key=NAME ;
  if _error_ then do;
     put 'NOTE: ' NAME= 'not found.';
     _error_=0;
  end;
  output;
  call missing(of _all_);
run;

Result:

Re: hash merge

Kurt_Bremser — Fri, 03 Mar 2023 07:39:03 GMT

Read the small dataset into a hash where you define an additional flag variable (you need to use a DO loop and the ADD() method to fill the hash).

Then read the large dataset and update the flag in the hash if a particular HIC is found. At the end, write out only those members from the hash which have the flag set.

Just an idea, will test it later when I'm in front of my desktop.

Re: hash merge

Kurt_Bremser — Fri, 03 Mar 2023 08:12:39 GMT

Here a code example for my suggestion:

data small;
set sashelp.class (obs=10);
run;

data large;
set sashelp.class (firstobs=6);
run;

data _null_;
set large end=done;
if _n_ = 1
then do;
  length flag 8;
  declare hash s (ordered:"yes");
  s.definekey("name");
  s.definedata("name","sex","age","height","weight","flag");
  s.definedone();
  do until(done_s);
    set small end=done_s;
    rc = s.add();
  end;
end;
if s.find() = 0
then do;
  flag = 1;
  rc = s.replace();
end;
if done
then do;
  rc = s.output(dataset:"want (where=(flag = 1))");
end;
run;

Re: hash merge

s_lassen — Fri, 03 Mar 2023 08:58:45 GMT

You may run into trouble with that code, if the small dataset has repeating values on the NAME variable (the first Alfred will be found, the second will not).

To avoid that, use KEY=/UNIQUE:

data want;
  set SMALL;
  set LARGE key=NAME  /unique;
  if _error_ then do;
     put 'NOTE: ' NAME= 'not found.';
     _error_=0;
  end;
  output;
  call missing(of _all_);
run;

Re: hash merge

Ksharp — Fri, 03 Mar 2023 12:46:38 GMT

/*
There is no need Hash at all.
It is a piece of cake for SQL
*/
data small (keep=name) large(index=(name)) ;
  set sashelp.class end=eof;
  if _n_ in (1,5,7) then output small;
  output large;
  if eof then do;
    name = 'ZZZ'; output small;
  end;
run;

proc sql;
create table want as
select name from small
except
select name from large;
quit;

Re: hash merge

Batman — Fri, 03 Mar 2023 13:59:11 GMT

Yes, I normally don't expect sql to run run faster than hash, but it does in this case, apparently because no sort is being done.

Re: hash merge

SASKiwi — Fri, 03 Mar 2023 22:19:51 GMT

That's why it's a good idea not to be fixated on any particular methodology when you have tricky high data volume use case.

Most common hash question titles: How do I use hash to do XYZ?

A better title is: What is the best coding method to do XYZ?

Also @Ksharp 's SQL solution is a lot simpler so it's a win-win...