Solved: Re: hash merge

Batman · Posted 03-02-2023 07:19 PM

I'm trying to find which records in a small dataset ('COHORT.HICS') are not in the larger one (ABCD.DATA). Trying use merge via hash rather than sorting and then a data step, but it seems to run out resources. Any suggestions how to make it work.

2 OPTIONS NOCENTER OBS= MAX ;
3
4 DATA HICDIFF0(KEEP=HIC);
5 IF 0 THEN SET ABCD.DATA;
6 IF _N_ = 1 THEN DO;
7 DECLARE HASH MB(DATASET:'ABCD.DATA');
8 MB.DEFINEKEY('HIC');
9 MB.DEFINEDONE();
10 END;
11 SET COHORT.HICS;
12 IF MB.FIND() NE 0 THEN OUTPUT;
13 RUN;

NOTE: THERE WERE 12354503 OBSERVATIONS READ FROM THE DATA SET ABCD.DATA.
ERROR: HASH DATA SET LOAD FAILED AT LINE 9 COLUMN 5.
ERROR: HASH OBJECT ADDED 12354498 ITEMS WHEN MEMORY FAILURE OCCURRED.
FATAL: INSUFFICIENT MEMORY TO EXECUTE DATA STEP PROGRAM. ABORTED DURING THE EXEC
ERROR: THE SAS SYSTEM STOPPED PROCESSING THIS STEP BECAUSE OF INSUFFICIENT MEMOR
WARNING: THE DATA SET WORK.HICDIFF0 MAY BE INCOMPLETE. WHEN THIS STEP WAS STOPP
NOTE: THE DATA STATEMENT USED 12.64 CPU SECONDS AND 18741K.

NOTE: THE ADDRESS SPACE HAS USED A MAXIMUM OF 920K BELOW THE LINE AND 1555896K A

Ksharp · Posted 03-03-2023 07:46 AM

/*
There is no need Hash at all.
It is a piece of cake for SQL
*/
data small (keep=name) large(index=(name)) ;
  set sashelp.class end=eof;
  if _n_ in (1,5,7) then output small;
  output large;
  if eof then do;
    name = 'ZZZ'; output small;
  end;
run;

proc sql;
create table want as
select name from small
except
select name from large;
quit;

View solution in original post

SASKiwi · Posted 03-02-2023 07:26 PM

Hash tables have to fit entirely into memory. Why not swap the tables so the small one COHORT.HICS is the hash table?

Batman · Posted 03-02-2023 07:31 PM

would I need to change to code in row 12?

SASKiwi · Posted 03-02-2023 08:53 PM

How many rows are in your small table and in your large table? A hash merge might not be the best approach for comparing a very small table with a very large one.

Tom · Posted 03-02-2023 11:30 PM

Does the LARGE one have an INDEX on HIC? If so then use that.

data want;
  set SMALL;
  set LARGE key=HIC;
  if _error_ then do;
     put 'NOTE: ' HIC= 'not found.';
     _error_=0;
  end;
  output;
  call missing(of _all_);
run;

Example:

data small (keep=name) large(index=(name)) ;
  set sashelp.class end=eof;
  if _n_ in (1,5,7) then output small;
  output large;
  if eof then do;
    name = 'ZZZ'; output small;
  end;
run;

data want;
  set SMALL;
  set LARGE key=NAME ;
  if _error_ then do;
     put 'NOTE: ' NAME= 'not found.';
     _error_=0;
  end;
  output;
  call missing(of _all_);
run;

Result:

s_lassen · Posted 03-03-2023 03:58 AM

You may run into trouble with that code, if the small dataset has repeating values on the NAME variable (the first Alfred will be found, the second will not).

To avoid that, use KEY=/UNIQUE:

data want;
  set SMALL;
  set LARGE key=NAME  /unique;
  if _error_ then do;
     put 'NOTE: ' NAME= 'not found.';
     _error_=0;
  end;
  output;
  call missing(of _all_);
run;

Kurt_Bremser · Posted 03-03-2023 02:39 AM

Read the small dataset into a hash where you define an additional flag variable (you need to use a DO loop and the ADD() method to fill the hash).

Then read the large dataset and update the flag in the hash if a particular HIC is found. At the end, write out only those members from the hash which have the flag set.

Just an idea, will test it later when I'm in front of my desktop.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Kurt_Bremser · Posted 03-03-2023 03:12 AM

Here a code example for my suggestion:

data small;
set sashelp.class (obs=10);
run;

data large;
set sashelp.class (firstobs=6);
run;

data _null_;
set large end=done;
if _n_ = 1
then do;
  length flag 8;
  declare hash s (ordered:"yes");
  s.definekey("name");
  s.definedata("name","sex","age","height","weight","flag");
  s.definedone();
  do until(done_s);
    set small end=done_s;
    rc = s.add();
  end;
end;
if s.find() = 0
then do;
  flag = 1;
  rc = s.replace();
end;
if done
then do;
  rc = s.output(dataset:"want (where=(flag = 1))");
end;
run;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Ksharp · Posted 03-03-2023 07:46 AM

/*
There is no need Hash at all.
It is a piece of cake for SQL
*/
data small (keep=name) large(index=(name)) ;
  set sashelp.class end=eof;
  if _n_ in (1,5,7) then output small;
  output large;
  if eof then do;
    name = 'ZZZ'; output small;
  end;
run;

proc sql;
create table want as
select name from small
except
select name from large;
quit;

Batman · Posted 03-03-2023 08:59 AM

Yes, I normally don't expect sql to run run faster than hash, but it does in this case, apparently because no sort is being done.

SASKiwi · Posted 03-03-2023 05:19 PM

That's why it's a good idea not to be fixated on any particular methodology when you have tricky high data volume use case.

Most common hash question titles: How do I use hash to do XYZ?

A better title is: What is the best coding method to do XYZ?

Also @Ksharp 's SQL solution is a lot simpler so it's a win-win...

Classroom Training Available!