I'm trying to find which records in a small dataset ('COHORT.HICS') are not in the larger one (ABCD.DATA). Trying use merge via hash rather than sorting and then a data step, but it seems to run out resources. Any suggestions how to make it work.
2 OPTIONS NOCENTER OBS= MAX ;
3
4 DATA HICDIFF0(KEEP=HIC);
5 IF 0 THEN SET ABCD.DATA;
6 IF _N_ = 1 THEN DO;
7 DECLARE HASH MB(DATASET:'ABCD.DATA');
8 MB.DEFINEKEY('HIC');
9 MB.DEFINEDONE();
10 END;
11 SET COHORT.HICS;
12 IF MB.FIND() NE 0 THEN OUTPUT;
13 RUN;
NOTE: THERE WERE 12354503 OBSERVATIONS READ FROM THE DATA SET ABCD.DATA.
ERROR: HASH DATA SET LOAD FAILED AT LINE 9 COLUMN 5.
ERROR: HASH OBJECT ADDED 12354498 ITEMS WHEN MEMORY FAILURE OCCURRED.
FATAL: INSUFFICIENT MEMORY TO EXECUTE DATA STEP PROGRAM. ABORTED DURING THE EXEC
ERROR: THE SAS SYSTEM STOPPED PROCESSING THIS STEP BECAUSE OF INSUFFICIENT MEMOR
WARNING: THE DATA SET WORK.HICDIFF0 MAY BE INCOMPLETE. WHEN THIS STEP WAS STOPP
NOTE: THE DATA STATEMENT USED 12.64 CPU SECONDS AND 18741K.
NOTE: THE ADDRESS SPACE HAS USED A MAXIMUM OF 920K BELOW THE LINE AND 1555896K A
/*
There is no need Hash at all.
It is a piece of cake for SQL
*/
data small (keep=name) large(index=(name)) ;
set sashelp.class end=eof;
if _n_ in (1,5,7) then output small;
output large;
if eof then do;
name = 'ZZZ'; output small;
end;
run;
proc sql;
create table want as
select name from small
except
select name from large;
quit;
Hash tables have to fit entirely into memory. Why not swap the tables so the small one COHORT.HICS is the hash table?
How many rows are in your small table and in your large table? A hash merge might not be the best approach for comparing a very small table with a very large one.
Does the LARGE one have an INDEX on HIC? If so then use that.
data want;
set SMALL;
set LARGE key=HIC;
if _error_ then do;
put 'NOTE: ' HIC= 'not found.';
_error_=0;
end;
output;
call missing(of _all_);
run;
Example:
data small (keep=name) large(index=(name)) ;
set sashelp.class end=eof;
if _n_ in (1,5,7) then output small;
output large;
if eof then do;
name = 'ZZZ'; output small;
end;
run;
data want;
set SMALL;
set LARGE key=NAME ;
if _error_ then do;
put 'NOTE: ' NAME= 'not found.';
_error_=0;
end;
output;
call missing(of _all_);
run;
Result:
You may run into trouble with that code, if the small dataset has repeating values on the NAME variable (the first Alfred will be found, the second will not).
To avoid that, use KEY=/UNIQUE:
data want;
set SMALL;
set LARGE key=NAME /unique;
if _error_ then do;
put 'NOTE: ' NAME= 'not found.';
_error_=0;
end;
output;
call missing(of _all_);
run;
Read the small dataset into a hash where you define an additional flag variable (you need to use a DO loop and the ADD() method to fill the hash).
Then read the large dataset and update the flag in the hash if a particular HIC is found. At the end, write out only those members from the hash which have the flag set.
Just an idea, will test it later when I'm in front of my desktop.
Here a code example for my suggestion:
data small;
set sashelp.class (obs=10);
run;
data large;
set sashelp.class (firstobs=6);
run;
data _null_;
set large end=done;
if _n_ = 1
then do;
length flag 8;
declare hash s (ordered:"yes");
s.definekey("name");
s.definedata("name","sex","age","height","weight","flag");
s.definedone();
do until(done_s);
set small end=done_s;
rc = s.add();
end;
end;
if s.find() = 0
then do;
flag = 1;
rc = s.replace();
end;
if done
then do;
rc = s.output(dataset:"want (where=(flag = 1))");
end;
run;
/*
There is no need Hash at all.
It is a piece of cake for SQL
*/
data small (keep=name) large(index=(name)) ;
set sashelp.class end=eof;
if _n_ in (1,5,7) then output small;
output large;
if eof then do;
name = 'ZZZ'; output small;
end;
run;
proc sql;
create table want as
select name from small
except
select name from large;
quit;
That's why it's a good idea not to be fixated on any particular methodology when you have tricky high data volume use case.
Most common hash question titles: How do I use hash to do XYZ?
A better title is: What is the best coding method to do XYZ?
Also @Ksharp 's SQL solution is a lot simpler so it's a win-win...
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.