BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
Batman
Quartz | Level 8

I'm trying to find which records in a small dataset ('COHORT.HICS') are not in the larger one (ABCD.DATA).  Trying use merge via hash rather than sorting and then a data step, but it seems to run out resources.   Any suggestions how to make it work.

 

2 OPTIONS NOCENTER OBS= MAX ;
3
4 DATA HICDIFF0(KEEP=HIC);
5 IF 0 THEN SET ABCD.DATA;
6 IF _N_ = 1 THEN DO;
7 DECLARE HASH MB(DATASET:'ABCD.DATA');
8 MB.DEFINEKEY('HIC');
9 MB.DEFINEDONE();
10 END;
11 SET COHORT.HICS;
12 IF MB.FIND() NE 0 THEN OUTPUT;
13 RUN;

NOTE: THERE WERE 12354503 OBSERVATIONS READ FROM THE DATA SET ABCD.DATA.
ERROR: HASH DATA SET LOAD FAILED AT LINE 9 COLUMN 5.
ERROR: HASH OBJECT ADDED 12354498 ITEMS WHEN MEMORY FAILURE OCCURRED.
FATAL: INSUFFICIENT MEMORY TO EXECUTE DATA STEP PROGRAM. ABORTED DURING THE EXEC
ERROR: THE SAS SYSTEM STOPPED PROCESSING THIS STEP BECAUSE OF INSUFFICIENT MEMOR
WARNING: THE DATA SET WORK.HICDIFF0 MAY BE INCOMPLETE. WHEN THIS STEP WAS STOPP
NOTE: THE DATA STATEMENT USED 12.64 CPU SECONDS AND 18741K.

NOTE: THE ADDRESS SPACE HAS USED A MAXIMUM OF 920K BELOW THE LINE AND 1555896K A

1 ACCEPTED SOLUTION

Accepted Solutions
Ksharp
Super User
/*
There is no need Hash at all.
It is a piece of cake for SQL
*/
data small (keep=name) large(index=(name)) ;
  set sashelp.class end=eof;
  if _n_ in (1,5,7) then output small;
  output large;
  if eof then do;
    name = 'ZZZ'; output small;
  end;
run;

proc sql;
create table want as
select name from small
except
select name from large;
quit;

View solution in original post

10 REPLIES 10
SASKiwi
PROC Star

Hash tables have to fit entirely into memory. Why not swap the tables so the small one COHORT.HICS is the hash table?

Batman
Quartz | Level 8
would I need to change to code in row 12?
SASKiwi
PROC Star

How many rows are in your small table and in your large table? A hash merge might not be the best approach for comparing a very small table with a very large one.

Tom
Super User Tom
Super User

Does the LARGE one have an INDEX on HIC?  If so then use that.

data want;
  set SMALL;
  set LARGE key=HIC;
  if _error_ then do;
     put 'NOTE: ' HIC= 'not found.';
     _error_=0;
  end;
  output;
  call missing(of _all_);
run;

Example:

data small (keep=name) large(index=(name)) ;
  set sashelp.class end=eof;
  if _n_ in (1,5,7) then output small;
  output large;
  if eof then do;
    name = 'ZZZ'; output small;
  end;
run;

data want;
  set SMALL;
  set LARGE key=NAME ;
  if _error_ then do;
     put 'NOTE: ' NAME= 'not found.';
     _error_=0;
  end;
  output;
  call missing(of _all_);
run;

Result:

Tom_0-1677817821174.png

 

 

s_lassen
Meteorite | Level 14

You may run into trouble with that code, if the small dataset has repeating values on the NAME variable (the first Alfred will be found, the second will not).

 

To avoid that, use KEY=/UNIQUE:

data want;
  set SMALL;
  set LARGE key=NAME  /unique;
  if _error_ then do;
     put 'NOTE: ' NAME= 'not found.';
     _error_=0;
  end;
  output;
  call missing(of _all_);
run;
Kurt_Bremser
Super User

Read the small dataset into a hash where you define an additional flag variable (you need to use a DO loop and the ADD() method to fill the hash).

Then read the large dataset and update the flag in the hash if a particular HIC is found. At the end, write out only those members from the hash which have the flag set.

 

Just an idea, will test it later when I'm in front of my desktop.

Kurt_Bremser
Super User

Here a code example for my suggestion:

data small;
set sashelp.class (obs=10);
run;

data large;
set sashelp.class (firstobs=6);
run;

data _null_;
set large end=done;
if _n_ = 1
then do;
  length flag 8;
  declare hash s (ordered:"yes");
  s.definekey("name");
  s.definedata("name","sex","age","height","weight","flag");
  s.definedone();
  do until(done_s);
    set small end=done_s;
    rc = s.add();
  end;
end;
if s.find() = 0
then do;
  flag = 1;
  rc = s.replace();
end;
if done
then do;
  rc = s.output(dataset:"want (where=(flag = 1))");
end;
run;
Ksharp
Super User
/*
There is no need Hash at all.
It is a piece of cake for SQL
*/
data small (keep=name) large(index=(name)) ;
  set sashelp.class end=eof;
  if _n_ in (1,5,7) then output small;
  output large;
  if eof then do;
    name = 'ZZZ'; output small;
  end;
run;

proc sql;
create table want as
select name from small
except
select name from large;
quit;
Batman
Quartz | Level 8
Yes, I normally don't expect sql to run run faster than hash, but it does in this case, apparently because no sort is being done.
SASKiwi
PROC Star

That's why it's a good idea not to be fixated on any particular methodology when you have tricky high data volume use case.

 

Most common hash question titles: How do I use hash to do XYZ?

 

A better title is: What is the best coding method to do XYZ?

 

Also @Ksharp 's SQL solution is a lot simpler so it's a win-win...

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 10 replies
  • 719 views
  • 5 likes
  • 6 in conversation