BookmarkSubscribeRSS Feed
benbuck
Obsidian | Level 7

Hi all,

 

I am trying to hash multiple data sets simultaneously. I have three datasets and want to hash them together to get one final data set. Set 'in' is my base set, 's1' is one set to merge and 's2' is the second to merge. Set 1 will be merged using 'key' as the key variable and set 2 will be merged using 'idno' as the key.

 

%macro merger(year);
	DATA Hashed_IN;
    
	set IN&year.;

	%IF _N_ = 1 %THEN %DO;
	  %IF 0 %THEN SET s1;
	  DCL hash h1(dataset:"s1");
	    h1.DefineKey("key");
		h1.DefineData(All:'Y');
		h1.defineDone();

	  %IF 0 %THEN SET s2;
	  DCL hash h2(dataset:"s2");
	    h2.DefineKey("idno");
		h2.DefineData(All:'Y');
		h2.defineDone();
	%END;
	%if h1.find()=0 AND h2.find()=0 %THEN output;
	RUN;
%mend;

Is this even possible? Or must I merge data sets in and S1, then the resulting set with S2?

 

When I do this, it seems to merge just the 'in' and 'S2' sets

 

Thanks for your help!

Ben

4 REPLIES 4
Kurt_Bremser
Super User

You can try to do it in one SQL, or you can do two data steps. Personally I prefer the data step method, as SQL can perform very badly when joining large tables.

Another point to consider is the relationship between the tables with respect to the used key variables. When there's a many-to-many relationship, SQL will be the way to go.

With the hash method, you need to keep in mind that your success depends on RAM size, while data steps and SQL can scale virtually indefinitely (until you run out of disk space).

Ksharp
Super User

Your syntax is not right. Post the sample data and output, that would be good.

 

%macro merger(year);
	DATA Hashed_IN;
    
	set IN&year.;

	IF _N_ = 1 THEN DO;
	  IF 0 THEN SET s1;
	  DCL hash h1(dataset:"s1");
	    h1.DefineKey("key");
		h1.DefineData(All:'Y');
		h1.defineDone();

	  IF 0 THEN SET s2;
	  DCL hash h2(dataset:"s2");
	    h2.DefineKey("idno");
		h2.DefineData(All:'Y');
		h2.defineDone();
	END;
	if h1.find()=0 AND h2.find()=0 THEN output;
	RUN;
%mend;
DanielSantos
Barite | Level 11

Hi.

 

Surely it's doable, question is the real necessity for that.

 

Hashing works well in a large/small dataset scenario.

 

Has with Data/Merge you are forced to sort both datasets, sorting a big chunk of data is a costly operation.

 

Hashing avoids that by reading sequentially the big table and loading the small table into memory (hash).

 

So like this you avoid sorting data... Is that the case?

 

If it's not that kind of situation, than there is not much on using Hash.

 

If both tables are small, SQL will handle that fast, no problem, or even data/merge as the sort is done fast.

If both table are large, then as @Kurt_Bremser said, you might get into memory size problems.

 

Hope it helps.

 

Daniel Santos @ www.cgd.pt

benbuck
Obsidian | Level 7

The data are in the following format (I import from SAS7bdat's):

Set 1:
Key var_1 var_2 ... HospID
1 0 1 3
2 1 0 49
...
7000000 etc.

Set 2:
Key Var_1
1 0
2 1
...
7000000 1


Set 3:
HospID Var_3
1 0
2 1
...
1000 1

 

I am using a Hash statement partially since I wanted to use them to learn-by-doing and partially since the datasets I am merging contain about 7,000,000 observations each, so I thought it was a good way to save time. When I use the Hash statement, I don't get any errors, it just only adds either set1+set2 or set1+set3

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 4 replies
  • 1343 views
  • 0 likes
  • 4 in conversation