DATA Step, Macro, Functions and more

Hash three data sets

Reply
Occasional Contributor
Posts: 16

Hash three data sets

[ Edited ]

Hi all,

 

I am trying to hash multiple data sets simultaneously. I have three datasets and want to hash them together to get one final data set. Set 'in' is my base set, 's1' is one set to merge and 's2' is the second to merge. Set 1 will be merged using 'key' as the key variable and set 2 will be merged using 'idno' as the key.

 

%macro merger(year);
	DATA Hashed_IN;
    
	set IN&year.;

	%IF _N_ = 1 %THEN %DO;
	  %IF 0 %THEN SET s1;
	  DCL hash h1(dataset:"s1");
	    h1.DefineKey("key");
		h1.DefineData(All:'Y');
		h1.defineDone();

	  %IF 0 %THEN SET s2;
	  DCL hash h2(dataset:"s2");
	    h2.DefineKey("idno");
		h2.DefineData(All:'Y');
		h2.defineDone();
	%END;
	%if h1.find()=0 AND h2.find()=0 %THEN output;
	RUN;
%mend;

Is this even possible? Or must I merge data sets in and S1, then the resulting set with S2?

 

When I do this, it seems to merge just the 'in' and 'S2' sets

 

Thanks for your help!

Ben

Super User
Posts: 7,762

Re: Hash three data sets

You can try to do it in one SQL, or you can do two data steps. Personally I prefer the data step method, as SQL can perform very badly when joining large tables.

Another point to consider is the relationship between the tables with respect to the used key variables. When there's a many-to-many relationship, SQL will be the way to go.

With the hash method, you need to keep in mind that your success depends on RAM size, while data steps and SQL can scale virtually indefinitely (until you run out of disk space).

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Super User
Posts: 10,020

Re: Hash three data sets

[ Edited ]

Your syntax is not right. Post the sample data and output, that would be good.

 

%macro merger(year);
	DATA Hashed_IN;
    
	set IN&year.;

	IF _N_ = 1 THEN DO;
	  IF 0 THEN SET s1;
	  DCL hash h1(dataset:"s1");
	    h1.DefineKey("key");
		h1.DefineData(All:'Y');
		h1.defineDone();

	  IF 0 THEN SET s2;
	  DCL hash h2(dataset:"s2");
	    h2.DefineKey("idno");
		h2.DefineData(All:'Y');
		h2.defineDone();
	END;
	if h1.find()=0 AND h2.find()=0 THEN output;
	RUN;
%mend;
Super Contributor
Posts: 474

Re: Hash three data sets

Hi.

 

Surely it's doable, question is the real necessity for that.

 

Hashing works well in a large/small dataset scenario.

 

Has with Data/Merge you are forced to sort both datasets, sorting a big chunk of data is a costly operation.

 

Hashing avoids that by reading sequentially the big table and loading the small table into memory (hash).

 

So like this you avoid sorting data... Is that the case?

 

If it's not that kind of situation, than there is not much on using Hash.

 

If both tables are small, SQL will handle that fast, no problem, or even data/merge as the sort is done fast.

If both table are large, then as @KurtBremser said, you might get into memory size problems.

 

Hope it helps.

 

Daniel Santos @ www.cgd.pt

Occasional Contributor
Posts: 16

Re: Hash three data sets

Posted in reply to DanielSantos

The data are in the following format (I import from SAS7bdat's):

Set 1:
Key var_1 var_2 ... HospID
1 0 1 3
2 1 0 49
...
7000000 etc.

Set 2:
Key Var_1
1 0
2 1
...
7000000 1


Set 3:
HospID Var_3
1 0
2 1
...
1000 1

 

I am using a Hash statement partially since I wanted to use them to learn-by-doing and partially since the datasets I am merging contain about 7,000,000 observations each, so I thought it was a good way to save time. When I use the Hash statement, I don't get any errors, it just only adds either set1+set2 or set1+set3

Ask a Question
Discussion stats
  • 4 replies
  • 175 views
  • 0 likes
  • 4 in conversation