02-20-2017 04:54 AM - edited 02-20-2017 05:06 AM
I am trying to hash multiple data sets simultaneously. I have three datasets and want to hash them together to get one final data set. Set 'in' is my base set, 's1' is one set to merge and 's2' is the second to merge. Set 1 will be merged using 'key' as the key variable and set 2 will be merged using 'idno' as the key.
%macro merger(year); DATA Hashed_IN; set IN&year.; %IF _N_ = 1 %THEN %DO; %IF 0 %THEN SET s1; DCL hash h1(dataset:"s1"); h1.DefineKey("key"); h1.DefineData(All:'Y'); h1.defineDone(); %IF 0 %THEN SET s2; DCL hash h2(dataset:"s2"); h2.DefineKey("idno"); h2.DefineData(All:'Y'); h2.defineDone(); %END; %if h1.find()=0 AND h2.find()=0 %THEN output; RUN; %mend;
Is this even possible? Or must I merge data sets in and S1, then the resulting set with S2?
When I do this, it seems to merge just the 'in' and 'S2' sets
Thanks for your help!
02-20-2017 05:02 AM
You can try to do it in one SQL, or you can do two data steps. Personally I prefer the data step method, as SQL can perform very badly when joining large tables.
Another point to consider is the relationship between the tables with respect to the used key variables. When there's a many-to-many relationship, SQL will be the way to go.
With the hash method, you need to keep in mind that your success depends on RAM size, while data steps and SQL can scale virtually indefinitely (until you run out of disk space).
02-20-2017 05:31 AM - edited 02-20-2017 05:33 AM
Your syntax is not right. Post the sample data and output, that would be good.
%macro merger(year); DATA Hashed_IN; set IN&year.; IF _N_ = 1 THEN DO; IF 0 THEN SET s1; DCL hash h1(dataset:"s1"); h1.DefineKey("key"); h1.DefineData(All:'Y'); h1.defineDone(); IF 0 THEN SET s2; DCL hash h2(dataset:"s2"); h2.DefineKey("idno"); h2.DefineData(All:'Y'); h2.defineDone(); END; if h1.find()=0 AND h2.find()=0 THEN output; RUN; %mend;
02-20-2017 06:50 AM
Surely it's doable, question is the real necessity for that.
Hashing works well in a large/small dataset scenario.
Has with Data/Merge you are forced to sort both datasets, sorting a big chunk of data is a costly operation.
Hashing avoids that by reading sequentially the big table and loading the small table into memory (hash).
So like this you avoid sorting data... Is that the case?
If it's not that kind of situation, than there is not much on using Hash.
If both tables are small, SQL will handle that fast, no problem, or even data/merge as the sort is done fast.
If both table are large, then as @KurtBremser said, you might get into memory size problems.
Hope it helps.
Daniel Santos @ www.cgd.pt
02-23-2017 06:43 AM
The data are in the following format (I import from SAS7bdat's):
Key var_1 var_2 ... HospID
1 0 1 3
2 1 0 49
I am using a Hash statement partially since I wanted to use them to learn-by-doing and partially since the datasets I am merging contain about 7,000,000 observations each, so I thought it was a good way to save time. When I use the Hash statement, I don't get any errors, it just only adds either set1+set2 or set1+set3