Solved: Re: Dynamic Multi-Table Merge Using Hash of Hashes

theponcer · Posted 07-13-2020 01:09 PM

I have been reading a lot about using the Hash of Hashes technique to reduce hard-coding and increase the flexibility of your program. I wanted to try it out by defining a program that takes an unknown number of tables, as well as a distinct list of members from those tables, as an input and spits out a left join on member_id for all tables. However, I can't quite get it to work. My questions are, how do I make the following code work, and what concept/concepts am I misunderstanding below in the "try" data set? I think the problem is that I am not iterating through the hash object correctly - it looks to me like I am defining the hash of hashes as I intend to (using the HoH.output statement), but the merge only seems to work on one instance of hash table h.

data have1;
   infile datalines delimiter=','; 
   input member_id $ results_measure_1 $;
   datalines;                      
164-111,.8
297-888,.8
495-111,1
562-888,.7
697-111,.7
833-888,.5
;
run;

data have2;
   infile datalines delimiter=','; 
   input member_id $ results_measure_2 $;
   datalines;                      
164-111,1
297-888,1
562-888,.2
697-111,.7
;
run;

data have3;
   infile datalines delimiter=','; 
   input member_id $ results_measure_3 $;
   datalines;                      
344-111,1
562-888,.5
697-111,.9
;
run;

data have4;
   infile datalines delimiter=','; 
   input table $ ;
   datalines;                      
have1
have2
have3
;
run;

data have5;
   infile datalines delimiter=','; 
   input member_id $ ;
   datalines;                      
164-111
297-888
495-111
562-888
697-111
833-888
344-111
;
run;
		

data want;
	infile datalines delimiter=",";
	input member_id $ results_measure_1 $ results_measure_2 $ results_measure_3 $;
	datalines;
344-111,0,0,1
562-888,.7,.2,.5
697-111,.7,.7,.9
164-111,.8,1,0
297-888,.8,1,0
495-111,1,0,0
833-888,.5,0,0
run;

data try;

if 0 then set have:;

if _n_ = 1 then do;

dcl hash HoH();
HoH.definekey("table");
HoH.definedata("h","table");
HoH.definedone();
dcl hiter HoHi("HoH");

do until (eof);
set have4 end=eof;
dcl hash h(dataset:table);
h.definekey("member_id");
h.definedata(all:"yes");
h.definedone();
HoH.add();
end;
end;

set have5;


do while (HoHi.next() = 0);
/*	rc_iter = HoHi.next();*/
	rc_h = h.find();
end;

HoH.output(dataset:"test");
h.output(dataset:"test_h");

run;

mkeintz · Posted 07-13-2020 09:07 PM

First and foremost - thank you for providing a usable program, with data, that we can use for our suggested modifications.

You didn't put it in words, but apparently you want one output record for each record in have5, collecting data from the datasets named in have4, using variables from the named datasets (HAVE1, HAVE2, HAVE3).

So a single change to your program would work:

But first, I think you are assuming that the result_measure variables are not retained in the datastep as it iterates through have5 (i.e. they are reset to missing values automatically). But this is wrong, for two reasons:

Variables implied in any SET statement are retained. One rarely notices this because ordinarily every iteration of the data step has a SET statement which would overwrite the preceding values of variables like result_measure1, results_measure2, and results_measure3. But you have no such SET statement. Instead you have
IF 0 then SET HAVE:
which (because it is a SET) tells SAS to retain the variables in the datasets, but (because it is "IF 0 THEN ...") it is never acted on, so the values are not replaced.
Variables retrieved in a hash FIND method are not modified until the next SUCCESSFUL find (or a SET or other explicit modification of the variables). So if a given member_id has successful find's only for HAVE1 and HAVE3, it will be updated for measure1 and measure3, but keep the old (previous successful) value for measure2.

This means that, before you iterate through the hash objects for each table value, you should set the result_measures to a default value (I use missing values below, because it's easy to code "call missing ...").

So the only required change is the call missing statement in the last data step:

data have1;
   infile datalines delimiter=','; 
   input member_id $ results_measure_1 $;
   datalines;                      
164-111,.8
297-888,.8
495-111,1
562-888,.7
697-111,.7
833-888,.5
;
run;

data have2;
   infile datalines delimiter=','; 
   input member_id $ results_measure_2 $;
   datalines;                      
164-111,1
297-888,1
562-888,.2
697-111,.7
;
run;

data have3;
   infile datalines delimiter=','; 
   input member_id $ results_measure_3 $;
   datalines;                      
344-111,1
562-888,.5
697-111,.9
;
run;

data have4;
   infile datalines delimiter=','; 
   input table $ ;
   datalines;                      
have1
have2
have3
;
run;

data have5;
   infile datalines delimiter=','; 
   input member_id $ ;
   datalines;                      
164-111
297-888
495-111
562-888
697-111
833-888
344-111
;
run;
		

data want0;
	infile datalines delimiter=",";
	input member_id $ results_measure_1 $ results_measure_2 $ results_measure_3 $;
	datalines;
344-111,0,0,1
562-888,.7,.2,.5
697-111,.7,.7,.9
164-111,.8,1,0
297-888,.8,1,0
495-111,1,0,0
833-888,.5,0,0
run;

data try;
  if 0 then set have:;

  if _n_ = 1 then do;

    dcl hash HoH();
      HoH.definekey("table");
      HoH.definedata("h","table");
      HoH.definedone();
    dcl hiter HoHi("HoH");

    do until (eof);
      set have4 end=eof;
      dcl hash h(dataset:table);
        h.definekey("member_id");
        h.definedata(all:"yes");
        h.definedone();
      HoH.add();
   end;
  end;

  set have5;
  call missing(of results_:); /*Add this: Initialize these variables */
  do while (HoHi.next() = 0);
	rc_h = h.find();
  end;

  /** HoH.output(dataset:"test");  Drop this **/
  /** h.output(dataset:"test_h");  Drop this **/
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

mkeintz · Posted 07-13-2020 09:07 PM

First and foremost - thank you for providing a usable program, with data, that we can use for our suggested modifications.

You didn't put it in words, but apparently you want one output record for each record in have5, collecting data from the datasets named in have4, using variables from the named datasets (HAVE1, HAVE2, HAVE3).

So a single change to your program would work:

But first, I think you are assuming that the result_measure variables are not retained in the datastep as it iterates through have5 (i.e. they are reset to missing values automatically). But this is wrong, for two reasons:

Variables implied in any SET statement are retained. One rarely notices this because ordinarily every iteration of the data step has a SET statement which would overwrite the preceding values of variables like result_measure1, results_measure2, and results_measure3. But you have no such SET statement. Instead you have
IF 0 then SET HAVE:
which (because it is a SET) tells SAS to retain the variables in the datasets, but (because it is "IF 0 THEN ...") it is never acted on, so the values are not replaced.
Variables retrieved in a hash FIND method are not modified until the next SUCCESSFUL find (or a SET or other explicit modification of the variables). So if a given member_id has successful find's only for HAVE1 and HAVE3, it will be updated for measure1 and measure3, but keep the old (previous successful) value for measure2.

This means that, before you iterate through the hash objects for each table value, you should set the result_measures to a default value (I use missing values below, because it's easy to code "call missing ...").

So the only required change is the call missing statement in the last data step:

data have1;
   infile datalines delimiter=','; 
   input member_id $ results_measure_1 $;
   datalines;                      
164-111,.8
297-888,.8
495-111,1
562-888,.7
697-111,.7
833-888,.5
;
run;

data have2;
   infile datalines delimiter=','; 
   input member_id $ results_measure_2 $;
   datalines;                      
164-111,1
297-888,1
562-888,.2
697-111,.7
;
run;

data have3;
   infile datalines delimiter=','; 
   input member_id $ results_measure_3 $;
   datalines;                      
344-111,1
562-888,.5
697-111,.9
;
run;

data have4;
   infile datalines delimiter=','; 
   input table $ ;
   datalines;                      
have1
have2
have3
;
run;

data have5;
   infile datalines delimiter=','; 
   input member_id $ ;
   datalines;                      
164-111
297-888
495-111
562-888
697-111
833-888
344-111
;
run;
		

data want0;
	infile datalines delimiter=",";
	input member_id $ results_measure_1 $ results_measure_2 $ results_measure_3 $;
	datalines;
344-111,0,0,1
562-888,.7,.2,.5
697-111,.7,.7,.9
164-111,.8,1,0
297-888,.8,1,0
495-111,1,0,0
833-888,.5,0,0
run;

data try;
  if 0 then set have:;

  if _n_ = 1 then do;

    dcl hash HoH();
      HoH.definekey("table");
      HoH.definedata("h","table");
      HoH.definedone();
    dcl hiter HoHi("HoH");

    do until (eof);
      set have4 end=eof;
      dcl hash h(dataset:table);
        h.definekey("member_id");
        h.definedata(all:"yes");
        h.definedone();
      HoH.add();
   end;
  end;

  set have5;
  call missing(of results_:); /*Add this: Initialize these variables */
  do while (HoHi.next() = 0);
	rc_h = h.find();
  end;

  /** HoH.output(dataset:"test");  Drop this **/
  /** h.output(dataset:"test_h");  Drop this **/
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Dynamic Multi-Table Merge Using Hash of Hashes

Re: Dynamic Multi-Table Merge Using Hash of Hashes

Re: Dynamic Multi-Table Merge Using Hash of Hashes

SAS Innovate 2025: Call for Content

Click image to register for webinar

Classroom Training Available!