Solved: Re: How do I do merging with changing variables

jungjein · Posted 07-08-2018 07:07 AM

I have 2 datasets with the following variables:

DatasetA - ID, Name

DatasetB - ID, Name, Parent_ID, Parent_Name

I want to keep merging and tracing the relationship between the ID and Parent_ID up to 5 times, like a family tree.

My output dataset should have columns like this:

ID, Name, Parent_ID, Parent_Name, Parent_ID1, Parent_Name1, Parent_ID2, Parent_Name2, Parent_ID3, Parent_Name3, Parent_ID4, Parent_Name4

Currently I create multiple data steps.

Step 1: Match ID (dataset A) with ID (dataset B).

Step 2: Match Parent_ID(dataset A) with ID (dataset B) and rename Parent_ID from dataset B to Parent_ID1.

Step 2: Match Parent_ID1with ID (dataset B)

etc.

FreelanceReinh · Posted 07-08-2018 11:02 AM

Hello @jungjein and welcome to the SAS Support Communities!

Try this:

/* Create test data */

data have_a;
input id name :$40.;
cards;
123456 Detroit
134567 Bordeaux
;

data have_b;
input id name :&$40. parent_id parent_name :&$40.;
cards;
123456  Detroit             123450  Wayne County
123450  Wayne County        123400  Michigan
123400  Michigan            123000  USA
123000  USA                 120000  North America
120000  North America       100000  Earth
134567  Bordeaux            134560  Gironde
134560  Gironde             134500  Nouvelle-Aquitaine
134500  Nouvelle-Aquitaine  134000  France
134000  France              130000  Europe
130000  Europe              100000  Earth
;

/* Perform look-ups */

data want;
if _n_=1 then do;
  dcl hash h(dataset: 'have_b');
  h.definekey('id');
  h.definedata('parent_id', 'parent_name');
  h.definedone();
  if 0 then set have_b;
end;
array pid    parent_id1-parent_id4;
array pn $40 parent_name1-parent_name4;
set have_a;
call missing(of parent:);
rc=h.find();
p_id=parent_id;
p_name=parent_name;
do i=1 to dim(pn) while(h.find(key: parent_id)=0);
  pid[i]=parent_id;
  pn[i]=parent_name;
end;
parent_id=p_id;
parent_name=p_name;
drop rc i p_:;
run;

View solution in original post

FreelanceReinh · Posted 07-08-2018 11:02 AM

Hello @jungjein and welcome to the SAS Support Communities!

Try this:

/* Create test data */

data have_a;
input id name :$40.;
cards;
123456 Detroit
134567 Bordeaux
;

data have_b;
input id name :&$40. parent_id parent_name :&$40.;
cards;
123456  Detroit             123450  Wayne County
123450  Wayne County        123400  Michigan
123400  Michigan            123000  USA
123000  USA                 120000  North America
120000  North America       100000  Earth
134567  Bordeaux            134560  Gironde
134560  Gironde             134500  Nouvelle-Aquitaine
134500  Nouvelle-Aquitaine  134000  France
134000  France              130000  Europe
130000  Europe              100000  Earth
;

/* Perform look-ups */

data want;
if _n_=1 then do;
  dcl hash h(dataset: 'have_b');
  h.definekey('id');
  h.definedata('parent_id', 'parent_name');
  h.definedone();
  if 0 then set have_b;
end;
array pid    parent_id1-parent_id4;
array pn $40 parent_name1-parent_name4;
set have_a;
call missing(of parent:);
rc=h.find();
p_id=parent_id;
p_name=parent_name;
do i=1 to dim(pn) while(h.find(key: parent_id)=0);
  pid[i]=parent_id;
  pn[i]=parent_name;
end;
parent_id=p_id;
parent_name=p_name;
drop rc i p_:;
run;

jungjein · Posted 07-09-2018 04:48 AM

Thanks, but I had this problem when I tried it on my data.

call missing (of parent:);
ERROR 252-185: The MISSING subroutine call does not have enough arguments.

Not all my id in have_a can be matched to a id in have_b.

The output I want could have some columns left blank if not traced but those that could be traced should be traced further to a parent_id

FreelanceReinh · Posted 07-09-2018 05:13 AM

@jungjein wrote:

call missing (of parent:);
ERROR 252-185: The MISSING subroutine call does not have enough arguments.

This error message means that no variable named "parent..." existed when CALL MISSING was executed. However, using my code this situation cannot occur. As you can see in my program, the ARRAY statements create plenty of "parent..." variables and dataset HAVE_B (see the first SET statement) contributes some as well. So, please post test data in the form of data steps (use my first two data steps as a template) and the program code which produced the error message. Then we will see what the issue with your program is.

@jungjein wrote:

Not all my id in have_a can be matched to a id in have_b.

The output I want could have some columns left blank if not traced but those that could be traced should be traced further to a parent_id

This is no problem. I had tested my code with non-matching IDs as you describe them. Again, please provide test data where my code does not produce the results you expected and describe clearly what the output dataset in this case should look like.

jungjein · Posted 07-09-2018 10:58 PM

Thanks for the explanation, I have figured it out.

But is there a way for me to do a one -to-many tracing?

i.e. if the have_b dataset has more than 1 record for each id...

I tried it on your code but it only show one record per id.

FreelanceReinh · Posted 07-10-2018 05:34 AM

@jungjein wrote:

But is there a way for me to do a one -to-many tracing?

i.e. if the have_b dataset has more than 1 record for each id...

In this case you'd have to declare the hash object like

dcl hash h(dataset: 'have_b', multidata: 'y');

and use the FIND_NEXT method to retrieve the individual parent IDs and names (after finding the first of the group with FIND). The recursive nature of this search will add complexity to the code, though.

I suggest you give it a try, using my code and the above hints as a basis, and come back with more specific questions if you get stuck. It's certainly an interesting programming task.

I've just searched the archive of the SAS Support Communities for "recursive search" and found this article in the SAS Communities Library. It seems that Ksharp addressed a similar problem. So, maybe you can get additional ideas from his code.

jungjein · Posted 07-10-2018 08:31 PM

I tried it on your dataset but I'm not sure how the FIND_NEXT part fits in.

Appreciate your help on this.

/* Create test data */

 

data have_a;

input id name :$40.;

cards;

123456 Detroit

134567 Bordeaux

472831 Lille

;

 

data have_b;

input id name :&$40. parent_id parent_name :&$40.;

cards;

123456  Detroit             123450  Wayne County

123456  Detroit             319484  Bastille

123450  Wayne County        123400  Michigan

123400  Michigan            123000  USA

123000  USA                 120000  North America

120000  North America       100000  Earth

134567  Bordeaux            134560  Gironde

134560  Gironde             134500  Nouvelle-Aquitaine

134500  Nouvelle-Aquitaine  134000  France

134000  France              130000  Europe

130000  Europe              100000  Earth

319484  Bastille                100001  Mars

319484  Bastille                100002  Venus

;

 

/* Perform look-ups */

 

data want;

if _n_=1 then do;

  dcl hash h(dataset: 'have_b', multidata: 'y');

  h.definekey('id');

  h.definedata('parent_id', 'parent_name');

  h.definedone();

  if 0 then set have_b;

end;

array pid    parent_id1-parent_id4;

array pn $40 parent_name1-parent_name4;

set have_a;

call missing(of parent:);

rc=h.find();

p_id=parent_id;

p_name=parent_name;

do i=1 to dim(pn) while(h.find(key: parent_id)=0);

  pid[i]=parent_id;

  pn[i]=parent_name;

end;

parent_id=p_id;

parent_name=p_name;

drop rc i p_:;

run;

FreelanceReinh · Posted 07-11-2018 02:59 PM

Okay, the task was too interesting anyway to refuse it. 🙂

data _null_;
if _n_=1 then do;
  dcl hash h(dataset: 'have_b(rename=(parent_id=p_id parent_name=p_name))', multidata: 'y');
  h.definekey('id');
  h.definedata('p_id', 'p_name');
  h.definedone();

  dcl hash ho(ordered: 'y');
  ho.definekey('id', 'name', 'parent_id',  'parent_name',
                             'parent_id1', 'parent_name1',
                             'parent_id2', 'parent_name2',
                             'parent_id3', 'parent_name3',
                             'parent_id4', 'parent_name4');
  ho.definedone();

  dcl hiter hoi('ho');
end;

array pid[5] parent_id parent_id1-parent_id4;
array pn[5] $40 parent_name parent_name1-parent_name4;
length p_name $40;

set have_a end=last;

if 0 then call missing(p_id, p_name);

/* Find "parents" of IDs from HAVE_A */

rc=h.find();
if rc=0 then do until(rc);
  parent_id=p_id;
  parent_name=p_name;
  rca=ho.add();
  rc=h.find_next();
end;
else rca=ho.add();

/* Find "grandparents", "great-grandparents" etc. (four iterations) */

if last then do;
  do i=1 to 4;
    do while(hoi.next()=0);
      if not missing(pid[i]) then do;
        rc=h.find(key: pid[i]);
        rcflag=(rc=0);
        do while(rc=0);
          pid[i+1]=p_id;
          pn[i+1]=p_name;
          rca=ho.add();
          if rca=0 then hoi.next();
          if rcflag then do;
            call missing(pid[i+1], pn[i+1]);
            ho.remove();
            rcflag=0;
          end;
          rc=h.find_next();
        end;
      end;
    end;
  end;
  ho.output(dataset: 'want');
end;
run;

Please note that dataset WANT is automatically sorted by the variables listed in the argument of ho.definekey and free of duplicate observations (even in case of duplicates in the input datasets).

jungjein · Posted 07-12-2018 04:16 AM

Hey thanks!

I was playing around with it but when I tried changing the id and parent_id to character in both have_a and have_b, it showed an error message.

input id :$6. name :$40.;
input id :&$6. name :&$40. parent_id :&$6. parent_name :&$40.;

I have also amended the array part

array pid[5] $6 parent_id parent_id1-parent_id4;
array pn[5] $40 parent_name parent_name1-parent_name4;
length p_name $40;

Error:

ERROR: Type mismatch for data variable p_id at line 54 and column 3.
ERROR: Hash data set load failed
ERROR: Data step component object failure. Aborted during the Execution phase.

FreelanceReinh · Posted 07-12-2018 04:48 AM

@jungjein wrote:

I have also amended the array part

array pid[5] $6 parent_id parent_id1-parent_id4;
array pn[5] $40 parent_name parent_name1-parent_name4;
length p_name $40;

Good, but you forgot to amend the LENGTH statement accordingly.

length p_id $6 p_name $40;

Registration is open

SAS Training: Just a Click Away