I have 2 datasets with the following variables:
DatasetA - ID, Name
DatasetB - ID, Name, Parent_ID, Parent_Name
I want to keep merging and tracing the relationship between the ID and Parent_ID up to 5 times, like a family tree.
My output dataset should have columns like this:
ID, Name, Parent_ID, Parent_Name, Parent_ID1, Parent_Name1, Parent_ID2, Parent_Name2, Parent_ID3, Parent_Name3, Parent_ID4, Parent_Name4
Currently I create multiple data steps.
Step 1: Match ID (dataset A) with ID (dataset B).
Step 2: Match Parent_ID(dataset A) with ID (dataset B) and rename Parent_ID from dataset B to Parent_ID1.
Step 2: Match Parent_ID1with ID (dataset B)
etc.
Hello @jungjein and welcome to the SAS Support Communities!
Try this:
/* Create test data */
data have_a;
input id name :$40.;
cards;
123456 Detroit
134567 Bordeaux
;
data have_b;
input id name :&$40. parent_id parent_name :&$40.;
cards;
123456 Detroit 123450 Wayne County
123450 Wayne County 123400 Michigan
123400 Michigan 123000 USA
123000 USA 120000 North America
120000 North America 100000 Earth
134567 Bordeaux 134560 Gironde
134560 Gironde 134500 Nouvelle-Aquitaine
134500 Nouvelle-Aquitaine 134000 France
134000 France 130000 Europe
130000 Europe 100000 Earth
;
/* Perform look-ups */
data want;
if _n_=1 then do;
dcl hash h(dataset: 'have_b');
h.definekey('id');
h.definedata('parent_id', 'parent_name');
h.definedone();
if 0 then set have_b;
end;
array pid parent_id1-parent_id4;
array pn $40 parent_name1-parent_name4;
set have_a;
call missing(of parent:);
rc=h.find();
p_id=parent_id;
p_name=parent_name;
do i=1 to dim(pn) while(h.find(key: parent_id)=0);
pid[i]=parent_id;
pn[i]=parent_name;
end;
parent_id=p_id;
parent_name=p_name;
drop rc i p_:;
run;
Hello @jungjein and welcome to the SAS Support Communities!
Try this:
/* Create test data */
data have_a;
input id name :$40.;
cards;
123456 Detroit
134567 Bordeaux
;
data have_b;
input id name :&$40. parent_id parent_name :&$40.;
cards;
123456 Detroit 123450 Wayne County
123450 Wayne County 123400 Michigan
123400 Michigan 123000 USA
123000 USA 120000 North America
120000 North America 100000 Earth
134567 Bordeaux 134560 Gironde
134560 Gironde 134500 Nouvelle-Aquitaine
134500 Nouvelle-Aquitaine 134000 France
134000 France 130000 Europe
130000 Europe 100000 Earth
;
/* Perform look-ups */
data want;
if _n_=1 then do;
dcl hash h(dataset: 'have_b');
h.definekey('id');
h.definedata('parent_id', 'parent_name');
h.definedone();
if 0 then set have_b;
end;
array pid parent_id1-parent_id4;
array pn $40 parent_name1-parent_name4;
set have_a;
call missing(of parent:);
rc=h.find();
p_id=parent_id;
p_name=parent_name;
do i=1 to dim(pn) while(h.find(key: parent_id)=0);
pid[i]=parent_id;
pn[i]=parent_name;
end;
parent_id=p_id;
parent_name=p_name;
drop rc i p_:;
run;
Thanks, but I had this problem when I tried it on my data.
call missing (of parent:);
ERROR 252-185: The MISSING subroutine call does not have enough arguments.
Not all my id in have_a can be matched to a id in have_b.
The output I want could have some columns left blank if not traced but those that could be traced should be traced further to a parent_id
@jungjein wrote:
call missing (of parent:); ERROR 252-185: The MISSING subroutine call does not have enough arguments.
This error message means that no variable named "parent..." existed when CALL MISSING was executed. However, using my code this situation cannot occur. As you can see in my program, the ARRAY statements create plenty of "parent..." variables and dataset HAVE_B (see the first SET statement) contributes some as well. So, please post test data in the form of data steps (use my first two data steps as a template) and the program code which produced the error message. Then we will see what the issue with your program is.
@jungjein wrote:
Not all my id in have_a can be matched to a id in have_b.
The output I want could have some columns left blank if not traced but those that could be traced should be traced further to a parent_id
This is no problem. I had tested my code with non-matching IDs as you describe them. Again, please provide test data where my code does not produce the results you expected and describe clearly what the output dataset in this case should look like.
Thanks for the explanation, I have figured it out.
But is there a way for me to do a one -to-many tracing?
i.e. if the have_b dataset has more than 1 record for each id...
I tried it on your code but it only show one record per id.
@jungjein wrote:
But is there a way for me to do a one -to-many tracing?
i.e. if the have_b dataset has more than 1 record for each id...
In this case you'd have to declare the hash object like
dcl hash h(dataset: 'have_b', multidata: 'y');
and use the FIND_NEXT method to retrieve the individual parent IDs and names (after finding the first of the group with FIND). The recursive nature of this search will add complexity to the code, though.
I suggest you give it a try, using my code and the above hints as a basis, and come back with more specific questions if you get stuck. It's certainly an interesting programming task.
I've just searched the archive of the SAS Support Communities for "recursive search" and found this article in the SAS Communities Library. It seems that Ksharp addressed a similar problem. So, maybe you can get additional ideas from his code.
I tried it on your dataset but I'm not sure how the FIND_NEXT part fits in.
Appreciate your help on this.
/* Create test data */
data have_a;
input id name :$40.;
cards;
123456 Detroit
134567 Bordeaux
472831 Lille
;
data have_b;
input id name :&$40. parent_id parent_name :&$40.;
cards;
123456 Detroit 123450 Wayne County
123456 Detroit 319484 Bastille
123450 Wayne County 123400 Michigan
123400 Michigan 123000 USA
123000 USA 120000 North America
120000 North America 100000 Earth
134567 Bordeaux 134560 Gironde
134560 Gironde 134500 Nouvelle-Aquitaine
134500 Nouvelle-Aquitaine 134000 France
134000 France 130000 Europe
130000 Europe 100000 Earth
319484 Bastille 100001 Mars
319484 Bastille 100002 Venus
;
/* Perform look-ups */
data want;
if _n_=1 then do;
dcl hash h(dataset: 'have_b', multidata: 'y');
h.definekey('id');
h.definedata('parent_id', 'parent_name');
h.definedone();
if 0 then set have_b;
end;
array pid parent_id1-parent_id4;
array pn $40 parent_name1-parent_name4;
set have_a;
call missing(of parent:);
rc=h.find();
p_id=parent_id;
p_name=parent_name;
do i=1 to dim(pn) while(h.find(key: parent_id)=0);
pid[i]=parent_id;
pn[i]=parent_name;
end;
parent_id=p_id;
parent_name=p_name;
drop rc i p_:;
run;
Okay, the task was too interesting anyway to refuse it. 🙂
data _null_;
if _n_=1 then do;
dcl hash h(dataset: 'have_b(rename=(parent_id=p_id parent_name=p_name))', multidata: 'y');
h.definekey('id');
h.definedata('p_id', 'p_name');
h.definedone();
dcl hash ho(ordered: 'y');
ho.definekey('id', 'name', 'parent_id', 'parent_name',
'parent_id1', 'parent_name1',
'parent_id2', 'parent_name2',
'parent_id3', 'parent_name3',
'parent_id4', 'parent_name4');
ho.definedone();
dcl hiter hoi('ho');
end;
array pid[5] parent_id parent_id1-parent_id4;
array pn[5] $40 parent_name parent_name1-parent_name4;
length p_name $40;
set have_a end=last;
if 0 then call missing(p_id, p_name);
/* Find "parents" of IDs from HAVE_A */
rc=h.find();
if rc=0 then do until(rc);
parent_id=p_id;
parent_name=p_name;
rca=ho.add();
rc=h.find_next();
end;
else rca=ho.add();
/* Find "grandparents", "great-grandparents" etc. (four iterations) */
if last then do;
do i=1 to 4;
do while(hoi.next()=0);
if not missing(pid[i]) then do;
rc=h.find(key: pid[i]);
rcflag=(rc=0);
do while(rc=0);
pid[i+1]=p_id;
pn[i+1]=p_name;
rca=ho.add();
if rca=0 then hoi.next();
if rcflag then do;
call missing(pid[i+1], pn[i+1]);
ho.remove();
rcflag=0;
end;
rc=h.find_next();
end;
end;
end;
end;
ho.output(dataset: 'want');
end;
run;
Please note that dataset WANT is automatically sorted by the variables listed in the argument of ho.definekey and free of duplicate observations (even in case of duplicates in the input datasets).
Hey thanks!
I was playing around with it but when I tried changing the id and parent_id to character in both have_a and have_b, it showed an error message.
input id :$6. name :$40.;
input id :&$6. name :&$40. parent_id :&$6. parent_name :&$40.;
I have also amended the array part
array pid[5] $6 parent_id parent_id1-parent_id4;
array pn[5] $40 parent_name parent_name1-parent_name4;
length p_name $40;
Error:
ERROR: Type mismatch for data variable p_id at line 54 and column 3.
ERROR: Hash data set load failed
ERROR: Data step component object failure. Aborted during the Execution phase.
@jungjein wrote:
I have also amended the array part
array pid[5] $6 parent_id parent_id1-parent_id4; array pn[5] $40 parent_name parent_name1-parent_name4; length p_name $40;
Good, but you forgot to amend the LENGTH statement accordingly.
length p_id $6 p_name $40;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.