- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hello Experts,
Hope all are doing fine. Please apologize to bother you with my question.
I am trying to assign the initial ID to the other ID rows based on Prior ID column. Can you please help on this to figure the way to do it.
I have
data have;
input ID $ p_id $ ;
datalines;
006 .
007 006
008 .
009 .
010 007
011 010
;
run;
I want
Here, I am trying to create the 'NEED' column with first ID number value based on P_ID. If you see row#2 has P_ID as '006' , since I have record with ID = '006' so I need to get NEED = '006' for that record. Row # 5 has P_ID = '007' which has inital P_ID is '006', so I want '006' in NEED column for rows # 5. Likewise, For Row # 6, based on all previous P_ID values, we should get NEED = '006' for that column. Can you please help on this.
Thanks all and Have a great day.
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
If there are no circular references, then this task can be simplified:
data want;
set have;
if _n_=1 then do;
declare hash h (dataset:'have (where=(not missing (p_id)))');
h.definekey('id');
h.definedata('p_id');
h.definedone();
end;
if not missing(p_id) then do until(h.find()^=0);
id=p_id;
end;
need=p_id;
set have;
run;
The h.find() method for the hash object h is a way to step from ID to P_ID, then to the P_ID of the P_ID, etc., until no further P_ID can be found. The only problem is that the loop using the h.find() method, keeps modifying P_ID and ID until the find failure, losing the original ID and P_ID.. But this is enough to get the NEED value.
Then just re-read the original observation (via the second SET statement) to restore the original ID and P_ID values. The point is that the two SET statement are a way to read each observation twice, once before establishing NEED, and once after.
Additional material below:
PS: Here's what the program would look like if you add a circularity check. It establishes an _orig_id=id and adds a test for "id=_orig_id" in the UNTIL clause:
data want (drop=_:);
set have;
if _n_=1 then do;
declare hash h (dataset:'have (where=(not missing (p_id)))');
h.definekey('id');
h.definedata('p_id');
h.definedone();
end;
_orig_id=id; /* For circularity check */
if not missing(p_id) then do until(h.find()^=0 or id=_orig_id);
id=p_id;
end;
need=p_id;
set have;
run;
However, this code is NOT robust against having a given P_ID show up twice, when at least one instance is in a circular reference. Although multiple instances of a given P_ID in which none of them occur is a circle does not present a problem.
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set
Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets
--------------------------
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
So you want to reference the original parent ID? And I suppose you have more than one original parent ID?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Yes, that is correct
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@Mr_sassy_sug Something like below should do the job.
data have;
input ID $ p_id $;
datalines;
006 .
007 006
008 .
009 .
010 007
011 010
012 008
013 012
014 015
015 014
;
proc sql;
create view hierarchy as
select
id as _child,
p_id as _parent
from have
where p_id is not missing
;
quit;
%let max_depth=99;
data want(drop=_:);
if _n_=1 then
do;
if 0 then set hierarchy;
dcl hash h1(dataset:'hierarchy');
h1.defineKey('_child');
h1.defineData('_child','_parent');
h1.defineDone();
end;
call missing(_child,_parent);
set have;
if not missing(p_id) then
do;
_parent=p_id;
do _i=1 to &max_depth until(h1.find(key:_parent) ne 0);
end;
need=_parent;
/* investigate data if _i > &max_depth */
if _i>&max_depth then circular_reference_flg=1;
end;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thank you so much, this is really helpful and it works to my situation.
Thank you so much again.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
If there are no circular references, then this task can be simplified:
data want;
set have;
if _n_=1 then do;
declare hash h (dataset:'have (where=(not missing (p_id)))');
h.definekey('id');
h.definedata('p_id');
h.definedone();
end;
if not missing(p_id) then do until(h.find()^=0);
id=p_id;
end;
need=p_id;
set have;
run;
The h.find() method for the hash object h is a way to step from ID to P_ID, then to the P_ID of the P_ID, etc., until no further P_ID can be found. The only problem is that the loop using the h.find() method, keeps modifying P_ID and ID until the find failure, losing the original ID and P_ID.. But this is enough to get the NEED value.
Then just re-read the original observation (via the second SET statement) to restore the original ID and P_ID values. The point is that the two SET statement are a way to read each observation twice, once before establishing NEED, and once after.
Additional material below:
PS: Here's what the program would look like if you add a circularity check. It establishes an _orig_id=id and adds a test for "id=_orig_id" in the UNTIL clause:
data want (drop=_:);
set have;
if _n_=1 then do;
declare hash h (dataset:'have (where=(not missing (p_id)))');
h.definekey('id');
h.definedata('p_id');
h.definedone();
end;
_orig_id=id; /* For circularity check */
if not missing(p_id) then do until(h.find()^=0 or id=_orig_id);
id=p_id;
end;
need=p_id;
set have;
run;
However, this code is NOT robust against having a given P_ID show up twice, when at least one instance is in a circular reference. Although multiple instances of a given P_ID in which none of them occur is a circle does not present a problem.
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set
Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets
--------------------------
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thank you so much and this is really simple and straight to the point. I really appreciate your help.
Thanks again for your valuable on this
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@mkeintz This lookup will in case of a match overwrite p_id in the base table with p_id from the hash. This could lead to a wrong outcome in case of DQ issues in the base table.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@Patrick wrote:
@mkeintz This lookup will in case of a match overwrite p_id in the base table with p_id from the hash. This could lead to a wrong outcome in case of DQ issues in the base table.
Yes. I believe that issue is addressed by the second SET statement, that follows the calculation of NEED.
"DQ issues". Is that "Data Quality"? I qualified my code offering as not suitable for data having circular references (although a check for circularity could be easily put in). Is there something else I missed (won't be the first time)?
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set
Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets
--------------------------