SAS Programming

Mr_sassy_sug · Posted 07-23-2021 07:44 PM

Hello Experts,

Hope all are doing fine. Please apologize to bother you with my question.

I am trying to assign the initial ID to the other ID rows based on Prior ID column. Can you please help on this to figure the way to do it.

I have

data have;
input ID $ p_id $ ;
datalines;
006 .
007 006
008 .
009 .
010 007
011 010
;
run;

I want

Here, I am trying to create the 'NEED' column with first ID number value based on P_ID. If you see row#2 has P_ID as '006' , since I have record with ID = '006' so I need to get NEED = '006' for that record. Row # 5 has P_ID = '007' which has inital P_ID is '006', so I want '006' in NEED column for rows # 5. Likewise, For Row # 6, based on all previous P_ID values, we should get NEED = '006' for that column. Can you please help on this.

Thanks all and Have a great day.

mkeintz · Posted 07-24-2021 01:46 AM

If there are no circular references, then this task can be simplified:

data want;
  set have;
  if _n_=1 then do;
    declare hash h (dataset:'have (where=(not missing (p_id)))');
     h.definekey('id');
     h.definedata('p_id');
     h.definedone();
  end;
  if not missing(p_id) then do until(h.find()^=0);
    id=p_id;
  end;
  need=p_id;
  set have;
run;

The h.find() method for the hash object h is a way to step from ID to P_ID, then to the P_ID of the P_ID, etc., until no further P_ID can be found. The only problem is that the loop using the h.find() method, keeps modifying P_ID and ID until the find failure, losing the original ID and P_ID.. But this is enough to get the NEED value.

Then just re-read the original observation (via the second SET statement) to restore the original ID and P_ID values. The point is that the two SET statement are a way to read each observation twice, once before establishing NEED, and once after.

Additional material below:

PS: Here's what the program would look like if you add a circularity check. It establishes an _orig_id=id and adds a test for "id=_orig_id" in the UNTIL clause:

data want (drop=_:);
  set have;
  if _n_=1 then do;
    declare hash h (dataset:'have (where=(not missing (p_id)))');
     h.definekey('id');
     h.definedata('p_id');
     h.definedone();
  end;

  _orig_id=id;     /* For circularity check */

  if not missing(p_id) then do until(h.find()^=0 or id=_orig_id);
    id=p_id;
  end;
  need=p_id;
  set have;
run;

However, this code is NOT robust against having a given P_ID show up twice, when at least one instance is in a circular reference. Although multiple instances of a given P_ID in which none of them occur is a circle does not present a problem.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

ChrisNZ · Posted 07-23-2021 08:48 PM

So you want to reference the original parent ID? And I suppose you have more than one original parent ID?

High-Performance SAS Coding - Third Edition

Mr_sassy_sug · Posted 07-24-2021 07:06 PM

Yes, that is correct

Patrick · Posted 07-23-2021 09:53 PM

@Mr_sassy_sug Something like below should do the job.

data have;
  input ID $ p_id $;
  datalines;
006 .
007 006
008 .
009 .
010 007
011 010
012 008
013 012
014 015
015 014
;

proc sql;
  create view hierarchy as
    select
      id as _child,
      p_id as _parent
    from have
    where p_id is not missing
    ;
quit;

%let max_depth=99;
data want(drop=_:);
  if _n_=1 then
    do;
      if 0 then set hierarchy;
      dcl hash h1(dataset:'hierarchy');
      h1.defineKey('_child');
      h1.defineData('_child','_parent');
      h1.defineDone();
    end;
  call missing(_child,_parent);

  set have;

  if not missing(p_id) then
    do; 
      _parent=p_id;
      do _i=1 to &max_depth until(h1.find(key:_parent) ne 0);
      end;
      need=_parent;
      /* investigate data if _i > &max_depth */
      if _i>&max_depth then circular_reference_flg=1;
    end;
run;

Mr_sassy_sug · Posted 07-24-2021 07:07 PM

Thank you so much, this is really helpful and it works to my situation.

Thank you so much again.

mkeintz · Posted 07-24-2021 01:46 AM

If there are no circular references, then this task can be simplified:

data want;
  set have;
  if _n_=1 then do;
    declare hash h (dataset:'have (where=(not missing (p_id)))');
     h.definekey('id');
     h.definedata('p_id');
     h.definedone();
  end;
  if not missing(p_id) then do until(h.find()^=0);
    id=p_id;
  end;
  need=p_id;
  set have;
run;

The h.find() method for the hash object h is a way to step from ID to P_ID, then to the P_ID of the P_ID, etc., until no further P_ID can be found. The only problem is that the loop using the h.find() method, keeps modifying P_ID and ID until the find failure, losing the original ID and P_ID.. But this is enough to get the NEED value.

Then just re-read the original observation (via the second SET statement) to restore the original ID and P_ID values. The point is that the two SET statement are a way to read each observation twice, once before establishing NEED, and once after.

Additional material below:

PS: Here's what the program would look like if you add a circularity check. It establishes an _orig_id=id and adds a test for "id=_orig_id" in the UNTIL clause:

data want (drop=_:);
  set have;
  if _n_=1 then do;
    declare hash h (dataset:'have (where=(not missing (p_id)))');
     h.definekey('id');
     h.definedata('p_id');
     h.definedone();
  end;

  _orig_id=id;     /* For circularity check */

  if not missing(p_id) then do until(h.find()^=0 or id=_orig_id);
    id=p_id;
  end;
  need=p_id;
  set have;
run;

However, this code is NOT robust against having a given P_ID show up twice, when at least one instance is in a circular reference. Although multiple instances of a given P_ID in which none of them occur is a circle does not present a problem.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Mr_sassy_sug · Posted 07-24-2021 07:08 PM

Thank you so much and this is really simple and straight to the point. I really appreciate your help.

Thanks again for your valuable on this

Patrick · Posted 07-25-2021 01:32 AM

@mkeintz This lookup will in case of a match overwrite p_id in the base table with p_id from the hash. This could lead to a wrong outcome in case of DQ issues in the base table.

mkeintz · Posted 07-25-2021 02:04 PM

@Patrick wrote:

@mkeintz This lookup will in case of a match overwrite p_id in the base table with p_id from the hash. This could lead to a wrong outcome in case of DQ issues in the base table.

Yes. I believe that issue is addressed by the second SET statement, that follows the calculation of NEED.

"DQ issues". Is that "Data Quality"? I qualified my code offering as not suitable for data having circular references (although a check for circularity could be easily put in). Is there something else I missed (won't be the first time)?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

SAS Programming

Assign first ID based on Prior ID values

Re: Assign first ID based on Prior ID values

Re: Assign first ID based on Prior ID values

Re: Assign first ID based on Prior ID values

Re: Assign first ID based on Prior ID values

Re: Assign first ID based on Prior ID values

Re: Assign first ID based on Prior ID values

Re: Assign first ID based on Prior ID values

Re: Assign first ID based on Prior ID values

Re: Assign first ID based on Prior ID values

How to reduce collinearity prior to modeling using PROC VARCLUS and PR...

Value assignment based on multiple if conditions

Assign value based on status

Incorporate prior knowledge (code a Bayesian prior for Proc BGLIMM)

Wide to Long and calculate 12 months prior to base month

Follow Us

What is...

SAS Programming

Special offer for SAS Communities members

SAS Training: Just a Click Away

Follow Us

What is...