Solved: Re: How do I create an ID variable accounting for duplicates?

deshies · Posted 10-04-2017 04:42 PM

I am trying to create an new ID for repeated measures data with a replicate variable for data organize as such:

ID ID_Before ID_After

04 19 10

07 32

10 04 178

19 04

32 07

178 10 187

187 178

For the data above, the new variables should end up as:

ID ID_Before ID_After New_ID Replicate

04 19 10 01 2

07 32 02 2

10 04 178 01 3

19 04 01 1

32 07 02 1

178 10 187 01 4

187 178 01 5

Patrick · Posted 10-07-2017 07:59 AM

@deshies

Here a way to go:

data have;
  infile datalines dsd truncover;
  input ID ID_Before ID_After;
datalines;
04,19,10
07,32,
10,04,178
19,,04
32,,07
178,10,187
187,178,
901,,902
902,901,903
903,902
911,901,912
912,911,913
913,912,912
921,,922
922,921,923
923,922,922
924,923
;
run;

data want;

  if _n_=1 then
    do;
      dcl hash h1(dataset:'have(keep=id id_before id_after where=(not missing(id_before))');
      h1.defineKey('id');
      h1.defineData(all:'y');
      h1.defineDone();
    end;

  set have (where=(missing(id_before))) end=last;
  New_id+1;
  Replicate=1;
  output;

  do while(h1.find(key:id_after)=0);
    replicate+1;
    output;
    /* remove record from hash to avoid cycling */
    h1.remove(key:id);
  end;
  
  if last then 
    do;
      /* not linked records due to dq issues */
      h1.output(dataset:'not_linked_records');
    end;
run;

data want;
  set want (in=inds1) not_linked_records;
  length linked_flg 3;
  if inds1 then linked_flg=1;
  else linked_flg=0;
run;

8 Oct 2017:

- Added line h1.remove(key:id); to avoid infinite looping in case of a cycle as mentioned by @Tom

- If there are DQ issues causing records in hash not being used for linking then append these records to table WANT and flag as records with DQ issues.

View solution in original post

ballardw · Posted 10-06-2017 01:02 PM

Please describe how you get the replicate value.

And is Id_after, for example in the first row, mean that the id value of 04 was changed or associated to 10? Or that 19 was changed or associated with 10 and now the id is 04?

deshies · Posted 10-06-2017 02:02 PM

The variable, ID, represents a unique identification number assigned to each observation or visit regardless of whether it is the same subject. Unfortunately the data does not provide a variable that identifies each subject and instead provides ID_Before and ID_After so you can determine which observations are linked to the same subject. ID_Before provides the unique identification number (i.e. the ID variable) for the previous visit, while ID_After provides the ID for the last visit.

Based on the data I've provided here, you can see that there are only two different subjects. The first visit for each subject will always have ID_Before as a missing value and some value for ID_After. So we know the first visit for subject 01 is ID = 19 and the first visit for subject 02 is ID = 32. To determine the rest of the replicate values we would look at the ID_After variable. For subject 01, ID = 19 points to ID = 04 as the ID_after. When you find ID = 04, you can see that ID = 19 was the ID before and ID = 10 is the ID after. You can continue to do this until you arrive at an observation in which ID_After is missing. If there is no ID_After, that observation represents the most recent visit for that subject.

I hope this is clear. Any help is appreciated. I am trying to figure out what code will help me create a unique subject ID and a replicate variable for the ordered visits. I imagine the data would need to be sorted a different way than this, but I have just provided the data in its original format (sorted by ID).

Patrick · Posted 10-07-2017 07:59 AM

@deshies

Here a way to go:

data have;
  infile datalines dsd truncover;
  input ID ID_Before ID_After;
datalines;
04,19,10
07,32,
10,04,178
19,,04
32,,07
178,10,187
187,178,
901,,902
902,901,903
903,902
911,901,912
912,911,913
913,912,912
921,,922
922,921,923
923,922,922
924,923
;
run;

data want;

  if _n_=1 then
    do;
      dcl hash h1(dataset:'have(keep=id id_before id_after where=(not missing(id_before))');
      h1.defineKey('id');
      h1.defineData(all:'y');
      h1.defineDone();
    end;

  set have (where=(missing(id_before))) end=last;
  New_id+1;
  Replicate=1;
  output;

  do while(h1.find(key:id_after)=0);
    replicate+1;
    output;
    /* remove record from hash to avoid cycling */
    h1.remove(key:id);
  end;
  
  if last then 
    do;
      /* not linked records due to dq issues */
      h1.output(dataset:'not_linked_records');
    end;
run;

data want;
  set want (in=inds1) not_linked_records;
  length linked_flg 3;
  if inds1 then linked_flg=1;
  else linked_flg=0;
run;

8 Oct 2017:

- Added line h1.remove(key:id); to avoid infinite looping in case of a cycle as mentioned by @Tom

- If there are DQ issues causing records in hash not being used for linking then append these records to table WANT and flag as records with DQ issues.

Tom · Posted 10-07-2017 03:07 PM

@Patrick

Cool.

Do you know if that code will properly handle cycles?

What would prevent it from running forever if the graph has a cycle?

deshies · Posted 10-17-2017 11:40 AM

Thank you @Patrick and @Tom for your comments. Patrick your solution gives me the majority of the data in the format that I want.

Tom · Posted 10-07-2017 01:52 PM

So you basically have a graph in the form of a linked list. So you want to find the number of distinct subgraphs. If you have a license for SAS/OR then you can use PROC OPTNET. https://support.sas.com/documentation/cdl/en/ornoaug/65289/HTML/default/viewer.htm#ornoaug_optnet_de...

Here is code to convert your data into the FROM,TO data that OPTNET needs and then uses the example OPTNET call.

data have ;
  input ID ID_Before ID_After ;
cards;
 04   19   10
 07   32    .
 10   04  178
 19    .   04
 32    .   07
178   10  187
187  178    .
;

data LinkSetIn;
 set have ;
 if not missing(id_before) then do;
    from =id_before ; to=id ; output;
 end;
 if not missing(id_after) then do;
    from =id ; to=id_after ; output;
 end;
 keep from to;
run;
proc sort nodupkey ;
 by from to ;
run;
proc optnet
 data_links = LinkSetIn
 out_nodes  = NodeSetOut
;
 concomp;
run;

proc sort data=nodesetout;
 by concomp node ;
run;

proc print;
run;

If you don't have OPTNET then look for macros/programs that can traverse your tree, like this one: https://communities.sas.com/t5/tkb/articleprintpage/tkb-id/library/article-id/1045

Registration is open