SAS Data Integration Studio, DataFlux Data Management Studio, SAS/ACCESS, SAS Data Loader for Hadoop and others

How do I create an ID variable accounting for duplicates?

Accepted Solution Solved
Reply
New Contributor
Posts: 3
Accepted Solution

How do I create an ID variable accounting for duplicates?

 

 

I am trying to create an new ID for repeated measures data with a replicate variable for data organize as such:

ID                    ID_Before         ID_After

04                     19                     10

07                     32                   

10                     04                     178

19                                              04

32                                              07

178                   10                     187

187                   178                 

 

For the data above, the new variables should end up as:

 

ID                    ID_Before         ID_After           New_ID            Replicate

04                     19                     10                     01                     2

07                     32                                              02                     2

10                     04                     178                   01                     3

19                                              04                     01                     1

32                                              07                     02                     1

178                   10                     187                   01                     4

187                   178                                            01                     5

 

 

 


Accepted Solutions
Highlighted
Solution
Tuesday
Respected Advisor
Posts: 4,137

Re: How do I create an ID variable accounting for duplicates?

[ Edited ]

@deshies

Here a way to go:

data have;
  infile datalines dsd truncover;
  input ID ID_Before ID_After;
datalines;
04,19,10
07,32,
10,04,178
19,,04
32,,07
178,10,187
187,178,
901,,902
902,901,903
903,902
911,901,912
912,911,913
913,912,912
921,,922
922,921,923
923,922,922
924,923
;
run;

data want;

  if _n_=1 then
    do;
      dcl hash h1(dataset:'have(keep=id id_before id_after where=(not missing(id_before))');
      h1.defineKey('id');
      h1.defineData(all:'y');
      h1.defineDone();
    end;

  set have (where=(missing(id_before))) end=last;
  New_id+1;
  Replicate=1;
  output;

  do while(h1.find(key:id_after)=0);
    replicate+1;
    output;
    /* remove record from hash to avoid cycling */
    h1.remove(key:id);
  end;
  
  if last then 
    do;
      /* not linked records due to dq issues */
      h1.output(dataset:'not_linked_records');
    end;
run;

data want;
  set want (in=inds1) not_linked_records;
  length linked_flg 3;
  if inds1 then linked_flg=1;
  else linked_flg=0;
run;

 

 

8 Oct 2017:

- Added line h1.remove(key:id); to avoid infinite looping in case of a cycle as mentioned by @Tom

- If there are DQ issues causing records in hash not being used for linking then append these records to table WANT and flag as records with DQ issues.

 

View solution in original post


All Replies
Super User
Posts: 11,134

Re: How do I create an ID variable accounting for duplicates?

Please describe how you get the replicate value.

 

And is Id_after,  for example in the first row, mean that the id value of 04 was changed or associated to 10? Or that 19 was changed or associated with 10 and now the id is 04?

New Contributor
Posts: 3

Re: How do I create an ID variable accounting for duplicates?

The variable, ID, represents a unique identification number assigned to each observation or visit regardless of whether it is the same subject. Unfortunately the data does not provide a variable that identifies each subject and instead provides ID_Before and ID_After so you can determine which observations are linked to the same subject. ID_Before provides the unique identification number (i.e. the ID variable) for the previous visit, while ID_After provides the ID for the last visit.

 

Based on the data I've provided here, you can see that there are only two different subjects. The first visit for each subject will always have ID_Before as a missing value and some value for ID_After. So we know the first visit for subject 01 is ID = 19 and the first visit for subject 02 is ID = 32. To determine the rest of the replicate values we would look at the ID_After variable. For subject 01, ID = 19 points to ID = 04 as the ID_after. When you find ID = 04, you can see that ID = 19 was the ID before and ID = 10 is the ID after. You can continue to do this until you arrive at an observation in which ID_After is missing. If there is no ID_After, that observation represents the most recent visit for that subject. 

 

I hope this is clear. Any help is appreciated. I am trying to figure out what code will help me create a unique subject ID and a replicate variable for the ordered visits. I imagine the data would need to be sorted a different way than this, but I have just provided the data in its original format (sorted by ID).

Highlighted
Solution
Tuesday
Respected Advisor
Posts: 4,137

Re: How do I create an ID variable accounting for duplicates?

[ Edited ]

@deshies

Here a way to go:

data have;
  infile datalines dsd truncover;
  input ID ID_Before ID_After;
datalines;
04,19,10
07,32,
10,04,178
19,,04
32,,07
178,10,187
187,178,
901,,902
902,901,903
903,902
911,901,912
912,911,913
913,912,912
921,,922
922,921,923
923,922,922
924,923
;
run;

data want;

  if _n_=1 then
    do;
      dcl hash h1(dataset:'have(keep=id id_before id_after where=(not missing(id_before))');
      h1.defineKey('id');
      h1.defineData(all:'y');
      h1.defineDone();
    end;

  set have (where=(missing(id_before))) end=last;
  New_id+1;
  Replicate=1;
  output;

  do while(h1.find(key:id_after)=0);
    replicate+1;
    output;
    /* remove record from hash to avoid cycling */
    h1.remove(key:id);
  end;
  
  if last then 
    do;
      /* not linked records due to dq issues */
      h1.output(dataset:'not_linked_records');
    end;
run;

data want;
  set want (in=inds1) not_linked_records;
  length linked_flg 3;
  if inds1 then linked_flg=1;
  else linked_flg=0;
run;

 

 

8 Oct 2017:

- Added line h1.remove(key:id); to avoid infinite looping in case of a cycle as mentioned by @Tom

- If there are DQ issues causing records in hash not being used for linking then append these records to table WANT and flag as records with DQ issues.

 

Super User
Super User
Posts: 6,849

Re: How do I create an ID variable accounting for duplicates?

@Patrick 

Cool. 

Do you know if that code will properly handle cycles?

What would prevent it from running forever if the graph has a cycle?

New Contributor
Posts: 3

Re: How do I create an ID variable accounting for duplicates?

Thank you @Patrick and @Tom for your comments. Patrick your solution gives me the majority of the data in the format that I want.
Super User
Super User
Posts: 6,849

Re: How do I create an ID variable accounting for duplicates?

[ Edited ]

So you basically have a graph in the form of a linked list. So you want to find the number of distinct subgraphs. If you have a license for SAS/OR then you can use PROC OPTNET.  https://support.sas.com/documentation/cdl/en/ornoaug/65289/HTML/default/viewer.htm#ornoaug_optnet_de...

 

Here is code to convert your data into the FROM,TO data that OPTNET needs and then uses the example OPTNET call.

data have ;
  input ID ID_Before ID_After ;
cards;
 04   19   10
 07   32    .
 10   04  178
 19    .   04
 32    .   07
178   10  187
187  178    .
;

data LinkSetIn;
 set have ;
 if not missing(id_before) then do;
    from =id_before ; to=id ; output;
 end;
 if not missing(id_after) then do;
    from =id ; to=id_after ; output;
 end;
 keep from to;
run;
proc sort nodupkey ;
 by from to ;
run;
proc optnet
 data_links = LinkSetIn
 out_nodes  = NodeSetOut
;
 concomp;
run;

proc sort data=nodesetout;
 by concomp node ;
run;

proc print;
run;

image.png

If you don't have OPTNET then look for macros/programs that can traverse your tree, like this one: https://communities.sas.com/t5/tkb/articleprintpage/tkb-id/library/article-id/1045

 

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 6 replies
  • 189 views
  • 4 likes
  • 4 in conversation