I am trying to create an new ID for repeated measures data with a replicate variable for data organize as such:
ID ID_Before ID_After
04 19 10
07 32
10 04 178
19 04
32 07
178 10 187
187 178
For the data above, the new variables should end up as:
ID ID_Before ID_After New_ID Replicate
04 19 10 01 2
07 32 02 2
10 04 178 01 3
19 04 01 1
32 07 02 1
178 10 187 01 4
187 178 01 5
Here a way to go:
data have;
infile datalines dsd truncover;
input ID ID_Before ID_After;
datalines;
04,19,10
07,32,
10,04,178
19,,04
32,,07
178,10,187
187,178,
901,,902
902,901,903
903,902
911,901,912
912,911,913
913,912,912
921,,922
922,921,923
923,922,922
924,923
;
run;
data want;
if _n_=1 then
do;
dcl hash h1(dataset:'have(keep=id id_before id_after where=(not missing(id_before))');
h1.defineKey('id');
h1.defineData(all:'y');
h1.defineDone();
end;
set have (where=(missing(id_before))) end=last;
New_id+1;
Replicate=1;
output;
do while(h1.find(key:id_after)=0);
replicate+1;
output;
/* remove record from hash to avoid cycling */
h1.remove(key:id);
end;
if last then
do;
/* not linked records due to dq issues */
h1.output(dataset:'not_linked_records');
end;
run;
data want;
set want (in=inds1) not_linked_records;
length linked_flg 3;
if inds1 then linked_flg=1;
else linked_flg=0;
run;
8 Oct 2017:
- Added line h1.remove(key:id); to avoid infinite looping in case of a cycle as mentioned by @Tom
- If there are DQ issues causing records in hash not being used for linking then append these records to table WANT and flag as records with DQ issues.
Please describe how you get the replicate value.
And is Id_after, for example in the first row, mean that the id value of 04 was changed or associated to 10? Or that 19 was changed or associated with 10 and now the id is 04?
The variable, ID, represents a unique identification number assigned to each observation or visit regardless of whether it is the same subject. Unfortunately the data does not provide a variable that identifies each subject and instead provides ID_Before and ID_After so you can determine which observations are linked to the same subject. ID_Before provides the unique identification number (i.e. the ID variable) for the previous visit, while ID_After provides the ID for the last visit.
Based on the data I've provided here, you can see that there are only two different subjects. The first visit for each subject will always have ID_Before as a missing value and some value for ID_After. So we know the first visit for subject 01 is ID = 19 and the first visit for subject 02 is ID = 32. To determine the rest of the replicate values we would look at the ID_After variable. For subject 01, ID = 19 points to ID = 04 as the ID_after. When you find ID = 04, you can see that ID = 19 was the ID before and ID = 10 is the ID after. You can continue to do this until you arrive at an observation in which ID_After is missing. If there is no ID_After, that observation represents the most recent visit for that subject.
I hope this is clear. Any help is appreciated. I am trying to figure out what code will help me create a unique subject ID and a replicate variable for the ordered visits. I imagine the data would need to be sorted a different way than this, but I have just provided the data in its original format (sorted by ID).
Here a way to go:
data have;
infile datalines dsd truncover;
input ID ID_Before ID_After;
datalines;
04,19,10
07,32,
10,04,178
19,,04
32,,07
178,10,187
187,178,
901,,902
902,901,903
903,902
911,901,912
912,911,913
913,912,912
921,,922
922,921,923
923,922,922
924,923
;
run;
data want;
if _n_=1 then
do;
dcl hash h1(dataset:'have(keep=id id_before id_after where=(not missing(id_before))');
h1.defineKey('id');
h1.defineData(all:'y');
h1.defineDone();
end;
set have (where=(missing(id_before))) end=last;
New_id+1;
Replicate=1;
output;
do while(h1.find(key:id_after)=0);
replicate+1;
output;
/* remove record from hash to avoid cycling */
h1.remove(key:id);
end;
if last then
do;
/* not linked records due to dq issues */
h1.output(dataset:'not_linked_records');
end;
run;
data want;
set want (in=inds1) not_linked_records;
length linked_flg 3;
if inds1 then linked_flg=1;
else linked_flg=0;
run;
8 Oct 2017:
- Added line h1.remove(key:id); to avoid infinite looping in case of a cycle as mentioned by @Tom
- If there are DQ issues causing records in hash not being used for linking then append these records to table WANT and flag as records with DQ issues.
Cool.
Do you know if that code will properly handle cycles?
What would prevent it from running forever if the graph has a cycle?
So you basically have a graph in the form of a linked list. So you want to find the number of distinct subgraphs. If you have a license for SAS/OR then you can use PROC OPTNET. https://support.sas.com/documentation/cdl/en/ornoaug/65289/HTML/default/viewer.htm#ornoaug_optnet_de...
Here is code to convert your data into the FROM,TO data that OPTNET needs and then uses the example OPTNET call.
data have ;
input ID ID_Before ID_After ;
cards;
04 19 10
07 32 .
10 04 178
19 . 04
32 . 07
178 10 187
187 178 .
;
data LinkSetIn;
set have ;
if not missing(id_before) then do;
from =id_before ; to=id ; output;
end;
if not missing(id_after) then do;
from =id ; to=id_after ; output;
end;
keep from to;
run;
proc sort nodupkey ;
by from to ;
run;
proc optnet
data_links = LinkSetIn
out_nodes = NodeSetOut
;
concomp;
run;
proc sort data=nodesetout;
by concomp node ;
run;
proc print;
run;
If you don't have OPTNET then look for macros/programs that can traverse your tree, like this one: https://communities.sas.com/t5/tkb/articleprintpage/tkb-id/library/article-id/1045
Save $250 on SAS Innovate and get a free advance copy of the new SAS For Dummies book! Use the code "SASforDummies" to register. Don't miss out, May 6-9, in Orlando, Florida.
Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.
Find more tutorials on the SAS Users YouTube channel.