Hello,
I was wondering if someone can help me with this problem I have. Based on the dataset "have", I want to obtain the dataset "want" illustrated below.
Specifically, for the cases where I have multiple records per visit number, I want to retain only one record per visit number, as follows:
1. if both duplicate records have VISTYP=1 then I want to retain the record with the smallest AWTDIFF (for example we see in dataset that usubjid#XXX001 has two records with AVISITN=6. Since both have VISTYP=1, we want to set flag=Yes only for the record with AWTDIFF=9). Or for cases that might have the same AWTDIFF(example usubjid#XXX002), pick either one, it doesn't matter.
2. if VISTYP NE 1 for both records, then again pick either one to have flag=Yes.
3. if one VISTYP=1 but the other VISTYP NE 1, then we want to retain the record with VISTYP=1 regardless of AWTDIFF (example, usubjid#XXX004 where we want flag=Yes for the record with VISTYP=1 eventhough the other record with VISTYP=3 has the smaller AWTDIFF).
I have been trying to crack this by sorting many times and using multiple IF then statements but can't seem to get it right. Any help would be greatly appreciated.
Thanks!
data have;
infile datalines;
input usubjid$ AVISITN vistyp AWTDIFF visdup;
datalines;
XXX001 0 1 0 1
XXX001 6 1 9 1
XXX001 6 1 79 2
XXX002 0 1 0 1
XXX002 6 1 59 1
XXX002 6 3 59 2
XXX002 12 1 31 1
XXX003 0 1 0 1
XXX003 6 1 9 1
XXX003 6 3 6 2
XXX003 12 1 20 1
XXX003 18 3 36 1
XXX004 0 1 0 1
XXX004 6 3 50 1
XXX004 12 3 72 1
XXX004 24 3 36 1
XXX004 24 1 57 2
XXX005 0 1 0 1
XXX005 6 2 73 1
XXX005 12 2 75 1
XXX005 24 2 20 1
XXX005 24 1 43 2
XXX006 0 1 0 1
XXX006 12 1 77 1
XXX006 12 3 77 2
XXX006 12 1 4 3
XXX006 18 1 0 1
;
run;
proc print data=have; run;
data want;
infile datalines;
input usubjid$ AVISITN vistyp AWTDIFF visdup flag$;
datalines;
XXX001 0 1 0 1 Yes
XXX001 6 1 9 1 Yes
XXX001 6 1 79 2 No
XXX002 0 1 0 1 Yes
XXX002 6 1 59 1 Yes
XXX002 6 3 59 2 No
XXX002 12 1 31 1 Yes
XXX003 0 1 0 1 Yes
XXX003 6 1 9 1 Yes
XXX003 6 3 6 2 No
XXX003 12 1 20 1 Yes
XXX003 18 3 36 1 Yes
XXX004 0 1 0 1 Yes
XXX004 6 3 50 1 Yes
XXX004 12 3 72 1 Yes
XXX004 24 3 36 1 No
XXX004 24 1 57 2 Yes
XXX005 0 1 0 1 Yes
XXX005 6 2 73 1 Yes
XXX005 12 2 75 1 Yes
XXX005 24 2 20 1 No
XXX005 24 1 43 2 Yes
XXX006 0 1 0 1 Yes
XXX006 12 1 77 1 No
XXX006 12 3 77 2 No
XXX006 12 1 4 3 Yes
XXX006 18 1 0 1 Yes
;
run;
proc print data=want; run;
Hello @Merdock,
This type of problem can be solved in three steps:
Often PROC SQL is particularly useful for step 1 because its ORDER BY clause can sort by expressions, not only variables, and it can create a view instead of a physical dataset. Both of these advantages apply to your task:
/* Create a suitably sorted view of the data */
proc sql;
create view tmp as
select * from have
order by usubjid, avisitn, vistyp ne 1, awtdiff, visdup;
quit;
/* Flag the first observation of each BY group */
data want;
set tmp;
by usubjid avisitn;
if first.avisitn then flag='Yes';
else flag='No';
run;
/* Restore the original sort order */
proc sort data=want;
by usubjid avisitn visdup;
run;
(Personally I would prefer a numeric 0/1 flag, which could simply be defined as flag=first.avisitn.)
So within the USUBJID-AVISITN BY-groups we sort first by the Boolean expression vistyp ne 1 so that observations with VISTYP=1 come first (the expression returns 0 for them and 1 for the others). Within the subgroups "VISTYP=1" and "VISTYP ne 1" we sort by AWTDIFF so that the observation with the smallest AWTDIFF comes first. (If missings of AWTDIFF, which are always smaller than non-missing values, should be sorted differently, the code can be adapted.) The final "tie-breaker" is variable VISDUP, so the observation with the smallest VISDUP comes first if all previous criteria result in a tie.
Hello @Merdock,
This type of problem can be solved in three steps:
Often PROC SQL is particularly useful for step 1 because its ORDER BY clause can sort by expressions, not only variables, and it can create a view instead of a physical dataset. Both of these advantages apply to your task:
/* Create a suitably sorted view of the data */
proc sql;
create view tmp as
select * from have
order by usubjid, avisitn, vistyp ne 1, awtdiff, visdup;
quit;
/* Flag the first observation of each BY group */
data want;
set tmp;
by usubjid avisitn;
if first.avisitn then flag='Yes';
else flag='No';
run;
/* Restore the original sort order */
proc sort data=want;
by usubjid avisitn visdup;
run;
(Personally I would prefer a numeric 0/1 flag, which could simply be defined as flag=first.avisitn.)
So within the USUBJID-AVISITN BY-groups we sort first by the Boolean expression vistyp ne 1 so that observations with VISTYP=1 come first (the expression returns 0 for them and 1 for the others). Within the subgroups "VISTYP=1" and "VISTYP ne 1" we sort by AWTDIFF so that the observation with the smallest AWTDIFF comes first. (If missings of AWTDIFF, which are always smaller than non-missing values, should be sorted differently, the code can be adapted.) The final "tie-breaker" is variable VISDUP, so the observation with the smallest VISDUP comes first if all previous criteria result in a tie.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.