Solved: Re: Picking unique visit among multiple based on specific condition

Merdock · Posted 02-24-2024 12:01 AM

Hello,

I was wondering if someone can help me with this problem I have. Based on the dataset "have", I want to obtain the dataset "want" illustrated below.

Specifically, for the cases where I have multiple records per visit number, I want to retain only one record per visit number, as follows:

1. if both duplicate records have VISTYP=1 then I want to retain the record with the smallest AWTDIFF (for example we see in dataset that usubjid#XXX001 has two records with AVISITN=6. Since both have VISTYP=1, we want to set flag=Yes only for the record with AWTDIFF=9). Or for cases that might have the same AWTDIFF(example usubjid#XXX002), pick either one, it doesn't matter.

2. if VISTYP NE 1 for both records, then again pick either one to have flag=Yes.

3. if one VISTYP=1 but the other VISTYP NE 1, then we want to retain the record with VISTYP=1 regardless of AWTDIFF (example, usubjid#XXX004 where we want flag=Yes for the record with VISTYP=1 eventhough the other record with VISTYP=3 has the smaller AWTDIFF).

I have been trying to crack this by sorting many times and using multiple IF then statements but can't seem to get it right. Any help would be greatly appreciated.

Thanks!

data have;
infile datalines;
input usubjid$ AVISITN vistyp AWTDIFF visdup;
datalines;
XXX001	0	1	0	1
XXX001	6	1	9	1
XXX001	6	1	79	2
XXX002	0	1	0	1
XXX002	6	1	59	1
XXX002	6	3	59	2
XXX002	12	1	31	1
XXX003	0	1	0	1
XXX003	6	1	9	1
XXX003	6	3	6	2
XXX003	12	1	20	1
XXX003	18	3	36	1
XXX004	0	1	0	1
XXX004	6	3	50	1
XXX004	12	3	72	1
XXX004	24	3	36	1
XXX004	24	1	57	2
XXX005	0	1	0	1
XXX005	6	2	73	1
XXX005	12	2	75	1
XXX005	24	2	20	1
XXX005	24	1	43	2
XXX006	0	1	0	1
XXX006	12	1	77	1
XXX006	12	3	77	2
XXX006	12	1	4	3
XXX006	18	1	0	1
;
run;
proc print data=have; run;

data want;
infile datalines;
input usubjid$ AVISITN vistyp AWTDIFF visdup flag$;
datalines;
XXX001	0	1	0	1	Yes
XXX001	6	1	9	1	Yes
XXX001	6	1	79	2	No
XXX002	0	1	0	1	Yes
XXX002	6	1	59	1	Yes
XXX002	6	3	59	2	No
XXX002	12	1	31	1	Yes
XXX003	0	1	0	1	Yes
XXX003	6	1	9	1	Yes
XXX003	6	3	6	2	No
XXX003	12	1	20	1	Yes
XXX003	18	3	36	1	Yes
XXX004	0	1	0	1	Yes
XXX004	6	3	50	1	Yes
XXX004	12	3	72	1	Yes
XXX004	24	3	36	1	No
XXX004	24	1	57	2	Yes
XXX005	0	1	0	1	Yes
XXX005	6	2	73	1	Yes
XXX005	12	2	75	1	Yes
XXX005	24	2	20	1	No
XXX005	24	1	43	2	Yes
XXX006	0	1	0	1	Yes
XXX006	12	1	77	1	No
XXX006	12	3	77	2	No
XXX006	12	1	4	3	Yes
XXX006	18	1	0	1	Yes
;
run;
proc print data=want; run;

FreelanceReinh · Posted 02-24-2024 07:03 AM

Hello @Merdock,

This type of problem can be solved in three steps:

Sort the input dataset in such a way that the observation to be flagged within a group of observations is always the first (or always the last) of that group and the resulting dataset (or view) is sorted by those groups.
Use a DATA step with BY-group processing to flag the first (or last, respectively) observation of each BY group.
If needed, sort the dataset to restore the original sort order.

Often PROC SQL is particularly useful for step 1 because its ORDER BY clause can sort by expressions, not only variables, and it can create a view instead of a physical dataset. Both of these advantages apply to your task:

/* Create a suitably sorted view of the data */

proc sql;
create view tmp as
select * from have
order by usubjid, avisitn, vistyp ne 1, awtdiff, visdup;
quit;

/* Flag the first observation of each BY group */

data want;
set tmp;
by usubjid avisitn;
if first.avisitn then flag='Yes';
else flag='No';
run;

/* Restore the original sort order */

proc sort data=want;
by usubjid avisitn visdup;
run;

(Personally I would prefer a numeric 0/1 flag, which could simply be defined as flag=first.avisitn.)

So within the USUBJID-AVISITN BY-groups we sort first by the Boolean expression vistyp ne 1 so that observations with VISTYP=1 come first (the expression returns 0 for them and 1 for the others). Within the subgroups "VISTYP=1" and "VISTYP ne 1" we sort by AWTDIFF so that the observation with the smallest AWTDIFF comes first. (If missings of AWTDIFF, which are always smaller than non-missing values, should be sorted differently, the code can be adapted.) The final "tie-breaker" is variable VISDUP, so the observation with the smallest VISDUP comes first if all previous criteria result in a tie.

View solution in original post

shivagujjaru · Posted 02-24-2024 02:16 AM

proc sort data=have;
by usubjid AVISITN vistyp;
run;
data want;
set have;
by usubjid AVISITN;

if first.usubjid then flag = ""
if first.AVISITN then flag = "";
if vistyp = 1 and first.AVISITN then flag = "Yes";
else if vistyp ne 1 and first.AVISITN then flag = "Yes";
else if vistyp = 1 and flag = "" then flag = "Yes";
if last.AVISITN then output;
drop visdup;
run;

proc print data=want; run;

Hope this should work

This code first sorts the have dataset by usubjid, AVISITN, and vistyp. Then, it processes the data by groups of usubjid and AVISITN. Within each group, it sets the flag variable based on the specified conditions. Finally, it keeps only the first record per visit (AVISITN) and drops the unnecessary

Try and let me know if it doesn't work

Merdock · Posted 02-25-2024 12:16 PM

thank you for your help! This code didn't work as expected but the code from FreelanceReinh below did the trick so it's all good. Appreciate your input!

FreelanceReinh · Posted 02-24-2024 07:03 AM

Hello @Merdock,

This type of problem can be solved in three steps:

Sort the input dataset in such a way that the observation to be flagged within a group of observations is always the first (or always the last) of that group and the resulting dataset (or view) is sorted by those groups.
Use a DATA step with BY-group processing to flag the first (or last, respectively) observation of each BY group.
If needed, sort the dataset to restore the original sort order.

Often PROC SQL is particularly useful for step 1 because its ORDER BY clause can sort by expressions, not only variables, and it can create a view instead of a physical dataset. Both of these advantages apply to your task:

/* Create a suitably sorted view of the data */

proc sql;
create view tmp as
select * from have
order by usubjid, avisitn, vistyp ne 1, awtdiff, visdup;
quit;

/* Flag the first observation of each BY group */

data want;
set tmp;
by usubjid avisitn;
if first.avisitn then flag='Yes';
else flag='No';
run;

/* Restore the original sort order */

proc sort data=want;
by usubjid avisitn visdup;
run;

(Personally I would prefer a numeric 0/1 flag, which could simply be defined as flag=first.avisitn.)

So within the USUBJID-AVISITN BY-groups we sort first by the Boolean expression vistyp ne 1 so that observations with VISTYP=1 come first (the expression returns 0 for them and 1 for the others). Within the subgroups "VISTYP=1" and "VISTYP ne 1" we sort by AWTDIFF so that the observation with the smallest AWTDIFF comes first. (If missings of AWTDIFF, which are always smaller than non-missing values, should be sorted differently, the code can be adapted.) The final "tie-breaker" is variable VISDUP, so the observation with the smallest VISDUP comes first if all previous criteria result in a tie.

Merdock · Posted 02-25-2024 12:15 PM

thank you so much, this is super helpful and works perfectly!

Picking unique visit among multiple based on specific condition

Re: Picking unique visit among multiple based on specific condition

Re: Picking unique visit among multiple based on specific condition

Re: Picking unique visit among multiple based on specific condition

Re: Picking unique visit among multiple based on specific condition

Re: Picking unique visit among multiple based on specific condition

Registration is open

SAS Training: Just a Click Away