BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
Merdock
Quartz | Level 8

Hello,

I was wondering if someone can help me with this problem I have. Based on the dataset "have", I want to obtain the dataset "want" illustrated below.

Specifically, for the cases where I have multiple records per visit number, I want to retain only one record per visit number, as follows:

1. if both duplicate records have VISTYP=1 then I want to retain the record with the smallest AWTDIFF (for example we see in dataset that usubjid#XXX001 has two records with AVISITN=6. Since both have VISTYP=1, we want to set flag=Yes only for the record with AWTDIFF=9). Or for cases that might have the same AWTDIFF(example usubjid#XXX002), pick either one, it doesn't matter.

2. if VISTYP NE 1 for both records, then again pick either one to have flag=Yes.

3. if one VISTYP=1 but the other VISTYP NE 1, then we want to retain the record with VISTYP=1 regardless of AWTDIFF (example, usubjid#XXX004 where we want flag=Yes for the record with VISTYP=1 eventhough the other record with VISTYP=3 has the smaller AWTDIFF).

 

I have been trying to crack this by sorting many times and using multiple IF then statements but can't seem to get it right. Any help would be greatly appreciated. 

 

Thanks!

 

data have;
infile datalines;
input usubjid$ AVISITN vistyp AWTDIFF visdup;
datalines;
XXX001	0	1	0	1
XXX001	6	1	9	1
XXX001	6	1	79	2
XXX002	0	1	0	1
XXX002	6	1	59	1
XXX002	6	3	59	2
XXX002	12	1	31	1
XXX003	0	1	0	1
XXX003	6	1	9	1
XXX003	6	3	6	2
XXX003	12	1	20	1
XXX003	18	3	36	1
XXX004	0	1	0	1
XXX004	6	3	50	1
XXX004	12	3	72	1
XXX004	24	3	36	1
XXX004	24	1	57	2
XXX005	0	1	0	1
XXX005	6	2	73	1
XXX005	12	2	75	1
XXX005	24	2	20	1
XXX005	24	1	43	2
XXX006	0	1	0	1
XXX006	12	1	77	1
XXX006	12	3	77	2
XXX006	12	1	4	3
XXX006	18	1	0	1
;
run;
proc print data=have; run;

data want;
infile datalines;
input usubjid$ AVISITN vistyp AWTDIFF visdup flag$;
datalines;
XXX001	0	1	0	1	Yes
XXX001	6	1	9	1	Yes
XXX001	6	1	79	2	No
XXX002	0	1	0	1	Yes
XXX002	6	1	59	1	Yes
XXX002	6	3	59	2	No
XXX002	12	1	31	1	Yes
XXX003	0	1	0	1	Yes
XXX003	6	1	9	1	Yes
XXX003	6	3	6	2	No
XXX003	12	1	20	1	Yes
XXX003	18	3	36	1	Yes
XXX004	0	1	0	1	Yes
XXX004	6	3	50	1	Yes
XXX004	12	3	72	1	Yes
XXX004	24	3	36	1	No
XXX004	24	1	57	2	Yes
XXX005	0	1	0	1	Yes
XXX005	6	2	73	1	Yes
XXX005	12	2	75	1	Yes
XXX005	24	2	20	1	No
XXX005	24	1	43	2	Yes
XXX006	0	1	0	1	Yes
XXX006	12	1	77	1	No
XXX006	12	3	77	2	No
XXX006	12	1	4	3	Yes
XXX006	18	1	0	1	Yes
;
run;
proc print data=want; run;
1 ACCEPTED SOLUTION

Accepted Solutions
FreelanceReinh
Jade | Level 19

Hello @Merdock,

 

This type of problem can be solved in three steps:

  1. Sort the input dataset in such a way that the observation to be flagged within a group of observations is always the first (or always the last) of that group and the resulting dataset (or view) is sorted by those groups.
  2. Use a DATA step with BY-group processing to flag the first (or last, respectively) observation of each BY group.
  3. If needed, sort the dataset to restore the original sort order.

 

Often PROC SQL is particularly useful for step 1 because its ORDER BY clause can sort by expressions, not only variables, and it can create a view instead of a physical dataset. Both of these advantages apply to your task:

/* Create a suitably sorted view of the data */

proc sql;
create view tmp as
select * from have
order by usubjid, avisitn, vistyp ne 1, awtdiff, visdup;
quit;

/* Flag the first observation of each BY group */

data want;
set tmp;
by usubjid avisitn;
if first.avisitn then flag='Yes';
else flag='No';
run;

/* Restore the original sort order */

proc sort data=want;
by usubjid avisitn visdup;
run;

(Personally I would prefer a numeric 0/1 flag, which could simply be defined as flag=first.avisitn.)

 

So within the USUBJID-AVISITN BY-groups we sort first by the Boolean expression vistyp ne 1 so that observations with VISTYP=1 come first (the expression returns 0 for them and 1 for the others). Within the subgroups "VISTYP=1" and "VISTYP ne 1" we sort by AWTDIFF so that the observation with the smallest AWTDIFF comes first. (If missings of AWTDIFF, which are always smaller than non-missing values, should be sorted differently, the code can be adapted.) The final "tie-breaker" is variable VISDUP, so the observation with the smallest VISDUP comes first if all previous criteria result in a tie.

View solution in original post

4 REPLIES 4
shivagujjaru
Calcite | Level 5
proc sort data=have;
by usubjid AVISITN vistyp;
run;
data want;
set have;
by usubjid AVISITN;

if first.usubjid then flag = ""
if first.AVISITN then flag = "";
if vistyp = 1 and first.AVISITN then flag = "Yes";
else if vistyp ne 1 and first.AVISITN then flag = "Yes";
else if vistyp = 1 and flag = "" then flag = "Yes";
if last.AVISITN then output;
drop visdup;
run;

proc print data=want; run;

Hope this should work

This code first sorts the have dataset by usubjid, AVISITN, and vistyp. Then, it processes the data by groups of usubjid and AVISITN. Within each group, it sets the flag variable based on the specified conditions. Finally, it keeps only the first record per visit (AVISITN) and drops the unnecessary

Try and let me know if it doesn't work
Merdock
Quartz | Level 8
thank you for your help! This code didn't work as expected but the code from FreelanceReinh below did the trick so it's all good. Appreciate your input!
FreelanceReinh
Jade | Level 19

Hello @Merdock,

 

This type of problem can be solved in three steps:

  1. Sort the input dataset in such a way that the observation to be flagged within a group of observations is always the first (or always the last) of that group and the resulting dataset (or view) is sorted by those groups.
  2. Use a DATA step with BY-group processing to flag the first (or last, respectively) observation of each BY group.
  3. If needed, sort the dataset to restore the original sort order.

 

Often PROC SQL is particularly useful for step 1 because its ORDER BY clause can sort by expressions, not only variables, and it can create a view instead of a physical dataset. Both of these advantages apply to your task:

/* Create a suitably sorted view of the data */

proc sql;
create view tmp as
select * from have
order by usubjid, avisitn, vistyp ne 1, awtdiff, visdup;
quit;

/* Flag the first observation of each BY group */

data want;
set tmp;
by usubjid avisitn;
if first.avisitn then flag='Yes';
else flag='No';
run;

/* Restore the original sort order */

proc sort data=want;
by usubjid avisitn visdup;
run;

(Personally I would prefer a numeric 0/1 flag, which could simply be defined as flag=first.avisitn.)

 

So within the USUBJID-AVISITN BY-groups we sort first by the Boolean expression vistyp ne 1 so that observations with VISTYP=1 come first (the expression returns 0 for them and 1 for the others). Within the subgroups "VISTYP=1" and "VISTYP ne 1" we sort by AWTDIFF so that the observation with the smallest AWTDIFF comes first. (If missings of AWTDIFF, which are always smaller than non-missing values, should be sorted differently, the code can be adapted.) The final "tie-breaker" is variable VISDUP, so the observation with the smallest VISDUP comes first if all previous criteria result in a tie.

Merdock
Quartz | Level 8
thank you so much, this is super helpful and works perfectly!

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 4 replies
  • 340 views
  • 2 likes
  • 3 in conversation