BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
sas_student1
Quartz | Level 8

 


Hello,

I have a hospital dataset and I want to be able to flag rows that are duplicates but meeting different conditions.

I need to have four flags each meeting a different condition of duplicates.

 

The dataset that I have has the clients ID, hospital ID, date of service, a service code and a procedure code.

The four flags I need are By clients

1. If the hospital ID, date of service, service code and procedure codes have duplicate rows then flag (flag1)

2. If the hospital ID is different but date of service, service code and procedure codes have duplicate rows then have another flag (flag2)

3. If the hospital ID and procedure codes are different but date of service and service code have duplicate rows then have another flag (flag3)

4. If the hospital ID and service codes are different but date of service and procedure code have duplicate rows then have another flag (flag4)

 

I  have the below code as a sample. NOTE the 1, 2,3 and 4 in the data set cards are examples of flags that meet the above 4 criteria. Ideally I would need four columns for each criteria and flagging 1 or 0 that meet each criteria. one would have to remove the numbers at the end to run the data step below.

 

data have;

input client_ID $ HospitalID $ Date : yymmdd10. Service_cd procedure_cd $;

format date yymmdd10.;

cards;

 

C001 hsp001 2015/06/03 1568 ABDC

C001 hsp001 2015/06/08 1400 ASDF 1

C001 hsp002 2015/06/08 1400 ASDF 1

C001 hsp002 2015/06/08 1568 SEWE

C002 hsp002 2015/06/08 1568 RWRW

C002 hsp002 2015/06/08 1567 EWRE

C002 hsp001 2015/06/08 1567 RRRR 3

C002 hsp003 2015/06/08 1568 RRRR 3

C003 hsp001 2015/06/08 1567 RRRR

C003 hsp001 2015/06/08 1567 RRRR 2

C003 hsp003 2015/06/08 1567 EWRE 2

C004 hsp001 2015/06/29 1567 RRRR 4

C004 hsp001 2015/06/29 1567 RRRR 4

C005 hsp001 2015/06/29 1567 RRRR

C005 hsp001 2015/06/30 1567 RRRR

C005 hsp002 2015/06/30 1567 RRRR

;

 


run;

 

Update below with what I would like to see the output as:

This is how I would like to see the output.

Each flag column should correspond to each point above.

 

Client_idHosp_itsDateservice_codeProcdure_codeFlag1flag2flag3flag4
C001hsp0016/3/20151568ABDC0000
C001hsp0016/8/20151400ASDF1000
C001hsp0026/8/20151400ASDF1000
C001hsp0026/8/20151568SEWE0000
C002hsp0026/8/20151568RWRW0000
C002hsp0026/8/20151567EWRE0010
C002hsp0016/8/20151567RRRR0011
C002hsp0036/8/20151568RRRR0001
C003hsp0016/8/20151567RRRR1000
C003hsp0016/8/20151567RRRR1100
C003hsp0036/8/20151567EWRE0100
C004hsp0016/29/20151567RRRR1000
C004hsp0016/29/20151567RRRR1000
C005hsp0016/29/20151567RRRR0000
C005hsp0016/30/20151567RRRR0100
C005hsp0026/30/20151567RRRR0100

Any assistance would be greatly appreciated!!!

1 ACCEPTED SOLUTION

Accepted Solutions
Astounding
PROC Star

Except for the first flag, this is a more difficult problem than it appears.  For the remaining flags, each requires separate processing.  Here is an approach, to get FLAG2.  You will need similar processing for the remaining flags.

 

The idea is to have the top loop count how many hospitals appear for the same combination of DATE / Service Code / Procedure Code, and compute FLAG2 accordingly.  That requires going past the initial observations for the combination, so the bottom loop reads the same observations and outputs (with the calculated values for FLAG2).

 

proc sort data=have;
by date service_cd procedure_cd Hospitalid;
run;

data want_with_flag2;
count = 0;
do until (last.procedure_cd);
   set have;
   by date service_cd procedure_cd Hospitalid;
   if first.Hospitalid then count + 1;
end;
if count > 1 then flag2=1;
else flag2=0;
do until (last.procedure_cd);
   set have
   by date service_cd procedure_cd Hospitalid;
   output;
end;
drop count;
run;

View solution in original post

7 REPLIES 7
Reeza
Super User

1. Show what your expected output is.

2. Have you looked into BY groups and using FIRST/LAST? You can use that type of method to identify duplicates relatively easily. 

 

Here's the first one for you. The rest are variations on this approach.

 

proc sort data=have;
by client_ID  HospitalID  Date  Service_cd procedure_cd ;

data dup1;
set have;
by client_ID  HospitalID  Date  Service_cd procedure_cd ;
if not (first.procedure_cd and last.procedure_cd) then duplicate=1;
else duplicate=0;

run;

@sas_student1 wrote:

 


Hello,

I have a hospital dataset and I want to be able to flag rows that are duplicates but meeting different conditions.

I need to have four flags each meeting a different condition of duplicates.

 

The dataset that I have has the clients ID, hospital ID, date of service, a service code and a procedure code.

The four flags I need are By clients

1. If the hospital ID, date of service, service code and procedure codes have duplicate rows then flag (flag1)

2. If the hospital ID is different but date of service, service code and procedure codes have duplicate rows then have another flag (flag2)

3. If the hospital ID and procedure codes are different but date of service and service code have duplicate rows then have another flag (flag3)

4. If the hospital ID and service codes are different but date of service and procedure code have duplicate rows then have another flag (flag4)

 

I  have the below code as a sample. NOTE the 1, 2,3 and 4 in the data set cards are examples of flags that meet the above 4 criteria. Ideally I would need four columns for each criteria and flagging 1 or 0 that meet each criteria. one would have to remove the numbers at the end to run the data step below.

 

data have;

input client_ID $ HospitalID $ Date : yymmdd10. Service_cd procedure_cd $;

format date yymmdd10.;

cards;

 

C001 hsp001 2015/06/03 1568 ABDC

C001 hsp001 2015/06/08 1400 ASDF 1

C001 hsp002 2015/06/08 1400 ASDF 1

C001 hsp002 2015/06/08 1568 SEWE

C002 hsp002 2015/06/08 1568 RWRW

C002 hsp002 2015/06/08 1567 EWRE

C002 hsp001 2015/06/08 1567 RRRR 3

C002 hsp003 2015/06/08 1568 RRRR 3

C003 hsp001 2015/06/08 1567 RRRR

C003 hsp001 2015/06/08 1567 RRRR 2

C003 hsp003 2015/06/08 1567 EWRE 2

C004 hsp001 2015/06/29 1567 RRRR 4

C004 hsp001 2015/06/29 1567 RRRR 4

C005 hsp001 2015/06/29 1567 RRRR

C005 hsp001 2015/06/30 1567 RRRR

C005 hsp002 2015/06/30 1567 RRRR

;

 


run;

 

Any assistance would be greatly appreciated!!!


 

sas_student1
Quartz | Level 8

Thank you @Reeza. This almost gets to what I was looking for. As per your suggestion I edited my original post to show how I was needing the output to be. In the example you gave me it seems like where the procedure code is not the same it only flags one row and two or more rows that would be associated with the rule is not flagged.

If you can further assist given the example I posted, I would appreciate it!

Shmuel
Garnet | Level 18

@Reeza gave you the hint: "Here's the first one for you. The rest are variations on this approach."

 

For each flag you need to run the code with different group of variables in the BY statements, then

use the last variable of the list in the FIRST. and LAST. where statement.

The output of each run is the input of next run while the first input is the data you have.

 

proc sort data=<dataset in>;
by variable_1 variable_2 ... variable-last ; /* list of same duplicate values */
run;

data <dataset_out>;
set <dataset in>;
by variable_1 variable_2 ... variable_last ;
if not (first.variable_last and last.variable_last) 
   then flag_1=1;      /* adapt to flag_2 or flag_3 ... */
   else flag_1=0;      /* adapt to flag_2 or flag_3 ... */

run;

 

sas_student1
Quartz | Level 8

@ @Reeza

apologies for not getting it. I used the first hint that @Reeza gave but the "not (first.id and last.id)" doesn't seem to work properly. I think I understand what the "not (first.id and last.id)" is supposed to do, that is in this case if the id are different when all the other elements in the by statement are same then to flag =1. But with the by statement of variables that we are conditioning to state are the same, I get rows flagged that are duplicates and not rows where all the elements are same expect for the one that is different.

 

Shmuel
Garnet | Level 18

I hope I got your point. Please check next code.

If you still have issues (meanwhile two flags only) please post the output

and show what are the issues - how to correct it:

data have;
input client_ID $ HospitalID $ Date : yymmdd10. Service_cd procedure_cd $;
format date yymmdd10.;
cards;
C001 hsp001 2015/06/03 1568 ABDC
C001 hsp001 2015/06/08 1400 ASDF 1
C001 hsp002 2015/06/08 1400 ASDF 1
C001 hsp002 2015/06/08 1568 SEWE
C002 hsp002 2015/06/08 1568 RWRW
C002 hsp002 2015/06/08 1567 EWRE
C002 hsp001 2015/06/08 1567 RRRR 3
C002 hsp003 2015/06/08 1568 RRRR 3
C003 hsp001 2015/06/08 1567 RRRR
C003 hsp001 2015/06/08 1567 RRRR 2
C003 hsp003 2015/06/08 1567 EWRE 2
C004 hsp001 2015/06/29 1567 RRRR 4
C004 hsp001 2015/06/29 1567 RRRR 4
C005 hsp001 2015/06/29 1567 RRRR
C005 hsp001 2015/06/30 1567 RRRR
C005 hsp002 2015/06/30 1567 RRRR
;
run;
*================================================*;
proc sort data=have;
by client_ID  HospitalID  Date  Service_cd procedure_cd ;
run;
data dup1;
   format client_ID  HospitalID  Date  Service_cd procedure_cd;
set have;
by client_ID  HospitalID  Date  Service_cd procedure_cd ;
if not (first.procedure_cd and last.procedure_cd) 
   then flag1=1;
   else flag1=0;
run;
*================================================*;
proc sort data=dup1;
by client_ID  Date  Service_cd procedure_cd HospitalID;
run;
proc sql;
     create table dup1a as select *,
     count (distinct HospitalID) as HospitalCount
     from dup1 group by client_ID, Date, Service_cd, procedure_cd
   ; quit;
data dup2;
  format client_ID  Date  Service_cd procedure_cd HospitalID;
set dup1a;
by client_ID  Date  Service_cd procedure_cd HospitalID;
if not (first.procedure_cd and last.procedure_cd) and HospitalCount > 1
   then flag2=1;
   else flag2=0;
drop HospitalCount;
run;

If it is OK continue with similar method to create flag_3 and flag_4, as needed.

Reeza
Super User

@sas_student1 wrote:

@ @Reeza

apologies for not getting it. I used the first hint that @Reeza gave but the "not (first.id and last.id)" doesn't seem to work properly. I think I understand what the "not (first.id and last.id)" is supposed to do, that is in this case if the id are different when all the other elements in the by statement are same then to flag =1. But with the by statement of variables that we are conditioning to state are the same, I get rows flagged that are duplicates and not rows where all the elements are same expect for the one that is different.

 


Show the code you ran and the log showing no errors or warnings. 

Astounding
PROC Star

Except for the first flag, this is a more difficult problem than it appears.  For the remaining flags, each requires separate processing.  Here is an approach, to get FLAG2.  You will need similar processing for the remaining flags.

 

The idea is to have the top loop count how many hospitals appear for the same combination of DATE / Service Code / Procedure Code, and compute FLAG2 accordingly.  That requires going past the initial observations for the combination, so the bottom loop reads the same observations and outputs (with the calculated values for FLAG2).

 

proc sort data=have;
by date service_cd procedure_cd Hospitalid;
run;

data want_with_flag2;
count = 0;
do until (last.procedure_cd);
   set have;
   by date service_cd procedure_cd Hospitalid;
   if first.Hospitalid then count + 1;
end;
if count > 1 then flag2=1;
else flag2=0;
do until (last.procedure_cd);
   set have
   by date service_cd procedure_cd Hospitalid;
   output;
end;
drop count;
run;

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 7 replies
  • 5878 views
  • 2 likes
  • 4 in conversation