Hello,
I have a hospital dataset and I want to be able to flag rows that are duplicates but meeting different conditions.
I need to have four flags each meeting a different condition of duplicates.
The dataset that I have has the clients ID, hospital ID, date of service, a service code and a procedure code.
The four flags I need are By clients
1. If the hospital ID, date of service, service code and procedure codes have duplicate rows then flag (flag1)
2. If the hospital ID is different but date of service, service code and procedure codes have duplicate rows then have another flag (flag2)
3. If the hospital ID and procedure codes are different but date of service and service code have duplicate rows then have another flag (flag3)
4. If the hospital ID and service codes are different but date of service and procedure code have duplicate rows then have another flag (flag4)
I have the below code as a sample. NOTE the 1, 2,3 and 4 in the data set cards are examples of flags that meet the above 4 criteria. Ideally I would need four columns for each criteria and flagging 1 or 0 that meet each criteria. one would have to remove the numbers at the end to run the data step below.
data have;
input client_ID $ HospitalID $ Date : yymmdd10. Service_cd procedure_cd $;
format date yymmdd10.;
cards;
C001 hsp001 2015/06/03 1568 ABDC
C001 hsp001 2015/06/08 1400 ASDF 1
C001 hsp002 2015/06/08 1400 ASDF 1
C001 hsp002 2015/06/08 1568 SEWE
C002 hsp002 2015/06/08 1568 RWRW
C002 hsp002 2015/06/08 1567 EWRE
C002 hsp001 2015/06/08 1567 RRRR 3
C002 hsp003 2015/06/08 1568 RRRR 3
C003 hsp001 2015/06/08 1567 RRRR
C003 hsp001 2015/06/08 1567 RRRR 2
C003 hsp003 2015/06/08 1567 EWRE 2
C004 hsp001 2015/06/29 1567 RRRR 4
C004 hsp001 2015/06/29 1567 RRRR 4
C005 hsp001 2015/06/29 1567 RRRR
C005 hsp001 2015/06/30 1567 RRRR
C005 hsp002 2015/06/30 1567 RRRR
;
run;
Update below with what I would like to see the output as:
This is how I would like to see the output.
Each flag column should correspond to each point above.
Client_id | Hosp_its | Date | service_code | Procdure_code | Flag1 | flag2 | flag3 | flag4 |
C001 | hsp001 | 6/3/2015 | 1568 | ABDC | 0 | 0 | 0 | 0 |
C001 | hsp001 | 6/8/2015 | 1400 | ASDF | 1 | 0 | 0 | 0 |
C001 | hsp002 | 6/8/2015 | 1400 | ASDF | 1 | 0 | 0 | 0 |
C001 | hsp002 | 6/8/2015 | 1568 | SEWE | 0 | 0 | 0 | 0 |
C002 | hsp002 | 6/8/2015 | 1568 | RWRW | 0 | 0 | 0 | 0 |
C002 | hsp002 | 6/8/2015 | 1567 | EWRE | 0 | 0 | 1 | 0 |
C002 | hsp001 | 6/8/2015 | 1567 | RRRR | 0 | 0 | 1 | 1 |
C002 | hsp003 | 6/8/2015 | 1568 | RRRR | 0 | 0 | 0 | 1 |
C003 | hsp001 | 6/8/2015 | 1567 | RRRR | 1 | 0 | 0 | 0 |
C003 | hsp001 | 6/8/2015 | 1567 | RRRR | 1 | 1 | 0 | 0 |
C003 | hsp003 | 6/8/2015 | 1567 | EWRE | 0 | 1 | 0 | 0 |
C004 | hsp001 | 6/29/2015 | 1567 | RRRR | 1 | 0 | 0 | 0 |
C004 | hsp001 | 6/29/2015 | 1567 | RRRR | 1 | 0 | 0 | 0 |
C005 | hsp001 | 6/29/2015 | 1567 | RRRR | 0 | 0 | 0 | 0 |
C005 | hsp001 | 6/30/2015 | 1567 | RRRR | 0 | 1 | 0 | 0 |
C005 | hsp002 | 6/30/2015 | 1567 | RRRR | 0 | 1 | 0 | 0 |
Any assistance would be greatly appreciated!!!
Except for the first flag, this is a more difficult problem than it appears. For the remaining flags, each requires separate processing. Here is an approach, to get FLAG2. You will need similar processing for the remaining flags.
The idea is to have the top loop count how many hospitals appear for the same combination of DATE / Service Code / Procedure Code, and compute FLAG2 accordingly. That requires going past the initial observations for the combination, so the bottom loop reads the same observations and outputs (with the calculated values for FLAG2).
proc sort data=have;
by date service_cd procedure_cd Hospitalid;
run;
data want_with_flag2;
count = 0;
do until (last.procedure_cd);
set have;
by date service_cd procedure_cd Hospitalid;
if first.Hospitalid then count + 1;
end;
if count > 1 then flag2=1;
else flag2=0;
do until (last.procedure_cd);
set have
by date service_cd procedure_cd Hospitalid;
output;
end;
drop count;
run;
1. Show what your expected output is.
2. Have you looked into BY groups and using FIRST/LAST? You can use that type of method to identify duplicates relatively easily.
Here's the first one for you. The rest are variations on this approach.
proc sort data=have;
by client_ID HospitalID Date Service_cd procedure_cd ;
data dup1;
set have;
by client_ID HospitalID Date Service_cd procedure_cd ;
if not (first.procedure_cd and last.procedure_cd) then duplicate=1;
else duplicate=0;
run;
@sas_student1 wrote:
Hello,I have a hospital dataset and I want to be able to flag rows that are duplicates but meeting different conditions.
I need to have four flags each meeting a different condition of duplicates.
The dataset that I have has the clients ID, hospital ID, date of service, a service code and a procedure code.
The four flags I need are By clients
1. If the hospital ID, date of service, service code and procedure codes have duplicate rows then flag (flag1)
2. If the hospital ID is different but date of service, service code and procedure codes have duplicate rows then have another flag (flag2)
3. If the hospital ID and procedure codes are different but date of service and service code have duplicate rows then have another flag (flag3)
4. If the hospital ID and service codes are different but date of service and procedure code have duplicate rows then have another flag (flag4)
I have the below code as a sample. NOTE the 1, 2,3 and 4 in the data set cards are examples of flags that meet the above 4 criteria. Ideally I would need four columns for each criteria and flagging 1 or 0 that meet each criteria. one would have to remove the numbers at the end to run the data step below.
data have;
input client_ID $ HospitalID $ Date : yymmdd10. Service_cd procedure_cd $;
format date yymmdd10.;
cards;
C001 hsp001 2015/06/03 1568 ABDC
C001 hsp001 2015/06/08 1400 ASDF 1
C001 hsp002 2015/06/08 1400 ASDF 1
C001 hsp002 2015/06/08 1568 SEWE
C002 hsp002 2015/06/08 1568 RWRW
C002 hsp002 2015/06/08 1567 EWRE
C002 hsp001 2015/06/08 1567 RRRR 3
C002 hsp003 2015/06/08 1568 RRRR 3
C003 hsp001 2015/06/08 1567 RRRR
C003 hsp001 2015/06/08 1567 RRRR 2
C003 hsp003 2015/06/08 1567 EWRE 2
C004 hsp001 2015/06/29 1567 RRRR 4
C004 hsp001 2015/06/29 1567 RRRR 4
C005 hsp001 2015/06/29 1567 RRRR
C005 hsp001 2015/06/30 1567 RRRR
C005 hsp002 2015/06/30 1567 RRRR
;
run;
Any assistance would be greatly appreciated!!!
Thank you @Reeza. This almost gets to what I was looking for. As per your suggestion I edited my original post to show how I was needing the output to be. In the example you gave me it seems like where the procedure code is not the same it only flags one row and two or more rows that would be associated with the rule is not flagged.
If you can further assist given the example I posted, I would appreciate it!
@Reeza gave you the hint: "Here's the first one for you. The rest are variations on this approach."
For each flag you need to run the code with different group of variables in the BY statements, then
use the last variable of the list in the FIRST. and LAST. where statement.
The output of each run is the input of next run while the first input is the data you have.
proc sort data=<dataset in>;
by variable_1 variable_2 ... variable-last ; /* list of same duplicate values */
run;
data <dataset_out>;
set <dataset in>;
by variable_1 variable_2 ... variable_last ;
if not (first.variable_last and last.variable_last)
then flag_1=1; /* adapt to flag_2 or flag_3 ... */
else flag_1=0; /* adapt to flag_2 or flag_3 ... */
run;
apologies for not getting it. I used the first hint that @Reeza gave but the "not (first.id and last.id)" doesn't seem to work properly. I think I understand what the "not (first.id and last.id)" is supposed to do, that is in this case if the id are different when all the other elements in the by statement are same then to flag =1. But with the by statement of variables that we are conditioning to state are the same, I get rows flagged that are duplicates and not rows where all the elements are same expect for the one that is different.
I hope I got your point. Please check next code.
If you still have issues (meanwhile two flags only) please post the output
and show what are the issues - how to correct it:
data have;
input client_ID $ HospitalID $ Date : yymmdd10. Service_cd procedure_cd $;
format date yymmdd10.;
cards;
C001 hsp001 2015/06/03 1568 ABDC
C001 hsp001 2015/06/08 1400 ASDF 1
C001 hsp002 2015/06/08 1400 ASDF 1
C001 hsp002 2015/06/08 1568 SEWE
C002 hsp002 2015/06/08 1568 RWRW
C002 hsp002 2015/06/08 1567 EWRE
C002 hsp001 2015/06/08 1567 RRRR 3
C002 hsp003 2015/06/08 1568 RRRR 3
C003 hsp001 2015/06/08 1567 RRRR
C003 hsp001 2015/06/08 1567 RRRR 2
C003 hsp003 2015/06/08 1567 EWRE 2
C004 hsp001 2015/06/29 1567 RRRR 4
C004 hsp001 2015/06/29 1567 RRRR 4
C005 hsp001 2015/06/29 1567 RRRR
C005 hsp001 2015/06/30 1567 RRRR
C005 hsp002 2015/06/30 1567 RRRR
;
run;
*================================================*;
proc sort data=have;
by client_ID HospitalID Date Service_cd procedure_cd ;
run;
data dup1;
format client_ID HospitalID Date Service_cd procedure_cd;
set have;
by client_ID HospitalID Date Service_cd procedure_cd ;
if not (first.procedure_cd and last.procedure_cd)
then flag1=1;
else flag1=0;
run;
*================================================*;
proc sort data=dup1;
by client_ID Date Service_cd procedure_cd HospitalID;
run;
proc sql;
create table dup1a as select *,
count (distinct HospitalID) as HospitalCount
from dup1 group by client_ID, Date, Service_cd, procedure_cd
; quit;
data dup2;
format client_ID Date Service_cd procedure_cd HospitalID;
set dup1a;
by client_ID Date Service_cd procedure_cd HospitalID;
if not (first.procedure_cd and last.procedure_cd) and HospitalCount > 1
then flag2=1;
else flag2=0;
drop HospitalCount;
run;
If it is OK continue with similar method to create flag_3 and flag_4, as needed.
@sas_student1 wrote:
apologies for not getting it. I used the first hint that @Reeza gave but the "not (first.id and last.id)" doesn't seem to work properly. I think I understand what the "not (first.id and last.id)" is supposed to do, that is in this case if the id are different when all the other elements in the by statement are same then to flag =1. But with the by statement of variables that we are conditioning to state are the same, I get rows flagged that are duplicates and not rows where all the elements are same expect for the one that is different.
Show the code you ran and the log showing no errors or warnings.
Except for the first flag, this is a more difficult problem than it appears. For the remaining flags, each requires separate processing. Here is an approach, to get FLAG2. You will need similar processing for the remaining flags.
The idea is to have the top loop count how many hospitals appear for the same combination of DATE / Service Code / Procedure Code, and compute FLAG2 accordingly. That requires going past the initial observations for the combination, so the bottom loop reads the same observations and outputs (with the calculated values for FLAG2).
proc sort data=have;
by date service_cd procedure_cd Hospitalid;
run;
data want_with_flag2;
count = 0;
do until (last.procedure_cd);
set have;
by date service_cd procedure_cd Hospitalid;
if first.Hospitalid then count + 1;
end;
if count > 1 then flag2=1;
else flag2=0;
do until (last.procedure_cd);
set have
by date service_cd procedure_cd Hospitalid;
output;
end;
drop count;
run;
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.