Solved: Re: Flag multiple duplicate rows with different conditons

sas_student1 · Posted 02-04-2019 06:47 PM

Hello,

I have a hospital dataset and I want to be able to flag rows that are duplicates but meeting different conditions.

I need to have four flags each meeting a different condition of duplicates.

The dataset that I have has the clients ID, hospital ID, date of service, a service code and a procedure code.

The four flags I need are By clients

1. If the hospital ID, date of service, service code and procedure codes have duplicate rows then flag (flag1)

2. If the hospital ID is different but date of service, service code and procedure codes have duplicate rows then have another flag (flag2)

3. If the hospital ID and procedure codes are different but date of service and service code have duplicate rows then have another flag (flag3)

4. If the hospital ID and service codes are different but date of service and procedure code have duplicate rows then have another flag (flag4)

I have the below code as a sample. NOTE the 1, 2,3 and 4 in the data set cards are examples of flags that meet the above 4 criteria. Ideally I would need four columns for each criteria and flagging 1 or 0 that meet each criteria. one would have to remove the numbers at the end to run the data step below.

data have;

input client_ID $ HospitalID $ Date : yymmdd10. Service_cd procedure_cd $;

format date yymmdd10.;

cards;

C001 hsp001 2015/06/03 1568 ABDC

C001 hsp001 2015/06/08 1400 ASDF 1

C001 hsp002 2015/06/08 1400 ASDF 1

C001 hsp002 2015/06/08 1568 SEWE

C002 hsp002 2015/06/08 1568 RWRW

C002 hsp002 2015/06/08 1567 EWRE

C002 hsp001 2015/06/08 1567 RRRR 3

C002 hsp003 2015/06/08 1568 RRRR 3

C003 hsp001 2015/06/08 1567 RRRR

C003 hsp001 2015/06/08 1567 RRRR 2

C003 hsp003 2015/06/08 1567 EWRE 2

C004 hsp001 2015/06/29 1567 RRRR 4

C005 hsp001 2015/06/29 1567 RRRR

C005 hsp001 2015/06/30 1567 RRRR

C005 hsp002 2015/06/30 1567 RRRR

;

run;

Update below with what I would like to see the output as:

This is how I would like to see the output.

Each flag column should correspond to each point above.

Client_id	Hosp_its	Date	service_code	Procdure_code	Flag1	flag2	flag3	flag4
C001	hsp001	6/3/2015	1568	ABDC	0	0	0	0
C001	hsp001	6/8/2015	1400	ASDF	1	0	0	0
C001	hsp002	6/8/2015	1400	ASDF	1	0	0	0
C001	hsp002	6/8/2015	1568	SEWE	0	0	0	0
C002	hsp002	6/8/2015	1568	RWRW	0	0	0	0
C002	hsp002	6/8/2015	1567	EWRE	0	0	1	0
C002	hsp001	6/8/2015	1567	RRRR	0	0	1	1
C002	hsp003	6/8/2015	1568	RRRR	0	0	0	1
C003	hsp001	6/8/2015	1567	RRRR	1	0	0	0
C003	hsp001	6/8/2015	1567	RRRR	1	1	0	0
C003	hsp003	6/8/2015	1567	EWRE	0	1	0	0
C004	hsp001	6/29/2015	1567	RRRR	1	0	0	0
C004	hsp001	6/29/2015	1567	RRRR	1	0	0	0
C005	hsp001	6/29/2015	1567	RRRR	0	0	0	0
C005	hsp001	6/30/2015	1567	RRRR	0	1	0	0
C005	hsp002	6/30/2015	1567	RRRR	0	1	0	0

Any assistance would be greatly appreciated!!!

Astounding · Posted 02-06-2019 10:17 AM

Except for the first flag, this is a more difficult problem than it appears. For the remaining flags, each requires separate processing. Here is an approach, to get FLAG2. You will need similar processing for the remaining flags.

The idea is to have the top loop count how many hospitals appear for the same combination of DATE / Service Code / Procedure Code, and compute FLAG2 accordingly. That requires going past the initial observations for the combination, so the bottom loop reads the same observations and outputs (with the calculated values for FLAG2).

proc sort data=have;
by date service_cd procedure_cd Hospitalid;
run;

data want_with_flag2;
count = 0;
do until (last.procedure_cd);
   set have;
   by date service_cd procedure_cd Hospitalid;
   if first.Hospitalid then count + 1;
end;
if count > 1 then flag2=1;
else flag2=0;
do until (last.procedure_cd);
   set have
   by date service_cd procedure_cd Hospitalid;
   output;
end;
drop count;
run;

View solution in original post

Reeza · Posted 02-04-2019 06:59 PM

1. Show what your expected output is.

2. Have you looked into BY groups and using FIRST/LAST? You can use that type of method to identify duplicates relatively easily.

Here's the first one for you. The rest are variations on this approach.

proc sort data=have;
by client_ID  HospitalID  Date  Service_cd procedure_cd ;

data dup1;
set have;
by client_ID  HospitalID  Date  Service_cd procedure_cd ;
if not (first.procedure_cd and last.procedure_cd) then duplicate=1;
else duplicate=0;

run;

@sas_student1 wrote:

Hello,

I have a hospital dataset and I want to be able to flag rows that are duplicates but meeting different conditions.

I need to have four flags each meeting a different condition of duplicates.

The dataset that I have has the clients ID, hospital ID, date of service, a service code and a procedure code.

The four flags I need are By clients

1. If the hospital ID, date of service, service code and procedure codes have duplicate rows then flag (flag1)

2. If the hospital ID is different but date of service, service code and procedure codes have duplicate rows then have another flag (flag2)

3. If the hospital ID and procedure codes are different but date of service and service code have duplicate rows then have another flag (flag3)

4. If the hospital ID and service codes are different but date of service and procedure code have duplicate rows then have another flag (flag4)

I have the below code as a sample. NOTE the 1, 2,3 and 4 in the data set cards are examples of flags that meet the above 4 criteria. Ideally I would need four columns for each criteria and flagging 1 or 0 that meet each criteria. one would have to remove the numbers at the end to run the data step below.

data have;

input client_ID $ HospitalID $ Date : yymmdd10. Service_cd procedure_cd $;

format date yymmdd10.;

cards;

C001 hsp001 2015/06/03 1568 ABDC

C001 hsp001 2015/06/08 1400 ASDF 1

C001 hsp002 2015/06/08 1400 ASDF 1

C001 hsp002 2015/06/08 1568 SEWE

C002 hsp002 2015/06/08 1568 RWRW

C002 hsp002 2015/06/08 1567 EWRE

C002 hsp001 2015/06/08 1567 RRRR 3

C002 hsp003 2015/06/08 1568 RRRR 3

C003 hsp001 2015/06/08 1567 RRRR

C003 hsp001 2015/06/08 1567 RRRR 2

C003 hsp003 2015/06/08 1567 EWRE 2

C004 hsp001 2015/06/29 1567 RRRR 4

C004 hsp001 2015/06/29 1567 RRRR 4

C005 hsp001 2015/06/29 1567 RRRR

C005 hsp001 2015/06/30 1567 RRRR

C005 hsp002 2015/06/30 1567 RRRR

;

run;

Any assistance would be greatly appreciated!!!

sas_student1 · Posted 02-04-2019 07:52 PM

Thank you @Reeza. This almost gets to what I was looking for. As per your suggestion I edited my original post to show how I was needing the output to be. In the example you gave me it seems like where the procedure code is not the same it only flags one row and two or more rows that would be associated with the rule is not flagged.

If you can further assist given the example I posted, I would appreciate it!

Shmuel · Posted 02-04-2019 09:57 PM

@Reeza gave you the hint: "Here's the first one for you. The rest are variations on this approach."

For each flag you need to run the code with different group of variables in the BY statements, then

use the last variable of the list in the FIRST. and LAST. where statement.

The output of each run is the input of next run while the first input is the data you have.

proc sort data=<dataset in>;
by variable_1 variable_2 ... variable-last ; /* list of same duplicate values */
run;

data <dataset_out>;
set <dataset in>;
by variable_1 variable_2 ... variable_last ;
if not (first.variable_last and last.variable_last) 
   then flag_1=1;      /* adapt to flag_2 or flag_3 ... */
   else flag_1=0;      /* adapt to flag_2 or flag_3 ... */

run;

sas_student1 · Posted 02-04-2019 10:23 PM

@Shmuel @Reeza

apologies for not getting it. I used the first hint that @Reeza gave but the "not (first.id and last.id)" doesn't seem to work properly. I think I understand what the "not (first.id and last.id)" is supposed to do, that is in this case if the id are different when all the other elements in the by statement are same then to flag =1. But with the by statement of variables that we are conditioning to state are the same, I get rows flagged that are duplicates and not rows where all the elements are same expect for the one that is different.

Shmuel · Posted 02-05-2019 01:26 AM

I hope I got your point. Please check next code.

If you still have issues (meanwhile two flags only) please post the output

and show what are the issues - how to correct it:

data have;
input client_ID $ HospitalID $ Date : yymmdd10. Service_cd procedure_cd $;
format date yymmdd10.;
cards;
C001 hsp001 2015/06/03 1568 ABDC
C001 hsp001 2015/06/08 1400 ASDF 1
C001 hsp002 2015/06/08 1400 ASDF 1
C001 hsp002 2015/06/08 1568 SEWE
C002 hsp002 2015/06/08 1568 RWRW
C002 hsp002 2015/06/08 1567 EWRE
C002 hsp001 2015/06/08 1567 RRRR 3
C002 hsp003 2015/06/08 1568 RRRR 3
C003 hsp001 2015/06/08 1567 RRRR
C003 hsp001 2015/06/08 1567 RRRR 2
C003 hsp003 2015/06/08 1567 EWRE 2
C004 hsp001 2015/06/29 1567 RRRR 4
C004 hsp001 2015/06/29 1567 RRRR 4
C005 hsp001 2015/06/29 1567 RRRR
C005 hsp001 2015/06/30 1567 RRRR
C005 hsp002 2015/06/30 1567 RRRR
;
run;
*================================================*;
proc sort data=have;
by client_ID  HospitalID  Date  Service_cd procedure_cd ;
run;
data dup1;
   format client_ID  HospitalID  Date  Service_cd procedure_cd;
set have;
by client_ID  HospitalID  Date  Service_cd procedure_cd ;
if not (first.procedure_cd and last.procedure_cd) 
   then flag1=1;
   else flag1=0;
run;
*================================================*;
proc sort data=dup1;
by client_ID  Date  Service_cd procedure_cd HospitalID;
run;
proc sql;
     create table dup1a as select *,
     count (distinct HospitalID) as HospitalCount
     from dup1 group by client_ID, Date, Service_cd, procedure_cd
   ; quit;
data dup2;
  format client_ID  Date  Service_cd procedure_cd HospitalID;
set dup1a;
by client_ID  Date  Service_cd procedure_cd HospitalID;
if not (first.procedure_cd and last.procedure_cd) and HospitalCount > 1
   then flag2=1;
   else flag2=0;
drop HospitalCount;
run;

If it is OK continue with similar method to create flag_3 and flag_4, as needed.

Reeza · Posted 02-05-2019 03:32 PM

@sas_student1 wrote:

@Shmuel @Reeza

apologies for not getting it. I used the first hint that @Reeza gave but the "not (first.id and last.id)" doesn't seem to work properly. I think I understand what the "not (first.id and last.id)" is supposed to do, that is in this case if the id are different when all the other elements in the by statement are same then to flag =1. But with the by statement of variables that we are conditioning to state are the same, I get rows flagged that are duplicates and not rows where all the elements are same expect for the one that is different.

Show the code you ran and the log showing no errors or warnings.

Astounding · Posted 02-06-2019 10:17 AM

Except for the first flag, this is a more difficult problem than it appears. For the remaining flags, each requires separate processing. Here is an approach, to get FLAG2. You will need similar processing for the remaining flags.

The idea is to have the top loop count how many hospitals appear for the same combination of DATE / Service Code / Procedure Code, and compute FLAG2 accordingly. That requires going past the initial observations for the combination, so the bottom loop reads the same observations and outputs (with the calculated values for FLAG2).

proc sort data=have;
by date service_cd procedure_cd Hospitalid;
run;

data want_with_flag2;
count = 0;
do until (last.procedure_cd);
   set have;
   by date service_cd procedure_cd Hospitalid;
   if first.Hospitalid then count + 1;
end;
if count > 1 then flag2=1;
else flag2=0;
do until (last.procedure_cd);
   set have
   by date service_cd procedure_cd Hospitalid;
   output;
end;
drop count;
run;

Registration is open

SAS Training: Just a Click Away