Solved: Remove duplicate IDs with similar characters

Ken_nuga · Posted 08-12-2022 11:41 PM

I have duplicates like CT001 and ct001 under "Pepid" but SAS does not recognize them as duplicates, how do I remove one and keep the most recent of the two?

Pepid	patient_id	Sex	LastPickupDate
CT08662	8480	Male	05-Apr-22
ct08662	8480	Male	11-Jan-22
IHSD/CT00071	4883	Male	27-Apr-22
IHSD/CT00073	4892	Male	10-May-22
IHSD/CT00079	4859	Male	27-Apr-22
ihsd/ct00071	4883	Male	26-Jan-22
ihsd/ct00073	4892	Male	09-Feb-22
ihsd/ct00079	4859	Male	26-Jan-22

Patrick · Posted 08-13-2022 12:27 AM

Like this?

data have;
  infile datalines truncover dlm='|' dsd;
  input Pepid :$20. patient_id:best32. Sex:$10. LastPickupDate :date9.;
  format LastPickupDate date9.;
  datalines;
CT08662|8480|Male|05-Apr-22
ct08662|8480|Male|11-Jan-22
IHSD/CT00071|4883|Male|27-Apr-22
IHSD/CT00073|4892|Male|10-May-22
IHSD/CT00079|4859|Male|27-Apr-22
ihsd/ct00071|4883|Male|26-Jan-22
ihsd/ct00073|4892|Male|09-Feb-22
ihsd/ct00079|4859|Male|26-Jan-22
;

proc sql;
  create table want as
  select *
  from have
  group by patient_id, sex, upcase(Pepid)
  having max(LastPickupDate)=LastPickupDate
  ;
quit;

View solution in original post

mkeintz · Posted 08-12-2022 11:46 PM

Why don't you use patient_id as your duplicate finder?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Ken_nuga · Posted 08-13-2022 12:43 AM

That would have worked fine, however, the data set has some facilities mapped under another (a hub and spoke approach) and running an independent database but having the same facility ID, therefore, using that erroneously deletes the records of patients from the 'spoke' see below: 'VHCT00001' from 'spoke' facility has the same Patient_id with TB143 from the 'hub' facility

FacilityCode	Pepid_fac	patient_id	Pepid	Sex	LastPickupDate	AgeAtStartofART
i6Rcojd7d9S	TB143-i6Rcojd7d9S	2	TB143	Male	12-Apr-22	40
i6Rcojd7d9S	VHCT00001-i6Rcojd7d9S	2	VHCT00001	Female	30-Jun-22	25
i6Rcojd7d9S	OP00002-i6Rcojd7d9S	3	OP00002	Female	22-Feb-18	3
i6Rcojd7d9S	PI1825-i6Rcojd7d9S	3	PI1825	Female	13-May-22	38
i6Rcojd7d9S	PI0964-i6Rcojd7d9S	4	PI0964	Female	03-Feb-22	28
i6Rcojd7d9S	VHCT00002-i6Rcojd7d9S	4	VHCT00002	Male	30-May-22	31
i6Rcojd7d9S	CT138-i6Rcojd7d9S	5	CT138	Female	06-Apr-22	48
i6Rcojd7d9S	VHCT00003B-i6Rcojd7d9S	5	VHCT00003B	Female	20-Jun-22	31
i6Rcojd7d9S	CT06237-i6Rcojd7d9S	6	CT06237	Female	11-Feb-22	52
i6Rcojd7d9S	VHOPI00009B-i6Rcojd7d9S	6	VHOPI00009B	Female	08-Aug-19	10
i6Rcojd7d9S	CT04218-i6Rcojd7d9S	7	CT04218	Male	25-Jan-22	64
i6Rcojd7d9S	VHOPI00004B-i6Rcojd7d9S	7	VHOPI00004B	Female	11-Apr-22	3
i6Rcojd7d9S	CT144-i6Rcojd7d9S	8	CT144	Female	11-Feb-19	32
i6Rcojd7d9S	VHOPI00005-i6Rcojd7d9S	8	VHOPI00005	Male	13-Jan-18	2
i6Rcojd7d9S	CT00272-i6Rcojd7d9S	9	CT00272	Female	25-Mar-22	39
i6Rcojd7d9S	VHOP00007-i6Rcojd7d9S	9	VHOP00007	Female	28-Jun-22	7
i6Rcojd7d9S	PM01485-i6Rcojd7d9S	10	PM01485	Male	04-Feb-22	40
i6Rcojd7d9S	VHCT00009A-i6Rcojd7d9S	10	VHCT00009A	Female	05-May-22	37

Patrick · Posted 08-13-2022 12:27 AM

Like this?

data have;
  infile datalines truncover dlm='|' dsd;
  input Pepid :$20. patient_id:best32. Sex:$10. LastPickupDate :date9.;
  format LastPickupDate date9.;
  datalines;
CT08662|8480|Male|05-Apr-22
ct08662|8480|Male|11-Jan-22
IHSD/CT00071|4883|Male|27-Apr-22
IHSD/CT00073|4892|Male|10-May-22
IHSD/CT00079|4859|Male|27-Apr-22
ihsd/ct00071|4883|Male|26-Jan-22
ihsd/ct00073|4892|Male|09-Feb-22
ihsd/ct00079|4859|Male|26-Jan-22
;

proc sql;
  create table want as
  select *
  from have
  group by patient_id, sex, upcase(Pepid)
  having max(LastPickupDate)=LastPickupDate
  ;
quit;

Ken_nuga · Posted 08-13-2022 05:32 AM

Many thanks, but I had to do as shown below:

DATA have;
set have;
Pepid2=upcase(Pepid);
Pepid_fac2=catx('-',Pepid2,FacilityCode);
run;

proc sql;
create table want1 as
select *
from have
group by Pepid_fac2, patient_id
having max(LastPickupDate)=LastPickupDate
;
quit;

DATA want2;
set want1;
by Pepid_fac2;
if first.Pepid_fac2;
run;

Remove duplicate IDs with similar characters

Re: Remove duplicate IDs with similar characters

Re: Remove duplicate IDs with similar characters

Re: Remove duplicate IDs with similar characters

Re: Remove duplicate IDs with similar characters

Re: Remove duplicate IDs with similar characters

Remove duplicate IDs with similar characters

Re: Remove duplicate IDs with similar characters

Re: Remove duplicate IDs with similar characters

Re: Remove duplicate IDs with similar characters

Re: Remove duplicate IDs with similar characters

Re: Remove duplicate IDs with similar characters

Registration is open

SAS Training: Just a Click Away