Input data named pat_detail
Pat_ID Dis_Date Linker Pat_Age Zip_code
A11 03/28/12 A11 03/28/12 39 000
A11 03/28/12 A11 03/28/12 39 000
A12 04/19/13 A12 04/19/13 43 111
A13 05/20/14 A13 05/20/14 44 222
I have a table with the following variables and several more observations but some of them are duplicates as shown in example for Pat_ID A11. I did this to check for the Pat_ID's that had duplicates and if so, how many.
proc sql;
 create table duplicate as
select distinct PAT_ID, dis_date, linker, count(*) as count
from pat_detail
group by recip_id, discharge_date
Having count>=2;
quit;
duplicate table data:
Pat_ID Dis_Date Linker Count
A11 03/28/12 A11 03/28/12 2
How do I delete the observations in Pat_detail that are duplicates based on checking Pat_ID, dis_date and linker?
Hi,
The NODUPKEY option in PROC SORT should do the trick. For example:
proc sort data=pat_detail out=pat_detail_new nodupkey;
by pat_ID dis_date linker;
run;
The OUT= option creates a new data set with only the unique combinations of the BY variables.
Hope this helps,
M.
proc sql;
select distinct * from pat_detail;
quit;
Hi,
The NODUPKEY option in PROC SORT should do the trick. For example:
proc sort data=pat_detail out=pat_detail_new nodupkey;
by pat_ID dis_date linker;
run;
The OUT= option creates a new data set with only the unique combinations of the BY variables.
Hope this helps,
M.
What if you have for some data quality issues duplicate keys for "Pat_ID, dis_date and linker" but then also different values in these records for "Pat_Age"? To be on the safe side I would only remove duplicate records (all variables having the same value).
data have;
infile datalines truncover;
input Pat_ID $ Dis_Date:mmddyy. Linker $ Pat_Age:mmddyy. Zip_code $;
format Dis_Date date9.;
datalines;
A11 03/28/12 A11 03/28/12 39 000
A11 03/28/12 A11 03/28/12 39 000
A11 03/28/12 A11 03/28/12 49 000
A12 04/19/13 A12 04/19/13 43 111
A13 05/20/14 A13 05/20/14 44 222
;
run;
/* remove duplicates */
proc sort data=have out=want dupout=duplicates nodupkey;
by _all_;
run;
/* DQ issues */
proc sql;
create table DQ_issue as
select * from want
group by Pat_ID, dis_date, linker
having count(*)>=2
;
quit;
Hello,
Depending on your interest you may choose the appropriate calculation for the variables not included in the key.
data have;
input Pat_ID $ Dis_Date $ Linker & $12. Pat_Age Zip_code ;
datalines;
A11 03/28/12 A11 03/28/12 39 000
A11 03/28/12 A11 03/28/12 43 000
A11 03/28/12 A11 03/28/12 44 001
A12 04/19/13 A12 04/19/13 43 111
A13 05/20/14 A13 05/20/14 44 222
;
proc sql;
select pat_ID, Dis_Date, Linker, max(Pat_Age) , count(Zip_code)
from have
group by pat_ID, Dis_Date, Linker
;
quit;
proc sort data=pat_detail out=pat_detail_new nodupkey;
by pat_ID dis_date linker;
run;
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.
