Input data named pat_detail
Pat_ID Dis_Date Linker Pat_Age Zip_code
A11 03/28/12 A11 03/28/12 39 000
A11 03/28/12 A11 03/28/12 39 000
A12 04/19/13 A12 04/19/13 43 111
A13 05/20/14 A13 05/20/14 44 222
I have a table with the following variables and several more observations but some of them are duplicates as shown in example for Pat_ID A11. I did this to check for the Pat_ID's that had duplicates and if so, how many.
proc sql;
create table duplicate as
select distinct PAT_ID, dis_date, linker, count(*) as count
from pat_detail
group by recip_id, discharge_date
Having count>=2;
quit;
duplicate table data:
Pat_ID Dis_Date Linker Count
A11 03/28/12 A11 03/28/12 2
How do I delete the observations in Pat_detail that are duplicates based on checking Pat_ID, dis_date and linker?
Hi,
The NODUPKEY option in PROC SORT should do the trick. For example:
proc sort data=pat_detail out=pat_detail_new nodupkey;
by pat_ID dis_date linker;
run;
The OUT= option creates a new data set with only the unique combinations of the BY variables.
Hope this helps,
M.
proc sql;
select distinct * from pat_detail;
quit;
Hi,
The NODUPKEY option in PROC SORT should do the trick. For example:
proc sort data=pat_detail out=pat_detail_new nodupkey;
by pat_ID dis_date linker;
run;
The OUT= option creates a new data set with only the unique combinations of the BY variables.
Hope this helps,
M.
What if you have for some data quality issues duplicate keys for "Pat_ID, dis_date and linker" but then also different values in these records for "Pat_Age"? To be on the safe side I would only remove duplicate records (all variables having the same value).
data have;
infile datalines truncover;
input Pat_ID $ Dis_Date:mmddyy. Linker $ Pat_Age:mmddyy. Zip_code $;
format Dis_Date date9.;
datalines;
A11 03/28/12 A11 03/28/12 39 000
A11 03/28/12 A11 03/28/12 39 000
A11 03/28/12 A11 03/28/12 49 000
A12 04/19/13 A12 04/19/13 43 111
A13 05/20/14 A13 05/20/14 44 222
;
run;
/* remove duplicates */
proc sort data=have out=want dupout=duplicates nodupkey;
by _all_;
run;
/* DQ issues */
proc sql;
create table DQ_issue as
select * from want
group by Pat_ID, dis_date, linker
having count(*)>=2
;
quit;
Hello,
Depending on your interest you may choose the appropriate calculation for the variables not included in the key.
data have;
input Pat_ID $ Dis_Date $ Linker & $12. Pat_Age Zip_code ;
datalines;
A11 03/28/12 A11 03/28/12 39 000
A11 03/28/12 A11 03/28/12 43 000
A11 03/28/12 A11 03/28/12 44 001
A12 04/19/13 A12 04/19/13 43 111
A13 05/20/14 A13 05/20/14 44 222
;
proc sql;
select pat_ID, Dis_Date, Linker, max(Pat_Age) , count(Zip_code)
from have
group by pat_ID, Dis_Date, Linker
;
quit;
proc sort data=pat_detail out=pat_detail_new nodupkey;
by pat_ID dis_date linker;
run;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.