Help using Base SAS procedures

Deleting duplicates based on three variables

Accepted Solution Solved
Reply
Contributor
Posts: 28
Accepted Solution

Deleting duplicates based on three variables


Input data named pat_detail

Pat_ID     Dis_Date     Linker               Pat_Age     Zip_code

A11          03/28/12     A11 03/28/12     39               000

A11          03/28/12     A11 03/28/12     39               000    

A12          04/19/13     A12 04/19/13     43               111

A13          05/20/14     A13 05/20/14     44               222

I have a table with the following variables and several more observations but some of them are duplicates as shown in example for Pat_ID A11.  I did this to check for the Pat_ID's that had duplicates and if so, how many.

proc sql;

create table duplicate as

select distinct PAT_ID, dis_date, linker, count(*) as count

from pat_detail

group by recip_id, discharge_date

Having count>=2;

quit;

duplicate table data:

Pat_ID    Dis_Date     Linker          Count

A11       03/28/12     A11 03/28/12     2

How do I delete the observations in Pat_detail that are duplicates based on checking Pat_ID, dis_date and linker?


Accepted Solutions
Solution
‎01-09-2015 04:49 PM
Contributor
Posts: 45

Re: Deleting duplicates based on three variables

Hi,

The NODUPKEY option in PROC SORT should do the trick. For example:

proc sort data=pat_detail out=pat_detail_new nodupkey;

by pat_ID dis_date linker;

run;

The OUT= option creates a new data set with only the unique combinations of the BY variables.

Hope this helps,

M.

View solution in original post


All Replies
Trusted Advisor
Posts: 1,204

Re: Deleting duplicates based on three variables

proc sql;

select distinct * from pat_detail;

quit;

Solution
‎01-09-2015 04:49 PM
Contributor
Posts: 45

Re: Deleting duplicates based on three variables

Hi,

The NODUPKEY option in PROC SORT should do the trick. For example:

proc sort data=pat_detail out=pat_detail_new nodupkey;

by pat_ID dis_date linker;

run;

The OUT= option creates a new data set with only the unique combinations of the BY variables.

Hope this helps,

M.

Respected Advisor
Posts: 3,887

Re: Deleting duplicates based on three variables

What if you have for some data quality issues duplicate keys for "Pat_ID, dis_date and linker" but then also different values in these records for "Pat_Age"? To be on the safe side I would only remove duplicate records (all variables having the same value).

data have;

  infile datalines truncover;

  input Pat_ID $ Dis_Date:mmddyy. Linker $ Pat_Age:mmddyy. Zip_code $;

  format Dis_Date date9.;

  datalines;

A11 03/28/12 A11 03/28/12 39 000

A11 03/28/12 A11 03/28/12 39 000

A11 03/28/12 A11 03/28/12 49 000

A12 04/19/13 A12 04/19/13 43 111

A13 05/20/14 A13 05/20/14 44 222

;

run;

/* remove duplicates */

proc sort data=have out=want dupout=duplicates nodupkey;

  by _all_;

run;

/* DQ issues */

proc sql;

  create table DQ_issue as

    select * from want

    group by Pat_ID, dis_date, linker

    having count(*)>=2

  ;

quit;

Super Contributor
Posts: 305

Re: Deleting duplicates based on three variables

Hello,

Depending on your interest you may choose the appropriate calculation for the variables not included in the key.

data have;

input Pat_ID   $  Dis_Date $     Linker    & $12.     Pat_Age     Zip_code ;

datalines;

A11          03/28/12     A11 03/28/12     39               000

A11          03/28/12     A11 03/28/12     43               000   

A11          03/28/12     A11 03/28/12     44               001

A12          04/19/13     A12 04/19/13     43               111

A13          05/20/14     A13 05/20/14     44               222

;

proc sql;

select pat_ID, Dis_Date,  Linker, max(Pat_Age) , count(Zip_code)

from have

group by pat_ID, Dis_Date,  Linker

;

quit;

Occasional Contributor
Posts: 5

Re: Deleting duplicates based on three variables

proc sort data=pat_detail out=pat_detail_new nodupkey;

by pat_ID dis_date linker;

run;

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 5 replies
  • 532 views
  • 3 likes
  • 6 in conversation