Solved: Re: Deleting duplicates based on three variables

pinkyc · Posted 01-09-2015 04:23 PM

Input data named pat_detail

Pat_ID Dis_Date Linker Pat_Age Zip_code

A11 03/28/12 A11 03/28/12 39 000

A12 04/19/13 A12 04/19/13 43 111

A13 05/20/14 A13 05/20/14 44 222

I have a table with the following variables and several more observations but some of them are duplicates as shown in example for Pat_ID A11. I did this to check for the Pat_ID's that had duplicates and if so, how many.

proc sql;

create table duplicate as

select distinct PAT_ID, dis_date, linker, count(*) as count

from pat_detail

group by recip_id, discharge_date

Having count>=2;

quit;

duplicate table data:

Pat_ID Dis_Date Linker Count

A11 03/28/12 A11 03/28/12 2

How do I delete the observations in Pat_detail that are duplicates based on checking Pat_ID, dis_date and linker?

MaikH_Schutze · Posted 01-09-2015 04:49 PM

Hi,

The NODUPKEY option in PROC SORT should do the trick. For example:

proc sort data=pat_detail out=pat_detail_new nodupkey;

by pat_ID dis_date linker;

run;

The OUT= option creates a new data set with only the unique combinations of the BY variables.

Hope this helps,

M.

View solution in original post

stat_sas · Posted 01-09-2015 04:47 PM

proc sql;

select distinct * from pat_detail;

quit;

MaikH_Schutze · Posted 01-09-2015 04:49 PM

Hi,

The NODUPKEY option in PROC SORT should do the trick. For example:

proc sort data=pat_detail out=pat_detail_new nodupkey;

by pat_ID dis_date linker;

run;

The OUT= option creates a new data set with only the unique combinations of the BY variables.

Hope this helps,

M.

Patrick · Posted 01-09-2015 05:12 PM

What if you have for some data quality issues duplicate keys for "Pat_ID, dis_date and linker" but then also different values in these records for "Pat_Age"? To be on the safe side I would only remove duplicate records (all variables having the same value).

data have;

infile datalines truncover;

input Pat_ID $ Dis_Date:mmddyy. Linker $ Pat_Age:mmddyy. Zip_code $;

format Dis_Date date9.;

datalines;

A11 03/28/12 A11 03/28/12 39 000

A11 03/28/12 A11 03/28/12 49 000

A12 04/19/13 A12 04/19/13 43 111

A13 05/20/14 A13 05/20/14 44 222

;

run;

/* remove duplicates */

proc sort data=have out=want dupout=duplicates nodupkey;

by _all_;

run;

/* DQ issues */

proc sql;

create table DQ_issue as

select * from want

group by Pat_ID, dis_date, linker

having count(*)>=2

;

quit;

Loko · Posted 01-10-2015 04:18 AM

Hello,

Depending on your interest you may choose the appropriate calculation for the variables not included in the key.

data have;

input Pat_ID $ Dis_Date $ Linker & $12. Pat_Age Zip_code ;

datalines;

A11 03/28/12 A11 03/28/12 39 000

A11 03/28/12 A11 03/28/12 43 000

A11 03/28/12 A11 03/28/12 44 001

A12 04/19/13 A12 04/19/13 43 111

A13 05/20/14 A13 05/20/14 44 222

;

proc sql;

select pat_ID, Dis_Date, Linker, max(Pat_Age) , count(Zip_code)

from have

group by pat_ID, Dis_Date, Linker

;

quit;

Tanmay · Posted 01-14-2015 03:22 AM

proc sort data=pat_detail out=pat_detail_new nodupkey;

by pat_ID dis_date linker;

run;

Deleting duplicates based on three variables

Re: Deleting duplicates based on three variables

Re: Deleting duplicates based on three variables

Re: Deleting duplicates based on three variables

Re: Deleting duplicates based on three variables

Re: Deleting duplicates based on three variables

Re: Deleting duplicates based on three variables

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away