topic Re: Delete duplicate records. in SAS Programming

Delete duplicate records.

deleted_user — Thu, 22 Apr 2010 18:36:50 GMT

proc sort data=mydataset nodupkey;
by _all_;
run;

Q1: Will there be no identical records in mydataset after execution of the above code; without any exception?

Q2: Is there any procedure to get a report on how many identical records there is for each ID, where ID is a primary key in mydataset? (Yes, of course run before the above code.)

Re: Delete duplicate records.

sbb — Thu, 22 Apr 2010 20:32:52 GMT

My observation is to say "yes" to Q1. As for Q2, I suggest you review the use of DUPOUT= (with PROC SORT), if you need to do some analysis, whether you need to generate a report - there is no built-in SAS "report" for analyzing with a duplicate-data condition.

Scott Barry
SBBWorks, Inc.

Re: Delete duplicate records.

Patrick — Sun, 25 Apr 2010 03:45:55 GMT

For Q2:

proc sql;
select id,count(*) as Duplicates
from mydataset
group by id
having count(*)>1
;
quit;

Re: Delete duplicate records.

Doc_Duke — Sun, 25 Apr 2010 13:37:56 GMT

Patrick,

Your code will display the cases with duplicate IDs, but does not address duplicate content, which is what I read Q2 to ask.

Doc Muhlbaier
Duke

Re: Delete duplicate records.

Patrick — Mon, 26 Apr 2010 02:42:16 GMT

Ernesto

Duke is of course right. The code I provided counts only duplicate ID's.

The code below shows/counts now duplicate records.

What confuses me:
ID is defined as primary key and should therefore be unique.
Just removing all duplicate records won't solve all possible issues with duplicate ID's. There could even be different sets of duplicate records per id as
the sample data illustrates.

data mydataset;
keep id var1-var5;
length id 8;
array var {5} 8;
do id=1 to 2;
do i=1 to 20;
do j=1 to dim(var);
var{j}=round(ranuni(1));
end;
output;
end;
end;
run;

/* variant1: count before duplicates got removed */
proc sql;
select NAME INTO :col_list separated by ','
from dictionary.columns
where libname='WORK' and memname='MYDATASET';
;
select count(*) as Duplicates,*
from mydataset
group by &col_list
having count(*)>1
;
quit;

/* variant2: count after duplicates got removed */
proc sort data=mydataset out=Unique_mydataset equals nodupkey dupout=Dup_mydataset;
by _all_;
run;

proc sql;
select name into :col_list separated by ','
from dictionary.columns
where libname='WORK' and memname='Dup_mydataset';
;
select count(*)+1 as Duplicates,*
from Dup_mydataset
group by &col_list
;
quit;

HTH
Patrick

Re: Delete duplicate records.

deleted_user — Mon, 26 Apr 2010 19:34:53 GMT

Correction: ID is not a primary key.

Thank's for all answers.