DATA Step, Macro, Functions and more

Delete duplicate records.

Reply
N/A
Posts: 0

Delete duplicate records.

proc sort data=mydataset nodupkey;
by _all_;
run;

Q1: Will there be no identical records in mydataset after execution of the above code; without any exception?

Q2: Is there any procedure to get a report on how many identical records there is for each ID, where ID is a primary key in mydataset? (Yes, of course run before the above code.)
Super Contributor
Super Contributor
Posts: 3,174

Re: Delete duplicate records.

Posted in reply to deleted_user
My observation is to say "yes" to Q1. As for Q2, I suggest you review the use of DUPOUT= (with PROC SORT), if you need to do some analysis, whether you need to generate a report - there is no built-in SAS "report" for analyzing with a duplicate-data condition.

Scott Barry
SBBWorks, Inc.
Respected Advisor
Posts: 4,173

Re: Delete duplicate records.

Posted in reply to deleted_user
For Q2:

proc sql;
select id,count(*) as Duplicates
from mydataset
group by id
having count(*)>1
;
quit;
Trusted Advisor
Posts: 2,116

Re: Delete duplicate records.

Posted in reply to deleted_user
Patrick,

Your code will display the cases with duplicate IDs, but does not address duplicate content, which is what I read Q2 to ask.

Doc Muhlbaier
Duke
Respected Advisor
Posts: 4,173

Re: Delete duplicate records.

Posted in reply to deleted_user
Ernesto

Duke is of course right. The code I provided counts only duplicate ID's.

The code below shows/counts now duplicate records.

What confuses me:
ID is defined as primary key and should therefore be unique.
Just removing all duplicate records won't solve all possible issues with duplicate ID's. There could even be different sets of duplicate records per id as
the sample data illustrates.


data mydataset;
keep id var1-var5;
length id 8;
array var {5} 8;
do id=1 to 2;
do i=1 to 20;
do j=1 to dim(var);
var{j}=round(ranuni(1));
end;
output;
end;
end;
run;

/* variant1: count before duplicates got removed */
proc sql;
select NAME INTO :col_list separated by ','
from dictionary.columns
where libname='WORK' and memname='MYDATASET';
;
select count(*) as Duplicates,*
from mydataset
group by &col_list
having count(*)>1
;
quit;


/* variant2: count after duplicates got removed */
proc sort data=mydataset out=Unique_mydataset equals nodupkey dupout=Dup_mydataset;
by _all_;
run;

proc sql;
select name into :col_list separated by ','
from dictionary.columns
where libname='WORK' and memname='Dup_mydataset';
;
select count(*)+1 as Duplicates,*
from Dup_mydataset
group by &col_list
;
quit;


HTH
Patrick
N/A
Posts: 0

Re: Delete duplicate records.

Correction: ID is not a primary key.

Thank's for all answers.
Ask a Question
Discussion stats
  • 5 replies
  • 254 views
  • 0 likes
  • 4 in conversation