Identify duplicates in a file

avatar · Posted 03-22-2013 10:43 AM

is there a way to identify duplicate ids in a file with 40,000 records?

Any suggestion would be greatly appreciated.

Thanks

Reeza · Posted 03-22-2013 10:44 AM

Task>Data>Sort> Under options, look for the first and the duplicate options.

Or look at proc sort.

avatar · Posted 03-22-2013 11:45 AM

Thanks a lot. I have few methods to try now,

avatar · Posted 03-22-2013 01:18 PM

data step using if not last.id then output dups;

else output unique; worked for me

Thanks

Reeza · Posted 03-22-2013 01:33 PM

it should be if not( last.id and first.id) to get both of the observations that are duplicate unless you want only the last one.

avatar · Posted 03-22-2013 01:43 PM

Interesting. I wanted both records. Now I got it . Thanks for the correction. Never knew about this code (if not last.id and first.id)

Haris · Posted 03-22-2013 01:51 PM

This would print all the records for IDs with multiple entries:

proc sql;

select *

from data

group by ID

having sum(*) GT 1;

quit;

Haikuo · Posted 03-22-2013 01:49 PM

40,000 does not sound too much for Proc SQL. The following code may also help you reach your goal;

proc sql;

select * from yourdata group by ID having count(*)>1; quit;

Haikuo

Re: Identify duplicates in a file