How to remove duplicate records and place them into a new data set sim...

Vish33 · Posted 02-07-2012 12:06 PM

Hi,

I have a data set with millions of records and i want to remove the duplicates from that and place them into a new data set.

I need different scenarios like using proc sort and other techniques.

Thanks in advance,

vishnu

art297 · Posted 02-07-2012 12:10 PM

Depends upon what you want to remove and if you want to have total control over what is removed.

My own preference is to use proc sort and then, in a datastep, take advantage of the first. and last. boolean variables.

FriedEgg · Posted 02-07-2012 12:33 PM

Look at the dupout option in proc sort to see if it meets your needs, it does not offer the same level of control as Art's recommendation.

http://support.sas.com/documentation/cdl/en/proc/61895/HTML/default/viewer.htm#a000146878.htm

ArtC · Posted 02-07-2012 01:22 PM

Remember if you use SORT with NODUPLICATES (or SORT followed by FIRST. and LAST.) that you must have a sufficient key to bring the duplicate records next to each other or they will not be eliminated. This paper has been around a while but is still important today: http://www2.sas.com/proceedings/sugi25/25/po/25p221.pdf

art297 · Posted 02-07-2012 01:26 PM

Art,

I would add that there aren't many (if any) use(s) I've ever seen for the noduplicates option. If I were to use the dupout option, I would always use it with NODUPKEY

Vish33 · Posted 02-08-2012 04:47 AM

Thanks Art,

this was really helpful..

Hima · Posted 02-07-2012 03:19 PM

Please try this code:

data person;
input id $ name $ dept $;
datalines;
1 John Sales
2 Mary Acctng
3 Tom Marketing
1 John Sales
3 Tom Marketing
;

PROC SORT DATA = person; BY id;
RUN;

PROC SQL;

CREATE TABLE TEST AS
SELECT DISTINCT * FROM person GROUP BY ID HAVING COUNT(ID) GT 1;
QUIT;

PROC SORT DATA = person NODUPKEY; BY id;
RUN;

How to remove duplicate records and place them into a new data set simultaneously

How to remove duplicate records and place them into a new data set simultaneously

How to remove duplicate records and place them into a new data set simultaneously

How to remove duplicate records and place them into a new data set simultaneously

How to remove duplicate records and place them into a new data set simultaneously

How to remove duplicate records and place them into a new data set simultaneously

Re: How to remove duplicate records and place them into a new data set simultaneously

SAS Innovate 2025: Call for Content

Classroom Training Available!