DATA Step, Macro, Functions and more

How to remove duplicate records and place them into a new data set simultaneously

Reply
Frequent Contributor
Posts: 117

How to remove duplicate records and place them into a new data set simultaneously

Hi,

I have a data set with millions of records and i want to remove the duplicates from that and place them into a new data set.

I need different scenarios like using proc sort and other techniques.

Thanks in advance,

vishnu

PROC Star
Posts: 7,362

How to remove duplicate records and place them into a new data set simultaneously

Depends upon what you want to remove and if you want to have total control over what is removed.

My own preference is to use proc sort and then, in a datastep, take advantage of the first. and last. boolean variables.

Trusted Advisor
Posts: 1,300

How to remove duplicate records and place them into a new data set simultaneously

Look at the dupout option in proc sort to see if it meets your needs, it does not offer the same level of control as Art's recommendation.

http://support.sas.com/documentation/cdl/en/proc/61895/HTML/default/viewer.htm#a000146878.htm

Valued Guide
Posts: 632

How to remove duplicate records and place them into a new data set simultaneously

Remember if you use SORT with NODUPLICATES (or SORT followed by FIRST. and LAST.) that you must have a sufficient key to bring the duplicate records next to each other or they will not be eliminated. This paper has been around a while but is still important today:  http://www2.sas.com/proceedings/sugi25/25/po/25p221.pdf

PROC Star
Posts: 7,362

How to remove duplicate records and place them into a new data set simultaneously

Art,

I would add that there aren't many (if any) use(s) I've ever seen for the noduplicates option.  If I were to use the dupout option, I would always use it with NODUPKEY

Frequent Contributor
Posts: 117

How to remove duplicate records and place them into a new data set simultaneously

Thanks Art,

this was really helpful..

Regular Contributor
Posts: 233

Re: How to remove duplicate records and place them into a new data set simultaneously

Please try this code:

data person;
   input id $ name $ dept $;
   datalines;
1 John Sales
2 Mary Acctng
3 Tom  Marketing
1 John Sales
3 Tom  Marketing
;

PROC SORT DATA = person; BY id;
RUN;

PROC SQL;

CREATE TABLE TEST AS
SELECT DISTINCT * FROM person GROUP BY ID HAVING COUNT(ID) GT 1;
QUIT;

PROC SORT DATA = person NODUPKEY; BY id;
RUN;

Ask a Question
Discussion stats
  • 6 replies
  • 357 views
  • 6 likes
  • 5 in conversation