Hi,
I have a data set with millions of records and i want to remove the duplicates from that and place them into a new data set.
I need different scenarios like using proc sort and other techniques.
Thanks in advance,
vishnu
Depends upon what you want to remove and if you want to have total control over what is removed.
My own preference is to use proc sort and then, in a datastep, take advantage of the first. and last. boolean variables.
Look at the dupout option in proc sort to see if it meets your needs, it does not offer the same level of control as Art's recommendation.
http://support.sas.com/documentation/cdl/en/proc/61895/HTML/default/viewer.htm#a000146878.htm
Remember if you use SORT with NODUPLICATES (or SORT followed by FIRST. and LAST.) that you must have a sufficient key to bring the duplicate records next to each other or they will not be eliminated. This paper has been around a while but is still important today: http://www2.sas.com/proceedings/sugi25/25/po/25p221.pdf
Art,
I would add that there aren't many (if any) use(s) I've ever seen for the noduplicates option. If I were to use the dupout option, I would always use it with NODUPKEY
Thanks Art,
this was really helpful..
Please try this code:
data person;
input id $ name $ dept $;
datalines;
1 John Sales
2 Mary Acctng
3 Tom Marketing
1 John Sales
3 Tom Marketing
;
PROC SORT DATA = person; BY id;
RUN;
PROC SQL;
CREATE TABLE TEST AS
SELECT DISTINCT * FROM person GROUP BY ID HAVING COUNT(ID) GT 1;
QUIT;
PROC SORT DATA = person NODUPKEY; BY id;
RUN;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.