BookmarkSubscribeRSS Feed
Obsidian | Level 7


I have the following code

proc sort data= dataset1 out=dataset2 NODUPKEY dupout=duprecords;
by DOB Gender admitdate memberid Facility_2 ;


There are other fields but I want to delete duplicates only based on those. 

dataset1 has two identical records; instead of leaving one of the records in dataset2 and putting the other in duprecords, both records are in the duprecords dataset.  I can't figure out why.


I'm  using SAS Enterprise Guide 8.3




Super User Tom
Super User

I doubt it isn't working as designed.  Perhaps you are not clear of the design?


You should have unique keys in DATASET2.  In DUPRECORDS you will have the observations that did not make it into DATASET2.  So if you started with 100 observations you will have 100 observations split between the two datasets.


Try replicating the divide yourself using separate sort and data step;

proc sort data= dataset1 out=dataset1_sorted;
  by DOB Gender admitdate memberid Facility_2 ;

data dataset2 duprecords;
  set dataset1_sorted ;
  by DOB Gender admitdate memberid Facility_2 ;
  if first.facility_2 then output dataset2;
  else output duprecords;
Obsidian | Level 7


Thanks for all the quick responses.  I figured out my problem.  I was checking the result based on a member's last name which she had changed during the time span of the observations.

It's nice to have some new ideas on how to check things.

I appreciate the suggestions.




You'll have to overcome decades of PROC SORT use for me to accept that NODUPKEY deletes ALL obs with duplicate BY values.  


Please show the log of your PROC SORT, and any key values (DOB Gender admitdate memberid Facility_2) in dataset1 that does not appear in dataset2.




The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

Jade | Level 19

Hi @abqdiane ,

@abqdiane wrote:

dataset1 has two identical records; instead of leaving one of the records in dataset2 and putting the other in duprecords, both records are in the duprecords dataset.

This is the expected behavior of PROC SORT, if further up in dataset1 there's a third record with the same values of the BY variables.

Super User

You say that you have two duplicate RECORDS, but since you are using NODUPEKEY does another record have the same value for the BY variables but a different value for one or more other not listed variables? The first record with the values of the by variables encountered is usually the one kept.


Sort your data by those variables without the nodupekey and dupout dataset.

Print it

Show use the values of all the variables for the records with that sort by statement that match your "duplicate records"



Barite | Level 11
DUPOUT scans dataset1 and outputs all duplicate records, so duprecords data set has double records, because record 1 is a dup of record 2, and the reverse is also true. Then NODUPKEY eliminates all the duplicate records from dataset1, so dataset2 will not have any duplicate records.

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 6 in conversation