BookmarkSubscribeRSS Feed
abqdiane
Obsidian | Level 7

Hi,

I have the following code


proc sort data= dataset1 out=dataset2 NODUPKEY dupout=duprecords;
by DOB Gender admitdate memberid Facility_2 ;
run;

 

There are other fields but I want to delete duplicates only based on those. 

dataset1 has two identical records; instead of leaving one of the records in dataset2 and putting the other in duprecords, both records are in the duprecords dataset.  I can't figure out why.

 

I'm  using SAS Enterprise Guide 8.3

Thanks,

Diane

 

6 REPLIES 6
Tom
Super User Tom
Super User

I doubt it isn't working as designed.  Perhaps you are not clear of the design?

 

You should have unique keys in DATASET2.  In DUPRECORDS you will have the observations that did not make it into DATASET2.  So if you started with 100 observations you will have 100 observations split between the two datasets.

 

Try replicating the divide yourself using separate sort and data step;

proc sort data= dataset1 out=dataset1_sorted;
  by DOB Gender admitdate memberid Facility_2 ;
run;

data dataset2 duprecords;
  set dataset1_sorted ;
  by DOB Gender admitdate memberid Facility_2 ;
  if first.facility_2 then output dataset2;
  else output duprecords;
run;
abqdiane
Obsidian | Level 7

Hi,

Thanks for all the quick responses.  I figured out my problem.  I was checking the result based on a member's last name which she had changed during the time span of the observations.

It's nice to have some new ideas on how to check things.

I appreciate the suggestions.

Diane

 

mkeintz
PROC Star

You'll have to overcome decades of PROC SORT use for me to accept that NODUPKEY deletes ALL obs with duplicate BY values.  

 

Please show the log of your PROC SORT, and any key values (DOB Gender admitdate memberid Facility_2) in dataset1 that does not appear in dataset2.

 

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
FreelanceReinh
Jade | Level 19

Hi @abqdiane ,


@abqdiane wrote:

dataset1 has two identical records; instead of leaving one of the records in dataset2 and putting the other in duprecords, both records are in the duprecords dataset.

This is the expected behavior of PROC SORT, if further up in dataset1 there's a third record with the same values of the BY variables.

ballardw
Super User

You say that you have two duplicate RECORDS, but since you are using NODUPEKEY does another record have the same value for the BY variables but a different value for one or more other not listed variables? The first record with the values of the by variables encountered is usually the one kept.

 

Sort your data by those variables without the nodupekey and dupout dataset.

Print it

Show use the values of all the variables for the records with that sort by statement that match your "duplicate records"

 

 

pink_poodle
Barite | Level 11
DUPOUT scans dataset1 and outputs all duplicate records, so duprecords data set has double records, because record 1 is a dup of record 2, and the reverse is also true. Then NODUPKEY eliminates all the duplicate records from dataset1, so dataset2 will not have any duplicate records.

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 6 replies
  • 1546 views
  • 0 likes
  • 6 in conversation