Re: Proc Sort data = nodupkey is deleting both records

abqdiane · Posted 03-08-2022 05:42 PM

Hi,

I have the following code

proc sort data= dataset1 out=dataset2 NODUPKEY dupout=duprecords;
by DOB Gender admitdate memberid Facility_2 ;
run;

There are other fields but I want to delete duplicates only based on those.

dataset1 has two identical records; instead of leaving one of the records in dataset2 and putting the other in duprecords, both records are in the duprecords dataset. I can't figure out why.

I'm using SAS Enterprise Guide 8.3

Thanks,

Diane

Tom · Posted 03-08-2022 06:06 PM

I doubt it isn't working as designed. Perhaps you are not clear of the design?

You should have unique keys in DATASET2. In DUPRECORDS you will have the observations that did not make it into DATASET2. So if you started with 100 observations you will have 100 observations split between the two datasets.

Try replicating the divide yourself using separate sort and data step;

proc sort data= dataset1 out=dataset1_sorted;
  by DOB Gender admitdate memberid Facility_2 ;
run;

data dataset2 duprecords;
  set dataset1_sorted ;
  by DOB Gender admitdate memberid Facility_2 ;
  if first.facility_2 then output dataset2;
  else output duprecords;
run;

abqdiane · Posted 03-08-2022 06:21 PM

Hi,

Thanks for all the quick responses. I figured out my problem. I was checking the result based on a member's last name which she had changed during the time span of the observations.

It's nice to have some new ideas on how to check things.

I appreciate the suggestions.

Diane

mkeintz · Posted 03-08-2022 06:15 PM

You'll have to overcome decades of PROC SORT use for me to accept that NODUPKEY deletes ALL obs with duplicate BY values.

Please show the log of your PROC SORT, and any key values (DOB Gender admitdate memberid Facility_2) in dataset1 that does not appear in dataset2.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

FreelanceReinh · Posted 03-08-2022 06:17 PM

Hi @abqdiane ,

@abqdiane wrote:

dataset1 has two identical records; instead of leaving one of the records in dataset2 and putting the other in duprecords, both records are in the duprecords dataset.

This is the expected behavior of PROC SORT, if further up in dataset1 there's a third record with the same values of the BY variables.

ballardw · Posted 03-08-2022 06:25 PM

You say that you have two duplicate RECORDS, but since you are using NODUPEKEY does another record have the same value for the BY variables but a different value for one or more other not listed variables? The first record with the values of the by variables encountered is usually the one kept.

Sort your data by those variables without the nodupekey and dupout dataset.

Print it

Show use the values of all the variables for the records with that sort by statement that match your "duplicate records"

pink_poodle · Posted 03-08-2022 10:02 PM

DUPOUT scans dataset1 and outputs all duplicate records, so duprecords data set has double records, because record 1 is a dup of record 2, and the reverse is also true. Then NODUPKEY eliminates all the duplicate records from dataset1, so dataset2 will not have any duplicate records.

Proc Sort data = nodupkey is deleting both records