Hi,
I have the following code
proc sort data= dataset1 out=dataset2 NODUPKEY dupout=duprecords;
by DOB Gender admitdate memberid Facility_2 ;
run;
There are other fields but I want to delete duplicates only based on those.
dataset1 has two identical records; instead of leaving one of the records in dataset2 and putting the other in duprecords, both records are in the duprecords dataset. I can't figure out why.
I'm using SAS Enterprise Guide 8.3
Thanks,
Diane
I doubt it isn't working as designed. Perhaps you are not clear of the design?
You should have unique keys in DATASET2. In DUPRECORDS you will have the observations that did not make it into DATASET2. So if you started with 100 observations you will have 100 observations split between the two datasets.
Try replicating the divide yourself using separate sort and data step;
proc sort data= dataset1 out=dataset1_sorted;
by DOB Gender admitdate memberid Facility_2 ;
run;
data dataset2 duprecords;
set dataset1_sorted ;
by DOB Gender admitdate memberid Facility_2 ;
if first.facility_2 then output dataset2;
else output duprecords;
run;
Hi,
Thanks for all the quick responses. I figured out my problem. I was checking the result based on a member's last name which she had changed during the time span of the observations.
It's nice to have some new ideas on how to check things.
I appreciate the suggestions.
Diane
You'll have to overcome decades of PROC SORT use for me to accept that NODUPKEY deletes ALL obs with duplicate BY values.
Please show the log of your PROC SORT, and any key values (DOB Gender admitdate memberid Facility_2) in dataset1 that does not appear in dataset2.
Hi @abqdiane ,
@abqdiane wrote:
dataset1 has two identical records; instead of leaving one of the records in dataset2 and putting the other in duprecords, both records are in the duprecords dataset.
This is the expected behavior of PROC SORT, if further up in dataset1 there's a third record with the same values of the BY variables.
You say that you have two duplicate RECORDS, but since you are using NODUPEKEY does another record have the same value for the BY variables but a different value for one or more other not listed variables? The first record with the values of the by variables encountered is usually the one kept.
Sort your data by those variables without the nodupekey and dupout dataset.
Print it
Show use the values of all the variables for the records with that sort by statement that match your "duplicate records"
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.