- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I have the following code
proc sort data= dataset1 out=dataset2 NODUPKEY dupout=duprecords;
by DOB Gender admitdate memberid Facility_2 ;
run;
There are other fields but I want to delete duplicates only based on those.
dataset1 has two identical records; instead of leaving one of the records in dataset2 and putting the other in duprecords, both records are in the duprecords dataset. I can't figure out why.
I'm using SAS Enterprise Guide 8.3
Thanks,
Diane
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I doubt it isn't working as designed. Perhaps you are not clear of the design?
You should have unique keys in DATASET2. In DUPRECORDS you will have the observations that did not make it into DATASET2. So if you started with 100 observations you will have 100 observations split between the two datasets.
Try replicating the divide yourself using separate sort and data step;
proc sort data= dataset1 out=dataset1_sorted;
by DOB Gender admitdate memberid Facility_2 ;
run;
data dataset2 duprecords;
set dataset1_sorted ;
by DOB Gender admitdate memberid Facility_2 ;
if first.facility_2 then output dataset2;
else output duprecords;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thanks for all the quick responses. I figured out my problem. I was checking the result based on a member's last name which she had changed during the time span of the observations.
It's nice to have some new ideas on how to check things.
I appreciate the suggestions.
Diane
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
You'll have to overcome decades of PROC SORT use for me to accept that NODUPKEY deletes ALL obs with duplicate BY values.
Please show the log of your PROC SORT, and any key values (DOB Gender admitdate memberid Facility_2) in dataset1 that does not appear in dataset2.
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set
Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets
--------------------------
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi @abqdiane ,
@abqdiane wrote:
dataset1 has two identical records; instead of leaving one of the records in dataset2 and putting the other in duprecords, both records are in the duprecords dataset.
This is the expected behavior of PROC SORT, if further up in dataset1 there's a third record with the same values of the BY variables.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
You say that you have two duplicate RECORDS, but since you are using NODUPEKEY does another record have the same value for the BY variables but a different value for one or more other not listed variables? The first record with the values of the by variables encountered is usually the one kept.
Sort your data by those variables without the nodupekey and dupout dataset.
Print it
Show use the values of all the variables for the records with that sort by statement that match your "duplicate records"
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content