Can someone help me understand the difference between removing duplicates using noduprecs and select distinct? In below code, why I'm getting different record count?
How to write the proc sort with noduprecs in proc sql?
148 proc sort data=BOMSING out=BOMSING_no_dup_ps dupout=removed_records_sas NODUPRECS;
149 BY LVL1_MATRL;
150 run;
NOTE: There were 226526 observations read from the data set WORK.BOMSING.
NOTE: 206 duplicate observations were deleted.
NOTE: The data set WORK.BOMSING_NO_DUP_PS has 226320 observations and 109 variables.
NOTE: The data set WORK.REMOVED_RECORDS_SAS has 206 observations and 109 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 0.38 seconds
cpu time 0.39 seconds
151
152 proc sql;
153 create table work.BOMSING_no_dup_sd as
154 select distinct * from
155 BOMSING
156 order by LVL1_MATRL;
NOTE: Table WORK.BOMSING_NO_DUP_SD created, with 226283 rows and 109 columns.
157 quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 0.96 seconds
cpu time 1.47 seconds
Yes, but it does not do so reliably and consistently. That's why it is no longer documented.
To reliably get rid of duplicates, run
proc sort data=data nodupkey;
by _all_;
run;
which is in essence what SQL does when DISTINCT is used.
You positively must read the paper (https://support.sas.com/resources/papers/proceedings/proceedings/sugi30/037-30.pdf) mentioned in the thread linked by @andreas_lds.
Quote from there:
One thing to beware of with both options is that they both compare the previous observation written to the output data set. So, if the observations that you want eliminated are not adjacent in the data set after the sort, they will not be eliminated.
Example code that demonstrates the shortcoming of NODUPREC:
data have;
input key data;
datalines;
1 1
1 2
1 3
1 1
;
proc sort
data=have
out=want1
noduprec
;
by key;
run;
title "SORT with NODUPREC";
proc print data=want1 noobs;
run;
proc sql;
create table want2 as
select distinct *
from have
;
quit;
title "SQL with DISTINCT";
proc print data=want2 noobs;
run;
Result:
SORT with NODUPREC key data 1 1 1 2 1 3 1 1 SQL with DISTINCT key data 1 1 1 2 1 3
Please have a look at https://communities.sas.com/t5/New-SAS-User/Difference-between-NODUPKEY-and-NODUPRECS/td-p/536246 especially at the comment by @Reeza
@andreas_lds I already gone through that article and comments. My Question is how to write the noduprecs in proc sql?
@David_Billa wrote:
@andreas_lds I already gone through that article and comments. My Question is how to write the noduprecs in proc sql?
Sorry, can't help you with this task, i hardly use proc sql, because in my eyes sql code is incredible ugly.
@David_Billa wrote:
@andreas_lds I already gone through that article and comments. My Question is how to write the noduprecs in proc sql?
You can't. You would have to re-engineer what is, in essence, a bug.
@Kurt_Bremser Noduprecs will remove the duplicate records based on all by variables which is similar to select distinct. Is my understanding right?
Yes, but it does not do so reliably and consistently. That's why it is no longer documented.
To reliably get rid of duplicates, run
proc sort data=data nodupkey;
by _all_;
run;
which is in essence what SQL does when DISTINCT is used.
You positively must read the paper (https://support.sas.com/resources/papers/proceedings/proceedings/sugi30/037-30.pdf) mentioned in the thread linked by @andreas_lds.
Quote from there:
One thing to beware of with both options is that they both compare the previous observation written to the output data set. So, if the observations that you want eliminated are not adjacent in the data set after the sort, they will not be eliminated.
Example code that demonstrates the shortcoming of NODUPREC:
data have;
input key data;
datalines;
1 1
1 2
1 3
1 1
;
proc sort
data=have
out=want1
noduprec
;
by key;
run;
title "SORT with NODUPREC";
proc print data=want1 noobs;
run;
proc sql;
create table want2 as
select distinct *
from have
;
quit;
title "SQL with DISTINCT";
proc print data=want2 noobs;
run;
Result:
SORT with NODUPREC key data 1 1 1 2 1 3 1 1 SQL with DISTINCT key data 1 1 1 2 1 3
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.