BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
David_Billa
Rhodochrosite | Level 12

Can someone help me understand the difference between removing duplicates using noduprecs and select distinct? In below code, why I'm getting different record count?

 

How to write the proc sort with noduprecs in proc sql?

 

148        proc sort data=BOMSING out=BOMSING_no_dup_ps dupout=removed_records_sas NODUPRECS;
149        	BY LVL1_MATRL;
150        run;

NOTE: There were 226526 observations read from the data set WORK.BOMSING.
NOTE: 206 duplicate observations were deleted.
NOTE: The data set WORK.BOMSING_NO_DUP_PS has 226320 observations and 109 variables.

NOTE: The data set WORK.REMOVED_RECORDS_SAS has 206 observations and 109 variables.
NOTE: PROCEDURE SORT used (Total process time):
      real time           0.38 seconds
      cpu time            0.39 seconds
      

151        
152        proc sql;
153        	create table work.BOMSING_no_dup_sd as
154        	select distinct * from
155        	BOMSING
156        	order by LVL1_MATRL;
NOTE: Table WORK.BOMSING_NO_DUP_SD created, with 226283 rows and 109 columns.

157        quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           0.96 seconds
      cpu time            1.47 seconds
      
1 ACCEPTED SOLUTION

Accepted Solutions
Kurt_Bremser
Super User

Yes, but it does not do so reliably and consistently. That's why it is no longer documented.

 

To reliably get rid of duplicates, run

proc sort data=data nodupkey;
by _all_;
run;

which is in essence what SQL does when DISTINCT is used.

 

You positively must read the paper (https://support.sas.com/resources/papers/proceedings/proceedings/sugi30/037-30.pdf) mentioned in the thread linked by @andreas_lds.

Quote from there:

One thing to beware of with both options is that they both compare the previous observation written to the output data set. So, if the observations that you want eliminated are not adjacent in the data set after the sort, they will not be eliminated.

 

Example code that demonstrates the shortcoming of NODUPREC:

data have;
input key data;
datalines;
1 1
1 2
1 3
1 1
;

proc sort
  data=have
  out=want1
  noduprec
;
by key;
run;

title "SORT with NODUPREC";

proc print data=want1 noobs;
run;

proc sql;
create table want2 as
  select distinct *
  from have
;
quit;

title "SQL with DISTINCT";

proc print data=want2 noobs;
run;

Result:

SORT with NODUPREC
key	data
1	1
1	2
1	3
1	1
SQL with DISTINCT
key	data
1	1
1	2
1	3

View solution in original post

6 REPLIES 6
David_Billa
Rhodochrosite | Level 12

@andreas_lds I already gone through that article and comments. My Question is how to write the noduprecs in proc sql? 

andreas_lds
Jade | Level 19

@David_Billa wrote:

@andreas_lds I already gone through that article and comments. My Question is how to write the noduprecs in proc sql? 


Sorry, can't help you with this task, i hardly use proc sql, because in my eyes sql code is incredible ugly.

David_Billa
Rhodochrosite | Level 12

@Kurt_Bremser Noduprecs will remove the duplicate records based on all by variables which is similar to select distinct. Is my understanding right?

Kurt_Bremser
Super User

Yes, but it does not do so reliably and consistently. That's why it is no longer documented.

 

To reliably get rid of duplicates, run

proc sort data=data nodupkey;
by _all_;
run;

which is in essence what SQL does when DISTINCT is used.

 

You positively must read the paper (https://support.sas.com/resources/papers/proceedings/proceedings/sugi30/037-30.pdf) mentioned in the thread linked by @andreas_lds.

Quote from there:

One thing to beware of with both options is that they both compare the previous observation written to the output data set. So, if the observations that you want eliminated are not adjacent in the data set after the sort, they will not be eliminated.

 

Example code that demonstrates the shortcoming of NODUPREC:

data have;
input key data;
datalines;
1 1
1 2
1 3
1 1
;

proc sort
  data=have
  out=want1
  noduprec
;
by key;
run;

title "SORT with NODUPREC";

proc print data=want1 noobs;
run;

proc sql;
create table want2 as
  select distinct *
  from have
;
quit;

title "SQL with DISTINCT";

proc print data=want2 noobs;
run;

Result:

SORT with NODUPREC
key	data
1	1
1	2
1	3
1	1
SQL with DISTINCT
key	data
1	1
1	2
1	3

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 6 replies
  • 2628 views
  • 6 likes
  • 3 in conversation