Solved: Re: Remove duplicate records using noduprecs and select distinct

David_Billa · Posted 07-07-2022 02:30 AM

Can someone help me understand the difference between removing duplicates using noduprecs and select distinct? In below code, why I'm getting different record count?

How to write the proc sort with noduprecs in proc sql?

148        proc sort data=BOMSING out=BOMSING_no_dup_ps dupout=removed_records_sas NODUPRECS;
149        	BY LVL1_MATRL;
150        run;

NOTE: There were 226526 observations read from the data set WORK.BOMSING.
NOTE: 206 duplicate observations were deleted.
NOTE: The data set WORK.BOMSING_NO_DUP_PS has 226320 observations and 109 variables.

NOTE: The data set WORK.REMOVED_RECORDS_SAS has 206 observations and 109 variables.
NOTE: PROCEDURE SORT used (Total process time):
      real time           0.38 seconds
      cpu time            0.39 seconds
      

151        
152        proc sql;
153        	create table work.BOMSING_no_dup_sd as
154        	select distinct * from
155        	BOMSING
156        	order by LVL1_MATRL;
NOTE: Table WORK.BOMSING_NO_DUP_SD created, with 226283 rows and 109 columns.

157        quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           0.96 seconds
      cpu time            1.47 seconds

Kurt_Bremser · Posted 07-07-2022 04:09 AM

Yes, but it does not do so reliably and consistently. That's why it is no longer documented.

To reliably get rid of duplicates, run

proc sort data=data nodupkey;
by _all_;
run;

which is in essence what SQL does when DISTINCT is used.

You positively must read the paper (https://support.sas.com/resources/papers/proceedings/proceedings/sugi30/037-30.pdf) mentioned in the thread linked by @andreas_lds.

Quote from there:

One thing to beware of with both options is that they both compare the previous observation written to the output data set. So, if the observations that you want eliminated are not adjacent in the data set after the sort, they will not be eliminated.

Example code that demonstrates the shortcoming of NODUPREC:

data have;
input key data;
datalines;
1 1
1 2
1 3
1 1
;

proc sort
  data=have
  out=want1
  noduprec
;
by key;
run;

title "SORT with NODUPREC";

proc print data=want1 noobs;
run;

proc sql;
create table want2 as
  select distinct *
  from have
;
quit;

title "SQL with DISTINCT";

proc print data=want2 noobs;
run;

Result:

SORT with NODUPREC
key	data
1	1
1	2
1	3
1	1
SQL with DISTINCT
key	data
1	1
1	2
1	3

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

View solution in original post

andreas_lds · Posted 07-07-2022 02:48 AM

Please have a look at https://communities.sas.com/t5/New-SAS-User/Difference-between-NODUPKEY-and-NODUPRECS/td-p/536246 especially at the comment by @Reeza

David_Billa · Posted 07-07-2022 03:02 AM

@andreas_lds I already gone through that article and comments. My Question is how to write the noduprecs in proc sql?

andreas_lds · Posted 07-07-2022 03:05 AM

@David_Billa wrote:

@andreas_lds I already gone through that article and comments. My Question is how to write the noduprecs in proc sql?

Sorry, can't help you with this task, i hardly use proc sql, because in my eyes sql code is incredible ugly.

Kurt_Bremser · Posted 07-07-2022 03:09 AM

@David_Billa wrote:

@andreas_lds I already gone through that article and comments. My Question is how to write the noduprecs in proc sql?

You can't. You would have to re-engineer what is, in essence, a bug.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

David_Billa · Posted 07-07-2022 03:45 AM

@Kurt_Bremser Noduprecs will remove the duplicate records based on all by variables which is similar to select distinct. Is my understanding right?

Kurt_Bremser · Posted 07-07-2022 04:09 AM

Yes, but it does not do so reliably and consistently. That's why it is no longer documented.

To reliably get rid of duplicates, run

proc sort data=data nodupkey;
by _all_;
run;

which is in essence what SQL does when DISTINCT is used.

You positively must read the paper (https://support.sas.com/resources/papers/proceedings/proceedings/sugi30/037-30.pdf) mentioned in the thread linked by @andreas_lds.

Quote from there:

One thing to beware of with both options is that they both compare the previous observation written to the output data set. So, if the observations that you want eliminated are not adjacent in the data set after the sort, they will not be eliminated.

Example code that demonstrates the shortcoming of NODUPREC:

data have;
input key data;
datalines;
1 1
1 2
1 3
1 1
;

proc sort
  data=have
  out=want1
  noduprec
;
by key;
run;

title "SORT with NODUPREC";

proc print data=want1 noobs;
run;

proc sql;
create table want2 as
  select distinct *
  from have
;
quit;

title "SQL with DISTINCT";

proc print data=want2 noobs;
run;

Result:

SORT with NODUPREC
key	data
1	1
1	2
1	3
1	1
SQL with DISTINCT
key	data
1	1
1	2
1	3

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Registration is open

SAS Training: Just a Click Away