DATA Step, Macro, Functions and more

Difference between NOdup and NoDupkey..??

Accepted Solution Solved
Reply
N/A
Posts: 0
Accepted Solution

Difference between NOdup and NoDupkey..??

[ Edited ]

Hi,

Can any one please let me know the difference between NoDup and NoDupKey in PROC SORT?

Thanks,


Accepted Solutions
Solution
‎10-31-2016 01:06 PM
Contributor
Posts: 43

Re: Difference between NOdup and NoDupkey..??

[ Edited ]

NODUP will delete duplicated observations (records that are identical to each other) while NODUPKEY will delete those observations that have duplicate BY values (the sort BY variables that you name in PROC SORT).

 

From @Cynthia_sas:

 

Here are just a few of the papers and Tech Support notes on the subject:

Also: 

To find these, I went to support.sas.com and entered
  NODUP NODUPKEY
in the search box.

View solution in original post


All Replies
Solution
‎10-31-2016 01:06 PM
Contributor
Posts: 43

Re: Difference between NOdup and NoDupkey..??

[ Edited ]

NODUP will delete duplicated observations (records that are identical to each other) while NODUPKEY will delete those observations that have duplicate BY values (the sort BY variables that you name in PROC SORT).

 

From @Cynthia_sas:

 

Here are just a few of the papers and Tech Support notes on the subject:

Also: 

To find these, I went to support.sas.com and entered
  NODUP NODUPKEY
in the search box.

N/A
Posts: 0

Re: Difference between NOdup and NoDupkey..??

Hi Kevin,

Thank you for the info.

Regards,
Kamal.
Contributor
Posts: 42

Re: Difference between NOdup and NoDupkey..??

When I have an issue with a SAS option, I usually take a data set from the SASHELP library and experiment a little. These are great datasets because everybody has access to them.

People are here to be helpful, and they can be more helpful if you showed some initiative.
.
Occasional Contributor
Posts: 17

Re: Difference between NOdup and NoDupkey..??

HOWEVER, NODUP (or NODUPRECS) may mean two different things depending upon your options settings.

OPTIONS SORTDUP=  has two options: PHYSICAL and LOGICAL.  They perform differently. 

 

PHYSICAL removes duplicates based on all the variables that are present in the data set. This is the default.
LOGICAL removes duplicates based on only the variables remaining after the DROP= and KEEP= data set options are processed.

 

The SORTDUP= option specifies what variables to sort to remove duplicate observations when the SORT procedure NODUPRECS option is specified.

When SORTDUP= is set to LOGICAL and NODUPRECS is specified in the SORT procedure, duplicate observations are removed based on the variables that remain after a DROP or KEEP operation on the input data set. Setting SORTDUP=LOGICAL increases the number of duplicate observations that are removed because it eliminates variables before observations are compared. Setting SORTDUP=LOGICAL might improve performance.  [This is a very awkward explanation from SAS documentation.  LOGICAL may eliiminate more observations because observations are deduped AFTER variables have been eliminated.]

When SORTDUP= is set to PHYSICAL and NODUPRECS is specified in the SORT procedure, duplicate observations are removed based on all of the variables in the input data set.

 

The SORTDUP option does not affect the behavior of NODUPKEY in the SORT procedure.  With NODUPKEY you specify the variables to be used to elimnate duplicates.

 

ALSO note that when you use EITHER the NODUP/NODUPRECS or the NODUPKEY modifying option, the "first" record in the sort order is maintained and all subsequent observations are eliminated.  Therefore the order in which variables are sorted will make a difference.

 

Example:

OPTIONS SORTDUP=PHYSICAL ;

PROC SORT DATA=dataset NODUPRECS ; BY ID LAST_NAME FIRST_NAME DATE ; RUN ;

     *I sort the data and all duplicate records based on all variables are eliminated.

PROC SORT DATA=dataset NODUPKEY ; BY ID LAST_NAME FIRST_NAME ; RUN ;

     *The data are not really resorted this time, but they are "deduped" so that I have only 1 record per last-name/first-name AND I retain the record with the earliest date.

 

But maybe I wanted the latest record, i.e. the record for each person with the most recent date.  Then I would have to do the following:

PROC SORT DATA=dataset NODUPRECS ; BY ID LAST_NAME FIRST_NAME  DATE  DESCENDING ; RUN ;

PROC SORT DATA=dataset NODUPKEY ; BY ID LAST_NAME FIRST_NAME ; RUN ;

     Because I sorted the data so that the most recent date comes first in the sort order, when I "dedupe" I retain the record with the msot recent date.

 

SAS Super FREQ
Posts: 8,742

Re: Difference between NOdup and NoDupkey..??

Hi:
Here are just a few of the papers and Tech Support notes on the subject:
http://support.sas.com/kb/1/566.html
http://support.sas.com/kb/22/382.html
http://www2.sas.com/proceedings/forum2007/069-2007.pdf
http://www2.sas.com/proceedings/sugi30/037-30.pdf
http://www2.sas.com/proceedings/sugi31/164-31.pdf

To find these, I went to support.sas.com and entered
NODUP NODUPKEY
in the search box.

cynthia
N/A
Posts: 0

Re: Difference between NOdup and NoDupkey..??

Hi Cynthia,

Thank you for your help. Given me a valuable information with examples.

Kind Regards,
Kamal.
Super User
Super User
Posts: 6,499

Re: Difference between NOdup and NoDupkey..??

[ Edited ]

No one seems to have mentioned the biggest difference (although it probably is in one of the articles linked by Cynthia). 

The NODUP option will sometimes NOT remove all of the duplicates.

This is because it compares ADJACENT records and only keeps the first.  But if there are two duplicate records separated from each other by one or more records that are different then both are kept.  Try this simple example.

data have ;
  input a b ;
cards;
1 1
1 2
1 1
2 1
;

proc sort data=have nodup out=nodup ;
  by a ;
run;

proc sort data=have nodupkey out=nodupkey;
  by a ;
run;

The NODUP output dataset will have all four observations since each observation is different than the one next to it.  The NODUPKEY output dataset will have only two observations since there are only two distinct values of A. 

Occasional Contributor
Posts: 17

Re: Difference between NOdup and NoDupkey..??

Tom makes a good point about the way NODUP works.  The key is to remember that when you do deduping you may have forgotten a detail.

 

Look at Tom's code (which illustrates his point very well) and my code above which solved the problem of deduping without pointing out the details that Tom stressed.

 

Tom's structure is to use similar SORT procedures, the first time using NODUP and the second time using NODUPKEY:

     proc sort data=data NODUP ;         by A  ; run ;

     proc sort data=data NODUPKEY ;  by A  ; run ;

My structure was slightly -- but significantly different for precisely the reason Tom pointed out:

     proc sort data=data NODUP ;         by A B C ; run ;

     proc sort data=data NODUPKEY ;  by A B    ; run ;

 

     Adding the 2nd variable in the first sort makes no difference in what records are removed.  But it does sort the variables A and B and C so that I can choose which record I want to keep.  I sorted on more variables than I was interested in deduping in the first sort.

 

If I want to dedupe on ALL variables, then I have to do one of the following:

1) Sort on all of the variables.

2) Use PROC SQL ; SELECT DISTINCT * FROM DATA ; QUIT;

    or

    OPTIONS SORTDUP=LOGICAL;

    PROC SORT DATA=data NODUP ; BY _ALL_ ; RUN ;

    /* and if my I want SORTDUP=PHYSICAL to be my default, then reset it */

   PROC OPTIONS SORTDUP=PHYSICAL ;

 

 

Contributor SJN
Contributor
Posts: 21

Re: Difference between NOdup and NoDupkey..??

Hi All,

 

I have one more question on sorting.

 

data want;
input a b c;
cards;
1 2 3
4 5 6
1 2 3
1 4 5
4 3 2
;
run;

proc sort data=want nodup out= have;
by a ;
run;

I have 5 observations and when I sort I get only 4 observations only. Can any one explain why there is decrease in row count

Occasional Contributor
Posts: 17

Re: Difference between NOdup and NoDupkey..??

Both NODUP/NODUPRECS and NODUPKEY delete duplicates.

If you don't want duplicate records deleted, don't ask SAS to delete records.

 

NODUP (aka NODUPRECS) deletes dulicate records.  

NODUPKEY deletes records based on the BY variables.  If two records have the same values on all of the BY variables only one record will be kept even if these records have different values on other variables.  If it matters to you which record is kept, then you need to do some additional processing before running PROC SORT.

 

(What constitutes a duplicate record depends upon the SORTDUP system options setting.. SORTDUP specifies whether the SORT procedure removes duplicate variables based on all variables in a data set or the variables that remain after the DROP or KEEP data set options have been applied.)

 

data want;
input a b c;
cards;
1 2 3
4 5 6
1 2 3
1 4 5
4 3 2
;
run;

 

proc sort data=want out= have;
by a ;
run;
*NOTE: There were 5 observations read from the data set WORK.WANT.
NOTE: The data set WORK.HAVE has 5 observations and 3 variables. ;

 

proc sort data=want nodupkey out= have;
by a ;
run;
*NOTE: There were 5 observations read from the data set WORK.WANT.
NOTE: 3 observations with duplicate key values were deleted.
NOTE: The data set WORK.HAVE has 2 observations and 3 variables.;

 

proc sort data=want nodup out= have;
by a ;
run;
*NOTE: There were 5 observations read from the data set WORK.WANT.
NOTE: 1 duplicate observations were deleted.
NOTE: The data set WORK.HAVE has 4 observations and 3 variables. ;

Super User
Super User
Posts: 6,499

Re: Difference between NOdup and NoDupkey..??

[ Edited ]

There is a decrease in the row count because your first and third rows are the same and you told PROC SORT to delete duplicate rows.


SJN wrote:

Hi All,

 

I have one more question on sorting.

 

data want;
input a b c;
cards;
1 2 3
4 5 6
1 2 3
1 4 5
4 3 2
;
run;

proc sort data=want nodup out= have;
by a ;
run;

I have 5 observations and when I sort I get only 4 observations only. Can any one explain why there is decrease in row count


 

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 11 replies
  • 50352 views
  • 8 likes
  • 7 in conversation