BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.

Hi,

Can any one please let me know the difference between NoDup and NoDupKey in PROC SORT?

Thanks,

1 ACCEPTED SOLUTION

Accepted Solutions
KevinQin
Obsidian | Level 7

NODUP will delete duplicated observations (records that are identical to each other) while NODUPKEY will delete those observations that have duplicate BY values (the sort BY variables that you name in PROC SORT).

 

From @Cynthia_sas:

 

Here are just a few of the papers and Tech Support notes on the subject:

Also: 

To find these, I went to support.sas.com and entered
  NODUP NODUPKEY
in the search box.

View solution in original post

10 REPLIES 10
KevinQin
Obsidian | Level 7

NODUP will delete duplicated observations (records that are identical to each other) while NODUPKEY will delete those observations that have duplicate BY values (the sort BY variables that you name in PROC SORT).

 

From @Cynthia_sas:

 

Here are just a few of the papers and Tech Support notes on the subject:

Also: 

To find these, I went to support.sas.com and entered
  NODUP NODUPKEY
in the search box.

deleted_user
Not applicable
Hi Kevin,

Thank you for the info.

Regards,
Kamal.
bbenbaruch
Quartz | Level 8

HOWEVER, NODUP (or NODUPRECS) may mean two different things depending upon your options settings.

OPTIONS SORTDUP=  has two options: PHYSICAL and LOGICAL.  They perform differently. 

 

PHYSICAL removes duplicates based on all the variables that are present in the data set. This is the default.
LOGICAL removes duplicates based on only the variables remaining after the DROP= and KEEP= data set options are processed.

 

The SORTDUP= option specifies what variables to sort to remove duplicate observations when the SORT procedure NODUPRECS option is specified.

When SORTDUP= is set to LOGICAL and NODUPRECS is specified in the SORT procedure, duplicate observations are removed based on the variables that remain after a DROP or KEEP operation on the input data set. Setting SORTDUP=LOGICAL increases the number of duplicate observations that are removed because it eliminates variables before observations are compared. Setting SORTDUP=LOGICAL might improve performance.  [This is a very awkward explanation from SAS documentation.  LOGICAL may eliiminate more observations because observations are deduped AFTER variables have been eliminated.]

When SORTDUP= is set to PHYSICAL and NODUPRECS is specified in the SORT procedure, duplicate observations are removed based on all of the variables in the input data set.

 

The SORTDUP option does not affect the behavior of NODUPKEY in the SORT procedure.  With NODUPKEY you specify the variables to be used to elimnate duplicates.

 

ALSO note that when you use EITHER the NODUP/NODUPRECS or the NODUPKEY modifying option, the "first" record in the sort order is maintained and all subsequent observations are eliminated.  Therefore the order in which variables are sorted will make a difference.

 

Example:

OPTIONS SORTDUP=PHYSICAL ;

PROC SORT DATA=dataset NODUPRECS ; BY ID LAST_NAME FIRST_NAME DATE ; RUN ;

     *I sort the data and all duplicate records based on all variables are eliminated.

PROC SORT DATA=dataset NODUPKEY ; BY ID LAST_NAME FIRST_NAME ; RUN ;

     *The data are not really resorted this time, but they are "deduped" so that I have only 1 record per last-name/first-name AND I retain the record with the earliest date.

 

But maybe I wanted the latest record, i.e. the record for each person with the most recent date.  Then I would have to do the following:

PROC SORT DATA=dataset NODUPRECS ; BY ID LAST_NAME FIRST_NAME  DATE  DESCENDING ; RUN ;

PROC SORT DATA=dataset NODUPKEY ; BY ID LAST_NAME FIRST_NAME ; RUN ;

     Because I sorted the data so that the most recent date comes first in the sort order, when I "dedupe" I retain the record with the msot recent date.

 

Cynthia_sas
SAS Super FREQ
Hi:
Here are just a few of the papers and Tech Support notes on the subject:
http://support.sas.com/kb/1/566.html
http://support.sas.com/kb/22/382.html
http://www2.sas.com/proceedings/forum2007/069-2007.pdf
http://www2.sas.com/proceedings/sugi30/037-30.pdf
http://www2.sas.com/proceedings/sugi31/164-31.pdf

To find these, I went to support.sas.com and entered
NODUP NODUPKEY
in the search box.

cynthia
deleted_user
Not applicable
Hi Cynthia,

Thank you for your help. Given me a valuable information with examples.

Kind Regards,
Kamal.
Tom
Super User Tom
Super User

No one seems to have mentioned the biggest difference (although it probably is in one of the articles linked by Cynthia). 

The NODUP option will sometimes NOT remove all of the duplicates.

This is because it compares ADJACENT records and only keeps the first.  But if there are two duplicate records separated from each other by one or more records that are different then both are kept.  Try this simple example.

data have ;
  input a b ;
cards;
1 1
1 2
1 1
2 1
;

proc sort data=have nodup out=nodup ;
  by a ;
run;

proc sort data=have nodupkey out=nodupkey;
  by a ;
run;

The NODUP output dataset will have all four observations since each observation is different than the one next to it.  The NODUPKEY output dataset will have only two observations since there are only two distinct values of A. 

bbenbaruch
Quartz | Level 8

Tom makes a good point about the way NODUP works.  The key is to remember that when you do deduping you may have forgotten a detail.

 

Look at Tom's code (which illustrates his point very well) and my code above which solved the problem of deduping without pointing out the details that Tom stressed.

 

Tom's structure is to use similar SORT procedures, the first time using NODUP and the second time using NODUPKEY:

     proc sort data=data NODUP ;         by A  ; run ;

     proc sort data=data NODUPKEY ;  by A  ; run ;

My structure was slightly -- but significantly different for precisely the reason Tom pointed out:

     proc sort data=data NODUP ;         by A B C ; run ;

     proc sort data=data NODUPKEY ;  by A B    ; run ;

 

     Adding the 2nd variable in the first sort makes no difference in what records are removed.  But it does sort the variables A and B and C so that I can choose which record I want to keep.  I sorted on more variables than I was interested in deduping in the first sort.

 

If I want to dedupe on ALL variables, then I have to do one of the following:

1) Sort on all of the variables.

2) Use PROC SQL ; SELECT DISTINCT * FROM DATA ; QUIT;

    or

    OPTIONS SORTDUP=LOGICAL;

    PROC SORT DATA=data NODUP ; BY _ALL_ ; RUN ;

    /* and if my I want SORTDUP=PHYSICAL to be my default, then reset it */

   PROC OPTIONS SORTDUP=PHYSICAL ;

 

 

SJN
Fluorite | Level 6 SJN
Fluorite | Level 6

Hi All,

 

I have one more question on sorting.

 

data want;
input a b c;
cards;
1 2 3
4 5 6
1 2 3
1 4 5
4 3 2
;
run;

proc sort data=want nodup out= have;
by a ;
run;

I have 5 observations and when I sort I get only 4 observations only. Can any one explain why there is decrease in row count

bbenbaruch
Quartz | Level 8

Both NODUP/NODUPRECS and NODUPKEY delete duplicates.

If you don't want duplicate records deleted, don't ask SAS to delete records.

 

NODUP (aka NODUPRECS) deletes dulicate records.  

NODUPKEY deletes records based on the BY variables.  If two records have the same values on all of the BY variables only one record will be kept even if these records have different values on other variables.  If it matters to you which record is kept, then you need to do some additional processing before running PROC SORT.

 

(What constitutes a duplicate record depends upon the SORTDUP system options setting.. SORTDUP specifies whether the SORT procedure removes duplicate variables based on all variables in a data set or the variables that remain after the DROP or KEEP data set options have been applied.)

 

data want;
input a b c;
cards;
1 2 3
4 5 6
1 2 3
1 4 5
4 3 2
;
run;

 

proc sort data=want out= have;
by a ;
run;
*NOTE: There were 5 observations read from the data set WORK.WANT.
NOTE: The data set WORK.HAVE has 5 observations and 3 variables. ;

 

proc sort data=want nodupkey out= have;
by a ;
run;
*NOTE: There were 5 observations read from the data set WORK.WANT.
NOTE: 3 observations with duplicate key values were deleted.
NOTE: The data set WORK.HAVE has 2 observations and 3 variables.;

 

proc sort data=want nodup out= have;
by a ;
run;
*NOTE: There were 5 observations read from the data set WORK.WANT.
NOTE: 1 duplicate observations were deleted.
NOTE: The data set WORK.HAVE has 4 observations and 3 variables. ;

Tom
Super User Tom
Super User

There is a decrease in the row count because your first and third rows are the same and you told PROC SORT to delete duplicate rows.


@SJN wrote:

Hi All,

 

I have one more question on sorting.

 

data want;
input a b c;
cards;
1 2 3
4 5 6
1 2 3
1 4 5
4 3 2
;
run;

proc sort data=want nodup out= have;
by a ;
run;

I have 5 observations and when I sort I get only 4 observations only. Can any one explain why there is decrease in row count


 

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 10 replies
  • 176664 views
  • 18 likes
  • 6 in conversation