DATA Step, Macro, Functions and more

question about selectively removing duplicates

Accepted Solution Solved
Reply
Contributor
Posts: 47
Accepted Solution

question about selectively removing duplicates

Hello,

 

I'm using SAS 9.4 edition. I'm trying to selectively remove duplicate observations but I'm not sure how to tell SAS to keep certain rows. I'd like to keep the row with more data in the state field, for example. My code is below and it's not working. Thanks for the help!

 

CASE_ID    State

1                 MA   

1              

 

proc sort data=casemergedata out=newdata nodupkey; by CASE_ID; run;
proc freq data=newdata; tables CASE_CONTROL; run;

  


Accepted Solutions
Solution
‎01-21-2018 09:46 AM
Super User
Posts: 8,220

Re: question about selectively removing duplicates

The way I like to do things like that is to use two sorts. First one, to order the records so that the record with the maximum value of state comes first within each id. i.e.:

proc sort data=casemergedata out=newdata;
  by CASE_ID descending state;
run;

Then a second sort to get rid of the duplicate caseids. i.e.:

proc sort data=newdata nodupkey;
  by CASE_ID;
run;

Art, CEO, AnalystFinder.com

 

View solution in original post


All Replies
Super User
Posts: 2,516

Re: question about selectively removing duplicates

[ Edited ]

Are you only looking at one variable?

In that case just also sort by descending STATE.

 

PROC Star
Posts: 1,410

Re: question about selectively removing duplicates

data have;
input CASE_ID State$;
infile datalines missover;
datalines;
1 MA   
1  
; 

proc sort data=have out=want nodupkey;
   by descending CASE_ID;
run;
Solution
‎01-21-2018 09:46 AM
Super User
Posts: 8,220

Re: question about selectively removing duplicates

The way I like to do things like that is to use two sorts. First one, to order the records so that the record with the maximum value of state comes first within each id. i.e.:

proc sort data=casemergedata out=newdata;
  by CASE_ID descending state;
run;

Then a second sort to get rid of the duplicate caseids. i.e.:

proc sort data=newdata nodupkey;
  by CASE_ID;
run;

Art, CEO, AnalystFinder.com

 

Super User
Posts: 2,516

Re: question about selectively removing duplicates

You can add the presorted option to the second sort to speed things up.
Respected Advisor
Posts: 4,802

Re: question about selectively removing duplicates

[ Edited ]

@ChrisNZ

I really liked your presorted proposal. Unfortunately this option doesn't appear to have an effect if combined with nodupkey.

 

28         proc sort data=sashelp.class out=work.class(sortedby=age);
29           by age name;
30         run;
         
33         proc sort data=work.class out=test1 presorted nodupkey;
34           by age;
35         run;

NOTE: Input data set is not in sorted order.
NOTE: ...
         
37         proc sort data=work.class out=test1 presorted;
38           by age;
39         run;
NOTE: Sort order of input data set has been verified.
NOTE: Input data set is already sorted; it has been copied to the output data set.
NOTE: ...

 

Super User
Posts: 2,516

Re: question about selectively removing duplicates

@Patrick 

There is no reason presorted  should not work with nodupkey.

Reporting this as a bug now.

Super User
Posts: 2,516

Re: question about selectively removing duplicates

@Patrick@art297 Update: This is not a bug, it is a planned shortcoming.

 

Option presorted is only there to set flags (that's a very expensive flag when you copy a whole table to set it!).

 

If any processing is required (such as removing duplicates) this option cannot be used.

 

What a wasted opportunity!

 

It is all described here is its flawed glory.

 

All the more reason to avoid such wastage by creating a new data step option.

 

 

I have asked for the log NOTE and the presorted documentation to be corrected. 

 

 

 

 

 

Super User
Posts: 8,220

Re: question about selectively removing duplicates

@ChrisNZ: Thanks for looking into it! I suppose the explanation made sense to somebody (at SAS) but I, for one, don't understand what benefit they gained by excluding PRESORTED when NODUPKEY is used.

 

Art, CEO, AnalystFinder.com

Super User
Posts: 2,516

Re: question about selectively removing duplicates

[ Edited ]

@art297 This option only helps to set metadata flags. So it's not really about excluding other options.

If you run 

proc sort data=SASHELP.CLASS out=CLASS presorted nodupkey danish;
  by NAME;
run;

then option PRESORTED will trigger the validation of the NODUPKEY and DANISH options, and add the corresponding metadata.

 

In this case PROC CONTENTS shows the updated values for flags SORTED, NODUPKEY and COLLATE.

 

That's all PRESORTED does: set the metadata. This option excludes any processing. What a wasted opportunity.

Super User
Posts: 8,220

Re: question about selectively removing duplicates

@ChrisNZ: Not totally true. Take a look at the log after running the following .. particularly the log for the last two sorts.

The presorted option cuts the sort down by 2/3rds of the time.

 

data have;
  set sashelp.class;
  do i=1 to 100000;
    output;
  end;
run;

proc sort data=have;
  by sex;
run;

data need;
  set have;
run;
 
proc sort data=need out=want1;
  by sex;
run;

proc sort data=need out=want2 presorted;
  by sex;
run;

Art, CEO, AnalystFinder.com

 

Super User
Posts: 2,516

Re: question about selectively removing duplicates

Yes PRESORTED avoids sorting, and just copies the table instead, thereby saving time.

But that's only true if no other processing, such as removing duplicates, is required.

That's where the wasted opportunity is: why not allow that removal? A simple sequential read would then suffice, instead of sorting.

 

 

Note that PRESORTED is even more efficient if you don't create another table.  

proc sort data=HAVE presorted; by NAME; run;

data _null_; set HAVE; by NAME; run;

 

Both these steps take about the same time, as no data is copied and the data set is just read sequentially.

 

I am not saying this is a useless option, quite the contrary.

I am saying it could easily have been more useful.

As in the case of this thread's question.

 

 

 

 

Super User
Posts: 8,220

Re: question about selectively removing duplicates

Then we totally agree!

 

Art, CEO, AnalystFinder.com

 

Super User
Posts: 2,516

Re: question about selectively removing duplicates

@art297 Worth noting as it is the first time I have this feedback after reporting countless issues:

 

The developer will: 

  1. Look at changing the wording of the Notes generated with the various options in PROC SORT  
  2. Work with Publications to ensure that the documentation is updated to clarify everything related to the options used and the effect they have on each other

So it seems this one will not be shelved! Yay! Smiley Happy

 

Super User
Posts: 8,220

Re: question about selectively removing duplicates

@ChrisNZ: What's the saying?: if you don't succeed the first time, try, try again?

 

Hopefully, in trying to explain it, they'll realize that the option should simply perform as a typical user would expect!

 

Art, CEO, AnalystFinder.com

 

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 18 replies
  • 402 views
  • 7 likes
  • 7 in conversation