BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
lmyers2
Obsidian | Level 7

Hello,

 

I'm using SAS 9.4 edition. I'm trying to selectively remove duplicate observations but I'm not sure how to tell SAS to keep certain rows. I'd like to keep the row with more data in the state field, for example. My code is below and it's not working. Thanks for the help!

 

CASE_ID    State

1                 MA   

1              

 

proc sort data=casemergedata out=newdata nodupkey; by CASE_ID; run;
proc freq data=newdata; tables CASE_CONTROL; run;

  

1 ACCEPTED SOLUTION

Accepted Solutions
art297
Opal | Level 21

The way I like to do things like that is to use two sorts. First one, to order the records so that the record with the maximum value of state comes first within each id. i.e.:

proc sort data=casemergedata out=newdata;
  by CASE_ID descending state;
run;

Then a second sort to get rid of the duplicate caseids. i.e.:

proc sort data=newdata nodupkey;
  by CASE_ID;
run;

Art, CEO, AnalystFinder.com

 

View solution in original post

18 REPLIES 18
ChrisNZ
Tourmaline | Level 20

Are you only looking at one variable?

In that case just also sort by descending STATE.

 

PeterClemmensen
Tourmaline | Level 20
data have;
input CASE_ID State$;
infile datalines missover;
datalines;
1 MA   
1  
; 

proc sort data=have out=want nodupkey;
   by descending CASE_ID;
run;
art297
Opal | Level 21

The way I like to do things like that is to use two sorts. First one, to order the records so that the record with the maximum value of state comes first within each id. i.e.:

proc sort data=casemergedata out=newdata;
  by CASE_ID descending state;
run;

Then a second sort to get rid of the duplicate caseids. i.e.:

proc sort data=newdata nodupkey;
  by CASE_ID;
run;

Art, CEO, AnalystFinder.com

 

ChrisNZ
Tourmaline | Level 20
You can add the presorted option to the second sort to speed things up.
Patrick
Opal | Level 21

@ChrisNZ

I really liked your presorted proposal. Unfortunately this option doesn't appear to have an effect if combined with nodupkey.

 

28         proc sort data=sashelp.class out=work.class(sortedby=age);
29           by age name;
30         run;
         
33         proc sort data=work.class out=test1 presorted nodupkey;
34           by age;
35         run;

NOTE: Input data set is not in sorted order.
NOTE: ...
         
37         proc sort data=work.class out=test1 presorted;
38           by age;
39         run;
NOTE: Sort order of input data set has been verified.
NOTE: Input data set is already sorted; it has been copied to the output data set.
NOTE: ...

 

ChrisNZ
Tourmaline | Level 20

@Patrick 

There is no reason presorted  should not work with nodupkey.

Reporting this as a bug now.

ChrisNZ
Tourmaline | Level 20

@Patrick@art297 Update: This is not a bug, it is a planned shortcoming.

 

Option presorted is only there to set flags (that's a very expensive flag when you copy a whole table to set it!).

 

If any processing is required (such as removing duplicates) this option cannot be used.

 

What a wasted opportunity!

 

It is all described here is its flawed glory.

 

All the more reason to avoid such wastage by creating a new data step option.

 

 

I have asked for the log NOTE and the presorted documentation to be corrected. 

 

 

 

 

 

art297
Opal | Level 21

@ChrisNZ: Thanks for looking into it! I suppose the explanation made sense to somebody (at SAS) but I, for one, don't understand what benefit they gained by excluding PRESORTED when NODUPKEY is used.

 

Art, CEO, AnalystFinder.com

ChrisNZ
Tourmaline | Level 20

@art297 This option only helps to set metadata flags. So it's not really about excluding other options.

If you run 

proc sort data=SASHELP.CLASS out=CLASS presorted nodupkey danish;
  by NAME;
run;

then option PRESORTED will trigger the validation of the NODUPKEY and DANISH options, and add the corresponding metadata.

 

In this case PROC CONTENTS shows the updated values for flags SORTED, NODUPKEY and COLLATE.

 

That's all PRESORTED does: set the metadata. This option excludes any processing. What a wasted opportunity.

art297
Opal | Level 21

@ChrisNZ: Not totally true. Take a look at the log after running the following .. particularly the log for the last two sorts.

The presorted option cuts the sort down by 2/3rds of the time.

 

data have;
  set sashelp.class;
  do i=1 to 100000;
    output;
  end;
run;

proc sort data=have;
  by sex;
run;

data need;
  set have;
run;
 
proc sort data=need out=want1;
  by sex;
run;

proc sort data=need out=want2 presorted;
  by sex;
run;

Art, CEO, AnalystFinder.com

 

ChrisNZ
Tourmaline | Level 20

Yes PRESORTED avoids sorting, and just copies the table instead, thereby saving time.

But that's only true if no other processing, such as removing duplicates, is required.

That's where the wasted opportunity is: why not allow that removal? A simple sequential read would then suffice, instead of sorting.

 

 

Note that PRESORTED is even more efficient if you don't create another table.  

proc sort data=HAVE presorted; by NAME; run;

data _null_; set HAVE; by NAME; run;

 

Both these steps take about the same time, as no data is copied and the data set is just read sequentially.

 

I am not saying this is a useless option, quite the contrary.

I am saying it could easily have been more useful.

As in the case of this thread's question.

 

 

 

 

art297
Opal | Level 21

Then we totally agree!

 

Art, CEO, AnalystFinder.com

 

ChrisNZ
Tourmaline | Level 20

@art297 Worth noting as it is the first time I have this feedback after reporting countless issues:

 

The developer will: 

  1. Look at changing the wording of the Notes generated with the various options in PROC SORT  
  2. Work with Publications to ensure that the documentation is updated to clarify everything related to the options used and the effect they have on each other

So it seems this one will not be shelved! Yay! 🙂

 

art297
Opal | Level 21

@ChrisNZ: What's the saying?: if you don't succeed the first time, try, try again?

 

Hopefully, in trying to explain it, they'll realize that the option should simply perform as a typical user would expect!

 

Art, CEO, AnalystFinder.com

 

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 18 replies
  • 1381 views
  • 7 likes
  • 7 in conversation