Desktop productivity for business analysts and programmers

how to remove duplicates in SAS 4.3 EG

Accepted Solution Solved
Reply
New Contributor
Posts: 2
Accepted Solution

how to remove duplicates in SAS 4.3 EG

Hello all

Got a query on how to remove dulicates while using SAS 4.3 EG

Any pointers would be of help.

Thanks

Sundari


Accepted Solutions
Solution
‎08-31-2012 06:48 AM
Contributor
Posts: 44

Re: how to remove duplicates in SAS 4.3 EG

You can also your the sorttask and select no dublicated keys or obs in the option panel.

View solution in original post


All Replies
Contributor
Posts: 30

Re: how to remove duplicates in SAS 4.3 EG

Assuming you want to remove duplicate observations with the same values?

data have;

input x y;

cards;

1 2

1 4

1 8

2 3

2 4

2 4

2 5

1 7

1 8

;

run;

/* PROC SORT method */

proc sort data=have nodup out=want1;

by x y;

quit;

/* PROC SQL method */

proc sql;

create table want2 as select distinct * from have;

quit;

If you want to delete duplicates that are not 100% identical (e.g. all units with the same ID variable regardless of whether they differ in other values) look at the NODUPKEY option in PROC SORT.

Respected Advisor
Posts: 3,831

Re: how to remove duplicates in SAS 4.3 EG

Use the query builder, select all columns, tick box "distinct rows"

Solution
‎08-31-2012 06:48 AM
Contributor
Posts: 44

Re: how to remove duplicates in SAS 4.3 EG

You can also your the sorttask and select no dublicated keys or obs in the option panel.

Respected Advisor
Posts: 3,831

Re: how to remove duplicates in SAS 4.3 EG

Don't use the sort task to get rid of duplicate records (=all variable values equal). This will only give you the expected result if you also sort by ALL variables in the data set (and then you could also choose "Keep only the first record per by group").

Selecting option "Keep only one of each record..." in the sort task creates a Proc Sort statement using NODUP (which is NODUPRECS). This option is imho one of the less lucky implementations SAS did. It's quite deceptive and I've seen already a lot of people stumbling over it.

What NODUP does is removing identical rows (all variable values the same) only if they follow right one after the other - and you have no control if this happens this way if you don't sort by all variables (and then NODUPKEY gives you the same result).

Here an example to showcase the issue

data sample;

  input key var $;

  datalines;

1 a

1 b

1 b

1 a

;

run;

Capture1.PNG

Capture.PNG

This generates the following Proc Sort code (in EG4.3):

PROC SORT DATA=WORK.SAMPLE
   OUT=WORK.SORTSortedSAMPLE(LABEL="Sorted WORK.SAMPLE")
   NODUP
   ;
   BY key;

RUN;

One would now expect to get 2 rows as result

1 a

1 b

BUT that's what you will get:

Capture.PNG

Selection of "Maintaining original data order..." will also influence the result - but with the simple sample data provided the result will be wrong as well.

Conclusion

Use the query builder with "select distinct rows only" for de-dupping.

Contributor
Posts: 30

Re: how to remove duplicates in SAS 4.3 EG

Patrick is correct, I was aware of this rather important detail but completely neglected to mention it in my response :-( In hindsight, my PROC SORT example should have used "by _all_".

New Contributor
Posts: 2

Re: how to remove duplicates in SAS 4.3 EG

Thank you guys...

New Contributor
Posts: 2

Re: how to remove duplicates in SAS 4.3 EG

I work with pharmacy claims, in which we often want to have a final report with only the most current fill reported.  We remove the duplicate claims, that have some different data and some matching, through using group filters in EG.   We also are non-coders. 

The process is to pull the full data set first.  Then query off it, for our example, we first select to group by the Member ID, and GPI (unique drug identifier) - therefore keeping the combination of member + drug, regardless of the other information on the claim )such as costs, pharmacy, etc) and then we use the MAX summary function on the date field. 

Once you have selected to do the MAX of date, on the filter tab, the "filter the summarized data" option will show up at the bottom half of the screen.   We then enter in the filter to this section where (CALCULATED MAX DATE) = DATE.

This is probably a bit confusing to read through, but I think it's a really good way to get to the most recent data out of millions of rows of multiple transactions.  I would assume this would be helpful further than the Pharmacy industry.

9-19-2012 2-59-41 PM.png

Community Manager
Posts: 2,693

Re: how to remove duplicates in SAS 4.3 EG

By chance, I just added a blog post with a simplified example of what EmilyC describes.

HAVING (clause) fun with SAS Enterprise Guide - The SAS Dummy

Chris

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 8 replies
  • 69333 views
  • 6 likes
  • 6 in conversation