08-30-2012 10:20 PM
Assuming you want to remove duplicate observations with the same values?
input x y;
/* PROC SORT method */
proc sort data=have nodup out=want1;
by x y;
/* PROC SQL method */
create table want2 as select distinct * from have;
If you want to delete duplicates that are not 100% identical (e.g. all units with the same ID variable regardless of whether they differ in other values) look at the NODUPKEY option in PROC SORT.
08-31-2012 10:54 PM
Don't use the sort task to get rid of duplicate records (=all variable values equal). This will only give you the expected result if you also sort by ALL variables in the data set (and then you could also choose "Keep only the first record per by group").
Selecting option "Keep only one of each record..." in the sort task creates a Proc Sort statement using NODUP (which is NODUPRECS). This option is imho one of the less lucky implementations SAS did. It's quite deceptive and I've seen already a lot of people stumbling over it.
What NODUP does is removing identical rows (all variable values the same) only if they follow right one after the other - and you have no control if this happens this way if you don't sort by all variables (and then NODUPKEY gives you the same result).
Here an example to showcase the issue
input key var $;
This generates the following Proc Sort code (in EG4.3):
PROC SORT DATA=WORK.SAMPLE
One would now expect to get 2 rows as result
BUT that's what you will get:
Selection of "Maintaining original data order..." will also influence the result - but with the simple sample data provided the result will be wrong as well.
Use the query builder with "select distinct rows only" for de-dupping.
09-02-2012 07:59 PM
Patrick is correct, I was aware of this rather important detail but completely neglected to mention it in my response :-( In hindsight, my PROC SORT example should have used "by _all_".
09-19-2012 04:00 PM
I work with pharmacy claims, in which we often want to have a final report with only the most current fill reported. We remove the duplicate claims, that have some different data and some matching, through using group filters in EG. We also are non-coders.
The process is to pull the full data set first. Then query off it, for our example, we first select to group by the Member ID, and GPI (unique drug identifier) - therefore keeping the combination of member + drug, regardless of the other information on the claim )such as costs, pharmacy, etc) and then we use the MAX summary function on the date field.
Once you have selected to do the MAX of date, on the filter tab, the "filter the summarized data" option will show up at the bottom half of the screen. We then enter in the filter to this section where (CALCULATED MAX DATE) = DATE.
This is probably a bit confusing to read through, but I think it's a really good way to get to the most recent data out of millions of rows of multiple transactions. I would assume this would be helpful further than the Pharmacy industry.
09-19-2012 04:31 PM
By chance, I just added a blog post with a simplified example of what EmilyC describes.
Need further help from the community? Please ask a new question.