topic Re: Removing Duplicate Observations in SAS Procedures

Removing Duplicate Observations

OS2Rules — Tue, 22 Dec 2009 19:57:03 GMT

Hi All:

I'm trying to get my brain bent around this and I know I could write a mile of code, but I thought someone could find an easier solution.

I have a ton of records (nearly 2 million) that are sequenced by date. For some dates there is a single record, for some there are 2 records. There are never more than 2 for any date.

When the 2 values for a selected date are the same I want to set the value for both to missing ("."), but I want to keep the records in the table. I know I could sort and use "first." and "last.", or use a "lag", but I'm thinkin' there is a more elegant solution. Most solutions I come up with only get rid of the duplicate but I need to get rid of them both.

Thanks in advance.

Re: Removing Duplicate Observations

Bill — Tue, 22 Dec 2009 20:41:01 GMT

One way would be to run proc summary against the data set (http://support.sas.com/kb/37/581.html) to find the duplicates. Merge back to the original file to update the duplicate records to missing.

Re: Removing Duplicate Observations

sbb — Tue, 22 Dec 2009 22:04:26 GMT

Elegant? After your PROC SORT step, you will have less than 10 SAS statements for the DATA step executino.

So, one DATA step pass and test for FIRST. AND LAST. -- if it's not the condition, then set your to missing values until you get to LAST..

Scott Barry
SBBWorks, Inc.

Re: Removing Duplicate Observations

deleted_user — Wed, 23 Dec 2009 07:58:59 GMT

Maybe this example might help, but take into account that you have a ton of records.
Program finds duplicates, keep them in the table and sets values to missing.

DATA Temp;
INPUT date data;
CARDS;
1 5000
1 6000
1 2100
2 1110
2 5000
2 5000
3 5000
;
RUN;

proc sql;
create table temp1
as
select a.date,
a.data,
(CASE
when b.date is not null then .
else a.data
END) as new_data_value
from temp a
left outer join (
select date
from temp
group by date
having count(*)>1) b on a.date=b.date
;
quit;

Re: Removing Duplicate Observations

Patrick — Wed, 23 Dec 2009 15:08:19 GMT

I agree with Scott: What's so bad about a first-last approach? It's not a lot of code and easy to understand and maintain.

Of course: If your data is stored in a database and you want to alter the values in this database table then a SQL approach would be better.

DATA Temp;
INPUT date data;
CARDS;
1 5000
1 6000
1 2100
2 1110
3 5000
3 5000
;
RUN;

proc sql noprint;
create view SortedTemp as
select *
from temp
order by date
;
quit;

data want;
set temp;
by date;
if not (first.date and last.date) then call missing(data);
run;

proc print data=want;
run;