Help using Base SAS procedures

Removing Duplicate Observations

Reply
Super Contributor
Posts: 358

Removing Duplicate Observations

Hi All:

I'm trying to get my brain bent around this and I know I could write a mile of code, but I thought someone could find an easier solution.

I have a ton of records (nearly 2 million) that are sequenced by date. For some dates there is a single record, for some there are 2 records. There are never more than 2 for any date.

When the 2 values for a selected date are the same I want to set the value for both to missing ("."), but I want to keep the records in the table. I know I could sort and use "first." and "last.", or use a "lag", but I'm thinkin' there is a more elegant solution. Most solutions I come up with only get rid of the duplicate but I need to get rid of them both.

Thanks in advance.
Super Contributor
Posts: 292

Re: Removing Duplicate Observations

One way would be to run proc summary against the data set (http://support.sas.com/kb/37/581.html) to find the duplicates. Merge back to the original file to update the duplicate records to missing.
Super Contributor
Super Contributor
Posts: 3,174

Re: Removing Duplicate Observations

Elegant? After your PROC SORT step, you will have less than 10 SAS statements for the DATA step executino.

So, one DATA step pass and test for FIRST. AND LAST. -- if it's not the condition, then set your to missing values until you get to LAST..

Scott Barry
SBBWorks, Inc.
N/A
Posts: 0

Re: Removing Duplicate Observations

Maybe this example might help, but take into account that you have a ton of records.
Program finds duplicates, keep them in the table and sets values to missing.

DATA Temp;
INPUT date data;
CARDS;
1 5000
1 6000
1 2100
2 1110
2 5000
2 5000
3 5000
;
RUN;

proc sql;
create table temp1
as
select a.date,
a.data,
(CASE
when b.date is not null then .
else a.data
END) as new_data_value
from temp a
left outer join (
select date
from temp
group by date
having count(*)>1) b on a.date=b.date
;
quit;
Respected Advisor
Posts: 4,173

Re: Removing Duplicate Observations

I agree with Scott: What's so bad about a first-last approach? It's not a lot of code and easy to understand and maintain.

Of course: If your data is stored in a database and you want to alter the values in this database table then a SQL approach would be better.

DATA Temp;
INPUT date data;
CARDS;
1 5000
1 6000
1 2100
2 1110
3 5000
3 5000
;
RUN;

proc sql noprint;
create view SortedTemp as
select *
from temp
order by date
;
quit;

data want;
set temp;
by date;
if not (first.date and last.date) then call missing(data);
run;

proc print data=want;
run;
Ask a Question
Discussion stats
  • 4 replies
  • 160 views
  • 0 likes
  • 5 in conversation