Hello!
I am new to SAS and even though I use it daily, I still haven't been able to have the basics down. I have a data set with repetitive values and I would like to get rid of them. For example, I would like to transform the first table below into the second table below. I am trying to count each employee only one time. Does anybody know how to delete observations vertically?
observation | employee | YSBS | Pay |
1 | 1 | 7 | 100000 |
---|---|---|---|
2 | 1 | 7 | 100000 |
3 | 1 | 7 | 100000 |
4 | 2 | 8 | 120000 |
5 | 2 | 8 | 120000 |
6 | 2 | 8 | 120000 |
7 | 2 | 8 | 120000 |
8 | 2 | 8 | 120000 |
9 | 3 | 11 | 110000 |
10 | 4 | 11 | 123000 |
observation | employee | YSBS | Pay |
1 | 1 | 7 | 100000 |
---|---|---|---|
2 | 2 | 8 | 120000 |
3 | 3 | 11 | 110000 |
4 | 4 | 11 | 123000 |
the 2 simple approaches when ALL variables of records are idential to kill such duplicates are
proc sql;
create table want as
select distinct *
from have;
quit;
distinct * only selects the first occurence of fully identical records, sorted or not.
A sas procedure alternative is
proc sort data=have out=want nodups;
by employee ysbs;
run;
If you don't use the out= newdataset option, this procedure will alter your existing dataset and remove the duplicates in it making it unable to recover. If it's an intermediate dataset it can be worthwhile to do so to save memory. The SQL approach has the advantage of not wasting computing time on re-sorting the data.
I'm sure other people will propose different approaches, this is how I do it.
Vincent
the 2 simple approaches when ALL variables of records are idential to kill such duplicates are
proc sql;
create table want as
select distinct *
from have;
quit;
distinct * only selects the first occurence of fully identical records, sorted or not.
A sas procedure alternative is
proc sort data=have out=want nodups;
by employee ysbs;
run;
If you don't use the out= newdataset option, this procedure will alter your existing dataset and remove the duplicates in it making it unable to recover. If it's an intermediate dataset it can be worthwhile to do so to save memory. The SQL approach has the advantage of not wasting computing time on re-sorting the data.
I'm sure other people will propose different approaches, this is how I do it.
Vincent
It worked!
Thank you so much!
Note that NODUPS does NOT do the same thing as DISTINCT * in SQL unless you include ALL variables in the sort order.
PROC SORT only compares adjacent records for duplication, so if there is an intervening record that is different for one of the non BY variables then the non adjacent duplicate records will not be removed.
If you do not care about the non BY variables and want to just get a single observation per by group then use the NODUPKEY option instead.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.