Hi
I am new at working with SAS 9.4. I am having problems deleting observations based on other observations.
I have people with an ID, year of study start, a score and the year the score was taken (score_year).
People with multiple scores have multiple observations.
I would like to keep all the observations where the score_year is after study start.
I would also like to keep one observation (if there is one) where score_year is before study start. When there are multiple observations with score_year before study start I only want to keep the one closest to study start (and still keep all the observations with score_year after study start). Can anyone help me with that?
data have;
input ID start score_year score;
1 2015 2003 2
1 2015 2007 0
2 2012 2004 2
2 2012 2005 0
2 2012 2013 1
3 2000 1997 0
3 2000 1999 6
3 2000 2002 4
3 2000 2005 6
4 2000 1998 0
4 2000 2008 1;
run;
data want;
input ID start score_year score;
1 2015 2007 0
2 2012 2005 0
2 2012 2013 1
3 2000 1999 6
3 2000 2002 4
3 2000 2005 6
4 2000 1998 0
4 2000 2008 1;
run;
Thank you in advance.
Regards,
RTN
This should work.
data have;
input ID 2. start 4. score_year 5. score 2.;
cards;
1 2015 2003 2
1 2015 2007 0
2 2012 2004 2
2 2012 2005 0
2 2012 2013 1
3 2000 1997 0
3 2000 1999 6
3 2000 2002 4
3 2000 2005 6
4 2000 1998 0
4 2000 2008 1
;
run;
proc sql;
create table temp as
select *
from have
where start >= score_year
order by id, start, score_year
;
create table keep1 as
select *
from have
where start < score_year
;
quit;
data keep2;
set temp;
by id;
if not last.id then delete;
run;
data want;
set keep1 keep2;
run;
proc sort data=want;
by id start score_year;
run;
This should work.
data have;
input ID 2. start 4. score_year 5. score 2.;
cards;
1 2015 2003 2
1 2015 2007 0
2 2012 2004 2
2 2012 2005 0
2 2012 2013 1
3 2000 1997 0
3 2000 1999 6
3 2000 2002 4
3 2000 2005 6
4 2000 1998 0
4 2000 2008 1
;
run;
proc sql;
create table temp as
select *
from have
where start >= score_year
order by id, start, score_year
;
create table keep1 as
select *
from have
where start < score_year
;
quit;
data keep2;
set temp;
by id;
if not last.id then delete;
run;
data want;
set keep1 keep2;
run;
proc sort data=want;
by id start score_year;
run;
Do like this
data have;
input ID start score_year score;
datalines;
1 2015 2003 2
1 2015 2007 0
2 2012 2004 2
2 2012 2005 0
2 2012 2013 1
3 2000 1997 0
3 2000 1999 6
3 2000 2002 4
3 2000 2005 6
4 2000 1998 0
4 2000 2008 1
run;
data want(drop=latest);
do until (last.id);
set have;
by id;
if score_year<start & score_year>latest then latest=score_year;
end;
do until (last.id);
set have;
by id;
if start<score_year | score_year=latest then output;
end;
run;
Result:
ID start score_year score 1 2015 2007 0 2 2012 2005 0 2 2012 2013 1 3 2000 1999 6 3 2000 2002 4 3 2000 2005 6 4 2000 1998 0 4 2000 2008 1
not run but will test later. The idea is to separate your concerns first (into before and after) then with the before set keep only the date closest to the score year. Stack this set with the after set and I think you will have what you want.
data before after;set have; if score_year ge start then output after; else output before; run; proc sort data=before; by ID start descending score_year; run; proc sort data=before nodupkey out=before2; by id start; run; data want; set after before2; run
-unison
SQL fun:
data have;
input ID start score_year score;
cards;
1 2015 2003 2
1 2015 2007 0
2 2012 2004 2
2 2012 2005 0
2 2012 2013 1
3 2000 1997 0
3 2000 1999 6
3 2000 2002 4
3 2000 2005 6
4 2000 1998 0
4 2000 2008 1
;
run;
proc sql;
create table want(drop=t:) as
select *,score_year>start as t, (start-score_year) as t1
from have
group by id,t
having t or min(t1)=t1 and not t
order by id, start, score_year;
quit;
@RTN:
This is a variation on the double DoW-loop scheme offered by @PeterClemmensen. The only essential difference is that the first DoW-loop reads only the records needed to determine the minimal difference in case WHERE score_year < start.
data have ;
input id start score_year score ;
cards ;
1 2015 2003 2
1 2015 2007 0
2 2012 2004 2
2 2012 2005 0
2 2012 2013 1
3 2000 1997 0
3 2000 1999 6
3 2000 2002 4
3 2000 2005 6
4 2000 1998 0
4 2000 2008 1
;
run ;
data want (drop = _:) ;
do until (last.id) ;
set have (where = (score_year < start)) ;
by id ;
_m = min (_m, start - score_year) ;
end ;
do until (last.id) ;
set have ;
by id ;
if start <= score_year or start - score_year = _m then output ;
end ;
run ;
An interesting aspect for those who like think in DoW terms is that though the two loops read different number of records from their respective independent input streams, they remain in sync thanks to the second BY.
Kind regards
Paul D.
An interesting aspect for those who like think in DoW terms is that though the two loops read different number of records from their respective independent input streams, they remain in sync thanks to the second BY.
Unless there is a BY group with NO observations that meet the WHERE condition in the first DO loop. If that happens you will start comparing apples to oranges.
@Tom: A keen observation ... that's what I call think.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.