Solved: Re: How do I delete observations based on other observations

RTN · Posted 11-12-2019 07:33 AM

Hi

I am new at working with SAS 9.4. I am having problems deleting observations based on other observations.

I have people with an ID, year of study start, a score and the year the score was taken (score_year).

People with multiple scores have multiple observations.

I would like to keep all the observations where the score_year is after study start.

I would also like to keep one observation (if there is one) where score_year is before study start. When there are multiple observations with score_year before study start I only want to keep the one closest to study start (and still keep all the observations with score_year after study start). Can anyone help me with that?

data have;
input ID start score_year score;

1 2015 2003 2
1 2015 2007 0

2 2012 2004 2
2 2012 2005 0
2 2012 2013 1

3 2000 1997 0
3 2000 1999 6
3 2000 2002 4
3 2000 2005 6

4 2000 1998 0
4 2000 2008 1;

run;

data want;
input ID start score_year score;

1 2015 2007 0

2 2012 2005 0
2 2012 2013 1

3 2000 1999 6
3 2000 2002 4
3 2000 2005 6

4 2000 1998 0
4 2000 2008 1;

run;

Thank you in advance.

Regards,

RTN

rudfaden · Posted 11-12-2019 07:50 AM

This should work.

data have;
input ID 2. start 4. score_year 5. score 2.;
cards;
1 2015 2003 2
1 2015 2007 0
2 2012 2004 2
2 2012 2005 0
2 2012 2013 1
3 2000 1997 0
3 2000 1999 6
3 2000 2002 4
3 2000 2005 6
4 2000 1998 0
4 2000 2008 1
;

run;

proc sql;
create table temp as
select *
from have 
where start >= score_year
order by id, start, score_year
;
create table keep1 as
select *
from have 
where start < score_year
;
quit;



data keep2;
set temp;
by id;
if not last.id then delete;
run;

data want;
set keep1 keep2;
run;

proc sort data=want;
by id start score_year;
run;

View solution in original post

rudfaden · Posted 11-12-2019 07:50 AM

This should work.

data have;
input ID 2. start 4. score_year 5. score 2.;
cards;
1 2015 2003 2
1 2015 2007 0
2 2012 2004 2
2 2012 2005 0
2 2012 2013 1
3 2000 1997 0
3 2000 1999 6
3 2000 2002 4
3 2000 2005 6
4 2000 1998 0
4 2000 2008 1
;

run;

proc sql;
create table temp as
select *
from have 
where start >= score_year
order by id, start, score_year
;
create table keep1 as
select *
from have 
where start < score_year
;
quit;



data keep2;
set temp;
by id;
if not last.id then delete;
run;

data want;
set keep1 keep2;
run;

proc sort data=want;
by id start score_year;
run;

PeterClemmensen · Posted 11-12-2019 07:51 AM

Do like this

data have;
input ID start score_year score;
datalines;
1 2015 2003 2
1 2015 2007 0
2 2012 2004 2
2 2012 2005 0
2 2012 2013 1
3 2000 1997 0
3 2000 1999 6
3 2000 2002 4
3 2000 2005 6
4 2000 1998 0
4 2000 2008 1
run;

data want(drop=latest);
    do until (last.id);
        set have;
        by id;
        if score_year<start & score_year>latest then latest=score_year;
    end;
    do until (last.id);
        set have; 
        by id;
        if start<score_year | score_year=latest then output;
    end;
run;

Result:

ID  start  score_year  score
1   2015   2007        0
2   2012   2005        0
2   2012   2013        1
3   2000   1999        6
3   2000   2002        4
3   2000   2005        6
4   2000   1998        0
4   2000   2008        1

The DATA to DATA Step Macro
Blog: SASnrd

unison · Posted 11-12-2019 07:56 AM

not run but will test later. The idea is to separate your concerns first (into before and after) then with the before set keep only the date closest to the score year. Stack this set with the after set and I think you will have what you want.

data before after;
set have;
if score_year ge start then output after;
else output before;
run;


proc sort data=before;
by ID start descending score_year;
run;

proc sort data=before nodupkey out=before2;
by id start;
run;

data want;
set after before2;
run

-unison

novinosrin · Posted 11-12-2019 08:05 AM

SQL fun:


data have;
input ID start score_year score;
cards;
1 2015 2003 2
1 2015 2007 0
2 2012 2004 2
2 2012 2005 0
2 2012 2013 1
3 2000 1997 0
3 2000 1999 6
3 2000 2002 4
3 2000 2005 6
4 2000 1998 0
4 2000 2008 1
;
run;

proc sql;
create table want(drop=t:) as
select *,score_year>start as t, (start-score_year) as t1
from have
group by id,t
having t or min(t1)=t1 and not t 
order by id, start, score_year;
quit;

hashman · Posted 11-12-2019 07:12 PM

@RTN:

This is a variation on the double DoW-loop scheme offered by @PeterClemmensen. The only essential difference is that the first DoW-loop reads only the records needed to determine the minimal difference in case WHERE score_year < start.

data have ;                                                                                                                                                                                                                                                     
  input id start score_year score ;                                                                                                                                                                                                                             
  cards ;                                                                                                                                                                                                                                                       
1 2015 2003 2                                                                                                                                                                                                                                                   
1 2015 2007 0                                                                                                                                                                                                                                                   
2 2012 2004 2                                                                                                                                                                                                                                                   
2 2012 2005 0                                                                                                                                                                                                                                                   
2 2012 2013 1                                                                                                                                                                                                                                                   
3 2000 1997 0                                                                                                                                                                                                                                                   
3 2000 1999 6                                                                                                                                                                                                                                                   
3 2000 2002 4                                                                                                                                                                                                                                                   
3 2000 2005 6                                                                                                                                                                                                                                                   
4 2000 1998 0                                                                                                                                                                                                                                                   
4 2000 2008 1                                                                                                                                                                                                                                                   
;                                                                                                                                                                                                                                                               
run ;                                                                                                                                                                                                                                                           
                                                                                                                                                                                                                                                                
data want (drop = _:) ;                                                                                                                                                                                                                                         
  do until (last.id) ;                                                                                                                                                                                                                                          
    set have (where = (score_year < start)) ;                                                                                                                                                                                                                   
    by id ;                                                                                                                                                                                                                                                     
    _m = min (_m, start - score_year) ;                                                                                                                                                                                                                         
  end ;                                                                                                                                                                                                                                                         
  do until (last.id) ;                                                                                                                                                                                                                                          
    set have ;                                                                                                                                                                                                                                                  
    by id ;                                                                                                                                                                                                                                                     
    if start <= score_year or start - score_year = _m then output ;                                                                                                                                                                                             
  end ;                                                                                                                                                                                                                                                         
run ;

An interesting aspect for those who like think in DoW terms is that though the two loops read different number of records from their respective independent input streams, they remain in sync thanks to the second BY.

Kind regards

Paul D.

Tom · Posted 11-12-2019 07:16 PM

An interesting aspect for those who like think in DoW terms is that though the two loops read different number of records from their respective independent input streams, they remain in sync thanks to the second BY.

Unless there is a BY group with NO observations that meet the WHERE condition in the first DO loop. If that happens you will start comparing apples to oranges.

hashman · Posted 11-12-2019 08:44 PM

@Tom: A keen observation ... that's what I call think.

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away