Dear SAS support community,
I want to keep observations with AT LEAST three years before with dummy variable =0, and AT LEAST three years after with Dummy variable =1. If there are more than three years of observations with Dummy =0 before certain year and more than three years of observations with Dummy =1 after, it is fine as well. The data sample is as follows:
Left: Data Have Right: Data Want
| Gvkey | year | Dummy | Gvkey | year | Dummy | |
| AAA | 1991 | 0 | AAA | 1991 | 0 | |
| AAA | 1992 | 0 | AAA | 1992 | 0 | |
| AAA | 1993 | 0 | AAA | 1993 | 0 | |
| AAA | 1994 | 0 | AAA | 1994 | 0 | |
| AAA | 1995 | 1 | AAA | 1995 | 1 | |
| AAA | 1996 | 1 | AAA | 1996 | 1 | |
| AAA | 1997 | 1 | AAA | 1997 | 1 | |
| AAA | 1998 | 1 | AAA | 1998 | 1 | |
| AAA | 1999 | 1 | AAA | 1999 | 1 | |
| AAA | 2000 | 0 | ||||
| AAA | 2001 | 1 | ||||
| AAA | 2002 | 0 | ||||
| AAA | 2003 | 0 | ||||
| AAA | 2004 | 1 |
I tried lead and lag, and also loop, but doesn't seem to work. Thank you so much! Any suggestion is appreciated!
I think SQL will provide an easy answer to your problem. Something like this should work:
proc sql;
create table want as
select A.*
,B.Year_Min
from have as A
inner join
(select Gvkey
,min(year) as Year_Min
from have
where Dummy = 1
group by Gvkey
) as B
on A.Gvkey = B.Gvkey
where abs(A.year - B.Year_Min) <= 4
;
quit;
I think SQL will provide an easy answer to your problem. Something like this should work:
proc sql;
create table want as
select A.*
,B.Year_Min
from have as A
inner join
(select Gvkey
,min(year) as Year_Min
from have
where Dummy = 1
group by Gvkey
) as B
on A.Gvkey = B.Gvkey
where abs(A.year - B.Year_Min) <= 4
;
quit;
Thank you! One more question: Is there a way to full fill both conditions? I saw your solution only meets the "at least four years of dummy = 1 observations".
I was trying something like this:
proc sql;
create table want1 as select A.*
from have as A inner join
(select gvkey,min(year) as Year_Min from have where dummy= 0
group by gvkey ) as B
on A.gvkey= B.gvkey
where abs(A.year - B.Year_Min) <= 3;
quit;
Then I can maybe inner join want and want1 tables?
More specifically the data should be something like this:
data have:
| key | year | Dummy | key | year | Dummy | |
| AAA | 1988 | 0 | AAA | 1991 | 0 | |
| AAA | 1989 | 1 | AAA | 1992 | 0 | |
| AAA | 1990 | 1 | AAA | 1993 | 0 | |
| AAA | 1991 | 0 | AAA | 1994 | 0 | |
| AAA | 1992 | 0 | AAA | 1995 | 1 | |
| AAA | 1993 | 0 | AAA | 1996 | 1 | |
| AAA | 1994 | 0 | AAA | 1997 | 1 | |
| AAA | 1995 | 1 | AAA | 1998 | 1 | |
| AAA | 1996 | 1 | AAA | 1999 | 1 | |
| AAA | 1997 | 1 | ||||
| AAA | 1998 | 1 | ||||
| AAA | 1999 | 1 | ||||
| AAA | 2000 | 0 | ||||
| AAA | 2001 | 1 | ||||
| AAA | 2002 | 0 | ||||
| AAA | 2003 | 0 | ||||
| AAA | 2004 | 1 |
How about something like this?
proc sql;
create table want as
select A.*
,B.Year_Min
,C.Year_Max
from have as A
left join
(select Gvkey
,min(year) as Year_Min
from have
where Dummy = 1
group by Gvkey
) as B
on A.Gvkey = B.Gvkey
left join
(select Gvkey
,max(year) as Year_Max
from have
where Dummy = 0
group by Gvkey
) as C
on A.Gvkey = C.Gvkey
where abs(A.year - B.Year_Min) <= 4
and abs(A.year - C.Year_Max) <= 4
;
quit;
data have;
input Gvkey $ year Dummy ;* Gvkey year Dummy ;
cards;
AAA 1991 0 AAA 1991 0
AAA 1992 0 AAA 1992 0
AAA 1993 0 AAA 1993 0
AAA 1994 0 AAA 1994 0
AAA 1995 1 AAA 1995 1
AAA 1996 1 AAA 1996 1
AAA 1997 1 AAA 1997 1
AAA 1998 1 AAA 1998 1
AAA 1999 1 AAA 1999 1
AAA 2000 0
AAA 2001 1
AAA 2002 0
AAA 2003 0
AAA 2004 1
;
data temp;
if _n_=1 then do;
if 0 then set have(rename=(year=_year dummy=_d));
dcl hash H (dataset:'have(rename=(year=_year dummy=_d))',ordered: "A") ;
h.definekey ("gvkey","_year") ;
h.definedata ("_year","_d") ;
h.definedone () ;
dcl hiter hi('h');
end;
n=0;
do until(last.year);
set have;
by gvkey year;
do rc=hi.setcur(key: gvkey,key:year) by 0 while(rc=0);
n+1;
if dummy=0 then do;
rc=hi.prev();
if dummy ne _d then leave;
end;
else do;
rc=hi.next();
if dummy ne _d then leave;
end;
end;
end;
do until(last.year);
set have;
by gvkey year;
if n>=4 then do;
do rc=hi.setcur(key: gvkey,key:year) by 0 while(rc=0);
output;
if dummy=0 then do;
rc=hi.prev();
if dummy ne _d then leave;
end;
else do;
rc=hi.next();
if dummy ne _d then leave;
end;
end;
end;
end;
keep gvkey _year dummy;
run;
proc sort data=temp out =want(rename=_year=year) nodupkey;
by gvkey _year dummy;
run;
data have;
input Gvkey $ year Dummy ;* Gvkey year Dummy ;
cards;
AAA 1991 0 AAA 1991 0
AAA 1992 0 AAA 1992 0
AAA 1993 0 AAA 1993 0
AAA 1994 0 AAA 1994 0
AAA 1995 1 AAA 1995 1
AAA 1996 1 AAA 1996 1
AAA 1997 1 AAA 1997 1
AAA 1998 1 AAA 1998 1
AAA 1999 1 AAA 1999 1
AAA 2000 0
AAA 2001 1
AAA 2002 0
AAA 2003 0
AAA 2004 1
;
proc sort data=have out=temp1;
by gvkey dummy year;
run;
data temp2;
set temp1;
by gvkey dummy ;
if dif(year) ne 1 the grp+1;
if first.dummy then grp=1;
run;
proc sql;
create table want(drop=grp) as
select *
from temp2
group by gvkey,dummy ,grp
having count(grp)>=4
order by gvkey,dummy, year;
quit;
data have;
input Gvkey $ year Dummy ;* Gvkey year Dummy ;
cards;
AAA 1988 0 AAA 1991 0
AAA 1989 1 AAA 1992 0
AAA 1990 1 AAA 1993 0
AAA 1991 0 AAA 1994 0
AAA 1992 0 AAA 1995 1
AAA 1993 0 AAA 1996 1
AAA 1994 0 AAA 1997 1
AAA 1995 1 AAA 1998 1
AAA 1996 1 AAA 1999 1
AAA 1997 1 AAA 2002 0
AAA 1998 1 AAA 2003 0
AAA 1999 1 AAA 2004 0
AAA 2000 0 AAA 2005 0
AAA 2001 1 AAA 2006 1
AAA 2002 0 AAA 2007 1
AAA 2003 0 AAA 2008 1
AAA 2004 0 AAA 2009 1
AAA 2005 0 AAA 2010 1
AAA 2006 1
AAA 2007 1
AAA 2008 1
AAA 2009 1
AAA 2010 1
AAA 2011 0
;
proc sort data=have out=temp1;
by gvkey dummy year;
run;
data want;
if _n_=1 then do;
dcl hash H (ordered: "A") ;
h.definekey ("gvkey","year") ;
h.definedata ("gvkey","year", "Dummy") ;
h.definedone () ;
dcl hiter hi('h');
end;
do until(last.gvkey);
set temp1;
by gvkey dummy ;
array t(9999) _temporary_;
_year=year;
if dif(_year) ne 1 then do;
grp+1;
count=1;
end;
else count+1;
if count=1 and n(of t(*))>=4 then do;
do year=min(of t(*)) to max(of t(*));
rc=h.add();
end;
call missing(of t(*));
end;
t(count)=_year;
if first.dummy then grp=1;
end;
if count>=4 then do year=min(of t(*)) to max(of t(*));
rc=h.add();
end;
do while(hi.next()=0);
output;
end;
h.clear();
call missing(of t(*),count,grp);
keep gvkey year Dummy;
run;
All of your suggestions are very helpful and I appreciate it!
Nearly 200 sessions are now available on demand in the Innovate Hub.
Watch Now →SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.