I have a dataset which looks like the following:
DATA have;
input individual investment_year firm individual_exit_year;
DATALINES;
1 2006 1004 .
1 2006 1005 .
1 2006 1006 .
1 2007 1004 .
1 2007 1005 .
1 2007 1007 .
1 2010 1003 .
1 2010 1006 .
1 2010 1007 .
1 2015 1004 .
1 2015 1005 .
1 2015 1007 .
2 2005 1003 2021
2 2005 1006 2021
2 2005 1008 2021
2 2008 1004 2021
2 2008 1005 2021
2 2008 1006 2021
2 2013 1005 2021
2 2013 1006 2021
2 2013 1008 2021
2 2019 1003 2021
2 2019 1004 2021
2 2019 1005 2021
3 2010 1003 2010
;
RUN;
I want to create the following dataset based on the above:
DATA want;
input firm year individual treatment;
DATALINES;
1003 2005 1 0
1003 2006 1 0
1003 2007 1 0
1003 2008 1 0
1003 2009 1 0
1003 2010 1 1
1003 2011 1 1
1003 2012 1 1
1003 2013 1 1
1003 2014 1 1
1003 2015 1 0
1003 2016 1 0
1003 2017 1 0
1003 2018 1 0
1003 2019 1 0
1003 2020 1 0
1003 2021 1 0
1003 2022 1 0
1003 2023 1 0
1004 2005 1 0
1004 2006 1 1
1004 2007 1 1
1004 2008 1 1
1004 2009 1 1
1004 2010 1 0
1004 2011 1 0
1004 2012 1 0
1004 2013 1 0
1004 2014 1 0
1004 2015 1 1
1004 2016 1 1
1004 2017 1 1
1004 2018 1 1
1004 2019 1 1
1004 2020 1 1
1004 2021 1 1
1004 2022 1 1
1004 2023 1 1
1005 2005 1 0
1005 2006 1 1
1005 2007 1 1
1005 2008 1 1
1005 2009 1 1
1005 2010 1 0
1005 2011 1 0
1005 2012 1 0
1005 2013 1 0
1005 2014 1 0
1005 2015 1 1
1005 2016 1 1
1005 2017 1 1
1005 2018 1 1
1005 2019 1 1
1005 2020 1 1
1005 2021 1 1
1005 2022 1 1
1005 2023 1 1
1006 2005 1 0
1006 2006 1 1
1006 2007 1 0
1006 2008 1 0
1006 2009 1 0
1006 2010 1 1
1006 2011 1 1
1006 2012 1 1
1006 2013 1 1
1006 2014 1 1
1006 2015 1 0
1006 2016 1 0
1006 2017 1 0
1006 2018 1 0
1006 2019 1 0
1006 2020 1 0
1006 2021 1 0
1006 2022 1 0
1006 2023 1 0
1007 2005 1 0
1007 2006 1 0
1007 2007 1 1
1007 2008 1 1
1007 2009 1 1
1007 2010 1 1
1007 2011 1 1
1007 2012 1 1
1007 2013 1 1
1007 2014 1 1
1007 2015 1 1
1007 2016 1 1
1007 2017 1 1
1007 2018 1 1
1007 2019 1 1
1007 2020 1 1
1007 2021 1 1
1007 2022 1 1
1007 2023 1 1
1003 2005 2 1
1003 2006 2 1
1003 2007 2 1
1003 2008 2 0
1003 2009 2 0
1003 2010 2 0
1003 2011 2 0
1003 2012 2 0
1003 2013 2 0
1003 2014 2 0
1003 2015 2 0
1003 2016 2 0
1003 2017 2 0
1003 2018 2 0
1003 2019 2 1
1003 2020 2 1
1003 2021 2 0
1003 2022 2 0
1003 2023 2 0
1004 2005 2 0
1004 2006 2 0
1004 2007 2 0
1004 2008 2 1
1004 2009 2 1
1004 2010 2 1
1004 2011 2 1
1004 2012 2 1
1004 2013 2 0
1004 2014 2 0
1004 2015 2 0
1004 2016 2 0
1004 2017 2 0
1004 2018 2 0
1004 2019 2 1
1004 2020 2 1
1004 2021 2 0
1004 2022 2 0
1004 2023 2 0
1005 2005 2 0
1005 2006 2 0
1005 2007 2 0
1005 2008 2 1
1005 2009 2 1
1005 2010 2 1
1005 2011 2 1
1005 2012 2 1
1005 2013 2 1
1005 2014 2 1
1005 2015 2 1
1005 2016 2 1
1005 2017 2 1
1005 2018 2 1
1005 2019 2 1
1005 2020 2 1
1005 2021 2 0
1005 2022 2 0
1005 2023 2 0
1006 2005 2 1
1006 2006 2 1
1006 2007 2 1
1006 2008 2 1
1006 2009 2 1
1006 2010 2 1
1006 2011 2 1
1006 2012 2 1
1006 2013 2 1
1006 2014 2 1
1006 2015 2 1
1006 2016 2 1
1006 2017 2 1
1006 2018 2 1
1006 2019 2 0
1006 2020 2 0
1006 2021 2 0
1006 2022 2 0
1006 2023 2 0
1008 2005 2 1
1008 2006 2 1
1008 2007 2 1
1008 2008 2 0
1008 2009 2 0
1008 2010 2 0
1008 2011 2 0
1008 2012 2 0
1008 2013 2 1
1008 2014 2 1
1008 2015 2 1
1008 2016 2 1
1008 2017 2 1
1008 2018 2 1
1008 2019 2 0
1008 2020 2 0
1008 2021 2 0
1008 2022 2 0
1008 2023 2 0
1003 2005 3 0
1003 2006 3 0
1003 2007 3 0
1003 2008 3 0
1003 2009 3 0
1003 2010 3 0
1003 2011 3 0
1003 2012 3 0
1003 2013 3 0
1003 2014 3 0
1003 2015 3 0
1003 2016 3 0
1003 2017 3 0
1003 2018 3 0
1003 2019 3 0
1003 2020 3 0
1003 2021 3 0
1003 2022 3 0
1003 2023 3 0
;
RUN;
I want to create a timeseries of a new variable called "treatment" for each firm from 2005 to 2023 based on the investment_year for each individual in the "have" dataset.
The rules to create this "treatment" variable is quite simple, but best illustrated using an example. Consider firm=1004 and individual=1 in "have". This individual invests in this firm in 2006 (i.e., in investment_year=2006, we see firm=1004 appears for individual=1), the individual invests in this firm again in 2007 (i.e., firm=1004 appears again for individual 1 when investment_year=2007), does NOT invest in it in 2010 (i.e., firm=1004 does NOT appear for individual 1 in investment_year=2010), and then invests in it again in 2015 (i.e., firm=1004 appears again for individual 1 when investment_year=2015).
The "treatment" variable in "want"is a binary variable that takes either the value of 0 or 1 based on when the individual invests in the given firm. So, for firm=1004 and individual=1, I want to create a timeseries from 2005 to 2023 where "treatment" starts with a value of 0 in 2005 but takes on a value of 1 from 2006 (because individual 1 invests in this firm), it stays on as 1 until the next round of investment, which is in 2007, it will stay as 1 from 2007 to 2010, but will switch back to 0 from 2010 since the individual does not invest in this firm from 2010, but it will switch back to 1 from 2015 onwards since the individual invests in this firm again. Note that if individual_exit_year is empty, then the value of treatment stays from the last investment_year until 2023. In this case, because individual 1 invests in firm=1004 in investment_year=2015 and this individual has an empty individual_exit_year, treatment will equal to 1 from 2015 all the way until 2023.
However, if individual_exit_year is NOT empty, then the value of treatment will equal to 0 from whenever the year in individual_exit_year is. As an example, consider firm=1003 and individual=2. This individual invests in this firm in 2005, so treatment will take a value of 1 from 2005 until 2008 when it will switch to 0 because the individual does NOT invest in this firm in investment_year=2008. Treatment will keep maintaining a value of 0 through to 2013 (since the individual still does NOT invest in this firm in investment_year=2013). But, treatment will take a value of 1 from 2019 (since the individual invests in this firm in 2019), but because individual_exit_year=2021 for individual 2, treatment will only take a value of 1 until 2021, where it will switch back to 0 until 2023. Note that if investment_year and individual_exit_year are the same (as is the case for individual=3), the individual_exit_year takes precedence, i.e., treatment will take a value of 0 until 2023.
Thank you! I managed to tweak your code to get it 🙂
data temp;
set have(keep=individual firm individual_exit_year);
run;
proc sort data=temp out=base nodup;
by individual firm individual_exit_year;
run;
proc sort data=have out=temp2(keep=individual investment_year) nodupkey;
by individual investment_year;
run;
data base2;
set base;
do year=2005 to 2023;
output;
end;
run;
proc sql;
create table base3
as select distinct a.*, b.investment_year, c.investment_year as inv_year
from base2 a
left join have b
on a.individual=b.individual
and a.firm=b.firm
and a.individual_exit_year=b.individual_exit_year
and a.year=b.investment_year
left join temp2 c
on a.individual=c.individual
and a.year=c.investment_year
order by firm, individual, year
;
quit;
data base3;
set base3;
if investment_year^=. then do;
if investment_year = inv_year then treat=1;
end;
if inv_year^=. and investment_year=. then treat=0;
run;
data base3;
set base3;
by firm individual;
if first.individual then treat_fill=treat;
else treat_fill=coalesce(treat,treat_fill); /** Note order matters treat, treat_fill **/
retain treat_fill;
run;
data base3;
set base3;
if treat_fill=. then treat_fill=0;
if individual_exit_year^=. then do;
if year>=individual_exit_year then treat_fill=0;
end;
run;
I can't get your data to match with your description of the rules. But this should get you started.
data temp;
set have(keep=individual firm individual_exit_year);
run;
proc sort data=temp out=base nodup;
by individual firm individual_exit_year;
run;
data base2;
set base;
do year=2005 to 2023;
output;
end;
run;
proc sql;
create table temp2
as select distinct investment_year as inv_year
from have
;
create table base3
as select distinct a.*, b.investment_year, c.inv_year
from base2 a
left join have b
on a.individual=b.individual
and a.firm=b.firm
and a.individual_exit_year=b.individual_exit_year
and a.year=b.investment_year
left join temp2 c
on a.year=c.inv_year
order by individual,year, firm
;
quit;
data want;
set base3;
retain treatment;
by individual year firm;
if first.indiviudal then
treatment=0;
if investment_year=inv_year and investment_year ne . then
treatment=1;
else if investment_year=. and inv_year ne . then treatment=0;
run;
Thank you! I managed to tweak your code to get it 🙂
data temp;
set have(keep=individual firm individual_exit_year);
run;
proc sort data=temp out=base nodup;
by individual firm individual_exit_year;
run;
proc sort data=have out=temp2(keep=individual investment_year) nodupkey;
by individual investment_year;
run;
data base2;
set base;
do year=2005 to 2023;
output;
end;
run;
proc sql;
create table base3
as select distinct a.*, b.investment_year, c.investment_year as inv_year
from base2 a
left join have b
on a.individual=b.individual
and a.firm=b.firm
and a.individual_exit_year=b.individual_exit_year
and a.year=b.investment_year
left join temp2 c
on a.individual=c.individual
and a.year=c.investment_year
order by firm, individual, year
;
quit;
data base3;
set base3;
if investment_year^=. then do;
if investment_year = inv_year then treat=1;
end;
if inv_year^=. and investment_year=. then treat=0;
run;
data base3;
set base3;
by firm individual;
if first.individual then treat_fill=treat;
else treat_fill=coalesce(treat,treat_fill); /** Note order matters treat, treat_fill **/
retain treat_fill;
run;
data base3;
set base3;
if treat_fill=. then treat_fill=0;
if individual_exit_year^=. then do;
if year>=individual_exit_year then treat_fill=0;
end;
run;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.