I have a dataset which looks like the following:
DATA have;
input individual investment_year firm individual_exit_year;
DATALINES;
1 2006 1004 .
1 2006 1005 .
1 2006 1006 .
1 2007 1004 .
1 2007 1005 .
1 2007 1007 .
1 2010 1003 .
1 2010 1006 .
1 2010 1007 .
1 2015 1004 .
1 2015 1005 .
1 2015 1007 .
2 2005 1003 2021
2 2005 1006 2021
2 2005 1008 2021
2 2008 1004 2021
2 2008 1005 2021
2 2008 1006 2021
2 2013 1005 2021
2 2013 1006 2021
2 2013 1008 2021
2 2019 1003 2021
2 2019 1004 2021
2 2019 1005 2021
3 2010 1003 2010
;
RUN;
I want to create the following dataset based on the above:
DATA want;
input firm year individual treatment;
DATALINES;
1003 2005 1 0
1003 2006 1 0
1003 2007 1 0
1003 2008 1 0
1003 2009 1 0
1003 2010 1 1
1003 2011 1 1
1003 2012 1 1
1003 2013 1 1
1003 2014 1 1
1003 2015 1 0
1003 2016 1 0
1003 2017 1 0
1003 2018 1 0
1003 2019 1 0
1003 2020 1 0
1003 2021 1 0
1003 2022 1 0
1003 2023 1 0
1004 2005 1 0
1004 2006 1 1
1004 2007 1 1
1004 2008 1 1
1004 2009 1 1
1004 2010 1 0
1004 2011 1 0
1004 2012 1 0
1004 2013 1 0
1004 2014 1 0
1004 2015 1 1
1004 2016 1 1
1004 2017 1 1
1004 2018 1 1
1004 2019 1 1
1004 2020 1 1
1004 2021 1 1
1004 2022 1 1
1004 2023 1 1
1005 2005 1 0
1005 2006 1 1
1005 2007 1 1
1005 2008 1 1
1005 2009 1 1
1005 2010 1 0
1005 2011 1 0
1005 2012 1 0
1005 2013 1 0
1005 2014 1 0
1005 2015 1 1
1005 2016 1 1
1005 2017 1 1
1005 2018 1 1
1005 2019 1 1
1005 2020 1 1
1005 2021 1 1
1005 2022 1 1
1005 2023 1 1
1006 2005 1 0
1006 2006 1 1
1006 2007 1 0
1006 2008 1 0
1006 2009 1 0
1006 2010 1 1
1006 2011 1 1
1006 2012 1 1
1006 2013 1 1
1006 2014 1 1
1006 2015 1 0
1006 2016 1 0
1006 2017 1 0
1006 2018 1 0
1006 2019 1 0
1006 2020 1 0
1006 2021 1 0
1006 2022 1 0
1006 2023 1 0
1007 2005 1 0
1007 2006 1 0
1007 2007 1 1
1007 2008 1 1
1007 2009 1 1
1007 2010 1 1
1007 2011 1 1
1007 2012 1 1
1007 2013 1 1
1007 2014 1 1
1007 2015 1 1
1007 2016 1 1
1007 2017 1 1
1007 2018 1 1
1007 2019 1 1
1007 2020 1 1
1007 2021 1 1
1007 2022 1 1
1007 2023 1 1
1003 2005 2 1
1003 2006 2 1
1003 2007 2 1
1003 2008 2 0
1003 2009 2 0
1003 2010 2 0
1003 2011 2 0
1003 2012 2 0
1003 2013 2 0
1003 2014 2 0
1003 2015 2 0
1003 2016 2 0
1003 2017 2 0
1003 2018 2 0
1003 2019 2 1
1003 2020 2 1
1003 2021 2 0
1003 2022 2 0
1003 2023 2 0
1004 2005 2 0
1004 2006 2 0
1004 2007 2 0
1004 2008 2 1
1004 2009 2 1
1004 2010 2 1
1004 2011 2 1
1004 2012 2 1
1004 2013 2 0
1004 2014 2 0
1004 2015 2 0
1004 2016 2 0
1004 2017 2 0
1004 2018 2 0
1004 2019 2 1
1004 2020 2 1
1004 2021 2 0
1004 2022 2 0
1004 2023 2 0
1005 2005 2 0
1005 2006 2 0
1005 2007 2 0
1005 2008 2 1
1005 2009 2 1
1005 2010 2 1
1005 2011 2 1
1005 2012 2 1
1005 2013 2 1
1005 2014 2 1
1005 2015 2 1
1005 2016 2 1
1005 2017 2 1
1005 2018 2 1
1005 2019 2 1
1005 2020 2 1
1005 2021 2 0
1005 2022 2 0
1005 2023 2 0
1006 2005 2 1
1006 2006 2 1
1006 2007 2 1
1006 2008 2 1
1006 2009 2 1
1006 2010 2 1
1006 2011 2 1
1006 2012 2 1
1006 2013 2 1
1006 2014 2 1
1006 2015 2 1
1006 2016 2 1
1006 2017 2 1
1006 2018 2 1
1006 2019 2 0
1006 2020 2 0
1006 2021 2 0
1006 2022 2 0
1006 2023 2 0
1008 2005 2 1
1008 2006 2 1
1008 2007 2 1
1008 2008 2 0
1008 2009 2 0
1008 2010 2 0
1008 2011 2 0
1008 2012 2 0
1008 2013 2 1
1008 2014 2 1
1008 2015 2 1
1008 2016 2 1
1008 2017 2 1
1008 2018 2 1
1008 2019 2 0
1008 2020 2 0
1008 2021 2 0
1008 2022 2 0
1008 2023 2 0
1003 2005 3 0
1003 2006 3 0
1003 2007 3 0
1003 2008 3 0
1003 2009 3 0
1003 2010 3 0
1003 2011 3 0
1003 2012 3 0
1003 2013 3 0
1003 2014 3 0
1003 2015 3 0
1003 2016 3 0
1003 2017 3 0
1003 2018 3 0
1003 2019 3 0
1003 2020 3 0
1003 2021 3 0
1003 2022 3 0
1003 2023 3 0
;
RUN;
I want to create a timeseries of a new variable called "treatment" for each firm from 2005 to 2023 based on the investment_year for each individual in the "have" dataset.
The rules to create this "treatment" variable is quite simple, but best illustrated using an example. Consider firm=1004 and individual=1 in "have". This individual invests in this firm in 2006 (i.e., in investment_year=2006, we see firm=1004 appears for individual=1), the individual invests in this firm again in 2007 (i.e., firm=1004 appears again for individual 1 when investment_year=2007), does NOT invest in it in 2010 (i.e., firm=1004 does NOT appear for individual 1 in investment_year=2010), and then invests in it again in 2015 (i.e., firm=1004 appears again for individual 1 when investment_year=2015).
The "treatment" variable in "want"is a binary variable that takes either the value of 0 or 1 based on when the individual invests in the given firm. So, for firm=1004 and individual=1, I want to create a timeseries from 2005 to 2023 where "treatment" starts with a value of 0 in 2005 but takes on a value of 1 from 2006 (because individual 1 invests in this firm), it stays on as 1 until the next round of investment, which is in 2007, it will stay as 1 from 2007 to 2010, but will switch back to 0 from 2010 since the individual does not invest in this firm from 2010, but it will switch back to 1 from 2015 onwards since the individual invests in this firm again. Note that if individual_exit_year is empty, then the value of treatment stays from the last investment_year until 2023. In this case, because individual 1 invests in firm=1004 in investment_year=2015 and this individual has an empty individual_exit_year, treatment will equal to 1 from 2015 all the way until 2023.
However, if individual_exit_year is NOT empty, then the value of treatment will equal to 0 from whenever the year in individual_exit_year is. As an example, consider firm=1003 and individual=2. This individual invests in this firm in 2005, so treatment will take a value of 1 from 2005 until 2008 when it will switch to 0 because the individual does NOT invest in this firm in investment_year=2008. Treatment will keep maintaining a value of 0 through to 2013 (since the individual still does NOT invest in this firm in investment_year=2013). But, treatment will take a value of 1 from 2019 (since the individual invests in this firm in 2019), but because individual_exit_year=2021 for individual 2, treatment will only take a value of 1 until 2021, where it will switch back to 0 until 2023. Note that if investment_year and individual_exit_year are the same (as is the case for individual=3), the individual_exit_year takes precedence, i.e., treatment will take a value of 0 until 2023.
Thank you! I managed to tweak your code to get it 🙂
data temp;
set have(keep=individual firm individual_exit_year);
run;
proc sort data=temp out=base nodup;
by individual firm individual_exit_year;
run;
proc sort data=have out=temp2(keep=individual investment_year) nodupkey;
by individual investment_year;
run;
data base2;
set base;
do year=2005 to 2023;
output;
end;
run;
proc sql;
create table base3
as select distinct a.*, b.investment_year, c.investment_year as inv_year
from base2 a
left join have b
on a.individual=b.individual
and a.firm=b.firm
and a.individual_exit_year=b.individual_exit_year
and a.year=b.investment_year
left join temp2 c
on a.individual=c.individual
and a.year=c.investment_year
order by firm, individual, year
;
quit;
data base3;
set base3;
if investment_year^=. then do;
if investment_year = inv_year then treat=1;
end;
if inv_year^=. and investment_year=. then treat=0;
run;
data base3;
set base3;
by firm individual;
if first.individual then treat_fill=treat;
else treat_fill=coalesce(treat,treat_fill); /** Note order matters treat, treat_fill **/
retain treat_fill;
run;
data base3;
set base3;
if treat_fill=. then treat_fill=0;
if individual_exit_year^=. then do;
if year>=individual_exit_year then treat_fill=0;
end;
run;
I can't get your data to match with your description of the rules. But this should get you started.
data temp;
set have(keep=individual firm individual_exit_year);
run;
proc sort data=temp out=base nodup;
by individual firm individual_exit_year;
run;
data base2;
set base;
do year=2005 to 2023;
output;
end;
run;
proc sql;
create table temp2
as select distinct investment_year as inv_year
from have
;
create table base3
as select distinct a.*, b.investment_year, c.inv_year
from base2 a
left join have b
on a.individual=b.individual
and a.firm=b.firm
and a.individual_exit_year=b.individual_exit_year
and a.year=b.investment_year
left join temp2 c
on a.year=c.inv_year
order by individual,year, firm
;
quit;
data want;
set base3;
retain treatment;
by individual year firm;
if first.indiviudal then
treatment=0;
if investment_year=inv_year and investment_year ne . then
treatment=1;
else if investment_year=. and inv_year ne . then treatment=0;
run;
Thank you! I managed to tweak your code to get it 🙂
data temp;
set have(keep=individual firm individual_exit_year);
run;
proc sort data=temp out=base nodup;
by individual firm individual_exit_year;
run;
proc sort data=have out=temp2(keep=individual investment_year) nodupkey;
by individual investment_year;
run;
data base2;
set base;
do year=2005 to 2023;
output;
end;
run;
proc sql;
create table base3
as select distinct a.*, b.investment_year, c.investment_year as inv_year
from base2 a
left join have b
on a.individual=b.individual
and a.firm=b.firm
and a.individual_exit_year=b.individual_exit_year
and a.year=b.investment_year
left join temp2 c
on a.individual=c.individual
and a.year=c.investment_year
order by firm, individual, year
;
quit;
data base3;
set base3;
if investment_year^=. then do;
if investment_year = inv_year then treat=1;
end;
if inv_year^=. and investment_year=. then treat=0;
run;
data base3;
set base3;
by firm individual;
if first.individual then treat_fill=treat;
else treat_fill=coalesce(treat,treat_fill); /** Note order matters treat, treat_fill **/
retain treat_fill;
run;
data base3;
set base3;
if treat_fill=. then treat_fill=0;
if individual_exit_year^=. then do;
if year>=individual_exit_year then treat_fill=0;
end;
run;
Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.
Register today!Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.