proc sql;
create table comp2_1 as
select *,(select count(cf) from comp2 where fyear between a.fyear-1 and a.fyear-5 and gvkey=a.gvkey) as count_cf
from comp2 as a;
quit;
Hello,
I first used the code above but as you can easily expect, it takes so long...
What I want is to obtain the number(count_cf) of observations(cf) between t-5 year and t-1 year for each firm(gvkey).
Since some "cf" observations are missing, I only want to use non-missing data, so I need to sort them out.
It is very sure that I am very close to hit the answer by searching for in this community but I have not reached that yet.
Please share your thoughts!
Thank you.
Run a double DO loop, in the first loop populate an array (indexed by year). In the second loop, calculate the value.
%let start = 1800;
%let end = 2100;
data want;
array years {&start.:&end.} _temporary_;
do _n_ = &start. to &end.;
years{_n_} = 0;
end;
do until (last.gvkey);
set have;
by gvkey;
years{fyear} + (cf ne .);
end;
count_cf = 0;
do until (last.gvkey);
set have;
by gvkey;
count_cf = sum(of years{fyear- 5:fyear - 1});
output;
end;
run;
Untested, posted from my tablet.
Run a double DO loop, in the first loop populate an array (indexed by year). In the second loop, calculate the value.
%let start = 1800;
%let end = 2100;
data want;
array years {&start.:&end.} _temporary_;
do _n_ = &start. to &end.;
years{_n_} = 0;
end;
do until (last.gvkey);
set have;
by gvkey;
years{fyear} + (cf ne .);
end;
count_cf = 0;
do until (last.gvkey);
set have;
by gvkey;
count_cf = sum(of years{fyear- 5:fyear - 1});
output;
end;
run;
Untested, posted from my tablet.
Nope. I am trying now. I know the code down there does not work. Can you suggest some? I am not sure where I need to create n(cf) in this self join.
proc sql ;
create table step5 as
select a.*, b.n(cf)
from step4 a left join step4 b
on a.gvkey = b.gvkey
where a.fyear between a.fyear-1 and a.fyear-5;
quit;
Thank you Reeze!
proc sql ;
create table step5 as
select a.*, n(b.cf) as n_cf
from step4 a left join step4 b
on a.gvkey = b.gvkey
where a.fyear between b.fyear-1 and b.fyear-5;
quit;
I now use this code but there is an issue. the value of n_cf is the whole number of observations of cf, so it is like 2,000,000. ( I have a similar number of observations)
It seems that the number of n_cf should be bounded by on a.gvkey = b.gvkey where a.fyear between b.fyear-1 and b.fyear-5;
But this is not the case.
Any ideas?
proc sql ;
create table step5 as
select a.*, n(b.cf) as n_cf
from step4 a left join step4 b
on a.gvkey = b.gvkey
where a.fyear between b.fyear-1 and b.fyear-5
group by gvkey;
quit;
@JKCho wrote:
Thank you Reeze!
proc sql ; create table step5 as select a.*, n(b.cf) as n_cf from step4 a left join step4 b on a.gvkey = b.gvkey where a.fyear between b.fyear-1 and b.fyear-5; quit;
I now use this code but there is an issue. the value of n_cf is the whole number of observations of cf, so it is like 2,000,000. ( I have a similar number of observations)
It seems that the number of n_cf should be bounded by
on a.gvkey = b.gvkey where a.fyear between b.fyear-1 and b.fyear-5;
But this is not the case.
Any ideas?
Thank you again.
I realized this programming code is not for what I want.
I just want to have is... if my raw data is...
Id year cf
1 2000 .
1 2001 .
1 2002 100
1 2003 200
1 2004 300
1 2005 150
1 2006 900
1 2007 .
1 2008 .
then....
Id year cf n_cf
1 2000 . 0
1 2001 . 0
1 2002 100 0
1 2003 200 1
1 2004 300 2
1 2005 150 3
1 2006 900 4
1 2007 . 5
1 2008 . 4
n_cf is the number of non-missing cf variables between t-1 year and t-5 year.
I thought
where a.fyear between b.fyear-1 and b.fyear-5;
the above part can do what I want but it actually did is first count "cf" of all observations In my dataset and then group by gvkey. I instead want to count cf observations within t-1 and t-5 by firm(gvkey).
Does any have thoughts on that? Thank you so much!
A query like this will get you the result you want, assuming you've already got the required HAVE input dataset
proc sql;
create table want as
select ID
,year
,cf
,count(cf)
from have
group by id
,year
,cf
;
quit;
After correcting a mistake, and adapting the variable names to your data, this works and creates the expected result:
data have;
input Id year cf;
datalines;
1 2000 .
1 2001 .
1 2002 100
1 2003 200
1 2004 300
1 2005 150
1 2006 900
1 2007 .
1 2008 .
;
%let start = 1800;
%let end = 2100;
data want;
array years {&start.:&end.} _temporary_;
do _n_ = &start. to &end.;
years{_n_} = 0;
end;
do until (last.id);
set have;
by id;
years{year} + (cf ne .);
end;
count_cf = 0;
do until (last.id);
set have;
by id;
count_cf = 0;
do fyear = year - 5 to year - 1;
count_cf + years{fyear};
end;
output;
end;
drop fyear;
run;
proc print data=want noobs;
run;
Result:
Id year cf count_cf 1 2000 . 0 1 2001 . 0 1 2002 100 0 1 2003 200 1 1 2004 300 2 1 2005 150 3 1 2006 900 4 1 2007 . 5 1 2008 . 4
@JKCho wrote:
Thank you again.
I realized this programming code is not for what I want.
I just want to have is... if my raw data is...
Id year cf
1 2000 .
1 2001 .
1 2002 100
1 2003 200
1 2004 300
1 2005 150
1 2006 900
1 2007 .
1 2008 .
then....
Id year cf n_cf
1 2000 . 0
1 2001 . 0
1 2002 100 0
1 2003 200 1
1 2004 300 2
1 2005 150 3
1 2006 900 4
1 2007 . 5
1 2008 . 4
n_cf is the number of non-missing cf variables between t-1 year and t-5 year.
I thought
where a.fyear between b.fyear-1 and b.fyear-5;
the above part can do what I want but it actually did is first count "cf" of all observations In my dataset and then group by gvkey. I instead want to count cf observations within t-1 and t-5 by firm(gvkey).
Does any have thoughts on that? Thank you so much!
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.