proc sql;
create table comp2_1 as
select *,(select count(cf) from comp2 where fyear between a.fyear-1 and a.fyear-5 and gvkey=a.gvkey) as count_cf
from comp2 as a;
quit;
Hello,
I first used the code above but as you can easily expect, it takes so long...
What I want is to obtain the number(count_cf) of observations(cf) between t-5 year and t-1 year for each firm(gvkey).
Since some "cf" observations are missing, I only want to use non-missing data, so I need to sort them out.
It is very sure that I am very close to hit the answer by searching for in this community but I have not reached that yet.
Please share your thoughts!
Thank you.
Run a double DO loop, in the first loop populate an array (indexed by year). In the second loop, calculate the value.
%let start = 1800;
%let end = 2100;
data want;
array years {&start.:&end.} _temporary_;
do _n_ = &start. to &end.;
years{_n_} = 0;
end;
do until (last.gvkey);
set have;
by gvkey;
years{fyear} + (cf ne .);
end;
count_cf = 0;
do until (last.gvkey);
set have;
by gvkey;
count_cf = sum(of years{fyear- 5:fyear - 1});
output;
end;
run;
Untested, posted from my tablet.
Run a double DO loop, in the first loop populate an array (indexed by year). In the second loop, calculate the value.
%let start = 1800;
%let end = 2100;
data want;
array years {&start.:&end.} _temporary_;
do _n_ = &start. to &end.;
years{_n_} = 0;
end;
do until (last.gvkey);
set have;
by gvkey;
years{fyear} + (cf ne .);
end;
count_cf = 0;
do until (last.gvkey);
set have;
by gvkey;
count_cf = sum(of years{fyear- 5:fyear - 1});
output;
end;
run;
Untested, posted from my tablet.
Nope. I am trying now. I know the code down there does not work. Can you suggest some? I am not sure where I need to create n(cf) in this self join.
proc sql ;
create table step5 as
select a.*, b.n(cf)
from step4 a left join step4 b
on a.gvkey = b.gvkey
where a.fyear between a.fyear-1 and a.fyear-5;
quit;
Thank you Reeze!
proc sql ;
create table step5 as
select a.*, n(b.cf) as n_cf
from step4 a left join step4 b
on a.gvkey = b.gvkey
where a.fyear between b.fyear-1 and b.fyear-5;
quit;
I now use this code but there is an issue. the value of n_cf is the whole number of observations of cf, so it is like 2,000,000. ( I have a similar number of observations)
It seems that the number of n_cf should be bounded by on a.gvkey = b.gvkey where a.fyear between b.fyear-1 and b.fyear-5;
But this is not the case.
Any ideas?
proc sql ;
create table step5 as
select a.*, n(b.cf) as n_cf
from step4 a left join step4 b
on a.gvkey = b.gvkey
where a.fyear between b.fyear-1 and b.fyear-5
group by gvkey;
quit;
@JKCho wrote:
Thank you Reeze!
proc sql ; create table step5 as select a.*, n(b.cf) as n_cf from step4 a left join step4 b on a.gvkey = b.gvkey where a.fyear between b.fyear-1 and b.fyear-5; quit;
I now use this code but there is an issue. the value of n_cf is the whole number of observations of cf, so it is like 2,000,000. ( I have a similar number of observations)
It seems that the number of n_cf should be bounded by
on a.gvkey = b.gvkey where a.fyear between b.fyear-1 and b.fyear-5;
But this is not the case.
Any ideas?
Thank you again.
I realized this programming code is not for what I want.
I just want to have is... if my raw data is...
Id year cf
1 2000 .
1 2001 .
1 2002 100
1 2003 200
1 2004 300
1 2005 150
1 2006 900
1 2007 .
1 2008 .
then....
Id year cf n_cf
1 2000 . 0
1 2001 . 0
1 2002 100 0
1 2003 200 1
1 2004 300 2
1 2005 150 3
1 2006 900 4
1 2007 . 5
1 2008 . 4
n_cf is the number of non-missing cf variables between t-1 year and t-5 year.
I thought
where a.fyear between b.fyear-1 and b.fyear-5;
the above part can do what I want but it actually did is first count "cf" of all observations In my dataset and then group by gvkey. I instead want to count cf observations within t-1 and t-5 by firm(gvkey).
Does any have thoughts on that? Thank you so much!
A query like this will get you the result you want, assuming you've already got the required HAVE input dataset
proc sql;
create table want as
select ID
,year
,cf
,count(cf)
from have
group by id
,year
,cf
;
quit;
After correcting a mistake, and adapting the variable names to your data, this works and creates the expected result:
data have;
input Id year cf;
datalines;
1 2000 .
1 2001 .
1 2002 100
1 2003 200
1 2004 300
1 2005 150
1 2006 900
1 2007 .
1 2008 .
;
%let start = 1800;
%let end = 2100;
data want;
array years {&start.:&end.} _temporary_;
do _n_ = &start. to &end.;
years{_n_} = 0;
end;
do until (last.id);
set have;
by id;
years{year} + (cf ne .);
end;
count_cf = 0;
do until (last.id);
set have;
by id;
count_cf = 0;
do fyear = year - 5 to year - 1;
count_cf + years{fyear};
end;
output;
end;
drop fyear;
run;
proc print data=want noobs;
run;
Result:
Id year cf count_cf 1 2000 . 0 1 2001 . 0 1 2002 100 0 1 2003 200 1 1 2004 300 2 1 2005 150 3 1 2006 900 4 1 2007 . 5 1 2008 . 4
@JKCho wrote:
Thank you again.
I realized this programming code is not for what I want.
I just want to have is... if my raw data is...
Id year cf
1 2000 .
1 2001 .
1 2002 100
1 2003 200
1 2004 300
1 2005 150
1 2006 900
1 2007 .
1 2008 .
then....
Id year cf n_cf
1 2000 . 0
1 2001 . 0
1 2002 100 0
1 2003 200 1
1 2004 300 2
1 2005 150 3
1 2006 900 4
1 2007 . 5
1 2008 . 4
n_cf is the number of non-missing cf variables between t-1 year and t-5 year.
I thought
where a.fyear between b.fyear-1 and b.fyear-5;
the above part can do what I want but it actually did is first count "cf" of all observations In my dataset and then group by gvkey. I instead want to count cf observations within t-1 and t-5 by firm(gvkey).
Does any have thoughts on that? Thank you so much!
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.