BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
JKCho
Pyrite | Level 9
proc sql;
create table comp2_1 as
 select *,(select count(cf) from comp2 where fyear between a.fyear-1 and a.fyear-5 and gvkey=a.gvkey) as count_cf
  from comp2 as a;
quit;

 Hello,

 

I first used the code above but as you can easily expect, it takes so long...

What I want is to obtain the number(count_cf) of observations(cf) between t-5 year and t-1 year for each firm(gvkey).

Since some "cf" observations are missing, I only want to use non-missing data, so I need to sort them out.

 

It is very sure that I am very close to hit the answer by searching for in this community but I have not reached that yet.

Please share your thoughts!

 

Thank you.

1 ACCEPTED SOLUTION

Accepted Solutions
Kurt_Bremser
Super User

Run a double DO loop, in the first loop populate an array (indexed by year). In the second loop, calculate the value.

%let start = 1800;
%let end = 2100;

data want;
array years {&start.:&end.} _temporary_;
do _n_ = &start. to &end.;
  years{_n_} = 0;
end;
do until (last.gvkey);
  set have;
  by gvkey;
  years{fyear} + (cf ne .);
end;
count_cf = 0;
do until (last.gvkey);
  set have;
  by gvkey;
  count_cf = sum(of years{fyear- 5:fyear - 1});
  output;
end;
run;
  

Untested, posted from my tablet.

View solution in original post

13 REPLIES 13
Kurt_Bremser
Super User

Run a double DO loop, in the first loop populate an array (indexed by year). In the second loop, calculate the value.

%let start = 1800;
%let end = 2100;

data want;
array years {&start.:&end.} _temporary_;
do _n_ = &start. to &end.;
  years{_n_} = 0;
end;
do until (last.gvkey);
  set have;
  by gvkey;
  years{fyear} + (cf ne .);
end;
count_cf = 0;
do until (last.gvkey);
  set have;
  by gvkey;
  count_cf = sum(of years{fyear- 5:fyear - 1});
  output;
end;
run;
  

Untested, posted from my tablet.

Reeza
Super User
A self join may even be slightly faster than a subquery though....have you tried that instead?
JKCho
Pyrite | Level 9

Nope. I am trying now. I know the code down there does not work. Can you suggest some? I am not sure where I need to create n(cf) in this self join.

proc sql ;
  create table step5 as 
    select a.*, b.n(cf)
    from step4 a left join step4 b
    on a.gvkey = b.gvkey
    where a.fyear between a.fyear-1 and a.fyear-5;
quit;
Reeza
Super User
Functions go around the variable reference, n(b.cf)

proc sql ;
create table step5 as
select a.*, n(b.cf)
from step4 a left join step4 b
on a.gvkey = b.gvkey
where a.fyear between a.fyear-1 and a.fyear-5;
quit;
Reeza
Super User
You also need to change your reference on the between to b from a. ie a.fyear between b.fyear-1 and b.fyear-5.
Check that the a/b references are what you want.
JKCho
Pyrite | Level 9

Thank you Reeze!

 

 

proc sql ;
  create table step5 as 
    select a.*, n(b.cf) as n_cf
	from step4 a left join step4 b
	on a.gvkey = b.gvkey
    where a.fyear between b.fyear-1 and b.fyear-5;
quit;

I now use this code but there is an issue. the value of n_cf is the whole number of observations of cf, so it is like 2,000,000. ( I have a similar number of observations)

 

It seems that the number of n_cf should be bounded by on a.gvkey = b.gvkey where a.fyear between b.fyear-1 and b.fyear-5;

But this is not the case. 

 

Any ideas?

 

tarheel13
Rhodochrosite | Level 12
If you want the count by firm you would have to do group by firm.
Reeza
Super User
proc sql ;
  create table step5 as 
    select a.*, n(b.cf) as n_cf
	from step4 a left join step4 b
	on a.gvkey = b.gvkey
    where a.fyear between b.fyear-1 and b.fyear-5
group by gvkey;
quit;

@JKCho wrote:

Thank you Reeze!

 

 

proc sql ;
  create table step5 as 
    select a.*, n(b.cf) as n_cf
	from step4 a left join step4 b
	on a.gvkey = b.gvkey
    where a.fyear between b.fyear-1 and b.fyear-5;
quit;

I now use this code but there is an issue. the value of n_cf is the whole number of observations of cf, so it is like 2,000,000. ( I have a similar number of observations)

 

It seems that the number of n_cf should be bounded by on a.gvkey = b.gvkey where a.fyear between b.fyear-1 and b.fyear-5;

But this is not the case. 

 

Any ideas?

 


 

JKCho
Pyrite | Level 9

Thank you again.

 

I realized this programming code is not for what I want. 

I just want to have is... if my raw data is... 

Id year cf 

1 2000  . 

1 2001  . 

1 2002  100

1 2003  200

1 2004  300

1 2005  150

1 2006  900    

1 2007   .        

1 2008   .

 

then....

 

Id year cf     n_cf

1 2000  .        0

1 2001  .        0

1 2002  100   0

1 2003  200   1

1 2004  300   2

1 2005  150   3    

1 2006  900   4

1 2007  .        5

1 2008  .        4 

 

n_cf is the number of non-missing cf variables between t-1 year and t-5 year.

I thought 

where a.fyear between b.fyear-1 and b.fyear-5;

 the above part can do what I want but it actually did is first count "cf" of all observations In my dataset and then group by gvkey. I instead want to count cf observations within t-1 and t-5 by firm(gvkey).

 

Does any have thoughts on that? Thank you so much!

 

Reeza
Super User
Note the GROUP BY in my last response.
JKCho
Pyrite | Level 9
well. I know you said that. What I said is... what you said also does not work. Seems no way and I have to approach in a different way. Thx!
SASKiwi
PROC Star

A query like this will get you the result you want, assuming you've already got the required HAVE input dataset

proc sql;
  create table want as
  select ID
        ,year
        ,cf
        ,count(cf)
  from have
  group by id
          ,year
          ,cf
;
quit;
Kurt_Bremser
Super User

After correcting a mistake, and adapting the variable names to your data, this works and creates the expected result:

data have;
input Id year cf;
datalines; 
1 2000  . 
1 2001  . 
1 2002  100
1 2003  200
1 2004  300
1 2005  150
1 2006  900    
1 2007   .        
1 2008   .
;

%let start = 1800;
%let end = 2100;

data want;
array years {&start.:&end.} _temporary_;
do _n_ = &start. to &end.;
  years{_n_} = 0;
end;
do until (last.id);
  set have;
  by id;
  years{year} + (cf ne .);
end;
count_cf = 0;
do until (last.id);
  set have;
  by id;
  count_cf = 0;
  do fyear = year - 5 to year - 1;
    count_cf + years{fyear};
  end;
  output;
end;
drop fyear;
run;

proc print data=want noobs;
run;

Result:

Id	year	cf	count_cf
1	2000	.	0
1	2001	.	0
1	2002	100	0
1	2003	200	1
1	2004	300	2
1	2005	150	3
1	2006	900	4
1	2007	.	5
1	2008	.	4

@JKCho wrote:

Thank you again.

 

I realized this programming code is not for what I want. 

I just want to have is... if my raw data is... 

Id year cf 

1 2000  . 

1 2001  . 

1 2002  100

1 2003  200

1 2004  300

1 2005  150

1 2006  900    

1 2007   .        

1 2008   .

 

then....

 

Id year cf     n_cf

1 2000  .        0

1 2001  .        0

1 2002  100   0

1 2003  200   1

1 2004  300   2

1 2005  150   3    

1 2006  900   4

1 2007  .        5

1 2008  .        4 

 

n_cf is the number of non-missing cf variables between t-1 year and t-5 year.

I thought 

where a.fyear between b.fyear-1 and b.fyear-5;

 the above part can do what I want but it actually did is first count "cf" of all observations In my dataset and then group by gvkey. I instead want to count cf observations within t-1 and t-5 by firm(gvkey).

 

Does any have thoughts on that? Thank you so much!

 


hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 13 replies
  • 3343 views
  • 7 likes
  • 5 in conversation