BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
JKCho
Pyrite | Level 9
proc sql;
create table comp2_1 as
 select *,(select count(cf) from comp2 where fyear between a.fyear-1 and a.fyear-5 and gvkey=a.gvkey) as count_cf
  from comp2 as a;
quit;

 Hello,

 

I first used the code above but as you can easily expect, it takes so long...

What I want is to obtain the number(count_cf) of observations(cf) between t-5 year and t-1 year for each firm(gvkey).

Since some "cf" observations are missing, I only want to use non-missing data, so I need to sort them out.

 

It is very sure that I am very close to hit the answer by searching for in this community but I have not reached that yet.

Please share your thoughts!

 

Thank you.

1 ACCEPTED SOLUTION

Accepted Solutions
Kurt_Bremser
Super User

Run a double DO loop, in the first loop populate an array (indexed by year). In the second loop, calculate the value.

%let start = 1800;
%let end = 2100;

data want;
array years {&start.:&end.} _temporary_;
do _n_ = &start. to &end.;
  years{_n_} = 0;
end;
do until (last.gvkey);
  set have;
  by gvkey;
  years{fyear} + (cf ne .);
end;
count_cf = 0;
do until (last.gvkey);
  set have;
  by gvkey;
  count_cf = sum(of years{fyear- 5:fyear - 1});
  output;
end;
run;
  

Untested, posted from my tablet.

View solution in original post

13 REPLIES 13
Kurt_Bremser
Super User

Run a double DO loop, in the first loop populate an array (indexed by year). In the second loop, calculate the value.

%let start = 1800;
%let end = 2100;

data want;
array years {&start.:&end.} _temporary_;
do _n_ = &start. to &end.;
  years{_n_} = 0;
end;
do until (last.gvkey);
  set have;
  by gvkey;
  years{fyear} + (cf ne .);
end;
count_cf = 0;
do until (last.gvkey);
  set have;
  by gvkey;
  count_cf = sum(of years{fyear- 5:fyear - 1});
  output;
end;
run;
  

Untested, posted from my tablet.

Reeza
Super User
A self join may even be slightly faster than a subquery though....have you tried that instead?
JKCho
Pyrite | Level 9

Nope. I am trying now. I know the code down there does not work. Can you suggest some? I am not sure where I need to create n(cf) in this self join.

proc sql ;
  create table step5 as 
    select a.*, b.n(cf)
    from step4 a left join step4 b
    on a.gvkey = b.gvkey
    where a.fyear between a.fyear-1 and a.fyear-5;
quit;
Reeza
Super User
Functions go around the variable reference, n(b.cf)

proc sql ;
create table step5 as
select a.*, n(b.cf)
from step4 a left join step4 b
on a.gvkey = b.gvkey
where a.fyear between a.fyear-1 and a.fyear-5;
quit;
Reeza
Super User
You also need to change your reference on the between to b from a. ie a.fyear between b.fyear-1 and b.fyear-5.
Check that the a/b references are what you want.
JKCho
Pyrite | Level 9

Thank you Reeze!

 

 

proc sql ;
  create table step5 as 
    select a.*, n(b.cf) as n_cf
	from step4 a left join step4 b
	on a.gvkey = b.gvkey
    where a.fyear between b.fyear-1 and b.fyear-5;
quit;

I now use this code but there is an issue. the value of n_cf is the whole number of observations of cf, so it is like 2,000,000. ( I have a similar number of observations)

 

It seems that the number of n_cf should be bounded by on a.gvkey = b.gvkey where a.fyear between b.fyear-1 and b.fyear-5;

But this is not the case. 

 

Any ideas?

 

tarheel13
Rhodochrosite | Level 12
If you want the count by firm you would have to do group by firm.
Reeza
Super User
proc sql ;
  create table step5 as 
    select a.*, n(b.cf) as n_cf
	from step4 a left join step4 b
	on a.gvkey = b.gvkey
    where a.fyear between b.fyear-1 and b.fyear-5
group by gvkey;
quit;

@JKCho wrote:

Thank you Reeze!

 

 

proc sql ;
  create table step5 as 
    select a.*, n(b.cf) as n_cf
	from step4 a left join step4 b
	on a.gvkey = b.gvkey
    where a.fyear between b.fyear-1 and b.fyear-5;
quit;

I now use this code but there is an issue. the value of n_cf is the whole number of observations of cf, so it is like 2,000,000. ( I have a similar number of observations)

 

It seems that the number of n_cf should be bounded by on a.gvkey = b.gvkey where a.fyear between b.fyear-1 and b.fyear-5;

But this is not the case. 

 

Any ideas?

 


 

JKCho
Pyrite | Level 9

Thank you again.

 

I realized this programming code is not for what I want. 

I just want to have is... if my raw data is... 

Id year cf 

1 2000  . 

1 2001  . 

1 2002  100

1 2003  200

1 2004  300

1 2005  150

1 2006  900    

1 2007   .        

1 2008   .

 

then....

 

Id year cf     n_cf

1 2000  .        0

1 2001  .        0

1 2002  100   0

1 2003  200   1

1 2004  300   2

1 2005  150   3    

1 2006  900   4

1 2007  .        5

1 2008  .        4 

 

n_cf is the number of non-missing cf variables between t-1 year and t-5 year.

I thought 

where a.fyear between b.fyear-1 and b.fyear-5;

 the above part can do what I want but it actually did is first count "cf" of all observations In my dataset and then group by gvkey. I instead want to count cf observations within t-1 and t-5 by firm(gvkey).

 

Does any have thoughts on that? Thank you so much!

 

Reeza
Super User
Note the GROUP BY in my last response.
JKCho
Pyrite | Level 9
well. I know you said that. What I said is... what you said also does not work. Seems no way and I have to approach in a different way. Thx!
SASKiwi
PROC Star

A query like this will get you the result you want, assuming you've already got the required HAVE input dataset

proc sql;
  create table want as
  select ID
        ,year
        ,cf
        ,count(cf)
  from have
  group by id
          ,year
          ,cf
;
quit;
Kurt_Bremser
Super User

After correcting a mistake, and adapting the variable names to your data, this works and creates the expected result:

data have;
input Id year cf;
datalines; 
1 2000  . 
1 2001  . 
1 2002  100
1 2003  200
1 2004  300
1 2005  150
1 2006  900    
1 2007   .        
1 2008   .
;

%let start = 1800;
%let end = 2100;

data want;
array years {&start.:&end.} _temporary_;
do _n_ = &start. to &end.;
  years{_n_} = 0;
end;
do until (last.id);
  set have;
  by id;
  years{year} + (cf ne .);
end;
count_cf = 0;
do until (last.id);
  set have;
  by id;
  count_cf = 0;
  do fyear = year - 5 to year - 1;
    count_cf + years{fyear};
  end;
  output;
end;
drop fyear;
run;

proc print data=want noobs;
run;

Result:

Id	year	cf	count_cf
1	2000	.	0
1	2001	.	0
1	2002	100	0
1	2003	200	1
1	2004	300	2
1	2005	150	3
1	2006	900	4
1	2007	.	5
1	2008	.	4

@JKCho wrote:

Thank you again.

 

I realized this programming code is not for what I want. 

I just want to have is... if my raw data is... 

Id year cf 

1 2000  . 

1 2001  . 

1 2002  100

1 2003  200

1 2004  300

1 2005  150

1 2006  900    

1 2007   .        

1 2008   .

 

then....

 

Id year cf     n_cf

1 2000  .        0

1 2001  .        0

1 2002  100   0

1 2003  200   1

1 2004  300   2

1 2005  150   3    

1 2006  900   4

1 2007  .        5

1 2008  .        4 

 

n_cf is the number of non-missing cf variables between t-1 year and t-5 year.

I thought 

where a.fyear between b.fyear-1 and b.fyear-5;

 the above part can do what I want but it actually did is first count "cf" of all observations In my dataset and then group by gvkey. I instead want to count cf observations within t-1 and t-5 by firm(gvkey).

 

Does any have thoughts on that? Thank you so much!

 


SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 13 replies
  • 2002 views
  • 7 likes
  • 5 in conversation