Solved: Re: counting observations within a certain period

JKCho · Posted 06-29-2021 01:04 PM

proc sql;
create table comp2_1 as
 select *,(select count(cf) from comp2 where fyear between a.fyear-1 and a.fyear-5 and gvkey=a.gvkey) as count_cf
  from comp2 as a;
quit;

Hello,

I first used the code above but as you can easily expect, it takes so long...

What I want is to obtain the number(count_cf) of observations(cf) between t-5 year and t-1 year for each firm(gvkey).

Since some "cf" observations are missing, I only want to use non-missing data, so I need to sort them out.

It is very sure that I am very close to hit the answer by searching for in this community but I have not reached that yet.

Please share your thoughts!

Thank you.

Kurt_Bremser · Posted 06-29-2021 03:10 PM

Run a double DO loop, in the first loop populate an array (indexed by year). In the second loop, calculate the value.

%let start = 1800;
%let end = 2100;

data want;
array years {&start.:&end.} _temporary_;
do _n_ = &start. to &end.;
  years{_n_} = 0;
end;
do until (last.gvkey);
  set have;
  by gvkey;
  years{fyear} + (cf ne .);
end;
count_cf = 0;
do until (last.gvkey);
  set have;
  by gvkey;
  count_cf = sum(of years{fyear- 5:fyear - 1});
  output;
end;
run;

Untested, posted from my tablet.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

View solution in original post

Kurt_Bremser · Posted 06-29-2021 03:10 PM

Run a double DO loop, in the first loop populate an array (indexed by year). In the second loop, calculate the value.

%let start = 1800;
%let end = 2100;

data want;
array years {&start.:&end.} _temporary_;
do _n_ = &start. to &end.;
  years{_n_} = 0;
end;
do until (last.gvkey);
  set have;
  by gvkey;
  years{fyear} + (cf ne .);
end;
count_cf = 0;
do until (last.gvkey);
  set have;
  by gvkey;
  count_cf = sum(of years{fyear- 5:fyear - 1});
  output;
end;
run;

Untested, posted from my tablet.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Reeza · Posted 06-29-2021 05:11 PM

A self join may even be slightly faster than a subquery though....have you tried that instead?

JKCho · Posted 06-29-2021 05:59 PM

Nope. I am trying now. I know the code down there does not work. Can you suggest some? I am not sure where I need to create n(cf) in this self join.

proc sql ;
  create table step5 as 
    select a.*, b.n(cf)
    from step4 a left join step4 b
    on a.gvkey = b.gvkey
    where a.fyear between a.fyear-1 and a.fyear-5;
quit;

Reeza · Posted 06-29-2021 06:00 PM

Functions go around the variable reference, n(b.cf)

proc sql ;
create table step5 as
select a.*, n(b.cf)
from step4 a left join step4 b
on a.gvkey = b.gvkey
where a.fyear between a.fyear-1 and a.fyear-5;
quit;

Reeza · Posted 06-29-2021 06:06 PM

You also need to change your reference on the between to b from a. ie a.fyear between b.fyear-1 and b.fyear-5.
Check that the a/b references are what you want.

JKCho · Posted 06-29-2021 06:20 PM

Thank you Reeze!

proc sql ;
  create table step5 as 
    select a.*, n(b.cf) as n_cf
	from step4 a left join step4 b
	on a.gvkey = b.gvkey
    where a.fyear between b.fyear-1 and b.fyear-5;
quit;

I now use this code but there is an issue. the value of n_cf is the whole number of observations of cf, so it is like 2,000,000. ( I have a similar number of observations)

It seems that the number of n_cf should be bounded by on a.gvkey = b.gvkey where a.fyear between b.fyear-1 and b.fyear-5;

But this is not the case.

Any ideas?

tarheel13 · Posted 06-29-2021 10:33 PM

If you want the count by firm you would have to do group by firm.

Reeza · Posted 06-30-2021 10:12 AM

proc sql ;
  create table step5 as 
    select a.*, n(b.cf) as n_cf
	from step4 a left join step4 b
	on a.gvkey = b.gvkey
    where a.fyear between b.fyear-1 and b.fyear-5
group by gvkey;
quit;

@JKCho wrote:

Thank you Reeze!
proc sql ;
  create table step5 as 
    select a.*, n(b.cf) as n_cf
	from step4 a left join step4 b
	on a.gvkey = b.gvkey
    where a.fyear between b.fyear-1 and b.fyear-5;
quit;
I now use this code but there is an issue. the value of n_cf is the whole number of observations of cf, so it is like 2,000,000. ( I have a similar number of observations)

It seems that the number of n_cf should be bounded by on a.gvkey = b.gvkey where a.fyear between b.fyear-1 and b.fyear-5;

But this is not the case.

Any ideas?

JKCho · Posted 06-30-2021 03:07 PM

Thank you again.

I realized this programming code is not for what I want.

I just want to have is... if my raw data is...

Id year cf

1 2000 .

1 2001 .

1 2002 100

1 2003 200

1 2004 300

1 2005 150

1 2006 900

1 2007 .

1 2008 .

then....

Id year cf n_cf

1 2000 . 0

1 2001 . 0

1 2002 100 0

1 2003 200 1

1 2004 300 2

1 2005 150 3

1 2006 900 4

1 2007 . 5

1 2008 . 4

n_cf is the number of non-missing cf variables between t-1 year and t-5 year.

I thought

where a.fyear between b.fyear-1 and b.fyear-5;

the above part can do what I want but it actually did is first count "cf" of all observations In my dataset and then group by gvkey. I instead want to count cf observations within t-1 and t-5 by firm(gvkey).

Does any have thoughts on that? Thank you so much!

Reeza · Posted 06-30-2021 03:37 PM

Note the GROUP BY in my last response.

JKCho · Posted 06-30-2021 07:30 PM

well. I know you said that. What I said is... what you said also does not work. Seems no way and I have to approach in a different way. Thx!

SASKiwi · Posted 06-30-2021 08:27 PM

A query like this will get you the result you want, assuming you've already got the required HAVE input dataset

proc sql;
  create table want as
  select ID
        ,year
        ,cf
        ,count(cf)
  from have
  group by id
          ,year
          ,cf
;
quit;

Kurt_Bremser · Posted 07-06-2021 05:19 AM

After correcting a mistake, and adapting the variable names to your data, this works and creates the expected result:

data have;
input Id year cf;
datalines; 
1 2000  . 
1 2001  . 
1 2002  100
1 2003  200
1 2004  300
1 2005  150
1 2006  900    
1 2007   .        
1 2008   .
;

%let start = 1800;
%let end = 2100;

data want;
array years {&start.:&end.} _temporary_;
do _n_ = &start. to &end.;
  years{_n_} = 0;
end;
do until (last.id);
  set have;
  by id;
  years{year} + (cf ne .);
end;
count_cf = 0;
do until (last.id);
  set have;
  by id;
  count_cf = 0;
  do fyear = year - 5 to year - 1;
    count_cf + years{fyear};
  end;
  output;
end;
drop fyear;
run;

proc print data=want noobs;
run;

Result:

Id	year	cf	count_cf
1	2000	.	0
1	2001	.	0
1	2002	100	0
1	2003	200	1
1	2004	300	2
1	2005	150	3
1	2006	900	4
1	2007	.	5
1	2008	.	4

@JKCho wrote:

Thank you again.

I realized this programming code is not for what I want.

I just want to have is... if my raw data is...

Id year cf

1 2000 .

1 2001 .

1 2002 100

1 2003 200

1 2004 300

1 2005 150

1 2006 900

1 2007 .

1 2008 .

then....

Id year cf n_cf

1 2000 . 0

1 2001 . 0

1 2002 100 0

1 2003 200 1

1 2004 300 2

1 2005 150 3

1 2006 900 4

1 2007 . 5

1 2008 . 4

n_cf is the number of non-missing cf variables between t-1 year and t-5 year.

I thought
where a.fyear between b.fyear-1 and b.fyear-5;
the above part can do what I want but it actually did is first count "cf" of all observations In my dataset and then group by gvkey. I instead want to count cf observations within t-1 and t-5 by firm(gvkey).

Does any have thoughts on that? Thank you so much!

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away