Hi guys,
I need to count the number of observations for "each firm" in "each year" to be able to ignore those years with less than (for example) 3 observations.
my data set is something like this (and the count column is what I need!)
year permno count
1 1 3
1 1 3
1 1 3
2 1 1
3 1 1
1 2 1
2 2 4
2 2 4
2 2 4
2 2 4
3 2 1
4 2 1
can any one help me to write this with SQL this, please? (and since I am a newbie in sas, would you please describe it a little ?)
Thanks a lot
proc sql;
create table want as
select * from have natural join
(select *, count(*) as count
from have
group by permno, year
/* having count(*) >=3 */ ) ;
select * from want;
quit;
Says : Count observations within each group defined by permno and year (that's the part within parenthesis) then join these counts with the original table, joining on the variables that they have in common (natural join).
The commented out (having clause) is where you would exclude years with less than 3 observations .
PG
proc sql;
create table want as
select * from have natural join
(select *, count(*) as count
from have
group by permno, year
/* having count(*) >=3 */ ) ;
select * from want;
quit;
Says : Count observations within each group defined by permno and year (that's the part within parenthesis) then join these counts with the original table, joining on the variables that they have in common (natural join).
The commented out (having clause) is where you would exclude years with less than 3 observations .
PG
That worked like a charm. Thanks a lot PG
Hi PG,
What is natural join? I have used left, right, inner join, but have never used natural join. I am not too lazy to read. Your expanation would help me to remeber better.
Updated after Art's comments:smileysilly:.
Thank you!
Linlin: don't be lazy! Read!
http://support.sas.com/documentation/cdl/en/proc/61895/HTML/default/viewer.htm#a002473691.htm
natural join is like inner join with an implicit ON clause on all the variables that have the same name in both tables. You can also do a natural left join.
PG
Thank you Art and PG!
The expanation I get from you would help me to understand better.
if i also want to take the average of those specific observation and excluding the one which the average less than e.g. n. how can I do that?
Thanks
I think this might work for you:
assume you want to take an average for variable VAR1
Proc sql;
create want as
select *, avg(VAR1) as average
from have
where average gt N;
It would be great if expert guys mention whether it works or not!
thank you very much. But can i combine this
Proc sql;
create want as
select *, avg(VAR1) as average
from have
where average gt N;
and this eg
proc sql noprint ;
create table want as
select *,count(*) as count
from have
group by year, permno
having calculated count > 2
;
quit;
to have 1 proc insteads of two
Actually the SAS specific implementation of SQL helps with this type of query. Try this program.
proc sql noprint ;
create table want as
select *,count(*) as count
from have
group by year, permno
;
quit;
Notice the log message :
NOTE: The query requires remerging summary statistics back with the original data.
So you get all of the original data and the new COUNT variable is merged on so that it has the same value for all observations for that year/permno combination. You can add a HAVING clause to eliminate the observations from year/permno combinations with insufficient counts.
proc sql noprint ;
create table want as
select *,count(*) as count
from have
group by year, permno
having calculated count > 2
;
quit;
That doesn't work Tom if the dataset contains only variables year and permno. In that case you get normal summarization without remerging.
PG
In that case use the new COUNT variable as a FREQ or WEIGHT variable in future analysis and save the disk space of store multiple copies of the group by variables.
hey guys,
what does select *,do in the above example? Can someone please explain
i mean i know about select var1,var2,..... from table or select*from table but i haven't seen this above before
It is just asking for all of the variables in the table. Since it is followed by count(*) as count, between the two statements it is asking for all of the variables as well as a count of the number of records.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.