Good morning,
******************************
Below is a display of my data (just showing a few observations for a large data set).
gvkey year acode
1000 1990 10
1000 1990 20
1000 1991 10
1000 1991 11
1000 1991 12
1001 1990 10
1001 1990 15
1001 1991 11
1001 1991 13
1001 1991 14
Gvkey is firm identifier, acode is the analyst identifier, for firm 1000, it is followed by 2 financial analysts each is identified by its unique acode.
I want to compute how many firms each analyst cover in each year, expected outcome : for year 1990, acode=10, coverage=2 (it covers 1000, 1001 firms), for year 1990, acode=20, coverage=1. Etc.
I tried
proc freq data=a;
tables acode*year /out=counts;
run;
this will not work given that my data has 4 million records, computer froze while running.
can you suggest another way?
thanks,
Lan
You can use NOPRINT option to prevent the output that might be over whelming your computer.
You can use PROC SUMMARY to count.
You can SORT first to eliminate the need to store in memory all possible combinations.
proc summary data=have nway ;
by year acode ;
output out=want (rename=(_freq_=count)) ;
run;
data have;
input gvkey year acode;
datalines;
1000 1990 10
1000 1990 20
1000 1991 10
1000 1991 11
1000 1991 12
1001 1990 10
1001 1990 15
1001 1991 11
1001 1991 13
1001 1991 14
;
proc sort data=have out=h;
by year acode;
run;
data want;
set h;
by year acode;
if first.year then call missing(coverage);
if first.acode then coverage=0;
coverage+1;
if last.acode then output;
run;
HTH,
Naveen Srinivasan
L&T Infotech
proc freq;
tables year*acode / output=want ;
run;
Hi Tom,
Your code is the same as my original post. Like I said when I ran it, with 4 million observations, computer froze, I wonder if some changes needed .
thanks,
Lan
Well, Tom uses the year variable as row and acode as column which is the opposite of your code to make a nice cross tab , which I believe is what you are after. Make that change as per tom's and run again. 4 million obs aint that big at all.
You can use NOPRINT option to prevent the output that might be over whelming your computer.
You can use PROC SUMMARY to count.
You can SORT first to eliminate the need to store in memory all possible combinations.
proc summary data=have nway ;
by year acode ;
output out=want (rename=(_freq_=count)) ;
run;
Try
proc sql;
create table want as
select acode, year, count(gvkey) as coverage
from have
group by acode, year;
quit;
Any performance improvement will require indexes being defined on your data table.
PG
I disagree! With 4 million records, the number of combinations could easily require proc freq to need more memory than you might have. I'd use Naveen's suggested data step method. You should just change the sort and by variables to match the order you want in your resultant file. The following eliminates one unnecessary step in his suggested code. e.g.:
proc sort data=have out=h;
by acode year;
run;
data want;
set h;
by acode year;
if first.year then coverage=1;
else coverage+1;
if last.year then output;
run;
I want to thank naveen, Arthur, PGstats, and Tom. All of your suggestions work, in my mind, they are all correct.
When I use proc freq as Tom suggested (without using noprint option). the computer was overwhelmed.
The last post by Tom (using proc summary ) is the fastest way for my purpose. I compared the datasets using both Arthur's code and Tom's , they produce the same results.
now I have a data set with
cusip year acode count
they are firm identifier, year, analyst identifier, and the number of firms covered by each analyst.
I want to compute the median value of the number of firms covered by each analyst by each firm.
e.g.
cusip | year | acode | count |
1001 | 1990 | 10000 | 12 |
1001 | 1990 | 10010 | 13 |
1001 | 1990 | 10020 | 14 |
then I expect the median =13,
is this correct way to do it ?
proc means data=have noprint;
class cusip year;
output out=stats;
output out=median median=;
run;
Thanks,
Lan
Lan: In the future you really should ask such questions by starting a new discussion.
You've changed the names of your variables, as well as the level of data you are trying to analyze. According to your latest example, you are trying to analyze COUNTS (are they what were previously labeled COVERAGE?), for each CUSIP (are they what were previously called GVKEY?), for each year and ACODE.
That isn't what your first question produced.
Proc means and proc summary use the exact same algorithm but, since proc summary doesn't produce any listing, NOPRINT wouldn't be needed with it. Two critical things you are missing are the NWAY option and the identification of an analysis variable.
If your data (I'll refer to the file as WANT as that is what the first step created) are already sorted, you would be better off using a BY statement rather than a CLASS statement. Both will accomplish the same thing but, if your data are already in the right order, the BY statement will cause the procedure to run faster.
proc summary data=want nway;
by cusip year;
var count;
output out=median (drop=_:) median=;
run;
Thank you very much, Arthur.
I apologize for not practicing the best policy in terms of posting. In the future, I will separate my questions into new posts. I will also be consistent with variable naming.
Your code works great.
Sincerely,
Lan
Hi Lan, It would make it convenient for community members to search through the forum if you break your questions to new discussions that warrants a new subject or a new solution.
Anyway, no worries, Continuing on Art's code from the above, do another sort and typical transpose :
proc sort data=have out=h;
by acode year;
run;
data want;
set h;
by acode year;
if first.year then coverage=1;
else coverage+1;
if last.year then output;
run;
proc sort data=want out=w;
by year;
run;
data w;
array a(25) _temporary_ ;
call missing(of a{*});
do _n_= 1 by 1 until(last.year);
set w;
by year;
a(_n_)=coverage;
med_value=median(of a{*});
end;
run;
Sorry, i am too lazy to look through your "renamed additional requirement", so just took the names as it was in your original question.
Thanks so much to you Naveen as well !!! I am not familiar with array, will try to learn more.
Sincerely,
Lan
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.