Solved: count of the number of observations based on the value of a variable

LanMin · Posted 03-14-2015 11:26 AM

Good morning,

******************************

Below is a display of my data (just showing a few observations for a large data set).

gvkey year acode

1000 1990 10

1000 1990 20

1000 1991 10

1000 1991 11

1000 1991 12

1001 1990 10

1001 1990 15

1001 1991 11

1001 1991 13

1001 1991 14

Gvkey is firm identifier, acode is the analyst identifier, for firm 1000, it is followed by 2 financial analysts each is identified by its unique acode.

I want to compute how many firms each analyst cover in each year, expected outcome : for year 1990, acode=10, coverage=2 (it covers 1000, 1001 firms), for year 1990, acode=20, coverage=1. Etc.

I tried

proc freq data=a;

tables acode*year /out=counts;

run;

this will not work given that my data has 4 million records, computer froze while running.

can you suggest another way?

thanks,

Lan

Tom · Posted 03-14-2015 04:52 PM

You can use NOPRINT option to prevent the output that might be over whelming your computer.

You can use PROC SUMMARY to count.

You can SORT first to eliminate the need to store in memory all possible combinations.

proc summary data=have nway ;

by year acode ;

output out=want (rename=(_freq_=count)) ;

run;

View solution in original post

naveen_srini · Posted 03-14-2015 12:11 PM

data have;

input gvkey year acode;

datalines;

1000 1990 10

1000 1990 20

1000 1991 10

1000 1991 11

1000 1991 12

1001 1990 10

1001 1990 15

1001 1991 11

1001 1991 13

1001 1991 14

;

proc sort data=have out=h;

by year acode;

run;

data want;

set h;

by year acode;

if first.year then call missing(coverage);

if first.acode then coverage=0;

coverage+1;

if last.acode then output;

run;

HTH,

Naveen Srinivasan

L&T Infotech

Tom · Posted 03-14-2015 12:58 PM

proc freq;

tables year*acode / output=want ;

run;

LanMin · Posted 03-14-2015 04:11 PM

Hi Tom,

Your code is the same as my original post. Like I said when I ran it, with 4 million observations, computer froze, I wonder if some changes needed .

thanks,

Lan

naveen_srini · Posted 03-14-2015 04:17 PM

Well, Tom uses the year variable as row and acode as column which is the opposite of your code to make a nice cross tab , which I believe is what you are after. Make that change as per tom's and run again. 4 million obs aint that big at all.

Tom · Posted 03-14-2015 04:52 PM

You can use NOPRINT option to prevent the output that might be over whelming your computer.

You can use PROC SUMMARY to count.

You can SORT first to eliminate the need to store in memory all possible combinations.

proc summary data=have nway ;

by year acode ;

output out=want (rename=(_freq_=count)) ;

run;

PGStats · Posted 03-14-2015 04:28 PM

Try

proc sql;

create table want as

select acode, year, count(gvkey) as coverage

from have

group by acode, year;

quit;

Any performance improvement will require indexes being defined on your data table.

PG

art297 · Posted 03-14-2015 04:35 PM

I disagree! With 4 million records, the number of combinations could easily require proc freq to need more memory than you might have. I'd use Naveen's suggested data step method. You should just change the sort and by variables to match the order you want in your resultant file. The following eliminates one unnecessary step in his suggested code. e.g.:

proc sort data=have out=h;

by acode year;

run;

data want;

set h;

by acode year;

if first.year then coverage=1;

else coverage+1;

if last.year then output;

run;

LanMin · Posted 03-14-2015 09:22 PM

I want to thank naveen, Arthur, PGstats, and Tom. All of your suggestions work, in my mind, they are all correct.

When I use proc freq as Tom suggested (without using noprint option). the computer was overwhelmed.

The last post by Tom (using proc summary ) is the fastest way for my purpose. I compared the datasets using both Arthur's code and Tom's , they produce the same results.

now I have a data set with

cusip year acode count

they are firm identifier, year, analyst identifier, and the number of firms covered by each analyst.

I want to compute the median value of the number of firms covered by each analyst by each firm.

e.g.

cusip	year	acode	count
1001	1990	10000	12
1001	1990	10010	13
1001	1990	10020	14

then I expect the median =13,

is this correct way to do it ?

proc means data=have noprint;

class cusip year;

output out=stats;

output out=median median=;

run;

Thanks,

Lan

art297 · Posted 03-15-2015 11:32 AM

Lan: In the future you really should ask such questions by starting a new discussion.

You've changed the names of your variables, as well as the level of data you are trying to analyze. According to your latest example, you are trying to analyze COUNTS (are they what were previously labeled COVERAGE?), for each CUSIP (are they what were previously called GVKEY?), for each year and ACODE.

That isn't what your first question produced.

Proc means and proc summary use the exact same algorithm but, since proc summary doesn't produce any listing, NOPRINT wouldn't be needed with it. Two critical things you are missing are the NWAY option and the identification of an analysis variable.

If your data (I'll refer to the file as WANT as that is what the first step created) are already sorted, you would be better off using a BY statement rather than a CLASS statement. Both will accomplish the same thing but, if your data are already in the right order, the BY statement will cause the procedure to run faster.

proc summary data=want nway;

by cusip year;

var count;

output out=median (drop=_:) median=;

run;

LanMin · Posted 03-15-2015 05:09 PM

Thank you very much, Arthur.

I apologize for not practicing the best policy in terms of posting. In the future, I will separate my questions into new posts. I will also be consistent with variable naming.

Your code works great.

Sincerely,

Lan

naveen_srini · Posted 03-15-2015 12:40 PM

Hi Lan, It would make it convenient for community members to search through the forum if you break your questions to new discussions that warrants a new subject or a new solution.

Anyway, no worries, Continuing on Art's code from the above, do another sort and typical transpose :

proc sort data=have out=h;

by acode year;

run;

data want;

set h;

by acode year;

if first.year then coverage=1;

else coverage+1;

if last.year then output;

run;

proc sort data=want out=w;

by year;

run;

data w;

array a(25) _temporary_ ;

call missing(of a{*});

do _n_= 1 by 1 until(last.year);

set w;

by year;

a(_n_)=coverage;

med_value=median(of a{*});

end;

run;

Sorry, i am too lazy to look through your "renamed additional requirement", so just took the names as it was in your original question.

LanMin · Posted 03-15-2015 05:11 PM

Thanks so much to you Naveen as well !!! I am not familiar with array, will try to learn more.

Sincerely,

Lan

count of the number of observations based on the value of a variable

Re: count of the number of observations based on the value of a variable

Re: count of the number of observations based on the value of a variable

Re: count of the number of observations based on the value of a variable

Re: count of the number of observations based on the value of a variable

Re: count of the number of observations based on the value of a variable

Re: count of the number of observations based on the value of a variable

Re: count of the number of observations based on the value of a variable

Re: count of the number of observations based on the value of a variable

Re: count of the number of observations based on the value of a variable

Re: count of the number of observations based on the value of a variable

Re: count of the number of observations based on the value of a variable

Re: count of the number of observations based on the value of a variable

Re: count of the number of observations based on the value of a variable

Registration is open

SAS Training: Just a Click Away