## count of the number of observations based on the value of a variable

Solved
Frequent Contributor
Posts: 102

# count of the number of observations based on the value of a variable

Good morning,

******************************

Below is a display of my data (just showing a few observations for a large data set).

gvkey  year acode

1000 1990 10

1000 1990 20

1000 1991 10

1000 1991 11

1000 1991 12

1001 1990 10

1001 1990 15

1001 1991 11

1001 1991 13

1001 1991 14

Gvkey is firm identifier, acode is the analyst identifier, for firm 1000, it is followed by 2 financial analysts each is identified by its unique acode.

I want to compute how many firms each analyst cover in each year, expected outcome : for year 1990, acode=10, coverage=2 (it covers 1000, 1001 firms), for year 1990, acode=20, coverage=1. Etc.

I tried

proc freq data=a;

tables acode*year /out=counts;

run;

this will not work given that my data has 4 million records, computer froze while running.

can you suggest another way?

thanks,

Lan

Accepted Solutions
Solution
‎03-14-2015 04:52 PM
Super User
Posts: 8,127

## Re: count of the number of observations based on the value of a variable

You can use NOPRINT option to prevent the output that might be over whelming your computer.

You can use PROC SUMMARY to count.

You can SORT first to eliminate the need to store in memory all possible combinations.

proc summary data=have nway ;

by year acode ;

output out=want (rename=(_freq_=count)) ;

run;

All Replies
Frequent Contributor
Posts: 115

## Re: count of the number of observations based on the value of a variable

data have;

input gvkey  year acode;

datalines;

1000 1990 10

1000 1990 20

1000 1991 10

1000 1991 11

1000 1991 12

1001 1990 10

1001 1990 15

1001 1991 11

1001 1991 13

1001 1991 14

;

proc sort data=have out=h;

by year acode;

run;

data want;

set h;

by year acode;

if first.year then call missing(coverage);

if first.acode then coverage=0;

coverage+1;

if last.acode then output;

run;

HTH,

Naveen Srinivasan

L&T Infotech

Super User
Posts: 8,127

## Re: count of the number of observations based on the value of a variable

proc freq;

tables year*acode / output=want ;

run;

Frequent Contributor
Posts: 102

## Re: count of the number of observations based on the value of a variable

Hi Tom,

Your code is the same as my original post. Like I said when I ran it, with 4 million observations, computer froze, I wonder if some changes needed .

thanks,

Lan

Frequent Contributor
Posts: 115

## Re: count of the number of observations based on the value of a variable

Well, Tom uses the year variable as row and acode as column which is the opposite of your code to make a nice cross tab , which I believe is what you are after. Make that change as per tom's and run again. 4 million obs aint that big at all.

Solution
‎03-14-2015 04:52 PM
Super User
Posts: 8,127

## Re: count of the number of observations based on the value of a variable

You can use NOPRINT option to prevent the output that might be over whelming your computer.

You can use PROC SUMMARY to count.

You can SORT first to eliminate the need to store in memory all possible combinations.

proc summary data=have nway ;

by year acode ;

output out=want (rename=(_freq_=count)) ;

run;

Posts: 5,543

## Re: count of the number of observations based on the value of a variable

Try

proc sql;

create table want as

select acode, year, count(gvkey) as coverage

from have

group by acode, year;

quit;

Any performance improvement will require indexes being defined on your data table.

PG

PG
PROC Star
Posts: 8,169

## Re: count of the number of observations based on the value of a variable

I disagree! With 4 million records, the number of combinations could easily require proc freq to need more memory than you might have.  I'd use Naveen's suggested data step method. You should just change the sort and by variables to match the order you want in your resultant file. The following eliminates one unnecessary step in his suggested code. e.g.:

proc sort data=have out=h;

by acode year;

run;

data want;

set h;

by acode year;

if first.year then coverage=1;

else coverage+1;

if last.year then output;

run;

Frequent Contributor
Posts: 102

## Re: count of the number of observations based on the value of a variable

I want to thank naveen, Arthur, PGstats, and Tom. All of your suggestions work, in my mind, they are all correct.

When I use proc freq as Tom suggested (without using noprint option). the computer was overwhelmed.

The last post by Tom (using proc summary ) is the fastest way for my purpose. I compared the datasets using both Arthur's code and Tom's , they produce the same results.

now I have a data set with

cusip  year acode count

they are firm identifier, year, analyst identifier, and  the number of firms covered by each analyst.

I want to compute the median value of the number of firms covered by each analyst by each firm.

e.g.

 cusip year acode count 1001 1990 10000 12 1001 1990 10010 13 1001 1990 10020 14

then I expect the median =13,

is this correct way to do it ?

proc means data=have noprint;

class cusip year;

output out=stats;

output out=median median=;

run;

Thanks,

Lan

PROC Star
Posts: 8,169

## Re: count of the number of observations based on the value of a variable

Lan: In the future you really should ask such questions by starting a new discussion.

You've changed the names of your variables, as well as the level of data you are trying to analyze.  According to your latest example, you are trying to analyze COUNTS (are they what were previously labeled COVERAGE?), for each CUSIP (are they what were previously called GVKEY?), for each year and ACODE.

That isn't what your first question produced.

Proc means and proc summary use the exact same algorithm but, since proc summary doesn't produce any listing, NOPRINT wouldn't be needed with it. Two critical things you are missing are the NWAY option and the identification of an analysis variable.

If your data (I'll refer to the file as WANT as that is what the first step created) are already sorted, you would be better off using a BY statement rather than a CLASS statement. Both will accomplish the same thing but, if your data are already in the right order, the BY statement will cause the procedure to run faster.

proc summary data=want nway;

by cusip year;

var count;

output out=median (drop=_ median=;

run;

Frequent Contributor
Posts: 102

## Re: count of the number of observations based on the value of a variable

Thank you very much, Arthur.

I apologize for not practicing the best policy in terms of posting. In the future, I will separate my questions into new posts. I will also be consistent with variable naming.

Sincerely,

Lan

Frequent Contributor
Posts: 115

## Re: count of the number of observations based on the value of a variable

Hi Lan, It would make it convenient for community members to search through the forum if you break your questions to new discussions that warrants a new subject or a new solution.

Anyway, no worries, Continuing on  Art's code from the above, do another sort and typical transpose :

proc sort data=have out=h;

by acode year;

run;

data want;

set h;

by acode year;

if first.year then coverage=1;

else coverage+1;

if last.year then output;

run;

proc sort data=want out=w;

by year;

run;

data w;

array a(25) _temporary_ ;

call missing(of a{*});

do _n_= 1 by 1 until(last.year);

set w;

by year;

a(_n_)=coverage;

med_value=median(of a{*});

end;

run;

Sorry, i am too lazy to look through your "renamed additional requirement", so just took the names as it was in your original question.

Frequent Contributor
Posts: 102

## Re: count of the number of observations based on the value of a variable

Thanks so much to you Naveen as well !!!   I am not familiar with array, will try to learn more.

Sincerely,

Lan

🔒 This topic is solved and locked.